Comparison of Table formats for Data Warehouse
Arjun Reddy Lingala
arjunreddy.lingala@gmail.com
Abstract—Modern data warehouses are developed on dis- tributed file system and object storage that offers scalability, data availability and performance. Table formats define how the data files are organized and stored on the file system. The evolution of data warehousing has given rise to diverse table formats with unique architectures and capabilities aiming at query performance, scalability and storage optimization. Hive table format is the foundational component of Hadoop ecosystem which uses centralized metastore and manual partitioning but the query performance is hindered in cases requiring incremental updates or complex query patterns. Hive table format fixed schema structure requires downtime and manual interventions for schema changes. Also, query planning for tables that have huge number of partitions takes lot of time. Iceberg table format addresses these issues with decentralized metadata management, snapshot isolation, and hidden partitioning. Iceberg supports dynamic schema adjustments with version control and backward compatibility. Further, Iceberg supports atomic commit capabil- ities which ensure consistency in high concurrent environments. This paper discusses how the data files are stored, how read and write patterns work, discuss the pain points in Hive table format and discuss in detail Iceberg table format, how it manages the files on the file system, how it addresses the challenges in Hive format. The comparison and overview aim to guide organizations in transitioning towards table formats that align with modern analytics requirements while ensuring long-term scalability and performance.
Keywords—Hive, Iceberg, Table Formats, Data Warehousing, Apache Hadoop, Schema Evolution, Performance, Scalability