Apache Parquet (Columnar Storage)
Parquet is a columnar storage format optimized for analytical queries on large datasets. By storing data column-by-column rather than row-by-row, Parquet enables efficient compression and fast queries that read only the columns needed.
MIME Type
application/vnd.apache.parquet
Type
Binary
Compression
Lossless
Advantages
- + Excellent compression through columnar encoding
- + Fast analytical queries — reads only needed columns
- + Predicate pushdown skips irrelevant row groups entirely
- + Standard in Spark, DuckDB, Pandas, and cloud data lakes
Disadvantages
- − Not suited for transactional row-level updates
- − More complex to write than CSV or JSON
- − Schema evolution has some limitations
When to Use .PARQUET
Use Parquet for data lakes, analytics workloads, Spark/Pandas processing, and any large dataset where columnar queries are dominant.
Technical Details
Parquet files contain row groups, each divided into column chunks with page-level encoding (dictionary, RLE, delta). Statistics (min/max) per column enable predicate pushdown. It supports nested data via Dremel encoding.
History
Twitter and Cloudera created Parquet in 2013, inspired by Google's Dremel paper. It became an Apache project and is now the default format for data lakes, Spark, and modern analytics platforms.