aalan

Why use Parquet Format in Spark

Parquet File Format – Advantages:

·      Nested data: Parquet file format is more capable of storing nested data, i.e., if data stored in HDFS have more hierarchy then parquet serves the best as it stores the data in a tree structure. 

·      Predicate Pushdown efficiency: Parquet is very useful in query optimization. Suppose, if you have to apply some filtering logic to a huge amount of data, then Spark Catalyst optimizer takes advantage of file format and pushes down a lot of calculations to file format. Parquet as a part of metadata information will store some stats such as min, max and avg value of the partitions, which makes the query fetch data from metadata itself.

·      Compression: Parquet is more compression efficient data storage than other file formats such as Avro, JSON, CSV, etc.

·   Less disk IO: Parquet with compression reduces your data storage by 75% on average, i.e., your 1TB scale factor data files will materialize only about 250 GB on disk. This reduces significantly input data needed for your Spark SQL applications. But in Spark 1.6.0, Parquet readers used push-down filters to further reduce disk IO. Push-down filters allow early data selection decisions to be made before data is even read into Spark.

·   Higher scan throughput in Spark 1.6.0: The Databricks’ Spark 1.6.0 release blog mentioned significant Parquet scan throughput because a “more optimized code path” is used. To show this in the real world, we ran query 97 in Spark 1.5.1 and in 1.6.0 and captured on data. The improvement is very obvious.

·   Efficient Spark execution graph: In addition to smarter readers such as in Parquet, data formats also directly impact Spark execution graph because one major input to the scheduler is RDD count.

·   Spark SQL is much faster with Parquet! The chart below compares the sum of all execution times of the 24 queries running in Spark 1.5.1. Queries taking about 12 hours to complete using flat CVS files vs. taking less than 1 hour to complete using Parquet, an 11X performance improvement.

·   Spark SQL works better at large-scale with Parquet: Poor choice of storage format often causes exceptions that are difficult to diagnose and fix. At 1TB scale factor, for example, at least 1/3 of all runnable queries failed to complete using flat CSV files, but they all completed using Parquet files.

·   Data security as Data is not human-readable
·   Low storage consumption
·   Efficient in reading Data in less time as it is columnar storage and minimizes latency.
·   Supports advanced nested data structures. Optimized for queries that process large volumes of data.
·   Parquet has a higher execution speed compared to other standard file formats like Avro, JSON, etc and it also consumes less disk space in comparison to AVRO and JSON.

Related Posts

Why Use Signal App over WhatsApp

How JWT Work

Leave a comment

You must login to add a new comment.

[wpqa_login]