Spark SQL parquet file

May 17, 2021 Spark Programming guide

Parquet file

Parquet is a columnar format that can be supported by many other data processing systems. Spark SQL provides the ability to read and write Parquet files that automatically preserve the pattern of the original data.

Load the data

// sqlContext from the previous example is used in this example.
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.

// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")

// Read in the parquet file created above.  Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a SchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")

//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Configuration

Parquet can be configured using the setConf method on SQLContext or by running SET key=value command when USQL is used.

Property Name	Default	Meaning
spark.sql.parquet.binaryAsString	false	Some other Parquet-producing systems, especially Impala and other versions of Spark SQL, can't tell the difference between binary data and strings when writing out Parquet mode. This tag tells Spark SQL to interpret binary data as strings to provide compatibility with these systems.
spark.sql.parquet.cacheMetadata	true	Opening the cache of parquet metadata can speed up queries for static data
spark.sql.parquet.compression.codec	Gzip	The acceptable values for setting the compression algorithm when writing parquet files include: uncompressed, snappy, gzip, lzo
spark.sql.parquet.filterPushdown	false	Turn on the pushdown optimization of the Parquet filter. T his feature is turned off by default because of a known Pruet error. If your table does not contain any empty strings or binary columns, it is still safe to open this feature
spark.sql.hive.convertMetastoreParquet	true	When set to false, Spark SQL uses Hive SerDe instead of built-in support

Spark SQL parquet file

Table of contents

Parquet file

Load the data

Configuration

Cookie Consent