Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Spark SQL parquet file


May 17, 2021 Spark Programming guide


Table of contents


Parquet file

Parquet is a columnar format that can be supported by many other data processing systems. Spark SQL provides the ability to read and write Parquet files that automatically preserve the pattern of the original data.

Load the data

// sqlContext from the previous example is used in this example.
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.

// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")

// Read in the parquet file created above.  Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a SchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")

//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Configuration

Parquet can be configured using the setConf method on SQLContext or by running SET key=value command when USQL is used.

Property Name Default Meaning
spark.sql.parquet.binaryAsString false Some other Parquet-producing systems, especially Impala and other versions of Spark SQL, can't tell the difference between binary data and strings when writing out Parquet mode. This tag tells Spark SQL to interpret binary data as strings to provide compatibility with these systems.
spark.sql.parquet.cacheMetadata true Opening the cache of parquet metadata can speed up queries for static data
spark.sql.parquet.compression.codec Gzip The acceptable values for setting the compression algorithm when writing parquet files include: uncompressed, snappy, gzip, lzo
spark.sql.parquet.filterPushdown false Turn on the pushdown optimization of the Parquet filter. T his feature is turned off by default because of a known Pruet error. If your table does not contain any empty strings or binary columns, it is still safe to open this feature
spark.sql.hive.convertMetastoreParquet true When set to false, Spark SQL uses Hive SerDe instead of built-in support