May 17, 2021 Spark Programming guide
Parquet is a columnar format that can be supported by many other data processing systems. Spark SQL provides the ability to read and write Parquet files that automatically preserve the pattern of the original data.
// sqlContext from the previous example is used in this example.
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a SchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")
//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
Parquet can be configured using the setConf method on SQLContext or by running
SET key=value
command when USQL is used.
Property Name | Default | Meaning |
---|---|---|
spark.sql.parquet.binaryAsString | false | Some other Parquet-producing systems, especially Impala and other versions of Spark SQL, can't tell the difference between binary data and strings when writing out Parquet mode. This tag tells Spark SQL to interpret binary data as strings to provide compatibility with these systems. |
spark.sql.parquet.cacheMetadata | true | Opening the cache of parquet metadata can speed up queries for static data |
spark.sql.parquet.compression.codec | Gzip | The acceptable values for setting the compression algorithm when writing parquet files include: uncompressed, snappy, gzip, lzo |
spark.sql.parquet.filterPushdown | false | Turn on the pushdown optimization of the Parquet filter. T his feature is turned off by default because of a known Pruet error. If your table does not contain any empty strings or binary columns, it is still safe to open this feature |
spark.sql.hive.convertMetastoreParquet | true | When set to false, Spark SQL uses Hive SerDe instead of built-in support |