May 17, 2021 Spark Programming guide
For some workloads, you can improve performance by caching data in memory or opening some lab options.
Spark SQL can cache tables in column format by
sqlContext.cacheTable("tableName")
S
park will then simply browse through the required columns and automatically compress the data to reduce memory usage and the pressure of garbage collection.
You can delete
sqlContext.uncacheTable("tableName")
method.
Note that if you
schemaRDD.cache()
sqlContext.cacheTable(...)
table will not be cached in a column format. I
n this
sqlContext.cacheTable(...)
is highly recommended usage.
You can configure the memory cache using the setConf method on SQLContext or by running
SET key=value
command when using SQL.
Property Name | Default | Meaning |
---|---|---|
spark.sql.inMemoryColumnarStorage.compressed | true | When set to true, Spark SQL automatically selects a compression algorithm for each column based on statistics. |
spark.sql.inMemoryColumnarStorage.batchSize | 10000 | The size of the batch data for the column cache. Larger batches of data can improve memory utilization and compression efficiency, but there is a risk of OOMs |
The following options can also be used to adjust the performance of query execution. It is possible that these options will be deprecated in a future release because more optimizations will be performed automatically.
Property Name | Default | Meaning |
---|---|---|
spark.sql.autoBroadcastJoinThreshold | 10485760(10m) |
Configure the maximum size (byte) of a table. W
hen the joy operation is performed, the table is broadcast to all worker nodes. Y
ou can set the value to -1 to disable broadcasting.
Note that the current statistics only support the Hive Metastore table, and the command
ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan
is already running in this table.
|
spark.sql.codegen | false | When true, the code evaluated by the expression in a particular query is generated dynamically at runtime. F or some queries with complex expressions, this option can result in a significant speed increase. However, for simple queries, this option slows down the execution of queries |
spark.sql.shuffle.partitions | 200 | Configure the number of partitions when you configure the joy or aggregate the shuffle data |