May 17, 2021 Spark Programming guide
Most of the
Spark on YARN
mode are the same as available for other deployment modes.
The following are the
Spark on YARN
mode.
Property Name | Default | Meaning |
---|---|---|
spark.yarn.applicationMaster.waitTries | 10 | The number of times ApplicationMaster waited for Spark Master and the number of SparkContext initialization attempts |
spark.yarn.submit.file.replication | HDFS default number of replications (3) | HDFS replication level for files uploaded to HDFS. These files include Spark jar, app jar, and any distributed cache files/archives |
spark.yarn.preserve.staging.files | false | Set to true to keep phased files (Spark jar, app jar, and any distributed cache files) at the end of the job instead of deleting them |
spark.yarn.scheduler.heartbeat.interval-ms | 5000 | Spark application master sends heartbeat interval (ms) to YARN ResourceManager |
spark.yarn.max.executor.failures | numExecutors s 2, minimum 3 | The maximum number of execution failures before the application failed |
spark.yarn.historyServer.address | (none) | The address of spark's host.com:18080, such as a server. T his address should not contain a pattern (http://). T he value is not set by default because this option is an optional option. This address is obtained from YARN ResourceManager when the Spark application completes the connection from the ResourceManager UI to the Spark Historical Server UI |
spark.yarn.dist.archives | (none) | Extract the comma-separated archive list to the working directory of each executor |
spark.yarn.dist.files | (none) | Place a comma-separated list of files into the working directory of each executor |
spark.yarn.executor.memoryOverhead | Executor Memory s 0.07, min 384 | The amount of heap memory allocated to each executor, in MB. I t is memory consumed by VM overhead, interned strings, or other local overheads. T his tends to grow with the size of the executor. (typically 6%-10%) |
spark.yarn.driver.memoryOverhead | DriverMemory s 0.07, min 384 | The amount of heap memory allocated to each river, in MB. I t is memory consumed by VM overhead, interned strings, or other local overheads. T his tends to grow with the size of the executor. (typically 6%-10%) |
spark.yarn.queue | default | The name of the YARN queue to which the application was submitted |
spark.yarn.jar | (none) | The location of the Spark jar file, overwriting the default location. B y default, Spark on YARN will use a locally installed Spark jar. B ut Spark jar can also be a common location in HDFS. T his allows YARN to cache it to the node without having to allocate it every time the application is run. Point to the jar package in HDFS, which can be "hdfs:///some/path" |
spark.yarn.access.namenodes | (none) |
Your Spark app accesses the HDFS namenode list. F
or example,
spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032
Spark applications must access the namemenode list, and Kerberos must be properly configured to access them.
Spark obtains a security token for namenode so that Spark applications can access these remote HDFS clusters.
|
spark.yarn.containerLauncherMaxThreads | 25 | The maximum number of threads used by the application master in order to start the performer container |
spark.yarn.appMasterEnv. [EnvironmentVariableName] | (none) |
Add the
EnvironmentVariableName
to the Applicity Master to handle the boot on YARN. U
sers can specify multiple settings to set multiple environment variables. I
n yarn-cluster mode, this controls spark driver's environment.
In yarn-client mode, this only controls the environment of the executor initiator.
|
Make
HADOOP_CONF_DIR
YARN_CONF_DIR
YARN_CONF_DIR
the directory you are pointing to contains the (client) profile of the Hadoop cluster.
These configurations are used to write data to dfs and connect to YARN ResourceManager.
There are two deployment modes that you can use to launch Spark applications on YARN. I n yarn-cluster mode, Spark driver runs in the application master process, which is managed by YARN in the cluster and the client closes after initializing the application. In yarn-client mode, the driver runs in the client process, and the application master is used only to request resources from YARN.
Different from Spark's separate mode and Mesos mode, where master addresses are specified by the "master" parameter, in YARN mode, ResourceManager addresses are obtained from the Hadoop configuration.
So the master parameters are
yarn-client
yarn-cluster
Start the Spark application in yarn-cluster mode.
./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]
Example:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
The above starts a YARN client program to start the default Application Master, and then SparkPi runs as a child thread of the Application Master. C lients periodically poll the Application Master for status updates and display them on the console. Once your application is running, the client exits.
Start the Spark application in yarn-client mode and run the shell script below
$ ./bin/spark-shell --master yarn-client
In yarn-cluster mode, the driver runs on a different machine, so leaving the files saved on the local client,
SparkContext.addJar
not work.
To
SparkContext.addJar
use files saved on the client, add the
--jars
command.
$ ./bin/spark-submit --class my.main.Class \
--master yarn-cluster \
--jars my-other-jar.jar,my-other-other-jar.jar
my-main-jar.jar
app_arg1 app_arg2
spark.local.dir
it will be ignored.
--files
and
--archives
support the designation
#
with a number. F
or example, you can
--files localtest.txt#appSees.txt
which uploads files
localtest.txt
to HDFS, but will link to
appSees.txt
When your application is running on YARN, you should
appSees.txt
refer to the file.
SparkContext.addJar
and use a local file,
--jars
SparkContext.addJar
to work.
If you are using HDFS, HTTPS, HTTPS or FTP, you do not need this option