Run Spark on yarn

May 17, 2021 Spark Programming guide

Run Spark on YARN

Configuration

Most of the Spark on YARN mode are the same as available for other deployment modes. The following are the Spark on YARN mode.

Spark properties

Property Name	Default	Meaning
spark.yarn.applicationMaster.waitTries	10	The number of times ApplicationMaster waited for Spark Master and the number of SparkContext initialization attempts
spark.yarn.submit.file.replication	HDFS default number of replications (3)	HDFS replication level for files uploaded to HDFS. These files include Spark jar, app jar, and any distributed cache files/archives
spark.yarn.preserve.staging.files	false	Set to true to keep phased files (Spark jar, app jar, and any distributed cache files) at the end of the job instead of deleting them
spark.yarn.scheduler.heartbeat.interval-ms	5000	Spark application master sends heartbeat interval (ms) to YARN ResourceManager
spark.yarn.max.executor.failures	numExecutors s 2, minimum 3	The maximum number of execution failures before the application failed
spark.yarn.historyServer.address	(none)	The address of spark's host.com:18080, such as a server. T his address should not contain a pattern (http://). T he value is not set by default because this option is an optional option. This address is obtained from YARN ResourceManager when the Spark application completes the connection from the ResourceManager UI to the Spark Historical Server UI
spark.yarn.dist.archives	(none)	Extract the comma-separated archive list to the working directory of each executor
spark.yarn.dist.files	(none)	Place a comma-separated list of files into the working directory of each executor
spark.yarn.executor.memoryOverhead	Executor Memory s 0.07, min 384	The amount of heap memory allocated to each executor, in MB. I t is memory consumed by VM overhead, interned strings, or other local overheads. T his tends to grow with the size of the executor. (typically 6%-10%)
spark.yarn.driver.memoryOverhead	DriverMemory s 0.07, min 384	The amount of heap memory allocated to each river, in MB. I t is memory consumed by VM overhead, interned strings, or other local overheads. T his tends to grow with the size of the executor. (typically 6%-10%)
spark.yarn.queue	default	The name of the YARN queue to which the application was submitted
spark.yarn.jar	(none)	The location of the Spark jar file, overwriting the default location. B y default, Spark on YARN will use a locally installed Spark jar. B ut Spark jar can also be a common location in HDFS. T his allows YARN to cache it to the node without having to allocate it every time the application is run. Point to the jar package in HDFS, which can be "hdfs:///some/path"
spark.yarn.access.namenodes	(none)	Your Spark app accesses the HDFS namenode list. F or example, `spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032` Spark applications must access the namemenode list, and Kerberos must be properly configured to access them. Spark obtains a security token for namenode so that Spark applications can access these remote HDFS clusters.
spark.yarn.containerLauncherMaxThreads	25	The maximum number of threads used by the application master in order to start the performer container
spark.yarn.appMasterEnv. [EnvironmentVariableName]	(none)	Add the `EnvironmentVariableName` to the Applicity Master to handle the boot on YARN. U sers can specify multiple settings to set multiple environment variables. I n yarn-cluster mode, this controls spark driver's environment. In yarn-client mode, this only controls the environment of the executor initiator.

Start Spark on YARN

Make HADOOP_CONF_DIR YARN_CONF_DIR YARN_CONF_DIR the directory you are pointing to contains the (client) profile of the Hadoop cluster. These configurations are used to write data to dfs and connect to YARN ResourceManager.

There are two deployment modes that you can use to launch Spark applications on YARN. I n yarn-cluster mode, Spark driver runs in the application master process, which is managed by YARN in the cluster and the client closes after initializing the application. In yarn-client mode, the driver runs in the client process, and the application master is used only to request resources from YARN.

Different from Spark's separate mode and Mesos mode, where master addresses are specified by the "master" parameter, in YARN mode, ResourceManager addresses are obtained from the Hadoop configuration. So the master parameters are yarn-client yarn-cluster

Start the Spark application in yarn-cluster mode.

./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]

Example:

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 3 \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    lib/spark-examples*.jar \
    10

The above starts a YARN client program to start the default Application Master, and then SparkPi runs as a child thread of the Application Master. C lients periodically poll the Application Master for status updates and display them on the console. Once your application is running, the client exits.

Start the Spark application in yarn-client mode and run the shell script below

$ ./bin/spark-shell --master yarn-client

Add another jar

In yarn-cluster mode, the driver runs on a different machine, so leaving the files saved on the local client, SparkContext.addJar not work. To SparkContext.addJar use files saved on the client, add the --jars command.

$ ./bin/spark-submit --class my.main.Class \
    --master yarn-cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar
    my-main-jar.jar
    app_arg1 app_arg2

Precautions

Prior to Hadoop 2.2, YARN did not support resource requests for container cores. T herefore, when an earlier version is run, the number of cores specified by the command line argument cannot be passed to YARN. In scheduling decisions, whether a nuclear request is fulfilled depends on which scheduler is used and how the scheduler is configured.
The local directory used by Executors will be the local directory of the YARN configuration (yarn.nodemanager.local-dirs). If the user specifies spark.local.dir it will be ignored.
--files and --archives support the designation # with a number. F or example, you can --files localtest.txt#appSees.txt which uploads files localtest.txt to HDFS, but will link to appSees.txt When your application is running on YARN, you should appSees.txt refer to the file.
If you run SparkContext.addJar and use a local file, --jars SparkContext.addJar to work. If you are using HDFS, HTTPS, HTTPS or FTP, you do not need this option