Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Run Spark on yarn


May 17, 2021 Spark Programming guide


Table of contents


Run Spark on YARN

Configuration

Most of the Spark on YARN mode are the same as available for other deployment modes. The following are the Spark on YARN mode.

Spark properties

Property Name Default Meaning
spark.yarn.applicationMaster.waitTries 10 The number of times ApplicationMaster waited for Spark Master and the number of SparkContext initialization attempts
spark.yarn.submit.file.replication HDFS default number of replications (3) HDFS replication level for files uploaded to HDFS. These files include Spark jar, app jar, and any distributed cache files/archives
spark.yarn.preserve.staging.files false Set to true to keep phased files (Spark jar, app jar, and any distributed cache files) at the end of the job instead of deleting them
spark.yarn.scheduler.heartbeat.interval-ms 5000 Spark application master sends heartbeat interval (ms) to YARN ResourceManager
spark.yarn.max.executor.failures numExecutors s 2, minimum 3 The maximum number of execution failures before the application failed
spark.yarn.historyServer.address (none) The address of spark's host.com:18080, such as a server. T his address should not contain a pattern (http://). T he value is not set by default because this option is an optional option. This address is obtained from YARN ResourceManager when the Spark application completes the connection from the ResourceManager UI to the Spark Historical Server UI
spark.yarn.dist.archives (none) Extract the comma-separated archive list to the working directory of each executor
spark.yarn.dist.files (none) Place a comma-separated list of files into the working directory of each executor
spark.yarn.executor.memoryOverhead Executor Memory s 0.07, min 384 The amount of heap memory allocated to each executor, in MB. I t is memory consumed by VM overhead, interned strings, or other local overheads. T his tends to grow with the size of the executor. (typically 6%-10%)
spark.yarn.driver.memoryOverhead DriverMemory s 0.07, min 384 The amount of heap memory allocated to each river, in MB. I t is memory consumed by VM overhead, interned strings, or other local overheads. T his tends to grow with the size of the executor. (typically 6%-10%)
spark.yarn.queue default The name of the YARN queue to which the application was submitted
spark.yarn.jar (none) The location of the Spark jar file, overwriting the default location. B y default, Spark on YARN will use a locally installed Spark jar. B ut Spark jar can also be a common location in HDFS. T his allows YARN to cache it to the node without having to allocate it every time the application is run. Point to the jar package in HDFS, which can be "hdfs:///some/path"
spark.yarn.access.namenodes (none) Your Spark app accesses the HDFS namenode list. F or example, spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032 Spark applications must access the namemenode list, and Kerberos must be properly configured to access them. Spark obtains a security token for namenode so that Spark applications can access these remote HDFS clusters.
spark.yarn.containerLauncherMaxThreads 25 The maximum number of threads used by the application master in order to start the performer container
spark.yarn.appMasterEnv. [EnvironmentVariableName] (none) Add the EnvironmentVariableName to the Applicity Master to handle the boot on YARN. U sers can specify multiple settings to set multiple environment variables. I n yarn-cluster mode, this controls spark driver's environment. In yarn-client mode, this only controls the environment of the executor initiator.

Start Spark on YARN

Make HADOOP_CONF_DIR YARN_CONF_DIR YARN_CONF_DIR the directory you are pointing to contains the (client) profile of the Hadoop cluster. These configurations are used to write data to dfs and connect to YARN ResourceManager.

There are two deployment modes that you can use to launch Spark applications on YARN. I n yarn-cluster mode, Spark driver runs in the application master process, which is managed by YARN in the cluster and the client closes after initializing the application. In yarn-client mode, the driver runs in the client process, and the application master is used only to request resources from YARN.

Different from Spark's separate mode and Mesos mode, where master addresses are specified by the "master" parameter, in YARN mode, ResourceManager addresses are obtained from the Hadoop configuration. So the master parameters are yarn-client yarn-cluster

Start the Spark application in yarn-cluster mode.

./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]

Example:

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 3 \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    lib/spark-examples*.jar \
    10

The above starts a YARN client program to start the default Application Master, and then SparkPi runs as a child thread of the Application Master. C lients periodically poll the Application Master for status updates and display them on the console. Once your application is running, the client exits.

Start the Spark application in yarn-client mode and run the shell script below

$ ./bin/spark-shell --master yarn-client

Add another jar

In yarn-cluster mode, the driver runs on a different machine, so leaving the files saved on the local client, SparkContext.addJar not work. To SparkContext.addJar use files saved on the client, add the --jars command.

$ ./bin/spark-submit --class my.main.Class \
    --master yarn-cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar
    my-main-jar.jar
    app_arg1 app_arg2

Precautions

  • Prior to Hadoop 2.2, YARN did not support resource requests for container cores. T herefore, when an earlier version is run, the number of cores specified by the command line argument cannot be passed to YARN. In scheduling decisions, whether a nuclear request is fulfilled depends on which scheduler is used and how the scheduler is configured.
  • The local directory used by Executors will be the local directory of the YARN configuration (yarn.nodemanager.local-dirs). If the user specifies spark.local.dir it will be ignored.
  • --files and --archives support the designation # with a number. F or example, you can --files localtest.txt#appSees.txt which uploads files localtest.txt to HDFS, but will link to appSees.txt When your application is running on YARN, you should appSees.txt refer to the file.
  • If you run SparkContext.addJar and use a local file, --jars SparkContext.addJar to work. If you are using HDFS, HTTPS, HTTPS or FTP, you do not need this option