May 17, 2021 Spark Programming guide

Spark GraphX submits the application

Spark-submit under the Spark bin spark-submit used to launch applications on a cluster. It can use all spark-supported cluster managers through a unified interface, all of which you don't have to configure for each manager.

Start the application with spark-submit

bin/spark-submit is responsible for establishing classpaths that contain Spark and its dependencies, which support different cluster managers and Spark-supported loading patterns.

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \

Some commonly used options are:

  • --class The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
  • --master The master URL of the cluster (spark://
  • --deploy-mode Deploy your driver (cluster) or locally as an external client at the worker node. The default is client.
  • --conf Any Spark configuration property in the format key-value.
  • application-jar The path that contains the application and the jar package on which it depends. This URL must be globally visible in the cluster, for example, the hdfs:// file:// nodes
  • application-arguments passed to the main method of the main class

A common deployment strategy is to submit your application from the gateway cluster, which works physically with your worker cluster. I n this setting, client mode is appropriate. I n client the driver starts spark-submit process, which serves directly as the client of the cluster. T he input and output of the application are connected to the console. Therefore, this pattern is particularly suitable for applications involving REPL.

Alternatively, if your application is submitted from a machine that is far from the worker machine, cluster mode to reduce network delays for drivers and executors. Note that cluster not currently support stand-alone clusters, mesos clusters, and python applications.

There are several cluster manager-specific options that we use. F or example, in cluster mode of a cluster you can --supervise ensure that the driver restarts automatically if it fails because of a non-zero exit code. To list all the available options for spark-submit, --help

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \

# Run on a Spark Standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark:// \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \

# Run on a Spark Standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark:// \
  --deploy-mode cluster
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \

# Run on a YARN cluster
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn-cluster \  # can also be `yarn-client` for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \

# Run a Python application on a Spark Standalone cluster
./bin/spark-submit \
  --master spark:// \
  examples/src/main/python/ \

Master URLs

Url passed to Spark can be used in the following mode

Master URL Meaning
local Run Spark locally with a worker thread
local[K] Run Spark locally with kworker threads (ideally, set this value to the number of cores of your machine)
local[*] Run Spark locally with as many worker threads as possible
spark://HOST:PORT Connect to a given Spark stand-alone deployment clustermaster. The port must be a master-configured port, which defaults to 7077
mesos://HOST:PORT Connect to a given mesos cluster
yarn-client Connect client the Yarn cluster in client mode. The cluster location will be found based on HADOOP_CONF_DIR variables that pass through
yarn-cluster Connect cluster Yarn cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR variables that pass through