Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Spark GraphX submits the application


May 17, 2021 Spark Programming guide


Table of contents


Spark GraphX submits the application

Spark-submit under the Spark bin spark-submit used to launch applications on a cluster. It can use all spark-supported cluster managers through a unified interface, all of which you don't have to configure for each manager.

Start the application with spark-submit

bin/spark-submit is responsible for establishing classpaths that contain Spark and its dependencies, which support different cluster managers and Spark-supported loading patterns.

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

Some commonly used options are:

  • --class The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
  • --master The master URL of the cluster (spark://23.195.26.187:7077)
  • --deploy-mode Deploy your driver (cluster) or locally as an external client at the worker node. The default is client.
  • --conf Any Spark configuration property in the format key-value.
  • application-jar The path that contains the application and the jar package on which it depends. This URL must be globally visible in the cluster, for example, the hdfs:// file:// nodes
  • application-arguments passed to the main method of the main class

A common deployment strategy is to submit your application from the gateway cluster, which works physically with your worker cluster. I n this setting, client mode is appropriate. I n client the driver starts spark-submit process, which serves directly as the client of the cluster. T he input and output of the application are connected to the console. Therefore, this pattern is particularly suitable for applications involving REPL.

Alternatively, if your application is submitted from a machine that is far from the worker machine, cluster mode to reduce network delays for drivers and executors. Note that cluster not currently support stand-alone clusters, mesos clusters, and python applications.

There are several cluster manager-specific options that we use. F or example, in cluster mode of a cluster you can --supervise ensure that the driver restarts automatically if it fails because of a non-zero exit code. To list all the available options for spark-submit, --help

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark Standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark Standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn-cluster \  # can also be `yarn-client` for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

# Run a Python application on a Spark Standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000

Master URLs

Url passed to Spark can be used in the following mode

Master URL Meaning
local Run Spark locally with a worker thread
local[K] Run Spark locally with kworker threads (ideally, set this value to the number of cores of your machine)
local[*] Run Spark locally with as many worker threads as possible
spark://HOST:PORT Connect to a given Spark stand-alone deployment clustermaster. The port must be a master-configured port, which defaults to 7077
mesos://HOST:PORT Connect to a given mesos cluster
yarn-client Connect client the Yarn cluster in client mode. The cluster location will be found based on HADOOP_CONF_DIR variables that pass through
yarn-cluster Connect cluster Yarn cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR variables that pass through