May 17, 2021 Spark Programming guide
Spark-submit under the Spark bin
spark-submit
used to launch applications on a cluster.
It can use all spark-supported cluster managers through
a unified interface,
all of which you don't have to configure for each manager.
bin/spark-submit
is responsible for establishing classpaths that contain Spark and its dependencies, which support different cluster managers and Spark-supported loading patterns.
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Some commonly used options are:
--class
The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master
The master URL of the cluster (spark://23.195.26.187:7077)
--deploy-mode
Deploy your driver (cluster) or locally as an external client at the worker node.
The default is client.
--conf
Any Spark configuration property in the format key-value.
application-jar
The path that contains the application and the jar package on which it depends.
This URL must be globally visible in the cluster, for example, the
hdfs://
file://
nodes
application-arguments
passed to the main method of the main class
A common deployment strategy is to submit your application from the gateway cluster, which works physically with your worker cluster. I
n this setting,
client
mode is appropriate. I
n
client
the driver starts
spark-submit
process, which serves directly as the client of the cluster. T
he input and output of the application are connected to the console.
Therefore, this pattern is particularly suitable for applications involving REPL.
Alternatively, if your application is submitted from a machine that is far from the worker machine,
cluster
mode to reduce network delays for drivers and executors.
Note that
cluster
not currently support stand-alone clusters, mesos clusters, and python applications.
There are several cluster manager-specific options that we use. F
or example, in cluster mode of a
cluster
you can
--supervise
ensure that the driver restarts automatically if it fails because of a non-zero exit code.
To list all the available options for spark-submit,
--help
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
# Run on a Spark Standalone cluster in client deploy mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
# Run on a Spark Standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster
--supervise
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
# Run a Python application on a Spark Standalone cluster
./bin/spark-submit \
--master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py \
1000
Url passed to Spark can be used in the following mode
Master URL | Meaning |
---|---|
local | Run Spark locally with a worker thread |
local[K] | Run Spark locally with kworker threads (ideally, set this value to the number of cores of your machine) |
local[*] | Run Spark locally with as many worker threads as possible |
spark://HOST:PORT | Connect to a given Spark stand-alone deployment clustermaster. The port must be a master-configured port, which defaults to 7077 |
mesos://HOST:PORT | Connect to a given mesos cluster |
yarn-client |
Connect
client
the Yarn cluster in client mode.
The cluster location will be found based on HADOOP_CONF_DIR variables that pass through
|
yarn-cluster |
Connect
cluster
Yarn cluster in cluster mode.
The cluster location will be found based on HADOOP_CONF_DIR variables that pass through
|