Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Spark stand-alone applications


May 17, 2021 Spark Programming guide


Table of contents


Stand-alone applications

Now suppose we want to write a separate application using the Spark API. We'll learn by writing a simple application using Scala (with SBT), Java (with Maven) and Python.

We created a very simple Spark application with Scala. So simple, in fact, its name SimpleApp.scala

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" // 应该是你系统上的某些文件
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

This program simply calculates the number of times a row contains 'a' and 'b' in Spark README. Y ou need to be YOUR_SPARK_HOME path where you already have Spark installed. Unlike the previous Spark Shell example, where we initialized our own SparkContext, we made SparkContext initialization part of the program.

We refer to the SparkConf object through sparkContext's constructor, which contains some information about our program.

Our program relies on the Spark API, so we need to include a sbt simple.sbt that Spark is a dependency. This file also adds that Spark relies on a repository:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"

For sbt to work correctly, we need SimpleApp.scala simple.sbt in accordance with the standard file directory structure. O nce the above is done, we can create the program's code into a JAR package. Then use spark-submit to run our program.

# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/scala-2.10/simple-project_2.10-1.0.jar
...
Lines with a: 46, Lines with b: 23