May 17, 2021 Spark Programming guide
Now suppose we want to write a separate application using the Spark API. We'll learn by writing a simple application using Scala (with SBT), Java (with Maven) and Python.
We created a very simple Spark application with Scala.
So simple, in fact, its name
SimpleApp.scala
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // 应该是你系统上的某些文件
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
This program simply calculates the number of times a row contains 'a' and 'b' in Spark README. Y
ou need to be
YOUR_SPARK_HOME
path where you already have Spark installed.
Unlike the previous Spark Shell example, where we initialized our own SparkContext, we made SparkContext initialization part of the program.
We refer to the SparkConf object through sparkContext's constructor, which contains some information about our program.
Our program relies on the Spark API, so we need to include a sbt
simple.sbt
that Spark is a dependency.
This file also adds that Spark relies on a repository:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"
For sbt to work correctly, we need
SimpleApp.scala
simple.sbt
in accordance with the standard file directory structure. O
nce the above is done, we can create the program's code into a JAR package.
Then use
spark-submit
to run our program.
# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.10/simple-project_2.10-1.0.jar
...
Lines with a: 46, Lines with b: 23