Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Spark Streaming example


May 17, 2021 Spark Programming guide


Table of contents


A quick Spark Streaming example

Before we get to the details of how to write a Spark Streaming program, let's take a quick look at a simple example. I n this example, the program gets the text data from the data server that listens to the TCP socket, and then calculates the number of words contained in the text. The practice is as follows:

First, we import the related classes of Spark Streaming and some implicit transitions from StreamingContext into our environment, providing useful methods for other classes we need, such as DStream. t is the primary entry point for all Spark streaming operations. We then created a local StreamingContext with two execution threads and a 1-second batch interval, which splits the data stream in seconds.

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
// Create a local StreamingContext with two working thread and batch interval of 1 second
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

With this context, we were able to create a DStream that represents streaming data obtained from the TCP source (host bit localhost, port 9999).

// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)

This lines is a DStream that represents the flow of data that is about to be obtained from the data server. E ach record for this DStream represents a line of text. Next, we need to cut every line of text in DStream into words.

// Split each line into words
val words = lines.flatMap(_.split(" "))

flatMap is a one-to-many DStream operation that creates a new DStream by generating multiple new records for each record of the source DStream. I n this example, each line of text is sliced into multiple words, and we represent the sliced word stream words as DStream. Next, we need to calculate the number of words.

import org.apache.spark.streaming.StreamingContext._
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()

words This DStream is a new DStream (one-to-one conversion operation) consisting of (word, 1) pairs. W e can then use this new DStream to calculate the word frequency for each batch of data. Finally, we use wordCounts.print() the calculated word frequency per second.

It's important to note that when the above code is executed, Spark Streaming is only ready for the calculations it's going to perform, and it's not really starting to execute. When these conversion operations are ready, the following methods need to be called to actually perform the calculation

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

Complete examples can be found in NetworkWordCount.

If you've downloaded and built the Spark environment, you'll be able to run this example in the following way. First, you need to run Netcat as a data server

$ nc -lk 9999

Then, at different terminals, you can run examples in the following way

$ ./bin/run-example streaming.NetworkWordCount localhost 9999