Enter DStreams

May 17, 2021 Spark Programming guide

Enter DStreams and receivers

Entering DStreams represents DStreams that get the input data stream from the data source. I n the quick example, lines input DStream, which represents the traffic obtained from the netcat server. Each input stream, DStream, is associated Receiver object that gets Receiver from the source and stores it in memory for processing.

Entering DStreams represents the original data stream obtained from the data source. Spark Streaming has two types of data sources

Basic sources: These sources are directly available in the StreamingContext API. Examples are file systems, socket connections, Akka's actors, and so on.
Advanced Sources: These sources include Kafka, Flume, Kinesis, Twitter, and more. T hey need to be used by additional classes. We discussed class dependencies in the related section.

Note that if you want to create multiple input DStreams in parallel in a streaming app to receive multiple streams, you can create multiple input streams (this is described in the Performance Tuning section). I t creates multiple Receivers that receive multiple streams at the same time. H owever, receiver runs as a long-running task in Spark worker or executor. T herefore, it occupies a core that is one of all the cores assigned to the Spark Streaming application (it occupies one of the cores allocated to the Spark Streaming application). Therefore, it is important to assign enough cores (or threads if running locally) to the Spark Streaming application to process the received data and receiver

A few points to note:

If the number of cores assigned to the application is less than or equal to the number of input DStreams or receivers, the system can only receive data and cannot process them.
When running locally, if your master URL is set to "local," there is only one nuclear running task. This is not enough for the program because receiver as the input to receiver, will take up this core, so that there is no remaining core to process the data.

The basic source

As we've seen in the quick ssc.socketTextStream(...) M ethod is used to create text data obtained from TCP sockets into DStream. In addition to sockets, the StreamingContext API also supports the creation of DStream with files and Akka actors as input sources.

File Streams: Read data from any file system compatible with the HDFS API, and a DStream can be created as follows

streamingContext.fileStream[keyClass, valueClass, inputFormatClass](dataDirectory)

Spark Streaming monitors dataDirectory and processes any files generated under the directory (nested directories are not supported). There are three points to note:

1 所有文件必须具有相同的数据格式
2 所有文件必须在`dataDirectory`目录下创建，文件是自动的移动和重命名到数据目录下
3 一旦移动，文件必须被修改。所以如果文件被持续的附加数据，新的数据不会被读取。

For simple text files, there is an easier way to streamingContext.textFileStream(dataDirectory) be called. The file stream does not need to run a receiver, so there is no need to assign a core.

In Spark 1.2, fileStream is not available in the Python API and only textFileStream

Custom-based flow: DStream can be created by calling the streamingContext.actorStream(actorProps, actor-name) method to get the data stream from Akka actors. For more information, see The custom receiver guide actorStream not available in the Python API.
RDD queues act as data streams: In order to test the Spark Streaming application with test data, one can streamingContext.queueStream(queueOfRDDs) create DStreams based on the RDD queue. Each push-to-queue RDD is treated as a batch of DStream data, processed like a stream.

For more details on getting streams from sockets, files, and actors, see StreamingContext and JavaStreamContext

Advanced source

Such sources require non-Spark library interfaces, and some of them require complex dependencies (such as kafka and flume). T o reduce dependency version conflicts, the ability to create DStream from these sources has been moved to a separate library where you can view details in the Association. For example, if you want to create DStream with streaming data from Twitter, you need to follow these steps:

Correlation: Add spark-streaming-twitter_2.10 to the dependency of an SBT or maven project
Write: Import TwitterUtils class and TwitterUtils.createStream method, as shown below
```
import org.apache.spark.streaming.twitter._
TwitterUtils.createStream(ssc)
```
Deployment: The program that is written and all its dependencies, including the spark-streaming-twitter_2.10 dependency and its delivery dependency, is typed as a jar package and then deployed. This is further covered in the Deployment section.

It is important to note that these advanced sources cannot be used in spark-shell so applications based on these sources cannot be tested in shells.

Here's a look at some of the high-level sources:

Twitter: Spark Streaming uses Twitter4j 3.0.3 to get public tweet streams, which are available through the Twitter Stream API. C ertification information can be provided by any method supported by the Twitter 4J library. Y ou can get both public and keyword-filtered streams. You can view API documentation (scala and java) and examples (Twitter PopularTags and Twitter AlgebirdCMS)
Flume:Spark Streaming 1.2 gets data from flume 1.4.0, and you can check out flume integration guides for more information
Kafka:Spark Streaming 1.2 gets data from kafka 0.8.0 and can view the kafka integration guide for more information
Kinesis: Check out kinesis integration guide for more information

Customize the source

In Spark 1.2, these sources are not supported by the Python API. E ntering DStream can also be created from a custom source, and all you need to receiver receiver that receives data from the custom source and pushes it into Spark. Learn more with the Custom Receiver guide

Receiver reliability

There are two types of data sources based on reliability. S ources (e.g. kafka, flume) are allowed. I f the system that gets the data from these reliable sources correctly answers the received data, it ensures that the data is not lost under any circumstances. Thus, there are two types of receiver:

Reliable Receiver: A reliable response to a reliable source, data has been received and copied correctly to Spark.
Unreliable Receiver: These responses are not supported. Even for a reliable source, the developer may implement an unreliable receiver that does not answer correctly.

Details of how to write a reliable Receiver are detailed in Custom Receiver.