May 17, 2021 Spark Programming guide
Entering DStreams represents DStreams that get the input data stream from the data source. I
n
the quick example,
lines
input DStream, which represents the traffic obtained from the netcat server.
Each input stream, DStream, is associated
Receiver
object that gets
Receiver
from the source and stores it in memory for processing.
Entering DStreams represents the original data stream obtained from the data source. Spark Streaming has two types of data sources
Note that if you want to create multiple input DStreams in parallel in a streaming app to receive multiple streams, you can create multiple input streams (this is
described in
the Performance Tuning section). I
t creates multiple Receivers that receive multiple streams at the same time. H
owever,
receiver
runs as a long-running task in Spark worker or executor. T
herefore, it occupies a core that is one of all the cores assigned to the Spark Streaming application (it occupies one of the cores allocated to the Spark Streaming application).
Therefore, it is important to assign enough cores (or threads if running locally) to the Spark Streaming application to process the received data and
receiver
A few points to note:
receiver
as the input to receiver, will take up this core, so that there is no remaining core to process the data.
As
we've seen
in the quick
ssc.socketTextStream(...)
M
ethod is used to create text data obtained from TCP sockets into DStream.
In addition to sockets, the StreamingContext API also supports the creation of DStream with files and Akka actors as input sources.
streamingContext.fileStream[keyClass, valueClass, inputFormatClass](dataDirectory)
Spark Streaming monitors
dataDirectory
and processes any files generated under the directory (nested directories are not supported).
There are three points to note:
1 所有文件必须具有相同的数据格式
2 所有文件必须在`dataDirectory`目录下创建,文件是自动的移动和重命名到数据目录下
3 一旦移动,文件必须被修改。所以如果文件被持续的附加数据,新的数据不会被读取。
For simple text files, there is an easier way to
streamingContext.textFileStream(dataDirectory)
be called.
The file stream does not need to run a receiver, so there is no need to assign a core.
In Spark 1.2,
fileStream
is not available in the Python API and only
textFileStream
streamingContext.actorStream(actorProps, actor-name)
method to get the data stream from Akka actors.
For more information,
see The custom receiver guide
actorStream
not available in the Python API.
streamingContext.queueStream(queueOfRDDs)
create DStreams based on the RDD queue.
Each push-to-queue RDD is treated as a batch of DStream data, processed like a stream.
For more details on getting streams from sockets, files, and actors, see StreamingContext and JavaStreamContext
Such sources require non-Spark library interfaces, and some of them require complex dependencies (such as kafka and flume). T o reduce dependency version conflicts, the ability to create DStream from these sources has been moved to a separate library where you can view details in the Association. For example, if you want to create DStream with streaming data from Twitter, you need to follow these steps:
spark-streaming-twitter_2.10
to the dependency of an SBT or maven project
Write: Import
TwitterUtils
class and
TwitterUtils.createStream
method, as shown below
import org.apache.spark.streaming.twitter._
TwitterUtils.createStream(ssc)
It is important to note that these advanced sources cannot be used in
spark-shell
so applications based on these sources cannot be tested in shells.
Here's a look at some of the high-level sources:
Twitter4j 3.0.3
to get public tweet streams,
which are available through the Twitter Stream API.
C
ertification information can be provided by any method supported by the
Twitter
4J library. Y
ou can get both public and keyword-filtered streams.
You can view API
documentation (scala
and java)
and
examples (Twitter PopularTags
and Twitter AlgebirdCMS)
In Spark 1.2, these sources are not supported by the Python API. E
ntering DStream can also be created from a custom source, and all you need to
receiver
receiver
that receives data from the custom source and pushes it into Spark.
Learn
more with the Custom Receiver
guide
There are two types of data sources based on reliability. S ources (e.g. kafka, flume) are allowed. I f the system that gets the data from these reliable sources correctly answers the received data, it ensures that the data is not lost under any circumstances. Thus, there are two types of receiver:
Details of how to write a reliable Receiver are detailed in Custom Receiver.