May 17, 2021 Spark Programming guide
Spark can create distributed datasets from any hadoop-supported storage source, including your local file system, HDFS, Cassandra, HBase, Amazon S3, and more. Spark supports text files, SequenceFiles, and other Hadoop InputFormats.
Text file RDDs can be created using
textFile
method. I
n this method, the URI of the file is passed in (local path
hdfs://
s3n://
and then it reads the file into a collection of rows.
Here's an example of a call:
scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
Once created,
distFiile
do the data set operation.
For example, we can add up the length of all rows using
map
and
reduce
operations in the following way:
distFile.map(s => s.length).reduce((a, b) => a + b)
Note that when Spark reads a file:
textFile
support file directories, compressed files, and wildcards well.
For example,
textFile("/my/文件目录")
textFile("/my/文件目录/*.txt")
and
textFile("/my/文件目录/*.gz")
textFile
method can also select a second optional parameter to control the
number of slices.
B
y default, Spark creates a
slice
for each file block (hdFS default block size is 64M). B
ut you can also set a higher number of slices with a larger value.
Note that you cannot set a slice value that is less than the number of blocks.
In addition to text files, Spark's Scala API supports several other data formats:
SparkContext.sholeTextFiles
file directory with multiple small text files and return each (filename, content) pair.
The
textFile
is that it records every line in each file.
sequenceFile[K, V]
which corresponds to the key and values types, respectively. L
ike
IntWritable and
Text,
they must be sub-classes of
Hadoop's Writable
interface. I
n addition, for several generic Writables, Spark allows you to specify acoustic types to replace them.
For example:
sequenceFile[Int, String]
will automatically read IntWritables and Text.
SparkContext.hadoopRDD
which specifies any
JobConf
InputFormat, key type, values type. Y
ou can set up the input source in the same way that you set Up Hadoop job.
You can also use
SparkContext.newAPIHadoopRDD
on top of the new MapReduce interface (org.apache.hadoop.mapreduce).
SparkContext.newHadoopRDD
RDD.saveAsObjectFile
and
SparkContext.objectFile
support saving an RDD in a simple Java object serialization format.
This is an inefficient proprietary format, such as Avro, which provides an easy way to save any one RDD.