Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Spark external dataset


May 17, 2021 Spark Programming guide



The external dataset

Spark can create distributed datasets from any hadoop-supported storage source, including your local file system, HDFS, Cassandra, HBase, Amazon S3, and more. Spark supports text files, SequenceFiles, and other Hadoop InputFormats.

Text file RDDs can be created using textFile method. I n this method, the URI of the file is passed in (local path hdfs:// s3n:// and then it reads the file into a collection of rows. Here's an example of a call:

scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08

Once created, distFiile do the data set operation. For example, we can add up the length of all rows using map and reduce operations in the following way: distFile.map(s => s.length).reduce((a, b) => a + b)

Note that when Spark reads a file:

  • If you use a local file system path, the file must be accessible on the work node with the same path. Either copy the file to all workers or share the file system over the network.
  • All Spark's file-based textFile support file directories, compressed files, and wildcards well. For example, textFile("/my/文件目录") textFile("/my/文件目录/*.txt") and textFile("/my/文件目录/*.gz")
  • textFile method can also select a second optional parameter to control the number of slices. B y default, Spark creates a slice for each file block (hdFS default block size is 64M). B ut you can also set a higher number of slices with a larger value. Note that you cannot set a slice value that is less than the number of blocks.

In addition to text files, Spark's Scala API supports several other data formats:

  • SparkContext.sholeTextFiles file directory with multiple small text files and return each (filename, content) pair. The textFile is that it records every line in each file.
  • For SequenceFiles, you can create it using SparkContext's sequenceFile[K, V] which corresponds to the key and values types, respectively. L ike IntWritable and Text, they must be sub-classes of Hadoop's Writable interface. I n addition, for several generic Writables, Spark allows you to specify acoustic types to replace them. For example: sequenceFile[Int, String] will automatically read IntWritables and Text.
  • For other Hadoop InputFormats, you can use SparkContext.hadoopRDD which specifies any JobConf InputFormat, key type, values type. Y ou can set up the input source in the same way that you set Up Hadoop job. You can also use SparkContext.newAPIHadoopRDD on top of the new MapReduce interface (org.apache.hadoop.mapreduce). SparkContext.newHadoopRDD
  • RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple Java object serialization format. This is an inefficient proprietary format, such as Avro, which provides an easy way to save any one RDD.