Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Spark RDD operation


May 17, 2021 Spark Programming guide


Table of contents


Spark RDD operation

RDDs support 2 types of operations: transforms create a new dataset from an existing dataset; F or example, map is a transformation operation that passes each dataset element to a function and returns a new RDD. On the other hand, reduce is an action that uses the same function to aggregate all elements of the RDD and return the final result to the driver (although reduceByKey that returns a distributed dataset).

In Spark, all transformations are lazy, and they don't calculate their results right away. I nstead, they simply record which underlying datasets, such as a file, are applied to the transformation. T he conversion is calculated only at this time: when the action requires a result to be returned to the driver. T his design makes Spark run more efficiently. For example, we can implement that map from map that is used in the reduce and simply reduce to the driver, not the entire large mapped dataset.

By default, each converted RDD is recalculated each time an action is performed. H owever, you persist use cache method to persist a RDD into memory. I n this case, Spark saves the relevant elements on the cluster, which gets faster the next time you query. Persistence of RDD to disk, or replication between multiple nodes, is also supported here.

Basis

To illustrate the basics of RDD, consider the following simple procedure:

val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)

The first line is to define the RDD from the external file. T his dataset is not loaded into memory or do anything else: lines are just a pointer to a file. T he second line lineLengths which map result of a map transformation. S imilarly, lineLengths not calculated immediately due to lazy mode. F inally, we reduce which is an action. I n this place, Spark divides the calculations into tasks and lets them run on multiple machines. E ach machine runs its own map section and the local reduce section. Then simply return the results to the driver.

If we want to use lineLengths we can add:

lineLengths.persist()

Before reduce it causes lineLengths be saved to memory after the first calculation is complete.