May 17, 2021 Spark Programming guide
RDDs support 2 types of
operations: transforms create a
new dataset from an existing
dataset;
F
or example,
map
is a transformation operation that passes each dataset element to a function and returns a new RDD.
On the other hand,
reduce
is an action that uses the same function to aggregate all elements of the RDD and return the final result to the driver (although
reduceByKey
that returns a distributed dataset).
In Spark, all transformations are lazy, and they don't calculate their results right away. I
nstead, they simply record which underlying datasets, such as a file, are applied to the transformation. T
he conversion is calculated only at this time: when the action requires a result to be returned to the driver. T
his design makes Spark run more efficiently.
For example, we can implement that
map
from map that is used in the
reduce
and simply
reduce
to the driver, not the entire large mapped dataset.
By default, each converted RDD is recalculated each time an action is performed. H
owever, you
persist
use
cache
method to
persist
a RDD into memory. I
n this case, Spark saves the relevant elements on the cluster, which gets faster the next time you query.
Persistence of RDD to disk, or replication between multiple nodes, is also supported here.
To illustrate the basics of RDD, consider the following simple procedure:
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line is to define the RDD from the external file. T
his dataset is not loaded into memory or do anything else:
lines
are just a pointer to a file. T
he second line
lineLengths
which
map
result of a map transformation. S
imilarly,
lineLengths
not calculated immediately due
to lazy
mode. F
inally, we
reduce
which is an action. I
n this place, Spark divides the calculations into tasks and lets them run on multiple machines. E
ach machine runs its own map section and the local reduce section.
Then simply return the results to the driver.
If we want to use
lineLengths
we can add:
lineLengths.persist()
Before
reduce
it causes
lineLengths
be saved to memory after the first calculation is complete.