May 17, 2021 Spark Programming guide
Parallelized collections are created
by calling SparkContext's parallelize method on an
parallelize
Seq
E
lements in a collection are copied to a distributed data set that can be operated in parallel.
For example, here's how to create a parallel collection in an array of 1 to 5:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Once created, this distributed dataset
distData
can be operated in parallel. F
or example, we
distData.reduce((a, b) => a + b)
add the elements in this array.
We'll describe some of the operations on distributed later.
An important parameter for parallel collections is the number of slices ,
which represents
the number of slices that a dataset slices. S
park runs a task on the cluster for each slice. Y
ou can set 2-4 slices (slices) for each CPU on the cluster. N
ormally, Spark tries to automatically set the number of slices based on your cluster condition.
However, you can
parallelize
it manually through the second parameter of
sc.parallelize(data, 10)