Spark parallel collection

May 17, 2021 Spark Programming guide

Spark parallel collection

Parallelized collections are created by calling SparkContext's parallelize method on an parallelize Seq E lements in a collection are copied to a distributed data set that can be operated in parallel. For example, here's how to create a parallel collection in an array of 1 to 5:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

Once created, this distributed dataset distData can be operated in parallel. F or example, we distData.reduce((a, b) => a + b) add the elements in this array. We'll describe some of the operations on distributed later.

An important parameter for parallel collections is the number of slices , which represents the number of slices that a dataset slices. S park runs a task on the cluster for each slice. Y ou can set 2-4 slices (slices) for each CPU on the cluster. N ormally, Spark tries to automatically set the number of slices based on your cluster condition. However, you can parallelize it manually through the second parameter of sc.parallelize(data, 10)

Spark parallel collection

Spark parallel collection

Cookie Consent