Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Spark parallel collection


May 17, 2021 Spark Programming guide



Spark parallel collection

Parallelized collections are created by calling SparkContext's parallelize method on an parallelize Seq E lements in a collection are copied to a distributed data set that can be operated in parallel. For example, here's how to create a parallel collection in an array of 1 to 5:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

Once created, this distributed dataset distData can be operated in parallel. F or example, we distData.reduce((a, b) => a + b) add the elements in this array. We'll describe some of the operations on distributed later.

An important parameter for parallel collections is the number of slices , which represents the number of slices that a dataset slices. S park runs a task on the cluster for each slice. Y ou can set 2-4 slices (slices) for each CPU on the cluster. N ormally, Spark tries to automatically set the number of slices based on your cluster condition. However, you can parallelize it manually through the second parameter of sc.parallelize(data, 10)