Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Why is pyspark running on top of spark?


Asked by Tripp Henson on Dec 10, 2021 Spark Programming guide



It provides a programming abstraction called DataFrame and can also act as distributed SQL query engine. Running on top of Spark, the streaming feature in Apache Spark enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics.
One may also ask,
What this means is, that when you use the PySpark API, even though the actual ‘data-processing’ is done by Python processes, data persistence and transfer are still handled by the Spark JVM. Things like scheduling (both DAG and Task), broadcast, networking, fault-recovery etc. are all reused from the core Scala Spark package.
Next, PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.
Similarly,
It also launches… a JVM programme! You can see the running Python and JVM processes by using ps aux : So what PySpark does, is it allows you to use a Python programme to send commands to a JVM programme named Spark! Confused? Well the key point here is that Spark is written in Java and Scala, but not in Python.
Also,
Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets. By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default.