Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

How does apache arrow work in apache spark?


Asked by Marley Tran on Nov 29, 2021 Spark Programming guide



By adding support for arrow in sparklyr, it makes Spark perform the row-format to column-format conversion in parallel in Spark. Data is then transferred through the socket but no custom serialization takes place. All the R process needs to do is copy this data from the socket into its heap, transform it and copy it back to the socket connection.
Additionally,
Apache Arrow in Spark. Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that work with Pandas/NumPy data.
In fact, One of such libraries in the data processing and data science space is Apache Arrow. Arrow is used by open-source projects like Apache Parquet, Apache Spark, pandas, and many commercial or closed-source services. It provides the following functionality: An IPC and RPC framework for data exchange between processes and nodes respectively
And,
In simple words, It facilitates communication between many components, for example, reading a parquet file with Python (pandas) and transforming to a Spark dataframe, Falcon Data Visualization or Cassandra without worrying about conversion. A good question is to ask how does the data look like in memory?
In this manner,
Of course, parts of the data is still going to be loaded into RAM but your data does not have to fit into memory. Arrow uses memory-mapping of its files to load only as much data into memory as necessary and possible. The heart of Apache Arrow is its columnar data format. What does columnar data mean?