Why is Apache-Spark - Python so slow locally as compared to pandas?

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-17T14:35:52+0000

While you are performing your operations via pandas, you found that in your case pandas defeated pyspark by a huge margin in terms of latency. Reasons for this observations are as follows:

Apache Spark is a complex framework designed to distribute processing across hundreds of nodes while ensuring correctness and fault tolerance. Each of these properties has significant cost.
Here, in Pandas, in-memory in-core processing is orders of magnitude faster than disk and network (even local) I/O (Spark).
parallelism (and distributed processing) add a significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
local mode is not designed for performance. It is used for testing.
One more reason is that 2 cores running on 393MB is not enough to see any performance improvements, and a single node doesn't provide any opportunity for distribution

Why is Apache-Spark - Python so slow locally as compared to pandas?

1 Answer

Related questions

Browse Categories