Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:

pyspark --master local[2]


I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.

I was wondering what could be a possible reason for this.

1 Answer

0 votes
by (32.3k points)

While you are performing your operations via pandas, you found that in your case pandas defeated pyspark by a huge margin in terms of latency. Reasons for this observations are as follows:

 

  • Apache Spark is a complex framework designed to distribute processing across hundreds of nodes while ensuring correctness and fault tolerance. Each of these properties has significant cost.

  • Here, in Pandas, in-memory in-core processing is orders of magnitude faster than disk and network (even local) I/O (Spark).

  • parallelism (and distributed processing) add a significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.

  • local mode is not designed for performance. It is used for testing.

  • One more reason is that 2 cores running on 393MB is not enough to see any performance improvements, and a single node doesn't provide any opportunity for distribution

...