Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Big Data Hadoop & Spark by (11.4k points)

A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:

pyspark --master local[2]


I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.

I was wondering what could be a possible reason for this.

1 Answer

0 votes
by (32.3k points)

While you are performing your operations via pandas, you found that in your case pandas defeated pyspark by a huge margin in terms of latency. Reasons for this observations are as follows:

 

  • Apache Spark is a complex framework designed to distribute processing across hundreds of nodes while ensuring correctness and fault tolerance. Each of these properties has significant cost.

  • Here, in Pandas, in-memory in-core processing is orders of magnitude faster than disk and network (even local) I/O (Spark).

  • parallelism (and distributed processing) add a significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.

  • local mode is not designed for performance. It is used for testing.

  • One more reason is that 2 cores running on 393MB is not enough to see any performance improvements, and a single node doesn't provide any opportunity for distribution

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.1k users

Browse Categories

...