A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:
pyspark --master local[2]
I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.
However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.
I was wondering what could be a possible reason for this.