Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I'm a beginner of Spark-DataFrame API.

I use this code to load csv tab-separated into Spark Dataframe

lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)

 

Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),

  • Does it store the Pandas object to local memory?
  • Does Pandas low-level computation handled all by Spark?
  • Does it exposed all pandas dataframe functionality?(I guess yes)
  • Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

1 Answer

0 votes
by (32.3k points)

Using spark to read in a CSV file to pandas, where the end goal is to read a CSV file into memory is quite a winding method.

In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. Here, no type inference can be performed. So, if you want to sum a column of numbers in your CSV file, you won't be able to do that because according to Spark they are still strings.

You should use pandas.read_csv and read the whole CSV into memory and pandas will automatically infer the type of each column.

Answers to your questions, when you create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas():

Does it store the Pandas object to local memory?

Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.

Does Pandas low-level computation handled all by Spark?

No. Pandas runs its own computations, there is no interplay between spark and pandas except some API compatibility.

Does it exposed all pandas dataframe functionality?

No. There are many methods and functions that are in the pandas API that are not in the PySpark API, e.g. Series objects have an interpolate method which isn't available in PySpark Column objects. 

Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

Absolutely. In fact, in this case, you probably shouldn't even use Spark at all.

Till you're not working with a huge amount of data, pandas.read_csv will likely handle your use case.

Browse Categories

...