Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am using the randomSplitfunction to get a small amount of a dataframe to use in dev purposes and I end up just taking the first df that is returned by this function.

val df_subset = data.randomSplit(Array(0.00000001, 0.01), seed = 12345)(0)


If I use df.take(1000) then I end up with an array of rows- not a dataframe, so that won't work for me.

Is there a better, simpler way to take say the first 1000 rows of the df and store it as another df?

1 Answer

0 votes
by (32.3k points)
edited by

I would suggest you to use limit method in you program, like this:

yourDataFrame.limit(10)

Applying limit() to your df will result in a new Dataframe. This is a transformation and does not perform collecting the data.

While when you do:

yourDataFrame.take(10)

It will result in an Array of Rows. This is an action and performs collecting the data (similar to collect).

If you want to know more about Spark, then do check out this awesome video tutorial:

...