I am using the randomSplitfunction to get a small amount of a dataframe to use in dev purposes and I end up just taking the first df that is returned by this function.

val df_subset = data.randomSplit(Array(0.00000001, 0.01), seed = 12345)(0)

If I use df.take(1000) then I end up with an array of rows- not a dataframe, so that won't work for me.

Is there a better, simpler way to take say the first 1000 rows of the df and store it as another df?

1 Answer

I would suggest you to use limit method in you program, like this:


Applying limit() to your df will result in a new Dataframe. This is a transformation and does not perform collecting the data.

While when you do:


It will result in an Array of Rows. This is an action and performs collecting the data (similar to collect).

