Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg

df1.count() = 10
df2.count() = 1000

noOfSamples = 10

I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)

Now while doing so,

var newSample = df1.sample(true, df1.count() / noOfSamples)

What does the fraction here imply? can it be greater than 1?

Also is there anyway we can specify the number of rows to be sampled?

1 Answer

0 votes
by (32.3k points)

The fraction parameter represents the approximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:

val newSample = df1.sample(true, 1D*noOfSamples/df1.count)

However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value, so there can be some variation seen in the resulting dataset size.

 A workaround can be:

val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)

Some scalability observations

You may note that doing a df1.count might be expensive as it evaluates the whole DataFrame, and you'll lose one of the benefits of sampling in the first place.

Therefore depending on the context of your application, I think you may want to use an already known number of total samples, or an approximation.

val newSample = df1.sample(true, 1D*noOfSamples/knownNoOfSamples)

Or assuming the size of your DataFrame as huge, I would still use a fraction and use limit to force the number of samples.

val guessedFraction = 0.1

val newSample = df1.sample(true, guessedFraction).limit(noOfSamples)

As for your questions:

can it be greater than 1?

No. It represents a fraction between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1.

Also is there anyway we can specify the number of rows to be sampled?

You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use.

Browse Categories