Sample unique rows from a column in a dataframe without replacement

Question

asked Jul 27, 2019 in Data Science by sourav (17.6k points)

I have a dataframe in which the first column contains unique row IDs, and the second column contains values that are often not unique between rows. Below is a simplified example using iris data:

> df <- as.data.frame(iris$Sepal.Length)
> id <- rownames(df)
> df <- cbind(id, df)
> colnames(df) <- c("id", "Sepal.Length")
> nrow(df)
[1] 150
> length(unique(df$id))
[1] 150
> length(unique(df$Sepal.Length))
[1] 35
> head(df,10)
id Sepal.Length
1 1 5.1
2 2 4.9
3 3 4.7
4 4 4.6
5 5 5.0
6 6 5.4
7 7 4.6
8 8 5.0
9 9 4.4
10 10 4.9

I would like to randomly sample from df$Sepal.Length without replacement so that the rows in the sampled data have unique row ID values.

> set.seed(22)
> df_sample <- df[sample(df$Sepal.Length, 10, replace=FALSE),]

However, replace=FALSE still gives me rows with duplicate IDs:

> duplicated(df_sample$id)
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE

Is there a way to sample this data without replacement so that it returns unique rows? I am trying to specifically sample df$Sepal.Length because I would also like to supply a probability vector for this column. Thank you!

1 Answer

Shlok Pandey · Answer 1 · 2019-08-01T06:29:09+0000

You can do like this:

df <- data.frame(id = 1:nrow(iris), Sepal.Length = iris$Sepal.Length)
df_sample <- df[sample(nrow(df), 10, replace = F), ]
duplicated(df_sample$id)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Sample unique rows from a column in a dataframe without replacement

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources