• Articles
  • Tutorials
  • Interview Questions
  • Webinars

Data Manipulation in R with Dplyr Package

Data Manipulation

Data manipulation involves modifying data to make it easier to read and to be more organized. We manipulate data for analysis and visualization. It is also used with the term ‘data exploration’ which involves organizing data using available sets of variables.
At times, the data collection process done by machines involves a lot of errors and inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and make data more accurate and precise.

For example:
We will use the default iris table in R, as follows:

#To load datasets package
library("datasets")
#To load iris dataset
data(iris)
summary(iris)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min.   :4.300 Min.   :2.000 Min.   :1.000 Min.   :0.100 setosa: 50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 versicolor:0.300 versicolor:50
Median: 5.800 Median: 3.000 Median: 4.350 Median: 1.300 Virginica: 50
Mean: 5.843 Mean: 3.057 Mean: 3.758 Mean: 1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max.   :7.900 Max.   :4.400 Max.   :6.900 Max.   :2.500

So after going through what data manipulation in R is, we are going to cover the following topics in this tutorial:

  • Data Manipulation in R
  • Data Manipulation in R With dplyr Package.
  • Grouping
  • Pipe Operator

Sample()

It is used to generate a sample of a specific size from a vector or a dataset, either with or without replacement.
The basic syntax of sample() function is as follows:

sample(data, size, replace = FALSE, prob = NULL)

For example:

#To return 5 random rows
index<-sample(1:nrow(iris), 5)
index
iris[index,]

Output:

Sl. No. Sepal.Length Sepal.Width Petal.Length Petal.Width Species
137 6.3 3.4 5.6 2.4 Virginica
85 5.4 3.0 4.5 1.5 Versicolor
14 4.3 3.0 1.1 0.1 Setosa
54 5.5 2.3 4.0 1.3 Versicolor
4 4.6 3.1 1.5 0.2 Setosa

Table()

It is used to create a frequency table to calculate the occurrences of unique values of a variable.
The table() function generates an object of the table class.
For example:

#To find the frequency distribution of Species in iris table
data(iris)
freq.table <- table(iris$Species)
head(freq.table)

Output:

setosa versicolor virginica
50 50 50

Become a Data Science Architect

 

Data Manipulation in R With dplyr Package

There are different ways to perform data manipulation in R, such as using Base R functions like subset(), with(), within(), etc., Packages like data.table, ggplot2, reshape2, readr, etc., and different Machine Learning algorithms.
However, in this tutorial, we are going to use the dplyr package to perform data manipulation in R.
The dplyr package consists of many functions specifically used for data manipulation. These functions process data faster than Base R functions and are known the best for data exploration and transformation, as well.
Following are some of the important functions included in the dplyr package
select() :- To select columns (variables)
filter() :-To filter (subset) rows.
mutate() :-To create new variables
summarise() :- To summarize (or aggregate) data
group_by() :- To group data
arrange() :- To sort data
join() :- To join data frames.
To install the dplyr package, run the following command:

install.packages("dplyr")

 

#To load dplyr package
library("dplyr")
#To load datasets package
library("datasets")
#To load iris dataset
data(iris)
summary(iris)

 Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min.   :4.300 Min.   :2.000 Min.   :1.000 Min.   :0.100 setosa: 50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 versicolor:0.300 versicolor:50
Median: 5.800 Median: 3.000 Median: 4.350 Median: 1.300 virginica: 50
Mean: 5.843 Mean: 3.057 Mean: 3.758 Mean: 1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max.   :7.900 Max.   :4.400 Max.   :6.900 Max.   :2.500

It contains 150 samples of three plant species (setosa, virginica, and versicolor) and four features measured for each sample.

Select()

It is used to select data by its column name. We can select any number of columns in a number of ways.
For example:

#To select the following columns
selected <- select(iris, Sepal.Length, Sepal.Width, Petal.Length)
head(selected)
#To select all columns from Sepal.Length to Petal.Length
selected1 <- select(iris, Sepal.Length:Petal.Length)
#To print first four rows
head(selected1, 4)                           
#To select columns with numeric indexes
selected1 <- select(iris,c(3:5))
head(selected1)

Output:

Sl.No. Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7

Output:

Sl.No. Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5

Output:

Sl.No. Petal.Length Petal.Width Species
1 1.4 0.2 Setosa
2 1.4 0.2 Setosa
3 1.3 0.2 Setosa
4 1.5 0.2 Setosa
5 1.4 0.2 Setosa
6 1.7 0.4 Setosa

 

#We use(-)to hide a particular column
selected <- select(iris, -Sepal.Length, -Sepal.Width)
head(selected)

Output:

Sl.No. Petal.Length Petal.Width Species
1 1.4 0.2 Setosa
2 1.4 0.2 Setosa
3 1.3 0.2 Setosa
4 1.5 0.2 Setosa
5 1.4 0.2 Setosa
6 1.7 0.4 Setosa

Filter()

It is used to find rows with matching criteria. It also works like the select() function, i.e., we pass a data frame along with a condition separated by a comma.
For example:

#To select the first 3 rows with Species as setosa
filtered <- filter(iris, Species == "setosa" )
head(filtered,3)

Output:

Sl. No. Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 Setosa
2 4.9 3.0 1.4 0.2 Setosa
3 4.7 3.2 1.3 0.2 Setosa

 

#To select the last 5 rows with Species as versicolor and Sepal width more than 3
filtered1 <- filter(iris, Species == "versicolor", Sepal.Width > 3)
tail(filtered1)

Output:

Sl. No. Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4 6.3 3.3 4.7 1.6 Versicolor
5 6.7 3.1 4.4 1.4 Versicolor
6 5.9 3.2 4.8 1.8 Versicolor
7 6.0 3.4 4.5 1.6 Versicolor
8 6.7 3.1 4.7 1.5 Versicolor

Mutate()

It creates new columns and preserves the existing columns in a dataset.
For example:

#To create a column “Greater.Half” which stores TRUE if given condition
is TRUE
col1 <- mutate(iris, Greater.Half = Sepal.Width > 0.5 * Sepal.Length)
tail(col1)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species Greater.Half
145 6.7 3.3 5.7 2.5 Virginica FALSE
146 6.7 3.0 5.2 2.3 Virginica FALSE
147 6.3 2.5 5.0 1.9 Virginica FALSE
148 6.5 3.0 5.2 2.0 Virginica FALSE
149 6.2 3.4 5.4 2.3 Virginica TRUE
150 5.9 3.0 5.1 1.8 Virginica TRUE

 

#To check how many flowers satisfy this condition
table(col1$Greater.Half)

Output:
FALSE=84  TRUE=66

Arrange()

It is used to sort rows by variables in both an ascending and descending order.
For example:

#To arrange Sepal Width in ascending order
arranged <- arrange(col1, Sepal.Width)
head(arranged)
#To arrange Sepal Width in descending order
arranged <- arrange(col1, desc(Sepal.Width))
head(arranged)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species Greater.Half
1 5.0 2.0 3.5 1.0 Versicolor FALSE
2 6.0 2.2 4.0 1.0 Versicolor FALSE
3 6.2 2.2 4.5 1.5 Versicolor FALSE
4 6.0 2.2 5.0 1.5 Virginica FALSE
5 4.5 2.3 1.3 0.3 Setosa TRUE
6 5.5 2.3 4.0 1.3 Versicolor FALSE
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Greater.Half
1 5.7 4.4 1.5 0.4 Setosa TRUE
2 5.5 4.2 1.4 0.2 Setosa TRUE
3 5.2 4.1 1.5 0.1 Setosa TRUE
4 5.8 4.0 1.2 0.2 Setosa TRUE
5 5.4 3.9 1.7 0.4 Setosa TRUE
6 5.4 3.9 1.3 0.4 Setosa TRUE

Summarise()

It is used to find insights(mean, median, mode, etc.) from a dataset. It reduces multiple values down to a single value.
For example:

summarised <- summarise(arranged, Mean.Width = mean(Sepal.Width))
head(summarised)

Output:

Mean.Width
1   3.057333

Grouping10

It is done to group observations within a dataset by one or more variables. Most data operations are performed on groups defined by variables.
For example:

#To find mean sepal width by Species, we use grouping as follows
gp <- group_by(iris,Species)
mn <- summarise(gp,Mean.Sepal = mean(Sepal.Width))
head(mn)

Output:

Sl. No. Species
<fct>
Mean.Sepal
<dbl>
1 setosa 3.43
2 versicolor 2.77
3 virginica 2.97

Pipe Operator

Pipe operator lets us wrap multiple functions together. It is denoted as %>% . It can be used with functions like filter(), select(), arrange(), summarise(), group_by(), etc.
For example:

#To get rows with the following conditions
iris %>% filter(Species == "setosa",Sepal.Width > 3.8)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.4 3.9 1.7 0.4 Setosa
5.8 4.0 1.2 0.2 Setosa
5.7 4.4 1.5 0.4 Setosa
5.4 3.9 1.3 0.4 Setosa
5.2 4.1 1.5 0.1 Setosa
5.5 4.2 1.4 0.2 Setosa

 

#To find mean Sepal Length by Species, we use pipe operator as follows
iris  %>% group_by(Species) %>% summarise(Mean.Length = mean(Sepal.Length))

Output:

Species
<fct>
Mean.Length
<dbl>
setosa 5.01
versicolor 5.94
virginica 6.59

In this tutorial we were talking about what data manipulation in R is, data manipulation in R using functions in the dplyr package, grouping, and using the pipe operator to tie multiple functions together. In the next section, we are going to cover data visualization in R.

Course Schedule

Name Date Details
R Programming Course 14 Dec 2024(Sat-Sun) Weekend Batch View Details
21 Dec 2024(Sat-Sun) Weekend Batch
28 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.