Data Manipulation in R with Dplyr Package

Data Manipulation

Data manipulation involves modifying data to make it easier to read and to be more organized. We manipulate data for analysis and visualization. It is also used with the term ‘data exploration’ which involves organizing data using available sets of variables.
At times, the data collection process done by machines involves a lot of errors and inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and make data more accurate and precise.

For example:
We will use the default iris table in R, as follows:

#To load datasets package<br>
library("datasets")<br>
#To load iris dataset<br>
data(iris)<br>
summary(iris)

Output:

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
Min. :4.300	Min. :2.000	Min. :1.000	Min. :0.100	setosa: 50
1st Qu.:5.100	1st Qu.:2.800	1st Qu.:1.600	versicolor:0.300	versicolor:50
Median: 5.800	Median: 3.000	Median: 4.350	Median: 1.300	Virginica: 50
Mean: 5.843	Mean: 3.057	Mean: 3.758	Mean: 1.199
3rd Qu.:6.400	3rd Qu.:3.300	3rd Qu.:5.100	3rd Qu.:1.800
Max. :7.900	Max. :4.400	Max. :6.900	Max. :2.500

So after going through what data manipulation in R is, we are going to cover the following topics in this tutorial:

Redefine Yourself as a Data Science Expert

Achieve More with Data Science Learning

Explore Program

Data Manipulation in R
Data Manipulation in R With dplyr Package.
Grouping
Pipe Operator

Sample()

It is used to generate a sample of a specific size from a vector or a dataset, either with or without replacement.
The basic syntax of sample() function is as follows:

sample(data, size, replace = FALSE, prob = NULL)

For example:

#To return 5 random rows<br>
index<-sample(1:nrow(iris), 5)<br>
index<br>
iris[index,]

Output:

Sl. No.	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
137	6.3	3.4	5.6	2.4	Virginica
85	5.4	3.0	4.5	1.5	Versicolor
14	4.3	3.0	1.1	0.1	Setosa
54	5.5	2.3	4.0	1.3	Versicolor
4	4.6	3.1	1.5	0.2	Setosa

Table()

It is used to create a frequency table to calculate the occurrences of unique values of a variable.
The table() function generates an object of the table class.
For example:

#To find the frequency distribution of Species in iris table<br>
data(iris)<br>
freq.table <- table(iris$Species)<br>
head(freq.table)

Output:

setosa	versicolor	virginica
50	50	50

Shape your future in Data Science for free.

Learn from Top Data Science Experts for Free

Explore Program

Data Manipulation in R With dplyr Package

There are different ways to perform data manipulation in R, such as using Base R functions like subset(), with(), within(), etc., Packages like data.table, ggplot2, reshape2, readr, etc., and different Machine Learning algorithms.
However, in this tutorial, we are going to use the dplyr package to perform data manipulation in R.
The dplyr package consists of many functions specifically used for data manipulation. These functions process data faster than Base R functions and are known the best for data exploration and transformation, as well.
Following are some of the important functions included in the dplyr package
select() :- To select columns (variables)
filter() :-To filter (subset) rows.
mutate() :-To create new variables
summarise() :- To summarize (or aggregate) data
group_by() :- To group data
arrange() :- To sort data
join() :- To join data frames.
To install the dplyr package, run the following command:

install.packages("dplyr")

#To load dplyr package<br>
library("dplyr")<br>
#To load datasets package<br>
library("datasets")<br>
#To load iris dataset<br>
data(iris)<br>
summary(iris)

Output:

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
Min. :4.300	Min. :2.000	Min. :1.000	Min. :0.100	setosa: 50
1st Qu.:5.100	1st Qu.:2.800	1st Qu.:1.600	versicolor:0.300	versicolor:50
Median: 5.800	Median: 3.000	Median: 4.350	Median: 1.300	virginica: 50
Mean: 5.843	Mean: 3.057	Mean: 3.758	Mean: 1.199
3rd Qu.:6.400	3rd Qu.:3.300	3rd Qu.:5.100	3rd Qu.:1.800
Max. :7.900	Max. :4.400	Max. :6.900	Max. :2.500

It contains 150 samples of three plant species (setosa, virginica, and versicolor) and four features measured for each sample.

Select()

It is used to select data by its column name. We can select any number of columns in a number of ways.
For example:

#To select the following columns<br>
selected <- select(iris, Sepal.Length, Sepal.Width, Petal.Length)<br>
head(selected)<br>
#To select all columns from Sepal.Length to Petal.Length<br>
selected1 <- select(iris, Sepal.Length:Petal.Length)<br>
#To print first four rows<br>
head(selected1, 4)<br>
#To select columns with numeric indexes<br>
selected1 <- select(iris,c(3:5))<br>
head(selected1)

Output:

Sl.No.	Sepal.Length	Sepal.Width	Petal.Length
1	5.1	3.5	1.4
2	4.9	3.0	1.4
3	4.7	3.2	1.3
4	4.6	3.1	1.5
5	5.0	3.6	1.4
6	5.4	3.9	1.7

Output:

Sl.No.	Sepal.Length	Sepal.Width	Petal.Length
1	5.1	3.5	1.4
2	4.9	3.0	1.4
3	4.7	3.2	1.3
4	4.6	3.1	1.5

Output:

Sl.No.	Petal.Length	Petal.Width	Species
1	1.4	0.2	Setosa
2	1.4	0.2	Setosa
3	1.3	0.2	Setosa
4	1.5	0.2	Setosa
5	1.4	0.2	Setosa
6	1.7	0.4	Setosa

#We use(-)to hide a particular column<br>
selected <- select(iris, -Sepal.Length, -Sepal.Width)<br>
head(selected)

Output:

Sl.No.	Petal.Length	Petal.Width	Species
1	1.4	0.2	Setosa
2	1.4	0.2	Setosa
3	1.3	0.2	Setosa
4	1.5	0.2	Setosa
5	1.4	0.2	Setosa
6	1.7	0.4	Setosa

Filter()

It is used to find rows with matching criteria. It also works like the select() function, i.e., we pass a data frame along with a condition separated by a comma.
For example:

#To select the first 3 rows with Species as setosa<br>
filtered <- filter(iris, Species == "setosa" )<br>
head(filtered,3)

Output:

Sl. No.	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	Setosa
2	4.9	3.0	1.4	0.2	Setosa
3	4.7	3.2	1.3	0.2	Setosa

#To select the last 5 rows with Species as versicolor and Sepal width more than 3<br>
filtered1 <- filter(iris, Species == "versicolor", Sepal.Width > 3)<br>
tail(filtered1)

Output:

Sl. No.	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
4	6.3	3.3	4.7	1.6	Versicolor
5	6.7	3.1	4.4	1.4	Versicolor
6	5.9	3.2	4.8	1.8	Versicolor
7	6.0	3.4	4.5	1.6	Versicolor
8	6.7	3.1	4.7	1.5	Versicolor

Mutate()

It creates new columns and preserves the existing columns in a dataset.
For example:

#To create a column “Greater.Half” which stores TRUE if given condition<br>
is TRUE<br>
col1 <- mutate(iris, Greater.Half = Sepal.Width > 0.5 * Sepal.Length)<br>
tail(col1)

Output:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	Greater.Half
145	6.7	3.3	5.7	2.5	Virginica	FALSE
146	6.7	3.0	5.2	2.3	Virginica	FALSE
147	6.3	2.5	5.0	1.9	Virginica	FALSE
148	6.5	3.0	5.2	2.0	Virginica	FALSE
149	6.2	3.4	5.4	2.3	Virginica	TRUE
150	5.9	3.0	5.1	1.8	Virginica	TRUE

#To check how many flowers satisfy this condition<br>
table(col1$Greater.Half)

Output:
FALSE=84 TRUE=66

Arrange()

It is used to sort rows by variables in both an ascending and descending order.
For example:

#To arrange Sepal Width in ascending order<br>
arranged <- arrange(col1, Sepal.Width)<br>
head(arranged)<br>
#To arrange Sepal Width in descending order<br>
arranged <- arrange(col1, desc(Sepal.Width))<br>
head(arranged)

Output:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	Greater.Half
1	5.0	2.0	3.5	1.0	Versicolor	FALSE
2	6.0	2.2	4.0	1.0	Versicolor	FALSE
3	6.2	2.2	4.5	1.5	Versicolor	FALSE
4	6.0	2.2	5.0	1.5	Virginica	FALSE
5	4.5	2.3	1.3	0.3	Setosa	TRUE
6	5.5	2.3	4.0	1.3	Versicolor	FALSE

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	Greater.Half
1	5.7	4.4	1.5	0.4	Setosa	TRUE
2	5.5	4.2	1.4	0.2	Setosa	TRUE
3	5.2	4.1	1.5	0.1	Setosa	TRUE
4	5.8	4.0	1.2	0.2	Setosa	TRUE
5	5.4	3.9	1.7	0.4	Setosa	TRUE
6	5.4	3.9	1.3	0.4	Setosa	TRUE

Summarise()

It is used to find insights(mean, median, mode, etc.) from a dataset. It reduces multiple values down to a single value.
For example:

summarised <- summarise(arranged, Mean.Width = mean(Sepal.Width))<br>
head(summarised)

Output:

Mean.Width<br>
1   3.057333

Get 100% Hike!

Master Most in Demand Skills Now!

Grouping10

It is done to group observations within a dataset by one or more variables. Most data operations are performed on groups defined by variables.
For example:

#To find mean sepal width by Species, we use grouping as follows<br>
gp <- group_by(iris,Species)<br>
mn <- summarise(gp,Mean.Sepal = mean(Sepal.Width))<br>
head(mn)

Output:

Sl. No.	Species *<fct>*	Mean.Sepal *<dbl>*
1	setosa	3.43
2	versicolor	2.77
3	virginica	2.97

Pipe Operator

Pipe operator lets us wrap multiple functions together. It is denoted as %>% . It can be used with functions like filter(), select(), arrange(), summarise(), group_by(), etc.
For example:

#To get rows with the following conditions<br>
iris %>% filter(Species == "setosa",Sepal.Width > 3.8)

Output:

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.4	3.9	1.7	0.4	Setosa
5.8	4.0	1.2	0.2	Setosa
5.7	4.4	1.5	0.4	Setosa
5.4	3.9	1.3	0.4	Setosa
5.2	4.1	1.5	0.1	Setosa
5.5	4.2	1.4	0.2	Setosa

#To find mean Sepal Length by Species, we use pipe operator as follows<br>
iris  %>% group_by(Species) %>% summarise(Mean.Length = mean(Sepal.Length))

Output:

Species *<fct>*	Mean.Length *<dbl>*
setosa	5.01
versicolor	5.94
virginica	6.59

In this tutorial we were talking about what data manipulation in R is, data manipulation in R using functions in the dplyr package, grouping, and using the pipe operator to tie multiple functions together. In the next section, we are going to cover data visualization in R.

If you are interested in learning Data Science, we recommend a perfect Data Science Course.