Data Manipulation
Data manipulation involves modifying data to make it easier to read and to be more organized. We manipulate data for analysis and visualization. It is also used with the term ‘data exploration’ which involves organizing data using available sets of variables.
At times, the data collection process done by machines involves a lot of errors and inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and make data more accurate and precise.
For example:
We will use the default iris table in R, as follows:
#To load datasets package<br>
library("datasets")<br>
#To load iris dataset<br>
data(iris)<br>
summary(iris)
Output:
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Min. :4.300 |
Min. :2.000 |
Min. :1.000 |
Min. :0.100 |
setosa: 50 |
1st Qu.:5.100 |
1st Qu.:2.800 |
1st Qu.:1.600 |
versicolor:0.300 |
versicolor:50 |
Median: 5.800 |
Median: 3.000 |
Median: 4.350 |
Median: 1.300 |
Virginica: 50 |
Mean: 5.843 |
Mean: 3.057 |
Mean: 3.758 |
Mean: 1.199 |
|
3rd Qu.:6.400 |
3rd Qu.:3.300 |
3rd Qu.:5.100 |
3rd Qu.:1.800 |
|
Max. :7.900 |
Max. :4.400 |
Max. :6.900 |
Max. :2.500 |
|
So after going through what data manipulation in R is, we are going to cover the following topics in this tutorial:
Redefine Yourself as a Data Science Expert
Achieve More with Data Science Learning
- Data Manipulation in R
- Data Manipulation in R With dplyr Package.
- Grouping
- Pipe Operator
Sample()
It is used to generate a sample of a specific size from a vector or a dataset, either with or without replacement.
The basic syntax of sample() function is as follows:
sample(data, size, replace = FALSE, prob = NULL)
For example:
#To return 5 random rows<br>
index<-sample(1:nrow(iris), 5)<br>
index<br>
iris[index,]
Output:
Sl. No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
137 |
6.3 |
3.4 |
5.6 |
2.4 |
Virginica |
85 |
5.4 |
3.0 |
4.5 |
1.5 |
Versicolor |
14 |
4.3 |
3.0 |
1.1 |
0.1 |
Setosa |
54 |
5.5 |
2.3 |
4.0 |
1.3 |
Versicolor |
4 |
4.6 |
3.1 |
1.5 |
0.2 |
Setosa |
Table()
It is used to create a frequency table to calculate the occurrences of unique values of a variable.
The table() function generates an object of the table class.
For example:
#To find the frequency distribution of Species in iris table<br>
data(iris)<br>
freq.table <- table(iris$Species)<br>
head(freq.table)
Output:
setosa |
versicolor |
virginica |
50 |
50 |
50 |
Shape your future in Data Science for free.
Learn from Top Data Science Experts for Free
Data Manipulation in R With dplyr Package
There are different ways to perform data manipulation in R, such as using Base R functions like subset(), with(), within(), etc., Packages like data.table, ggplot2, reshape2, readr, etc., and different Machine Learning algorithms.
However, in this tutorial, we are going to use the dplyr package to perform data manipulation in R.
The dplyr package consists of many functions specifically used for data manipulation. These functions process data faster than Base R functions and are known the best for data exploration and transformation, as well.
Following are some of the important functions included in the dplyr package
select() :- To select columns (variables)
filter() :-To filter (subset) rows.
mutate() :-To create new variables
summarise() :- To summarize (or aggregate) data
group_by() :- To group data
arrange() :- To sort data
join() :- To join data frames.
To install the dplyr package, run the following command:
install.packages("dplyr")
#To load dplyr package<br>
library("dplyr")<br>
#To load datasets package<br>
library("datasets")<br>
#To load iris dataset<br>
data(iris)<br>
summary(iris)
Output:
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Min. :4.300 |
Min. :2.000 |
Min. :1.000 |
Min. :0.100 |
setosa: 50 |
1st Qu.:5.100 |
1st Qu.:2.800 |
1st Qu.:1.600 |
versicolor:0.300 |
versicolor:50 |
Median: 5.800 |
Median: 3.000 |
Median: 4.350 |
Median: 1.300 |
virginica: 50 |
Mean: 5.843 |
Mean: 3.057 |
Mean: 3.758 |
Mean: 1.199 |
|
3rd Qu.:6.400 |
3rd Qu.:3.300 |
3rd Qu.:5.100 |
3rd Qu.:1.800 |
|
Max. :7.900 |
Max. :4.400 |
Max. :6.900 |
Max. :2.500 |
|
It contains 150 samples of three plant species (setosa, virginica, and versicolor) and four features measured for each sample.
Select()
It is used to select data by its column name. We can select any number of columns in a number of ways.
For example:
#To select the following columns<br>
selected <- select(iris, Sepal.Length, Sepal.Width, Petal.Length)<br>
head(selected)<br>
#To select all columns from Sepal.Length to Petal.Length<br>
selected1 <- select(iris, Sepal.Length:Petal.Length)<br>
#To print first four rows<br>
head(selected1, 4)<br>
#To select columns with numeric indexes<br>
selected1 <- select(iris,c(3:5))<br>
head(selected1)
Output:
Sl.No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
1 |
5.1 |
3.5 |
1.4 |
2 |
4.9 |
3.0 |
1.4 |
3 |
4.7 |
3.2 |
1.3 |
4 |
4.6 |
3.1 |
1.5 |
5 |
5.0 |
3.6 |
1.4 |
6 |
5.4 |
3.9 |
1.7 |
Output:
Sl.No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
1 |
5.1 |
3.5 |
1.4 |
2 |
4.9 |
3.0 |
1.4 |
3 |
4.7 |
3.2 |
1.3 |
4 |
4.6 |
3.1 |
1.5 |
Output:
Sl.No. |
Petal.Length |
Petal.Width |
Species |
1 |
1.4 |
0.2 |
Setosa |
2 |
1.4 |
0.2 |
Setosa |
3 |
1.3 |
0.2 |
Setosa |
4 |
1.5 |
0.2 |
Setosa |
5 |
1.4 |
0.2 |
Setosa |
6 |
1.7 |
0.4 |
Setosa |
#We use(-)to hide a particular column<br>
selected <- select(iris, -Sepal.Length, -Sepal.Width)<br>
head(selected)
Output:
Sl.No. |
Petal.Length |
Petal.Width |
Species |
1 |
1.4 |
0.2 |
Setosa |
2 |
1.4 |
0.2 |
Setosa |
3 |
1.3 |
0.2 |
Setosa |
4 |
1.5 |
0.2 |
Setosa |
5 |
1.4 |
0.2 |
Setosa |
6 |
1.7 |
0.4 |
Setosa |
Filter()
It is used to find rows with matching criteria. It also works like the select() function, i.e., we pass a data frame along with a condition separated by a comma.
For example:
#To select the first 3 rows with Species as setosa<br>
filtered <- filter(iris, Species == "setosa" )<br>
head(filtered,3)
Output:
Sl. No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
1 |
5.1 |
3.5 |
1.4 |
0.2 |
Setosa |
2 |
4.9 |
3.0 |
1.4 |
0.2 |
Setosa |
3 |
4.7 |
3.2 |
1.3 |
0.2 |
Setosa |
#To select the last 5 rows with Species as versicolor and Sepal width more than 3<br>
filtered1 <- filter(iris, Species == "versicolor", Sepal.Width > 3)<br>
tail(filtered1)
Output:
Sl. No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
4 |
6.3 |
3.3 |
4.7 |
1.6 |
Versicolor |
5 |
6.7 |
3.1 |
4.4 |
1.4 |
Versicolor |
6 |
5.9 |
3.2 |
4.8 |
1.8 |
Versicolor |
7 |
6.0 |
3.4 |
4.5 |
1.6 |
Versicolor |
8 |
6.7 |
3.1 |
4.7 |
1.5 |
Versicolor |
Mutate()
It creates new columns and preserves the existing columns in a dataset.
For example:
#To create a column “Greater.Half” which stores TRUE if given condition<br>
is TRUE<br>
col1 <- mutate(iris, Greater.Half = Sepal.Width > 0.5 * Sepal.Length)<br>
tail(col1)
Output:
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Greater.Half |
145 |
6.7 |
3.3 |
5.7 |
2.5 |
Virginica |
FALSE |
146 |
6.7 |
3.0 |
5.2 |
2.3 |
Virginica |
FALSE |
147 |
6.3 |
2.5 |
5.0 |
1.9 |
Virginica |
FALSE |
148 |
6.5 |
3.0 |
5.2 |
2.0 |
Virginica |
FALSE |
149 |
6.2 |
3.4 |
5.4 |
2.3 |
Virginica |
TRUE |
150 |
5.9 |
3.0 |
5.1 |
1.8 |
Virginica |
TRUE |
#To check how many flowers satisfy this condition<br>
table(col1$Greater.Half)
Output:
FALSE=84 TRUE=66
Arrange()
It is used to sort rows by variables in both an ascending and descending order.
For example:
#To arrange Sepal Width in ascending order<br>
arranged <- arrange(col1, Sepal.Width)<br>
head(arranged)<br>
#To arrange Sepal Width in descending order<br>
arranged <- arrange(col1, desc(Sepal.Width))<br>
head(arranged)
Output:
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Greater.Half |
1 |
5.0 |
2.0 |
3.5 |
1.0 |
Versicolor |
FALSE |
2 |
6.0 |
2.2 |
4.0 |
1.0 |
Versicolor |
FALSE |
3 |
6.2 |
2.2 |
4.5 |
1.5 |
Versicolor |
FALSE |
4 |
6.0 |
2.2 |
5.0 |
1.5 |
Virginica |
FALSE |
5 |
4.5 |
2.3 |
1.3 |
0.3 |
Setosa |
TRUE |
6 |
5.5 |
2.3 |
4.0 |
1.3 |
Versicolor |
FALSE |
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Greater.Half |
1 |
5.7 |
4.4 |
1.5 |
0.4 |
Setosa |
TRUE |
2 |
5.5 |
4.2 |
1.4 |
0.2 |
Setosa |
TRUE |
3 |
5.2 |
4.1 |
1.5 |
0.1 |
Setosa |
TRUE |
4 |
5.8 |
4.0 |
1.2 |
0.2 |
Setosa |
TRUE |
5 |
5.4 |
3.9 |
1.7 |
0.4 |
Setosa |
TRUE |
6 |
5.4 |
3.9 |
1.3 |
0.4 |
Setosa |
TRUE |
Summarise()
It is used to find insights(mean, median, mode, etc.) from a dataset. It reduces multiple values down to a single value.
For example:
summarised <- summarise(arranged, Mean.Width = mean(Sepal.Width))<br>
head(summarised)
Output:
Mean.Width<br>
1 3.057333
Get 100% Hike!
Master Most in Demand Skills Now!
Grouping10
It is done to group observations within a dataset by one or more variables. Most data operations are performed on groups defined by variables.
For example:
#To find mean sepal width by Species, we use grouping as follows<br>
gp <- group_by(iris,Species)<br>
mn <- summarise(gp,Mean.Sepal = mean(Sepal.Width))<br>
head(mn)
Output:
Sl. No. |
Species
<fct> |
Mean.Sepal
<dbl> |
1 |
setosa |
3.43 |
2 |
versicolor |
2.77 |
3 |
virginica |
2.97 |
Pipe Operator
Pipe operator lets us wrap multiple functions together. It is denoted as %>% . It can be used with functions like filter(), select(), arrange(), summarise(), group_by(), etc.
For example:
#To get rows with the following conditions<br>
iris %>% filter(Species == "setosa",Sepal.Width > 3.8)
Output:
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
5.4 |
3.9 |
1.7 |
0.4 |
Setosa |
5.8 |
4.0 |
1.2 |
0.2 |
Setosa |
5.7 |
4.4 |
1.5 |
0.4 |
Setosa |
5.4 |
3.9 |
1.3 |
0.4 |
Setosa |
5.2 |
4.1 |
1.5 |
0.1 |
Setosa |
5.5 |
4.2 |
1.4 |
0.2 |
Setosa |
#To find mean Sepal Length by Species, we use pipe operator as follows<br>
iris %>% group_by(Species) %>% summarise(Mean.Length = mean(Sepal.Length))
Output:
Species
<fct> |
Mean.Length
<dbl> |
setosa |
5.01 |
versicolor |
5.94 |
virginica |
6.59 |
In this tutorial we were talking about what data manipulation in R is, data manipulation in R using functions in the dplyr package, grouping, and using the pipe operator to tie multiple functions together. In the next section, we are going to cover data visualization in R.
If you are interested in learning Data Science, we recommend a perfect Data Science Course.
Our Data Science Courses Duration and Fees
Cohort Starts on: 4th May 2025
₹69,027
Cohort Starts on: 27th Apr 2025
₹69,027
Cohort Starts on: 13th Apr 2025
₹69,027