Data Manipulation
Data manipulation involves modifying data to make it easier to read and to be more organized. We manipulate data for analysis and visualization. It is also used with the term ‘data exploration’ which involves organizing data using available sets of variables.
At times, the data collection process done by machines involves a lot of errors and inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and make data more accurate and precise.
For example:
We will use the default iris table in R, as follows:
#To load datasets package
library("datasets")
#To load iris dataset
data(iris)
summary(iris)
Output:
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Min. :4.300 |
Min. :2.000 |
Min. :1.000 |
Min. :0.100 |
setosa: 50 |
1st Qu.:5.100 |
1st Qu.:2.800 |
1st Qu.:1.600 |
versicolor:0.300 |
versicolor:50 |
Median: 5.800 |
Median: 3.000 |
Median: 4.350 |
Median: 1.300 |
Virginica: 50 |
Mean: 5.843 |
Mean: 3.057 |
Mean: 3.758 |
Mean: 1.199 |
|
3rd Qu.:6.400 |
3rd Qu.:3.300 |
3rd Qu.:5.100 |
3rd Qu.:1.800 |
|
Max. :7.900 |
Max. :4.400 |
Max. :6.900 |
Max. :2.500 |
|
So after going through what data manipulation in R is, we are going to cover the following topics in this tutorial:
- Data Manipulation in R
- Data Manipulation in R With dplyr Package.
- Grouping
- Pipe Operator
Sample()
It is used to generate a sample of a specific size from a vector or a dataset, either with or without replacement.
The basic syntax of sample() function is as follows:
sample(data, size, replace = FALSE, prob = NULL)
For example:
#To return 5 random rows
index<-sample(1:nrow(iris), 5)
index
iris[index,]
Output:
Sl. No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
137 |
6.3 |
3.4 |
5.6 |
2.4 |
Virginica |
85 |
5.4 |
3.0 |
4.5 |
1.5 |
Versicolor |
14 |
4.3 |
3.0 |
1.1 |
0.1 |
Setosa |
54 |
5.5 |
2.3 |
4.0 |
1.3 |
Versicolor |
4 |
4.6 |
3.1 |
1.5 |
0.2 |
Setosa |
Table()
It is used to create a frequency table to calculate the occurrences of unique values of a variable.
The table() function generates an object of the table class.
For example:
#To find the frequency distribution of Species in iris table
data(iris)
freq.table <- table(iris$Species)
head(freq.table)
Output:
setosa |
versicolor |
virginica |
50 |
50 |
50 |
Data Manipulation in R With dplyr Package
There are different ways to perform data manipulation in R, such as using Base R functions like subset(), with(), within(), etc., Packages like data.table, ggplot2, reshape2, readr, etc., and different Machine Learning algorithms.
However, in this tutorial, we are going to use the dplyr package to perform data manipulation in R.
The dplyr package consists of many functions specifically used for data manipulation. These functions process data faster than Base R functions and are known the best for data exploration and transformation, as well.
Following are some of the important functions included in the dplyr package
select() :- To select columns (variables)
filter() :-To filter (subset) rows.
mutate() :-To create new variables
summarise() :- To summarize (or aggregate) data
group_by() :- To group data
arrange() :- To sort data
join() :- To join data frames.
To install the dplyr package, run the following command:
install.packages("dplyr")
#To load dplyr package
library("dplyr")
#To load datasets package
library("datasets")
#To load iris dataset
data(iris)
summary(iris)
Output:
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Min. :4.300 |
Min. :2.000 |
Min. :1.000 |
Min. :0.100 |
setosa: 50 |
1st Qu.:5.100 |
1st Qu.:2.800 |
1st Qu.:1.600 |
versicolor:0.300 |
versicolor:50 |
Median: 5.800 |
Median: 3.000 |
Median: 4.350 |
Median: 1.300 |
virginica: 50 |
Mean: 5.843 |
Mean: 3.057 |
Mean: 3.758 |
Mean: 1.199 |
|
3rd Qu.:6.400 |
3rd Qu.:3.300 |
3rd Qu.:5.100 |
3rd Qu.:1.800 |
|
Max. :7.900 |
Max. :4.400 |
Max. :6.900 |
Max. :2.500 |
|
It contains 150 samples of three plant species (setosa, virginica, and versicolor) and four features measured for each sample.
Select()
It is used to select data by its column name. We can select any number of columns in a number of ways.
For example:
#To select the following columns
selected <- select(iris, Sepal.Length, Sepal.Width, Petal.Length)
head(selected)
#To select all columns from Sepal.Length to Petal.Length
selected1 <- select(iris, Sepal.Length:Petal.Length)
#To print first four rows
head(selected1, 4)
#To select columns with numeric indexes
selected1 <- select(iris,c(3:5))
head(selected1)
Output:
Sl.No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
1 |
5.1 |
3.5 |
1.4 |
2 |
4.9 |
3.0 |
1.4 |
3 |
4.7 |
3.2 |
1.3 |
4 |
4.6 |
3.1 |
1.5 |
5 |
5.0 |
3.6 |
1.4 |
6 |
5.4 |
3.9 |
1.7 |
Output:
Sl.No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
1 |
5.1 |
3.5 |
1.4 |
2 |
4.9 |
3.0 |
1.4 |
3 |
4.7 |
3.2 |
1.3 |
4 |
4.6 |
3.1 |
1.5 |
Output:
Sl.No. |
Petal.Length |
Petal.Width |
Species |
1 |
1.4 |
0.2 |
Setosa |
2 |
1.4 |
0.2 |
Setosa |
3 |
1.3 |
0.2 |
Setosa |
4 |
1.5 |
0.2 |
Setosa |
5 |
1.4 |
0.2 |
Setosa |
6 |
1.7 |
0.4 |
Setosa |
#We use(-)to hide a particular column
selected <- select(iris, -Sepal.Length, -Sepal.Width)
head(selected)
Output:
Sl.No. |
Petal.Length |
Petal.Width |
Species |
1 |
1.4 |
0.2 |
Setosa |
2 |
1.4 |
0.2 |
Setosa |
3 |
1.3 |
0.2 |
Setosa |
4 |
1.5 |
0.2 |
Setosa |
5 |
1.4 |
0.2 |
Setosa |
6 |
1.7 |
0.4 |
Setosa |
Filter()
It is used to find rows with matching criteria. It also works like the select() function, i.e., we pass a data frame along with a condition separated by a comma.
For example:
#To select the first 3 rows with Species as setosa
filtered <- filter(iris, Species == "setosa" )
head(filtered,3)
Output:
Sl. No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
1 |
5.1 |
3.5 |
1.4 |
0.2 |
Setosa |
2 |
4.9 |
3.0 |
1.4 |
0.2 |
Setosa |
3 |
4.7 |
3.2 |
1.3 |
0.2 |
Setosa |
#To select the last 5 rows with Species as versicolor and Sepal width more than 3
filtered1 <- filter(iris, Species == "versicolor", Sepal.Width > 3)
tail(filtered1)
Output:
Sl. No. |
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
4 |
6.3 |
3.3 |
4.7 |
1.6 |
Versicolor |
5 |
6.7 |
3.1 |
4.4 |
1.4 |
Versicolor |
6 |
5.9 |
3.2 |
4.8 |
1.8 |
Versicolor |
7 |
6.0 |
3.4 |
4.5 |
1.6 |
Versicolor |
8 |
6.7 |
3.1 |
4.7 |
1.5 |
Versicolor |
Mutate()
It creates new columns and preserves the existing columns in a dataset.
For example:
#To create a column “Greater.Half” which stores TRUE if given condition
is TRUE
col1 <- mutate(iris, Greater.Half = Sepal.Width > 0.5 * Sepal.Length)
tail(col1)
Output:
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Greater.Half |
145 |
6.7 |
3.3 |
5.7 |
2.5 |
Virginica |
FALSE |
146 |
6.7 |
3.0 |
5.2 |
2.3 |
Virginica |
FALSE |
147 |
6.3 |
2.5 |
5.0 |
1.9 |
Virginica |
FALSE |
148 |
6.5 |
3.0 |
5.2 |
2.0 |
Virginica |
FALSE |
149 |
6.2 |
3.4 |
5.4 |
2.3 |
Virginica |
TRUE |
150 |
5.9 |
3.0 |
5.1 |
1.8 |
Virginica |
TRUE |
#To check how many flowers satisfy this condition
table(col1$Greater.Half)
Output:
FALSE=84 TRUE=66
Arrange()
It is used to sort rows by variables in both an ascending and descending order.
For example:
#To arrange Sepal Width in ascending order
arranged <- arrange(col1, Sepal.Width)
head(arranged)
#To arrange Sepal Width in descending order
arranged <- arrange(col1, desc(Sepal.Width))
head(arranged)
Output:
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Greater.Half |
1 |
5.0 |
2.0 |
3.5 |
1.0 |
Versicolor |
FALSE |
2 |
6.0 |
2.2 |
4.0 |
1.0 |
Versicolor |
FALSE |
3 |
6.2 |
2.2 |
4.5 |
1.5 |
Versicolor |
FALSE |
4 |
6.0 |
2.2 |
5.0 |
1.5 |
Virginica |
FALSE |
5 |
4.5 |
2.3 |
1.3 |
0.3 |
Setosa |
TRUE |
6 |
5.5 |
2.3 |
4.0 |
1.3 |
Versicolor |
FALSE |
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
Greater.Half |
1 |
5.7 |
4.4 |
1.5 |
0.4 |
Setosa |
TRUE |
2 |
5.5 |
4.2 |
1.4 |
0.2 |
Setosa |
TRUE |
3 |
5.2 |
4.1 |
1.5 |
0.1 |
Setosa |
TRUE |
4 |
5.8 |
4.0 |
1.2 |
0.2 |
Setosa |
TRUE |
5 |
5.4 |
3.9 |
1.7 |
0.4 |
Setosa |
TRUE |
6 |
5.4 |
3.9 |
1.3 |
0.4 |
Setosa |
TRUE |
Summarise()
It is used to find insights(mean, median, mode, etc.) from a dataset. It reduces multiple values down to a single value.
For example:
summarised <- summarise(arranged, Mean.Width = mean(Sepal.Width))
head(summarised)
Output:
Mean.Width
1 3.057333
Grouping10
It is done to group observations within a dataset by one or more variables. Most data operations are performed on groups defined by variables.
For example:
#To find mean sepal width by Species, we use grouping as follows
gp <- group_by(iris,Species)
mn <- summarise(gp,Mean.Sepal = mean(Sepal.Width))
head(mn)
Output:
Sl. No. |
Species
<fct> |
Mean.Sepal
<dbl> |
1 |
setosa |
3.43 |
2 |
versicolor |
2.77 |
3 |
virginica |
2.97 |
Pipe Operator
Pipe operator lets us wrap multiple functions together. It is denoted as %>% . It can be used with functions like filter(), select(), arrange(), summarise(), group_by(), etc.
For example:
#To get rows with the following conditions
iris %>% filter(Species == "setosa",Sepal.Width > 3.8)
Output:
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
5.4 |
3.9 |
1.7 |
0.4 |
Setosa |
5.8 |
4.0 |
1.2 |
0.2 |
Setosa |
5.7 |
4.4 |
1.5 |
0.4 |
Setosa |
5.4 |
3.9 |
1.3 |
0.4 |
Setosa |
5.2 |
4.1 |
1.5 |
0.1 |
Setosa |
5.5 |
4.2 |
1.4 |
0.2 |
Setosa |
#To find mean Sepal Length by Species, we use pipe operator as follows
iris %>% group_by(Species) %>% summarise(Mean.Length = mean(Sepal.Length))
Output:
Species
<fct> |
Mean.Length
<dbl> |
setosa |
5.01 |
versicolor |
5.94 |
virginica |
6.59 |
In this tutorial we were talking about what data manipulation in R is, data manipulation in R using functions in the dplyr package, grouping, and using the pipe operator to tie multiple functions together. In the next section, we are going to cover data visualization in R.