Data Manipulation

Data manipulation involves modifying data to make it easier to read and to be more organized. We manipulate data for analysis and visualization. It is also used with the term ‘data exploration’ which involves organizing data using available sets of variables.
At times, the data collection process done by machines involves a lot of errors and inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and make data more accurate and precise.

Enroll yourself in R Training and give a head-start to your career in R!

For example:
We will use the default iris table in R, as follows:

#To load datasets package
library("datasets")
#To load iris dataset
data(iris)
summary(iris)

Output:

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
Min.   :4.300Min.   :2.000Min.   :1.000Min.   :0.100setosa: 50
1st Qu.:5.1001st Qu.:2.8001st Qu.:1.600versicolor:0.300versicolor:50
Median: 5.800Median: 3.000Median: 4.350Median: 1.300Virginica: 50
Mean: 5.843Mean: 3.057Mean: 3.758Mean: 1.199
3rd Qu.:6.4003rd Qu.:3.3003rd Qu.:5.1003rd Qu.:1.800
Max.   :7.900Max.   :4.400Max.   :6.900Max.   :2.500

So after going through what data manipulation in R is, we are going to cover the following topics in this tutorial:

  • Data Manipulation in R
  • Data Manipulation in R With dplyr Package.
  • Grouping
  • Pipe Operator

Want to get certified in R! Learn R from top R experts and excel in your career with Intellipaat’s R Programming certification!

Sample()

It is used to generate a sample of a specific size from a vector or a dataset, either with or without replacement.
The basic syntax of sample() function is as follows:

sample(data, size, replace = FALSE, prob = NULL)

For example:

#To return 5 random rows
index<-sample(1:nrow(iris), 5)
index
iris[index,]

Output:

Sl. No.Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1376.33.45.62.4Virginica
855.43.04.51.5Versicolor
144.33.01.10.1Setosa
545.52.34.01.3Versicolor
44.63.11.50.2Setosa

Table()

It is used to create a frequency table to calculate the occurrences of unique values of a variable.
The table() function generates an object of the table class.
For example:

#To find the frequency distribution of Species in iris table
data(iris)
freq.table <- table(iris$Species)
head(freq.table)

Output:

setosaversicolorvirginica
505050

Have you got more queries? Come to our R Programming Community and get them clarified today!

Data Manipulation in R With dplyr Package

There are different ways to perform data manipulation in R, such as using Base R functions like subset(), with(), within(), etc., Packages like data.table, ggplot2, reshape2, readr, etc., and different Machine Learning algorithms.
However, in this tutorial, we are going to use the dplyr package to perform data manipulation in R.
The dplyr package consists of many functions specifically used for data manipulation. These functions process data faster than Base R functions and are known the best for data exploration and transformation, as well.
Following are some of the important functions included in the dplyr package
select() :- To select columns (variables)
filter() :-To filter (subset) rows.
mutate() :-To create new variables
summarise() :- To summarize (or aggregate) data
group_by() :- To group data
arrange() :- To sort data
join() :- To join data frames.
To install the dplyr package, run the following command:

install.packages("dplyr")

In this tutorial, we are going to use the iris dataset from the datasets package in R programming that can be loaded as follows:

#To load dplyr package
library("dplyr")
#To load datasets package
library("datasets")
#To load iris dataset
data(iris)
summary(iris)

 Output:

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
Min.   :4.300Min.   :2.000Min.   :1.000Min.   :0.100setosa: 50
1st Qu.:5.1001st Qu.:2.8001st Qu.:1.600versicolor:0.300versicolor:50
Median: 5.800Median: 3.000Median: 4.350Median: 1.300virginica: 50
Mean: 5.843Mean: 3.057Mean: 3.758Mean: 1.199
3rd Qu.:6.4003rd Qu.:3.3003rd Qu.:5.1003rd Qu.:1.800
Max.   :7.900Max.   :4.400Max.   :6.900Max.   :2.500

It contains 150 samples of three plant species (setosa, virginica, and versicolor) and four features measured for each sample.

Get familiar with the top R Programming Interview Questions to get a head start in your career!

Select()

It is used to select data by its column name. We can select any number of columns in a number of ways.
For example:

#To select the following columns
selected <- select(iris, Sepal.Length, Sepal.Width, Petal.Length)
head(selected)
#To select all columns from Sepal.Length to Petal.Length
selected1 <- select(iris, Sepal.Length:Petal.Length)
#To print first four rows
head(selected1, 4)                           
#To select columns with numeric indexes
selected1 <- select(iris,c(3:5))
head(selected1)

Output:

Sl.No.Sepal.LengthSepal.WidthPetal.Length
15.13.51.4
24.93.01.4
34.73.21.3
44.63.11.5
55.03.61.4
65.43.91.7

Output:

Sl.No.Sepal.LengthSepal.WidthPetal.Length
15.13.51.4
24.93.01.4
34.73.21.3
44.63.11.5

Output:

Sl.No.Petal.LengthPetal.WidthSpecies
11.40.2Setosa
21.40.2Setosa
31.30.2Setosa
41.50.2Setosa
51.40.2Setosa
61.70.4Setosa

 

#We use(-)to hide a particular column
selected <- select(iris, -Sepal.Length, -Sepal.Width)
head(selected)

Output:

Sl.No.Petal.LengthPetal.WidthSpecies
11.40.2Setosa
21.40.2Setosa
31.30.2Setosa
41.50.2Setosa
51.40.2Setosa
61.70.4Setosa

Interested in learning R Programming? Click here to learn more in this R Programming Training in Bangalore!

Filter()

It is used to find rows with matching criteria. It also works like the select() function, i.e., we pass a data frame along with a condition separated by a comma.
For example:

#To select the first 3 rows with Species as setosa
filtered <- filter(iris, Species == "setosa" )
head(filtered,3)

Output:

Sl. No.Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2Setosa
24.93.01.40.2Setosa
34.73.21.30.2Setosa

 

#To select the last 5 rows with Species as versicolor and Sepal width more than 3
filtered1 <- filter(iris, Species == "versicolor", Sepal.Width > 3)
tail(filtered1)

Output:

Sl. No.Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
46.33.34.71.6Versicolor
56.73.14.41.4Versicolor
65.93.24.81.8Versicolor
76.03.44.51.6Versicolor
86.73.14.71.5Versicolor

Mutate()

It creates new columns and preserves the existing columns in a dataset.
For example:

#To create a column “Greater.Half” which stores TRUE if given condition
is TRUE
col1 <- mutate(iris, Greater.Half = Sepal.Width > 0.5 * Sepal.Length)
tail(col1)

Output:

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesGreater.Half
1456.73.35.72.5VirginicaFALSE
1466.73.05.22.3VirginicaFALSE
1476.32.55.01.9VirginicaFALSE
1486.53.05.22.0VirginicaFALSE
1496.23.45.42.3VirginicaTRUE
1505.93.05.11.8VirginicaTRUE

 

#To check how many flowers satisfy this condition
table(col1$Greater.Half)

Output:
FALSE=84  TRUE=66

Are you interested in learning R Programming from experts? Enroll in our R Programming training in Sydney now!

Arrange()

It is used to sort rows by variables in both an ascending and descending order.
For example:

#To arrange Sepal Width in ascending order
arranged <- arrange(col1, Sepal.Width)
head(arranged)
#To arrange Sepal Width in descending order
arranged <- arrange(col1, desc(Sepal.Width))
head(arranged)

Output:

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesGreater.Half
15.02.03.51.0VersicolorFALSE
26.02.24.01.0VersicolorFALSE
36.22.24.51.5VersicolorFALSE
46.02.25.01.5VirginicaFALSE
54.52.31.30.3SetosaTRUE
65.52.34.01.3VersicolorFALSE
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesGreater.Half
15.74.41.50.4SetosaTRUE
25.54.21.40.2SetosaTRUE
35.24.11.50.1SetosaTRUE
45.84.01.20.2SetosaTRUE
55.43.91.70.4SetosaTRUE
65.43.91.30.4SetosaTRUE

Summarise()

It is used to find insights(mean, median, mode, etc.) from a dataset. It reduces multiple values down to a single value.
For example:

summarised <- summarise(arranged, Mean.Width = mean(Sepal.Width))
head(summarised)

Output:

Mean.Width
1   3.057333

Grouping10

It is done to group observations within a dataset by one or more variables. Most data operations are performed on groups defined by variables.
For example:

#To find mean sepal width by Species, we use grouping as follows
gp <- group_by(iris,Species)
mn <- summarise(gp,Mean.Sepal = mean(Sepal.Width))
head(mn)

Output:

Sl. No.Species
<fct>
Mean.Sepal
<dbl>
1setosa3.43
2versicolor2.77
3virginica2.97

Pipe Operator

Pipe operator lets us wrap multiple functions together. It is denoted as %>% . It can be used with functions like filter(), select(), arrange(), summarise(), group_by(), etc.
For example:

#To get rows with the following conditions
iris %>% filter(Species == "setosa",Sepal.Width > 3.8)

Output:

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.43.91.70.4Setosa
5.84.01.20.2Setosa
5.74.41.50.4Setosa
5.43.91.30.4Setosa
5.24.11.50.1Setosa
5.54.21.40.2Setosa

 

#To find mean Sepal Length by Species, we use pipe operator as follows
iris  %>% group_by(Species) %>% summarise(Mean.Length = mean(Sepal.Length))

Output:

Species
<fct>
Mean.Length
<dbl>
setosa5.01
versicolor5.94
virginica6.59

In this tutorial we were talking about what data manipulation in R is, data manipulation in R using functions in the dplyr package, grouping, and using the pipe operator to tie multiple functions together. In the next section, we are going to cover data visualization in R.

Recommended Videos

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
28 + 5 =