R for Data Science

Data Science is changing the analytics industry. The techniques and methodologies of Data Science help in exploring data to understand the parameters for visualization. These visualizations are then created in the form of graphs, histograms, and various plots. They help..Read More

Data Science with R
  • Updated on 28th May, 20
  • 231 Views

Now, let us have a glance at the topics and concepts covered in this blog on R for Data Science:

What is R?

R is a programming language extensively developed for Data Analytics. It is used for statistical analysis, visualization of data, and finding insights in Data Analytics. Also, R programming proves to be helpful for creating Machine Learning models. Further, it proves to be very beneficial while creating projects on R for Data Science.

There are various packages and libraries in R that help create great visualizations for understanding the patterns in data. Also, there are various fields where the applications of R for Data Science play a major role such as IT industry, banking and finance, media and entertainment, healthcare, and many more. Now, let us understand why we use R for Data Science.

If you are a beginner in Data Science, then start your journey with this Data Science tutorial:

Why do we use R for Data Science?

Nowadays, as Data Science is in great demand, the need for a Data Scientist has simultaneously increased in the analytics industry. Besides, one of the widely used tools for Data Analytics is the R programming language. It consists of more than 10,000 packages that help us perform statistical analysis, visualization, data manipulation, exploratory data analysis (EDA), and building Machine Learning models. Also, R is an easy programming language that allows us to efficiently work on various techniques of Data Science.

Now, let us look at some of its features:

  • R helps in solving complex real-world problems through statistical analysis and modeling.
  • It provides the facility for the customization of libraries and packages. Developers can easily create libraries and packages in R as per their requirements.
  • R provides several tools for statistical analysis. For this reason, it is majorly used in the field of research and development.
  • R is the best language for data wrangling as it consists of preprocessed packages.
  • The ggplot2 package in R helps in smartly visualizing data. It is one of the popular packages used in the Data Science industry for data visualization.

Other than in data manipulation, visualization, and wrangling, R helps in building Machine Learning models as well. There are numerous libraries and packages for building models of regression, classification, clustering, etc. Its wide range of applications in every field makes it the best programming language of the day.

In the Data Science use case we have used in this blog, we will work on the ‘US pollution’ dataset documented by US EPA for the years 2000–2016. It consists of 28 fields, along with four main pollutants (Ozone, Carbon Monoxide, Nitrogen Dioxide, and Sulphur Dioxide), for which we will be visualizing the dataset.

Interested in learning Data Science? Click here to learn more in this Data Science Course in London!

Problem Statement

Here is the US pollution dataset from which we have to understand the trends of the Air Quality Index for SO2, CO, and CO. Also, we will implement Machine Learning algorithms such as linear regression, logistic regression, multiple logistic regression, and k-means clustering.

Let us first load the dataset using the read.csv method. For this, use the path where you have saved the US pollution dataset.

p_data<-read.csv("C:/Users/Intellipaat-Team/Documents/R for Data //Science/pollution_dataset.csv")

Now, we will have a look at the data and will try to understand the variables.

View(p_data)

Now, we will have a look at the first six values of the dataset using the head function.

head(p_data)

Next, in this R for Data Science  blog, we will load the ‘ggplot2’ package that we will use later for data visualization. Then, we will load then ‘readr’ package to read the CSV files. Also, we will use ‘TSA’ and ‘tseries’ for time series analysis.

Are you interested in learning Data Science course  from Experts?

library(ggplot2)
library(readr)
library(TSA)
library(tseries)
install.packages("tseries")

Now, extract NO2 AQI (Air Quality Index), O3.AQI, and SO2.AQI of New York

New_York<- subset(p_data, City == "New York" & County == "Queens", select = c(City, Date.Local, NO2.AQI,O3.AQI,SO2.AQI,CO.AQI,County))

head(New_York)

tail(New_York)

In this blog on Data Science with R, we are dealing with the US pollution dataset that consists of NA values. Let’s have a view of the dataset of New York:

View(New_York)

In this data, we are not having all the data for the years 2010 and 2016, so we will remove all these data for the years 2010 and 2016. By this, we will recreate the dataset having the data of the years from 2011 to 2015. Also, we will check for NA values in the dataset and eliminate them.

sum(is.na(New_York)) #if the result of the sum is 0, then there is no "NA" value.

This number shows that there are 24,177 NA values in the data. Let us remove them with the help of the omit function.

p_data<-na.omit(p_data)

In this blog on R for Data Science, we are working with dates in the data as well. Therefore, we have to make sure that the dates are of the class ‘Date’.

New_York$Date.Local<- as.Date(New_York$Date.Local)

Now, we will remove the data for the years 2000 and 2016.

New_York<- with(New_York, New_York[(Date.Local>= "2011-01-01" &Date.Local<= "2015-12-31"),]) 

# ordering the date by Date.Local

New_York<- New_York[order(New_York$Date.Local),] 

Certification in Bigdata Analytics

As there are several observations having the same value, we will remove the repeated values and make them unique using the unique function.

head(New_York)

New_York<- unique(New_York)

Next, by taking the average of dates by months, we will try to analyze the trend of time series.

Then, for each year, we will calculate the monthly averages. After that, we will convert them into characters and paste the concatenated vectors.

We will use the following functions:

  • as.POSIXlt: Used to manipulate objects of classes ‘POSIXlt’
  • POSIXct: Used to represent calendar dates and times

yyyymm<- paste(format(as.POSIXlt(New_York$Date.Local), format = "%y-%m"), "01", sep = "-")
monthly_mean<- tapply(New_York$NO2.AQI, yyyymm, mean)
monthly_mean<- as.data.frame(monthly_mean)
str(monthly_mean)

#time series

mean.ts<- ts(data = monthly_mean, start = c(2011, 1), frequency = 12)

tsp(mean.ts)

#Plotting a graph for the Air Quality Index of NO2

plot(mean.ts, ylab = "NO2 AQI", main = "NO2 AQI Time Series")

# For O3.AQI

yyyymm<- paste(format(as.POSIXlt(New_York$Date.Local), format = "%y-%m"), "01", sep = "-")
monthly_mean_O3 <- tapply(New_York$O3.AQI, yyyymm, mean)
monthly_mean_O3 <- as.data.frame(monthly_mean_O3)

#time series

mean.ts1 <- ts(data = monthly_mean_O3, start = 2011, end = 2015, frequency = 12)
tsp(mean.ts)
plot(mean.ts, ylab = "O3 AQI", main = "O3 AQI Time Series")

Go through the Data Science Course in Hyderabad  to get clear understanding of Data Science Technique.

Now, moving further in this blog on R programming for Data Science, we will create visualizations for the Air Quality Index of NO2, SO2, and CO over the years.

library(scales)
library(ggplot2)
ggplot(data = p_data,aes(Month, NO2.AQI)) + ggtitle("NO2 AQI over the years") + stat_summary(fun.y = mean, geom = "line") + scale_x_date(labels = date_format("%Y-%m"), date_breaks = "6 month") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
#Plot for AQI of SO2
ggplot(data = p_data,aes(Month, SO2.AQI)) + ggtitle("SO2 AQI over the years") + stat_summary(fun.y = mean, geom = "line") + scale_x_date(labels = date_format("%Y-%m"), date_breaks = "6 month") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#Plot for AQI of CO

ggplot(data = p_data,aes(Month, CO.AQI)) + ggtitle("CO AQI over the years") + stat_summary(fun.y = mean, geom = "line") + scale_x_date(labels = date_format("%Y-%m"), date_breaks = "6 month") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#AQI of NO2 by country

ggplot(data = p_data,aes(y=NO2.AQI, x=State ,
colour=Month)) + ggtitle("NO2 AQI by Country") + geom_boxplot() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#AQI of SO2 by country
ggplot(data = p_data,aes(y=SO2.AQI, x=State , colour=Month))
+ ggtitle("SO2 AQI by Country") + geom_boxplot() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
#AQI of CO by country
ggplot(data = p_data,aes(y=CO.AQI, x=State ,fill=State, colour=Month)) +ggtitle("CO AQI by Country") + geom_boxplot() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Further, we will look into the distribution of the Air Quality Index by state using a bar plot, a scatter plot, a box plot, and a histogram.

#Bar-plot
ggplot(data = p_data, aes(x = NO2.AQI, fill = State)) + geom_bar(position = "dodge") + labs(x =  "Air Quality Index of NO2", y = "Count", title = "Distribution of AQI by State" )
#Scatter-plot
ggplot(data = p_data,aes(x = NO2.AQI, y=NO2.Mean, col =
State)) + geom_point(alpha = 0.6,size = 2) + labs(x = "Air Quality Index of NO2", y= "Count",title = "Distribution of AQI by State")
#Box-plot

ggplot(data = p_data, aes(x = "Count",y = NO2.AQI, fill = State)) + geom_boxplot()

#Histogram
ggplot(data = p_data, aes(x = O3.AQI,col=State)) + geom_histogram(bins = 50) + labs(title = "Air Quality Index of O3") + theme_bw()


Interested in learning Data Science? Click here to learn more in this Data Science Training in Bangalore!

Implementing Linear Regression

Till now, we have visualized the data of US pollution levels. Next, we will be implementing Machine Learning algorithms such as linear regression and k-means clustering.

First, we will be loading the caTools package for implementing linear regression algorithms.

library(caTools)

#We will use the seed() function to generate the same set of random values from the dataset
set.seed(111)

#Splitting the data into a 70:30 ratio using NO2.1st.Max.Value
sample.split(p_data$NO2.1st.Max.Value, SplitRatio = 0.7) >split_tag

#Creating the train and test sets
subset(p_data, split_tag == TRUE) ->train
subset(p_data, split_tag == FALSE) -> test

#Building the linear regression model using the train dataset
l_model<- lm(NO2.AQI ~ NO2.1st.Max.Value , data = train)

#Using options will convert the scientific values into numerical values
options(scipen = 999)

#Making predictions using the test dataset 
pred_val<- predict(l_model, newdata = test)
head(pred_val)
#Binding the actual and predicted values
cbind(Actual = test$NO2.AQI,Predicted =  pred_val) ->final_data
View(final_data)
#As the data is in the form of a matrix, we will convert it into a dataframe
final_data<- as.data.frame(final_data)
final_data
#Calculating the error and binding it to the actual and
predicted values
final_data$Actual - final_data$Predicted -> error
View(final_data)
cbind(final_data,error) ->final_data
head(final_data)
#Calculating the root mean square. The lower value of RMSE denotes the
perfection of the model in making predictions
sqrt(mean((final_data$error)^2))
plot(p_data$NO2.1st.Max.Value,p_data$NO2.AQI

Become a Data Science Architect IBM

Implementing K-means Clustering

Further in this R tutorial for Data Science, will now implement the k-means clustering algorithm to understand the structure of the data. For this, we will try to cluster the different groups of the data.

Now, Let us start implementing the algorithm.

First, we will load the required packages for implementing the algorithm.

library(dplyr)

Next, we will make a group of NO2, SO2, and O3 using the select() function.

p_data %>% select("NO2.AQI","O3.AQI","SO2.AQI")->AQI_cluster
plot_clus_coord(AQI_cluster, p_data)

We will have a view of the cluster that is created.

View(AQI_cluster)

Now, we will create separate clusters for O3, NO2, and SO2. We will start by creating the clusters of O3.

kmeans(AQI_cluster$O3.AQI,3)->cluster_O3     
cbind(O3=AQI_cluster$O3.AQI,Cluster=cluster_O3$cluster)->cluster_group     
View(cluster_group)
#We will convert the matrix of clusters into a dataframe
as.data.frame(cluster_group)->cluster_group

Let us now separate all the clusters of O3 by using the filter function.

cluster_group %>% filter(Cluster==1)->cluster_group_1
cluster_group %>%
filter(Cluster==2)->cluster_group_2

cluster_group %>%
filter(Cluster==3)->cluster_group_3

View(cluster_group_1)

Similarly, we will create clusters for NO2 and further separate the clusters.

kmeans(AQI_cluster$NO2.AQI,3)->cluster_NO2  
cbind(NO2=AQI_cluster$NO2.AQI,Cluster=cluster_NO2$cluster)->cluster_group_NO2 
View(cluster_group_NO2)
as.data.frame(cluster_group_NO2)->cluster_group_NO2
cluster_group_NO2 %>% filter(Cluster==1)->cluster_group1_NO2
cluster_group_NO2 %>% filter(Cluster==2)->cluster_group2_NO2
cluster_group_NO2 %>% filter(Cluster==3)->cluster_group3_NO2
View(cluster_group1_NO2)
View(cluster_group2_NO2)
View(cluster_group3_NO2)

Finally, we will create clusters for SO2.

kmeans(AQI_cluster$SO2.AQI,3)->cluster_SO2 
View(cluster_SO2)
cbind(SO2=AQI_cluster$SO2.AQI,Cluster=cluster_SO2$cluster)->cluster_group_SO2 
View(cluster_group_SO2)
as.data.frame(cluster_group_SO2)->cluster_group_SO2
cluster_group_SO2 %>%
filter(Cluster==1)->cluster_group1_SO2
cluster_group_SO2 %>%
filter(Cluster==2)->cluster_group2_SO2
cluster_group_SO2 %>%
filter(Cluster==3)->cluster_group3_SO2
View(cluster_group1_SO2)
View(cluster_group2_SO2)
View(cluster_group3_SO2)

Let us plot the clusters of NO2, O3, and SO2 using the plot function.

plot(AQI_cluster[c("NO2.AQI","O3.AQI","SO2.AQI")],
col=AQI_cluster$NO2.AQI)

This plot shows nine different clusters (three each for NO2, SO2, and O3).

Now, let us have a look at the clustering plots based on some data and allocate the centers.

#Clustering plots
X <-data.frame(c1=c(0,1,2,4,5,4,6,7),c2=c(0,1,2,3,3,4,5,5))
km <- kmeans(X, center=2)
plot(X,col=km$cluster)
points(km$center,col=1:2,pch=8,cex=1)

Finally, we will create the clustering plots using just three centers for NO2, O3, and SO2.

X <- data.frame(AQI_cluster)
km <- kmeans(X, center=3)
plot(X,col=km$cluster)

The above plot shows the values of the Air Quality Index for NO2, O3, and SO2. Here, the bottom right plot shows the visualization for SO2 and NO2. In this, most of the points lie between 0 and 60, which is considered as acceptable air quality. Similarly, most of the values of the Air Quality Index for SO2 and O3 and NO2 and O3 lie between 0 and 50. This is again considered an acceptable air quality.

In this blog on Data Science with R programming, we worked on understanding the data through visualizations. Then, we implemented the linear regression algorithm for the dataset. Finally, we implemented the k-means clustering algorithm to build clusters of the data for the comparative analysis of the Air Quality Index for NO2, SO2, and O3. This is all about this use case of Data Science with R.

Go through this  Data Science Interview Questions And Answers to excel in your Data Science Interview.

Course Schedule

Name Date
Data Science Architect 2020-10-24 2020-10-25
(Sat-Sun) Weekend batch
View Details
Data Science Architect 2020-10-31 2020-11-01
(Sat-Sun) Weekend batch
View Details
Data Science Architect 2020-11-07 2020-11-08
(Sat-Sun) Weekend batch
View Details

Leave a Reply

Your email address will not be published. Required fields are marked *

Associated Courses

Subscribe to our newsletter

Signup for our weekly newsletter to get the latest news, updates and amazing offers delivered directly in your inbox.