R Programming Tutorial

Now, let us have a glance at the topics and concepts covered in this blog on R for Data Science:

What is R?
Why do we use R for Data Science?
Problem Statement for US Pollution Dataset
Data Visualization
Implementing Linear Regression
Implementing K-means Clustering
Frequently Asked Questions (FAQs)

If you are a beginner in Data Science, then start your journey with this Data Science Course:

What is R?

R is a programming language extensively developed for Data Analytics. It is used for statistical analysis, visualization of data, and finding insights in Data Analytics. Also, R programming proves to be helpful for creating Machine Learning models. Further, it proves to be very beneficial while creating projects on R for Data Science.

There are various packages and libraries in R that help create great visualizations for understanding the patterns in data. Also, there are various fields where the applications of R for Data Science play a major role such as IT industry, banking and finance, media and entertainment, healthcare, and many more. Now, let us understand why we use R for Data Science.

Why do we use R for Data Science?

Nowadays, as Data Science is in great demand, the need for a Data Scientist has simultaneously increased in the analytics industry. Besides, one of the widely used tools for Data Analytics is the R programming language. It consists of more than 10,000 packages that help us perform statistical analysis, visualization, data manipulation, exploratory data analysis (EDA), and building Machine Learning models. Also, R is an easy programming language that allows us to efficiently work on various techniques of Data Science.

Now, let us look at some of its features:

R helps in solving complex real-world problems through statistical analysis and modeling.
It provides the facility for the customization of libraries and packages. Developers can easily create libraries and packages in R as per their requirements.
R provides several tools for statistical analysis. For this reason, it is majorly used in the field of research and development.
R is the best language for data wrangling as it consists of preprocessed packages.
The ggplot2 package in R helps in smartly visualizing data. It is one of the popular packages used in the Data Science industry for data visualization.

Other than in data manipulation, visualization, and wrangling, R helps in building Machine Learning models as well. There are numerous libraries and packages for building models of regression, classification, clustering, etc. Its wide range of applications in every field makes it the best programming language of the day.

In the Data Science use case we have used in this blog, we will work on the ‘US pollution’ dataset documented by US EPA for the years 2000–2016. It consists of 28 fields, along with four main pollutants (Ozone, Carbon Monoxide, Nitrogen Dioxide, and Sulphur Dioxide), for which we will be visualizing the dataset.

Problem Statement for US Pollution Dataset

Here is the US pollution dataset from which we have to understand the trends of the Air Quality Index for SO₂, CO, and CO. Also, we will implement Machine Learning algorithms such as linear regression, logistic regression, multiple logistic regression, and k-means clustering.

Let us first load the dataset using the read.csv method. For this, use the path where you have saved the US pollution dataset.

p_data&amp;amp;amp;amp;lt;-read.csv(&amp;amp;amp;amp;quot;C:/Users/Intellipaat-Team/Documents/R for Data //Science/pollution_dataset.csv&amp;amp;amp;amp;quot;)

Now, we will have a look at the data and will try to understand the variables.

View(p_data)

Now, we will have a look at the first six values of the dataset using the head function.

head(p_data)

Data Visualization

Next, in this R for Data Science blog, we will load the ‘ggplot2’ package that we will use later for data visualization. Then, we will load then ‘readr’ package to read the CSV files. Also, we will use ‘TSA’ and ‘tseries’ for time series analysis.

library(ggplot2)
library(readr)
library(TSA)
library(tseries)
install.packages(&amp;amp;amp;quot;tseries&amp;amp;amp;quot;)

Now, extract NO2 AQI (Air Quality Index), O3.AQI, and SO2.AQI of New York

New_York&amp;amp;amp;lt;- subset(p_data, City == &amp;amp;amp;quot;New York&amp;amp;amp;quot; &amp;amp;amp;amp; County == &amp;amp;amp;quot;Queens&amp;amp;amp;quot;, select = c(City, Date.Local, NO2.AQI,O3.AQI,SO2.AQI,CO.AQI,County))

head(New_York)

tail(New_York)

In this blog on Data Science with R, we are dealing with the US pollution dataset that consists of NA values. Let’s have a view of the dataset of New York:

View(New_York)

In this data, we are not having all the data for the years 2010 and 2016, so we will remove all these data for the years 2010 and 2016. By this, we will recreate the dataset having the data of the years from 2011 to 2015. Also, we will check for NA values in the dataset and eliminate them.

sum(is.na(New_York)) #if the result of the sum is 0, then there is no &amp;amp;amp;quot;NA&amp;amp;amp;quot; value.

This number shows that there are 24,177 NA values in the data. Let us remove them with the help of the omit function.

p_data&amp;amp;amp;lt;-na.omit(p_data)

In this blog on R for Data Science, we are working with dates in the data as well. Therefore, we have to make sure that the dates are of the class ‘Date’.

New_York$Date.Local&amp;amp;amp;lt;- as.Date(New_York$Date.Local)

Now, we will remove the data for the years 2000 and 2016.

New_York&amp;amp;amp;lt;- with(New_York, New_York[(Date.Local&amp;amp;amp;gt;= &amp;amp;amp;quot;2011-01-01&amp;amp;amp;quot; &amp;amp;amp;amp;Date.Local&amp;amp;amp;lt;= &amp;amp;amp;quot;2015-12-31&amp;amp;amp;quot;),])

# ordering the date by Date.Local

New_York&amp;amp;amp;lt;- New_York[order(New_York$Date.Local),]

As there are several observations having the same value, we will remove the repeated values and make them unique using the unique function.

head(New_York)

New_York&amp;amp;amp;lt;- unique(New_York)

Next, by taking the average of dates by months, we will try to analyze the trend of time series.

Then, for each year, we will calculate the monthly averages. After that, we will convert them into characters and paste the concatenated vectors.

We will use the following functions:

as.POSIXlt: Used to manipulate objects of classes ‘POSIXlt’
POSIXct: Used to represent calendar dates and times

yyyymm&amp;amp;amp;lt;- paste(format(as.POSIXlt(New_York$Date.Local), format = &amp;amp;amp;quot;%y-%m&amp;amp;amp;quot;), &amp;amp;amp;quot;01&amp;amp;amp;quot;, sep = &amp;amp;amp;quot;-&amp;amp;amp;quot;)
monthly_mean&amp;amp;amp;lt;- tapply(New_York$NO2.AQI, yyyymm, mean)
monthly_mean&amp;amp;amp;lt;- as.data.frame(monthly_mean)
str(monthly_mean)

#time series

mean.ts&amp;amp;amp;lt;- ts(data = monthly_mean, start = c(2011, 1), frequency = 12)

tsp(mean.ts)

#Plotting a graph for the Air Quality Index of NO2

plot(mean.ts, ylab = &amp;amp;amp;quot;NO2 AQI&amp;amp;amp;quot;, main = &amp;amp;amp;quot;NO2 AQI Time Series&amp;amp;amp;quot;)

# For O3.AQI

yyyymm&amp;amp;amp;lt;- paste(format(as.POSIXlt(New_York$Date.Local), format = &amp;amp;amp;quot;%y-%m&amp;amp;amp;quot;), &amp;amp;amp;quot;01&amp;amp;amp;quot;, sep = &amp;amp;amp;quot;-&amp;amp;amp;quot;)
monthly_mean_O3 &amp;amp;amp;lt;- tapply(New_York$O3.AQI, yyyymm, mean)
monthly_mean_O3 &amp;amp;amp;lt;- as.data.frame(monthly_mean_O3)

#time series

mean.ts1 &amp;amp;amp;lt;- ts(data = monthly_mean_O3, start = 2011, end = 2015, frequency = 12)
tsp(mean.ts)
plot(mean.ts, ylab = &amp;amp;amp;quot;O3 AQI&amp;amp;amp;quot;, main = &amp;amp;amp;quot;O3 AQI Time Series&amp;amp;amp;quot;)

Now, moving further in this blog on R programming for Data Science, we will create visualizations for the Air Quality Index of NO2, SO2, and CO over the years.

library(scales)
library(ggplot2)
ggplot(data = p_data,aes(Month, NO2.AQI)) + ggtitle(&amp;amp;amp;quot;NO2 AQI over the years&amp;amp;amp;quot;) + stat_summary(fun.y = mean, geom = &amp;amp;amp;quot;line&amp;amp;amp;quot;) + scale_x_date(labels = date_format(&amp;amp;amp;quot;%Y-%m&amp;amp;amp;quot;), date_breaks = &amp;amp;amp;quot;6 month&amp;amp;amp;quot;) + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#Plot for AQI of SO2

ggplot(data = p_data,aes(Month, SO2.AQI)) + ggtitle(&amp;amp;amp;quot;SO2 AQI over the years&amp;amp;amp;quot;) + stat_summary(fun.y = mean, geom = &amp;amp;amp;quot;line&amp;amp;amp;quot;) + scale_x_date(labels = date_format(&amp;amp;amp;quot;%Y-%m&amp;amp;amp;quot;), date_breaks = &amp;amp;amp;quot;6 month&amp;amp;amp;quot;) + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#Plot for AQI of CO

ggplot(data = p_data,aes(Month, CO.AQI)) + ggtitle(&amp;amp;amp;amp;amp;quot;CO AQI over the years&amp;amp;amp;amp;amp;quot;) + stat_summary(fun.y = mean, geom = &amp;amp;amp;amp;amp;quot;line&amp;amp;amp;amp;amp;quot;) + scale_x_date(labels = date_format(&amp;amp;amp;amp;amp;quot;%Y-%m&amp;amp;amp;amp;amp;quot;), date_breaks = &amp;amp;amp;amp;amp;quot;6 month&amp;amp;amp;amp;amp;quot;) + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#AQI of NO2 by country

ggplot(data = p_data,aes(y=NO2.AQI, x=State ,
colour=Month)) + ggtitle(&amp;amp;amp;quot;NO2 AQI by Country&amp;amp;amp;quot;) + geom_boxplot() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#AQI of SO2 by country

ggplot(data = p_data,aes(y=SO2.AQI, x=State , colour=Month))
+ ggtitle(&amp;amp;amp;quot;SO2 AQI by Country&amp;amp;amp;quot;) + geom_boxplot() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#AQI of CO by country

ggplot(data = p_data,aes(y=CO.AQI, x=State ,fill=State, colour=Month)) +ggtitle(&amp;amp;amp;quot;CO AQI by Country&amp;amp;amp;quot;) + geom_boxplot() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Get 100% Hike!

Master Most in Demand Skills Now!

Further, we will look into the distribution of the Air Quality Index by state using a bar plot, a scatter plot, a box plot, and a histogram.

#Bar-plot

ggplot(data = p_data, aes(x = NO2.AQI, fill = State)) + geom_bar(position = &amp;amp;amp;quot;dodge&amp;amp;amp;quot;) + labs(x =  &amp;amp;amp;quot;Air Quality Index of NO2&amp;amp;amp;quot;, y = &amp;amp;amp;quot;Count&amp;amp;amp;quot;, title = &amp;amp;amp;quot;Distribution of AQI by State&amp;amp;amp;quot; )

#Scatter-plot

ggplot(data = p_data,aes(x = NO2.AQI, y=NO2.Mean, col =
State)) + geom_point(alpha = 0.6,size = 2) + labs(x = &amp;amp;amp;quot;Air Quality Index of NO2&amp;amp;amp;quot;, y= &amp;amp;amp;quot;Count&amp;amp;amp;quot;,title = &amp;amp;amp;quot;Distribution of AQI by State&amp;amp;amp;quot;)

#Box-plot

ggplot(data = p_data, aes(x = &amp;amp;amp;quot;Count&amp;amp;amp;quot;,y = NO2.AQI, fill = State)) + geom_boxplot()

#Histogram

ggplot(data = p_data, aes(x = O3.AQI,col=State)) + geom_histogram(bins = 50) + labs(title = &amp;amp;amp;amp;amp;quot;Air Quality Index of O3&amp;amp;amp;amp;amp;quot;) + theme_bw()

Implementing Linear Regression

Till now, we have visualized the data of US pollution levels. Next, we will be implementing Machine Learning algorithms such as linear regression and k-means clustering.

First, we will be loading the caTools package for implementing linear regression algorithms.

library(caTools)

#We will use the seed() function to generate the same set of random values from the dataset
set.seed(111)

#Splitting the data into a 70:30 ratio using NO2.1st.Max.Value
sample.split(p_data$NO2.1st.Max.Value, SplitRatio = 0.7) &amp;amp;amp;gt;split_tag

#Creating the train and test sets
subset(p_data, split_tag == TRUE) -&amp;amp;amp;gt;train
subset(p_data, split_tag == FALSE) -&amp;amp;amp;gt; test

#Building the linear regression model using the train dataset
l_model&amp;amp;amp;lt;- lm(NO2.AQI ~ NO2.1st.Max.Value , data = train)

#Using options will convert the scientific values into numerical values
options(scipen = 999)

#Making predictions using the test dataset 
pred_val&amp;amp;amp;lt;- predict(l_model, newdata = test)
head(pred_val)

#Binding the actual and predicted values
cbind(Actual = test$NO2.AQI,Predicted =  pred_val) -&amp;amp;amp;gt;final_data
View(final_data)

#As the data is in the form of a matrix, we will convert it into a dataframe
final_data&amp;amp;amp;lt;- as.data.frame(final_data)
final_data

#Calculating the error and binding it to the actual and
predicted values
final_data$Actual - final_data$Predicted -&amp;amp;amp;gt; error
View(final_data)
cbind(final_data,error) -&amp;amp;amp;gt;final_data
head(final_data)

#Calculating the root mean square. The lower value of RMSE denotes the
perfection of the model in making predictions

sqrt(mean((final_data$error)^2))

plot(p_data$NO2.1st.Max.Value,p_data$NO2.AQI

Implementing K-means Clustering

Further in this R tutorial for Data Science, will now implement the k-means clustering algorithm to understand the structure of the data. For this, we will try to cluster the different groups of the data.

Now, Let us start implementing the algorithm.

First, we will load the required packages for implementing the algorithm.

library(dplyr)

Next, we will make a group of NO₂, SO₂, and O₃ using the select() function.

p_data %&amp;amp;amp;gt;% select(&amp;amp;amp;quot;NO2.AQI&amp;amp;amp;quot;,&amp;amp;amp;quot;O3.AQI&amp;amp;amp;quot;,&amp;amp;amp;quot;SO2.AQI&amp;amp;amp;quot;)-&amp;amp;amp;gt;AQI_cluster
plot_clus_coord(AQI_cluster, p_data)

We will have a view of the cluster that is created.

View(AQI_cluster)

Now, we will create separate clusters for O₃, NO₂, and SO₂. We will start by creating the clusters of O₃.

kmeans(AQI_cluster$O3.AQI,3)-&amp;amp;amp;amp;amp;amp;gt;cluster_O3&amp;amp;amp;amp;amp;amp;nbsp;&amp;amp;amp;amp;amp;amp;nbsp;&amp;amp;amp;amp;amp;amp;nbsp;&amp;amp;amp;amp;amp;amp;nbsp;&amp;amp;amp;amp;amp;amp;nbsp;
cbind(O3=AQI_cluster$O3.AQI,Cluster=cluster_O3$cluster)-&amp;amp;amp;amp;amp;amp;gt;cluster_group&amp;amp;amp;amp;amp;amp;nbsp;&amp;amp;amp;amp;amp;amp;nbsp;&amp;amp;amp;amp;amp;amp;nbsp;&amp;amp;amp;amp;amp;amp;nbsp;&amp;amp;amp;amp;amp;amp;nbsp;
View(cluster_group)

#We will convert the matrix of clusters into a dataframe
as.data.frame(cluster_group)-&amp;amp;amp;amp;amp;amp;gt;cluster_group

Let us now separate all the clusters of O₃ by using the filter function.

cluster_group %&amp;amp;amp;amp;amp;amp;gt;% filter(Cluster==1)-&amp;amp;amp;amp;amp;amp;gt;cluster_group_1
cluster_group %&amp;amp;amp;amp;amp;amp;gt;%
filter(Cluster==2)-&amp;amp;amp;amp;amp;amp;gt;cluster_group_2

cluster_group %&amp;amp;amp;amp;amp;amp;gt;%
filter(Cluster==3)-&amp;amp;amp;amp;amp;amp;gt;cluster_group_3

View(cluster_group_1)

Similarly, we will create clusters for NO₂ and further separate the clusters.

kmeans(AQI_cluster$NO2.AQI,3)-&amp;amp;amp;amp;amp;amp;gt;cluster_NO2&amp;amp;amp;amp;amp;amp;nbsp; 
cbind(NO2=AQI_cluster$NO2.AQI,Cluster=cluster_NO2$cluster)-&amp;amp;amp;amp;amp;amp;gt;cluster_group_NO2&amp;amp;amp;amp;amp;amp;nbsp;

View(cluster_group_NO2)

as.data.frame(cluster_group_NO2)-&amp;amp;amp;amp;amp;amp;gt;cluster_group_NO2
cluster_group_NO2 %&amp;amp;amp;amp;amp;amp;gt;% filter(Cluster==1)-&amp;amp;amp;amp;amp;amp;gt;cluster_group1_NO2
cluster_group_NO2 %&amp;amp;amp;amp;amp;amp;gt;% filter(Cluster==2)-&amp;amp;amp;amp;amp;amp;gt;cluster_group2_NO2
cluster_group_NO2 %&amp;amp;amp;amp;amp;amp;gt;% filter(Cluster==3)-&amp;amp;amp;amp;amp;amp;gt;cluster_group3_NO2

View(cluster_group1_NO2)
View(cluster_group2_NO2)
View(cluster_group3_NO2)

Finally, we will create clusters for SO₂.

kmeans(AQI_cluster$SO2.AQI,3)-&amp;amp;amp;amp;amp;amp;gt;cluster_SO2&amp;amp;amp;amp;amp;amp;nbsp;

View(cluster_SO2)

cbind(SO2=AQI_cluster$SO2.AQI,Cluster=cluster_SO2$cluster)-&amp;amp;amp;amp;amp;amp;gt;cluster_group_SO2&amp;amp;amp;amp;amp;amp;nbsp;

View(cluster_group_SO2)

as.data.frame(cluster_group_SO2)-&amp;amp;amp;amp;amp;amp;gt;cluster_group_SO2

cluster_group_SO2 %&amp;amp;amp;amp;amp;amp;gt;%
filter(Cluster==1)-&amp;amp;amp;amp;amp;amp;gt;cluster_group1_SO2

cluster_group_SO2 %&amp;amp;amp;amp;amp;amp;gt;%
filter(Cluster==2)-&amp;amp;amp;amp;amp;amp;gt;cluster_group2_SO2

cluster_group_SO2 %&amp;amp;amp;amp;amp;amp;gt;%
filter(Cluster==3)-&amp;amp;amp;amp;amp;amp;gt;cluster_group3_SO2

View(cluster_group1_SO2)
View(cluster_group2_SO2)
View(cluster_group3_SO2)

Let us plot the clusters of NO₂, O₃, and SO₂ using the plot function.

plot(AQI_cluster[c(&amp;amp;amp;amp;amp;quot;NO2.AQI&amp;amp;amp;amp;amp;quot;,&amp;amp;amp;amp;amp;quot;O3.AQI&amp;amp;amp;amp;amp;quot;,&amp;amp;amp;amp;amp;quot;SO2.AQI&amp;amp;amp;amp;amp;quot;)],
col=AQI_cluster$NO2.AQI)

This plot shows nine different clusters (three each for NO₂, SO₂, and O₃).

Now, let us have a look at the clustering plots based on some data and allocate the centers.

#Clustering plots

X &amp;amp;amp;amp;amp;amp;lt;-data.frame(c1=c(0,1,2,4,5,4,6,7),c2=c(0,1,2,3,3,4,5,5))
km &amp;amp;amp;amp;amp;amp;lt;- kmeans(X, center=2)
plot(X,col=km$cluster)

points(km$center,col=1:2,pch=8,cex=1)

Finally, we will create the clustering plots using just three centers for NO₂, O₃, and SO₂.

X &amp;amp;amp;amp;amp;amp;lt;- data.frame(AQI_cluster)
km &amp;amp;amp;amp;amp;amp;lt;- kmeans(X, center=3)
plot(X,col=km$cluster)

The above plot shows the values of the Air Quality Index for NO₂, O₃, and SO₂. Here, the bottom right plot shows the visualization for SO₂ and NO₂. In this, most of the points lie between 0 and 60, which is considered as acceptable air quality. Similarly, most of the values of the Air Quality Index for SO₂ and O₃ and NO₂ and O₃ lie between 0 and 50. This is again considered an acceptable air quality.

In this blog on Data Science with R programming, we worked on understanding the data through visualizations. Then, we implemented the linear regression algorithm for the dataset. Finally, we implemented the k-means clustering algorithm to build clusters of the data for the comparative analysis of the Air Quality Index for NO₂, SO₂, and O₃. This is all about this use case of Data Science with R.

Check out related Tutorials & Tools blogs-

What is Chi-Square Test?	What is Interpolation?	Data vs Information
Mathematics for Data Science	Kurtosis and Skewness	Data Reduction in Data Mining

Frequently Asked Questions (FAQs)

What is meant by R in data science?

R is a programming language and environment used for statistical computing, data analysis, and graphical representation in data science.

What is R and why is it used?

R is a statistical computing language used for data analysis, visualization, and statistical modeling, aiding in informed decision-making and predictions.

Which is better, R or Python?

Both are valuable; Python is more versatile, while R is great for statistical analysis and visualization. The choice depends on the project needs.

What does R stand for in R data?

The name ‘R’ derives from the initials of the two authors (Ross Ihaka and Robert Gentleman) and is a play on the language S, which it succeeded.

Why is R used for data?

R is used for its statistical analysis capabilities, data visualization, and a comprehensive set of packages for specialized analysis.

Which is easier, R or Python?

Python is often seen as easier due to its readability and broad application, while R or Python each have their strengths – R has a steeper learning curve but excels in statistics and visualization.

What is R in simple terms?

R is a programming language used for statistical analysis, data visualization, and data modeling.

What is the basic concept of R?

R is centered around statistical computing, providing tools for data analysis, modeling, and visualization to derive insights from data.

What is the use of R in Python?

R and Python can be used together for data analysis, with libraries like rpy2 allowing integration of R functionalities within Python scripts.

What is the future of R programming?

R continues to be valuable in academia and industries requiring advanced statistical analysis, though Python’s versatility may overshadow it in broader data science applications.