- Updated on 28th May, 20
- 167 Views

Now, let us have a glance at the topics and concepts covered in this blog on Data Science with R:

- What is R?
- Why do we use R for Data Science?
- Problem Statement for US Pollution Dataset
- Data Visulalization
- Implementing Linear Regression
- Implement K-means Clustering

**What is R?**

R is a programming language extensively developed for Data Analytics. It is used for statistical analysis, visualization of data, and finding insights in Data Analytics. Also, R programming proves to be helpful for creating Machine Learning models. Further, it proves to be very beneficial while creating projects on Data Science with R.

There are various packages and libraries in R that help create great visualizations for understanding the patterns in data. Also, there are various fields where the applications of Data Science with R play a major role such as IT industry, banking and finance, media and entertainment, healthcare, and many more. Now, let us understand why we use R for Data Science.

**If you are a beginner in Data Science, then start your journey with this Data Science tutorial:**

**Why do we use R for Data Science?**

Nowadays, as Data Science is in great demand, the need for a Data Scientist has simultaneously increased in the analytics industry. Besides, one of the widely used tools for Data Analytics is the R programming language. It consists of more than 10,000 packages that help us perform statistical analysis, visualization, data manipulation, exploratory data analysis (EDA), and building Machine Learning models. Also, R is an easy programming language that allows us to efficiently work on various techniques of Data Science.

**Now, let us look at some of its features:**

- R helps in solving complex real-world problems through statistical analysis and modeling.
- It provides the facility for the customization of libraries and packages. Developers can easily create libraries and packages in R as per their requirements.
- R provides several tools for statistical analysis. For this reason, it is majorly used in the field of research and development.
- R is the best language for data wrangling as it consists of preprocessed packages.
- The ggplot2 package in R helps in smartly visualizing data. It is one of the popular packages used in the Data Science industry for data visualization.

Other than in data manipulation, visualization, and wrangling, R helps in building Machine Learning models as well. There are numerous libraries and packages for building models of regression, classification, clustering, etc. Its wide range of applications in every field makes it the best programming language of the day.

In the Data Science use case we have used in this blog, we will work on the ‘US pollution’ dataset documented by US EPA for the years 2000–2016. It consists of 28 fields, along with four main pollutants (Ozone, Carbon Monoxide, Nitrogen Dioxide, and Sulphur Dioxide), for which we will be visualizing the dataset.

**Interested in learning Data Science? Click here to learn more in this Data Science Course in London!**

**Problem Statement**

Here is the US pollution dataset from which we have to understand the trends of the Air Quality Index for SO_{2}, CO, and CO. Also, we will implement Machine Learning algorithms such as linear regression, logistic regression, multiple logistic regression, and k-means clustering.

Let us first load the dataset using the read.csv method. For this, use the path where you have saved the US pollution dataset.

p_data<-read.csv("C:/Users/Intellipaat-Team/Documents/R for Data //Science/pollution_dataset.csv")

Now, we will have a look at the data and will try to understand the variables.

View(p_data)

Now, we will have a look at the first six values of the dataset using the head function.

head(p_data)

Next, in this Data Science with R blog, we will load the ‘ggplot2’ package that we will use later for data visualization. Then, we will load then ‘readr’ package to read the CSV files. Also, we will use ‘TSA’ and ‘tseries’ for time series analysis.

*Are you interested in learning **Data Science course **from Experts?*

library(ggplot2) library(readr) library(TSA) library(tseries) install.packages("tseries")

Now, extract NO2 AQI (Air Quality Index), O3.AQI, and SO2.AQI of New York

New_York<- subset(p_data, City == "New York" & County == "Queens", select = c(City, Date.Local, NO2.AQI,O3.AQI,SO2.AQI,CO.AQI,County))

head(New_York)

tail(New_York)

In this blog on Data Science with R, we are dealing with the US pollution dataset that consists of NA values. Let’s have a view of the dataset of New York:

View(New_York)

In this data, we are not having all the data for the years 2010 and 2016, so we will remove all these data for the years 2010 and 2016. By this, we will recreate the dataset having the data of the years from 2011 to 2015. Also, we will check for NA values in the dataset and eliminate them.

sum(is.na(New_York)) #if the result of the sum is 0, then there is no "NA" value.

This number shows that there are 24,177 NA values in the data. Let us remove them with the help of the **omit** function.

p_data<-na.omit(p_data)

In this blog on R for Data Science, we are working with dates in the data as well. Therefore, we have to make sure that the dates are of the class ‘Date’.

New_York$Date.Local<- as.Date(New_York$Date.Local)

Now, we will remove the data for the years 2000 and 2016.

New_York<- with(New_York, New_York[(Date.Local>= "2011-01-01" &Date.Local<= "2015-12-31"),])

# ordering the date by Date.Local

New_York<- New_York[order(New_York$Date.Local),]

As there are several observations having the same value, we will remove the repeated values and make them unique using the **unique** function.

head(New_York)

New_York<- unique(New_York)

Next, by taking the average of dates by months, we will try to analyze the trend of time series.

Then, for each year, we will calculate the monthly averages. After that, we will convert them into characters and paste the concatenated vectors.

**We will use the following functions:**

**as.POSIXlt**: Used to manipulate objects of classes ‘POSIXlt’**POSIXct**: Used to represent calendar dates and times

yyyymm<- paste(format(as.POSIXlt(New_York$Date.Local), format = "%y-%m"), "01", sep = "-") monthly_mean<- tapply(New_York$NO2.AQI, yyyymm, mean) monthly_mean<- as.data.frame(monthly_mean) str(monthly_mean)

#time series

mean.ts<- ts(data = monthly_mean, start = c(2011, 1), frequency = 12)

tsp(mean.ts)

#Plotting a graph for the Air Quality Index of NO2

plot(mean.ts, ylab = "NO2 AQI", main = "NO2 AQI Time Series")

# For O3.AQI

yyyymm<- paste(format(as.POSIXlt(New_York$Date.Local), format = "%y-%m"), "01", sep = "-") monthly_mean_O3 <- tapply(New_York$O3.AQI, yyyymm, mean) monthly_mean_O3 <- as.data.frame(monthly_mean_O3)

#time series

mean.ts1 <- ts(data = monthly_mean_O3, start = 2011, end = 2015, frequency = 12) tsp(mean.ts) plot(mean.ts, ylab = "O3 AQI", main = "O3 AQI Time Series")

*Go through the **Data Science Course in Hyderabad ** to get clear understanding of Data Science Technique.*

Now, moving further in this blog on Data Science with R, we will create visualizations for the Air Quality Index of NO_{2}, SO_{2}, and CO over the years.

library(scales) library(ggplot2) ggplot(data = p_data,aes(Month, NO2.AQI)) + ggtitle("NO2 AQI over the years") + stat_summary(fun.y = mean, geom = "line") + scale_x_date(labels = date_format("%Y-%m"), date_breaks = "6 month") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#Plot for AQI of SO2

ggplot(data = p_data,aes(Month, SO2.AQI)) + ggtitle("SO2 AQI over the years") + stat_summary(fun.y = mean, geom = "line") + scale_x_date(labels = date_format("%Y-%m"), date_breaks = "6 month") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#Plot for AQI of CO

ggplot(data = p_data,aes(Month, CO.AQI)) + ggtitle("CO AQI over the years") + stat_summary(fun.y = mean, geom = "line") + scale_x_date(labels = date_format("%Y-%m"), date_breaks = "6 month") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#AQI of NO2 by country

ggplot(data = p_data,aes(y=NO2.AQI, x=State , colour=Month)) + ggtitle("NO2 AQI by Country") + geom_boxplot() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#AQI of SO2 by country

ggplot(data = p_data,aes(y=SO2.AQI, x=State , colour=Month)) + ggtitle("SO2 AQI by Country") + geom_boxplot() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

#AQI of CO by country

ggplot(data = p_data,aes(y=CO.AQI, x=State ,fill=State, colour=Month)) +ggtitle("CO AQI by Country") + geom_boxplot() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Further, we will look into the distribution of the Air Quality Index by state using a bar plot, a scatter plot, a box plot, and a histogram.

#Bar-plot

ggplot(data = p_data, aes(x = NO2.AQI, fill = State)) + geom_bar(position = "dodge") + labs(x = "Air Quality Index of NO2", y = "Count", title = "Distribution of AQI by State" )

#Scatter-plot

ggplot(data = p_data,aes(x = NO2.AQI, y=NO2.Mean, col = State)) + geom_point(alpha = 0.6,size = 2) + labs(x = "Air Quality Index of NO2", y= "Count",title = "Distribution of AQI by State")

#Box-plot

ggplot(data = p_data, aes(x = "Count",y = NO2.AQI, fill = State)) + geom_boxplot()

#Histogram

ggplot(data = p_data, aes(x = O3.AQI,col=State)) + geom_histogram(bins = 50) + labs(title = "Air Quality Index of O3") + theme_bw()

*Interested in learning Data Science? Click here to learn more in this **Data Science Training in Bangalore**!*

**Implementing** **Linear Regression**

Till now, we have visualized the data of US pollution levels. Next, we will be implementing Machine Learning algorithms such as linear regression and k-means clustering.

First, we will be loading the **caTools** package for implementing linear regression algorithms.

library(caTools) #We will use theseed()function to generate the same set of random values from the dataset set.seed(111) #Splitting the data into a 70:30 ratio using NO2.1st.Max.Value sample.split(p_data$NO2.1st.Max.Value, SplitRatio = 0.7) >split_tag #Creating the train and test sets subset(p_data, split_tag == TRUE) ->train subset(p_data, split_tag == FALSE) -> test #Building the linear regression model using the train dataset l_model<- lm(NO2.AQI ~ NO2.1st.Max.Value , data = train) #Usingoptionswill convert the scientific values into numerical values options(scipen = 999) #Making predictions using the test dataset pred_val<- predict(l_model, newdata = test) head(pred_val)

#Binding the actual and predicted values cbind(Actual = test$NO2.AQI,Predicted = pred_val) ->final_data View(final_data)

#As the data is in the form of a matrix, we will convert it into a dataframe final_data<- as.data.frame(final_data) final_data

#Calculating the error and binding it to the actual and predicted values final_data$Actual - final_data$Predicted -> error View(final_data) cbind(final_data,error) ->final_data head(final_data)

#Calculating the root mean square. The lower value of RMSE denotes the perfection of the model in making predictions

sqrt(mean((final_data$error)^2))

plot(p_data$NO2.1st.Max.Value,p_data$NO2.AQI

**Implementing K-means Clustering**

We will now implement the k-means clustering algorithm to understand the structure of the data. For this, we will try to cluster the different groups of the data.

Now, Let us start implementing the algorithm.

First, we will load the required packages for implementing the algorithm.

library(dplyr)

Next, we will make a group of NO_{2}, SO_{2}, and O_{3} using the **select()** function.

p_data %>% select("NO2.AQI","O3.AQI","SO2.AQI")->AQI_cluster plot_clus_coord(AQI_cluster, p_data)

We will have a view of the cluster that is created.

View(AQI_cluster)

Now, we will create separate clusters for O_{3}, NO_{2}, and SO_{2}. We will start by creating the clusters of O_{3}.

kmeans(AQI_cluster$O3.AQI,3)->cluster_O3 cbind(O3=AQI_cluster$O3.AQI,Cluster=cluster_O3$cluster)->cluster_group View(cluster_group)

#We will convert the matrix of clusters into a dataframe as.data.frame(cluster_group)->cluster_group

Let us now separate all the clusters of O_{3} by using the filter function.

cluster_group %>% filter(Cluster==1)->cluster_group_1 cluster_group %>% filter(Cluster==2)->cluster_group_2 cluster_group %>% filter(Cluster==3)->cluster_group_3 View(cluster_group_1)

Similarly, we will create clusters for NO_{2} and further separate the clusters.

kmeans(AQI_cluster$NO2.AQI,3)->cluster_NO2 cbind(NO2=AQI_cluster$NO2.AQI,Cluster=cluster_NO2$cluster)->cluster_group_NO2

View(cluster_group_NO2)

as.data.frame(cluster_group_NO2)->cluster_group_NO2 cluster_group_NO2 %>% filter(Cluster==1)->cluster_group1_NO2 cluster_group_NO2 %>% filter(Cluster==2)->cluster_group2_NO2 cluster_group_NO2 %>% filter(Cluster==3)->cluster_group3_NO2

View(cluster_group1_NO2) View(cluster_group2_NO2) View(cluster_group3_NO2)

Finally, we will create clusters for SO_{2}.

kmeans(AQI_cluster$SO2.AQI,3)->cluster_SO2

View(cluster_SO2)

cbind(SO2=AQI_cluster$SO2.AQI,Cluster=cluster_SO2$cluster)->cluster_group_SO2

View(cluster_group_SO2)

as.data.frame(cluster_group_SO2)->cluster_group_SO2

cluster_group_SO2 %>% filter(Cluster==1)->cluster_group1_SO2

cluster_group_SO2 %>% filter(Cluster==2)->cluster_group2_SO2

cluster_group_SO2 %>% filter(Cluster==3)->cluster_group3_SO2

View(cluster_group1_SO2) View(cluster_group2_SO2) View(cluster_group3_SO2)

Let us plot the clusters of NO_{2}, O_{3}, and SO_{2} using the **plot** function.

plot(AQI_cluster[c("NO2.AQI","O3.AQI","SO2.AQI")], col=AQI_cluster$NO2.AQI)

This plot shows nine different clusters (three each for NO_{2}, SO_{2}, and O_{3}).

Now, let us have a look at the clustering plots based on some data and allocate the centers.

#Clustering plots

X <-data.frame(c1=c(0,1,2,4,5,4,6,7),c2=c(0,1,2,3,3,4,5,5)) km <- kmeans(X, center=2) plot(X,col=km$cluster)

points(km$center,col=1:2,pch=8,cex=1)

Finally, we will create the clustering plots using just three centers for NO_{2}, O_{3}, and SO_{2}.

X <- data.frame(AQI_cluster) km <- kmeans(X, center=3) plot(X,col=km$cluster)

The above plot shows the values of the Air Quality Index for NO_{2}, O_{3}, and SO_{2}. Here, the bottom right plot shows the visualization for SO_{2} and NO_{2}. In this, most of the points lie between 0 and 60, which is considered as acceptable air quality. Similarly, most of the values of the Air Quality Index for SO_{2} and O_{3} and NO_{2} and O_{3} lie between 0 and 50. This is again considered an acceptable air quality.

In this blog on Data Science with R, we worked on understanding the data through visualizations. Then, we implemented the linear regression algorithm for the dataset. Finally, we implemented the k-means clustering algorithm to build clusters of the data for the comparative analysis of the Air Quality Index for NO_{2}, SO_{2}, and O_{3}. This is all about this use case of Data Science with R.

**Go through this Data Science Interview Questions And Answers to excel in your Data Science Interview.**