Data Extraction in R
In data extraction, the initial step is data pre-processing or data cleaning. In data cleaning, the task is to transform the dataset into a basic form that makes it easy to work with. One characteristic of a clean/tidy dataset is that it has one observation per row and one variable per column.
The next step in data extraction is data manipulation. In data manipulation, the task is to modify the data to make it easier to read and more organized. We manipulate the data for data analysis and data visualization. Data manipulation is also used with the term ‘data exploration’ which involves organizing data using the available sets of variables.
At times, the data collection process done by machines involves lots of errors and inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and make data more accurate and precise.
Interested in learning R Programming? Go through this R Tutorial!
Check out this Intellipaat R Programming full course video:
Exploring Raw Data
Raw data is data collected from a source, which has not yet been processed for usage.Typically, the readily available data is not in a state in which it can be used efficiently for data extraction. Such data is difficult to manipulate and often needs to be processed in some way, before it can be used for data analysis and data extraction in general, and is referred to as raw data or source data.
In this blog, we are going to perform data extraction and data manipulation in R on the ‘Census Income’ dataset from the UCI Machine Learning Repository, which contains the income information of over 48,000 individuals, taken from the 1994 US census.
census <- read.csv("C:\\Users\\Intellipaat-Team\\Desktop\\census-income.csv")
To understand the structure of the dataset, we often use the following functions in R Programming:
- a) class() – class of data object
- b) str() – compact display of the internal structure of data
- c) summary() – summary of data
- d) names() – column names
- e) dim() – dimensions of data
- f) head() – view of the data from the top
- g) tail() – view of the data from the bottom
class(census)  "data.frame" > dim(census)  30162 15 > names(census)  "age" "workclass" "fnlwgt" "education" "education.num"  "marital.status" "occupation" "relationship" "race" "sex"  "capital.gain" "capital.loss" "hours.per.week" "native.country" "X" > head(census) #First six rows and all columns age workclass fnlwgt education education.num 1 39 State-gov 77516 Bachelors 13 2 50 Self-emp-not-inc 83311 Bachelors 13 3 38 Private 215646 HS-grad 9 4 53 Private 234721 11th 7 5 28 Private 338409 Bachelors 13 6 37 Private 284582 Masters 14
Go for this in-depth job-oriented Data Science Training Course now!
Watch this K Means Clustering Tutorial video
Removing Leading and Trailing Whitespaces
While dealing with character columns in a dataset, there can be situations where there are whitespaces at the beginning and/or end of the strings. These whitespaces can cause issues when attempting to perform data extraction and various operations such as sorting, sub-setting, etc.
In this blog, we will use various functions from the stringr and the dplyr packages to remove any whitespaces from the columns.
If the whitespaces exist in a factor column that contains characters, then we first need to convert the factor columns to character columns before we can remove those whitespaces. After the whitespaces are removed, we can convert those columns back to factors.
In the ‘census’ dataset, the factor columns that contain whitespaces are first converted to character columns before removing the whitespaces.
#Convert factor columns to character columns census$workclass <- as.character(census$workclass) census$occupation <- as.character(census$occupation) census$native.country <- as.character(census$native.country) census$education <- as.character(census$education) census$marital.status <- as.character(census$marital.status) census$relationship <- as.character(census$relationship) census$race <- as.character(census$race) census$sex <- as.character(census$sex) census$X <- as.character(census$X)
To remove whitespaces from the above columns, we will use the mutate_if and the str_trim functions from the dplyr and the stringr packages, respectively.
library("dplyr") library(stringr) census %>% mutate_if(is.character, str_trim) -> census
After performing the above operation, all the leading and trailing whitespaces will be removed.
Now, we need to convert the above columns back to factors to get back to the original structure of the dataset.
#Convert character columns back to factors census$workclass <- as.factor(census$workclass) census$occupation <- as.factor(census$occupation) census$native.country <- as.factor(census$native.country) census$education <- as.factor(census$education) census$marital.status <- as.factor(census$marital.status) census$relationship <- as.factor(census$relationship) census$race <- as.factor(census$race) census$sex <- as.factor(census$sex) census$X <- as.factor(census$X)
Missing and Special Values
While performing data extraction, we might encounter some missing values or some special values in the dataset. When any information is not available, we call it a missing value.
Often times in data analysis, we want to get a sense of how many complete observations we have. This can be helpful in determining how we handle observations with missing data points.
In R, the missing values are represented as NA, but these missing values may appear in other forms as well.
There are also certain special values that we often encounter in our dataset.
- Inf() – infinite values
- 1/0 – numerical values divided by zero
- NaN – not a number
- 0/0 – zero divided by zero
To find missing values in a data frame, we can use the following functions:
- is.na() – indicates which elements are missing
- any(is.na()) – indicates if there are any NAs
- sum(is.na()) – counts the number of NAs
- summary() – finds NAs
To deal with NAs, we can use the following functions:
- complete.cases() – to find rows with no missing values
- na.omit() – to remove rows with NAs
df <- data.frame(A = c(2,NA,9,NA), B = c(4,NA, 55,66), C = c(3,44,1,22)) is.na(df) A B C [1,] FALSE FALSE FALSE [2,] TRUE TRUE FALSE [3,] FALSE FALSE FALSE [4,] TRUE FALSE FALSE > any(is.na(df))  TRUE > sum(is.na(df))  3 > complete.cases(df)  TRUE FALSE TRUE FALSE df[complete.cases(df),] A B C 1 2 4 3 3 9 55 1 > na.omit(df) A B C 1 2 4 3 3 9 55 1
In the ‘census’ dataset, the missing values are represented by the “?” symbol. So, in order to remove these values, we will first convert them to NA values and then perform any one of the above operations to remove the NA values using the following code:
#Replace "?" with NA census[census == "?"] <- NA #Remove NA census <- na.omit(census) sum(is.na(census))  0
Become a master of Data Science with R by going through this online Data Science Course in London!
Watch this Decision Tree Machine Learning Tutorial video
Data Extraction in R
After completing the data cleaning step, our data is now in a clean and tidy format that can be used to perform data extraction and data analysis.
First, we will use the base R functions to extract rows and columns from a data frame. While performing data analysis or working on Data Science projects, these commands come in handy to extract information from a dataset.
In this blog, we will use the indexing features in R to perform data extraction on the ‘census’ dataset.
#select columns age, education, sex mycol <- c("age", "education", "sex") > census[mycol] age education sex 1 39 Bachelors Male 2 50 Bachelors Male 3 38 HS-grad Male 4 53 11th Male 5 28 Bachelors Female 6 37 Masters Female # First Row and 2nd and third column census[1, 2:3] workclass fnlwgt 1 State-gov 77516 # First 6 Rows and Second Column as a data frame as.data.frame( census[1:6,2], drop=false) census[1:10, 2] 1 State-gov 2 Self-emp-not-inc 3 Private 4 Private 5 Private 6 Private #Element at 5th row, tenth column census[5,10] Female Levels: Female Male # exclude columns age, occupation, and race mycols <- names(census) %in% c("age", "occupation", "race") newdata <- census[!mycols] # exclude 3rd and 5th column newdata <- census[c(-3,-5)] # delete columns fnlwgt and education.num census$fnlwgt <- census$education.num <- NULL #selecting rows based on column values newdata <- census[ which(census$sex=='Female' & census$age > 65), ]
Wish to get certified in Data Science! Learn Data Science from top Data Science experts and excel in your career with Intellipaat’s Data Science certification!
We can also use the subset() function from the base R package to select variables and observations.
#selecting rows where age is either greater than 50 or less than 20, and only age and education columns. newdata <- subset(census, age >= 50 | age < 20, select=c(age, education)) #selecting rows where education is Bachelors and Yearly Income(X) is >50K, and only education and capital.gain column. > newdata <- subset(census, education == "Bachelors" & X == ">50K", select = c(education, capital.gain)) > head(newdata) education capital.gain 10 Bachelors 5178 12 Bachelors 0 26 Bachelors 0 46 Bachelors 0 54 Bachelors 0 73 Bachelors 0
We use the sample() function from the base R package to extract a random sample of size n from a dataset.
#To sample 20 random rows from the “census” dataset without replacement. mysample <- census[sample(1:nrow(census), 20, replace=FALSE),]
We use the table() function to calculate the occurrences of unique values of a variable and display the frequency table of the categories of that variable.
#To print a frequency distribution of the “workclass” column. freq.table <- table(census$workclass) freq.table Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc 943 2067 22286 1074 2499 State-gov Without-pay 1279 14
Watch this insightful video of a Data Science Project:
Data Extraction in R with dplyr
The dplyr package contains various functions that are specifically designed for data extraction and data manipulation. These functions are preferred over the base R functions because the former process data at a faster rate and are known as the best for data extraction, exploration, and transformation.
Some of the important functions for data manipulation in R are as follows:
- select(): to select columns (variables)
- filter(): to filter (subset) rows
- mutate(): to create new columns
- summarise(): to summarize (or aggregate) data
- group_by(): to group data
- arrange(): to sort data
- join(): to join data frames
Now, let’s explore some examples of data extraction on the ‘census’ dataset using the dplyr package.
#To select the following columns mycols <- select(census, age, education,occupation) head(mycols) #To select all columns from education to relationship mycols <- select(census, education:relationship) #To print first 5 rows head(mycols, 5) #To select columns with numeric indexes mycols <- select(mycols,c(6:9)) head(mycols)
age education occupation 1 39 Bachelors Adm-clerical 2 50 Bachelors Exec-managerial 3 38 HS-grad Handlers-cleaners 4 53 11th Handlers-cleaners 5 28 Bachelors Prof-specialty 6 37 Masters Exec-managerial
education education.num marital.status occupation relationship 1 Bachelors 13 Never-married Adm-clerical Not-in-family 2 Bachelors 13 Married-civ-spouse Exec-managerial Husband 3 HS-grad 9 Divorced Handlers-cleaners Not-in-family 4 11th 7 Married-civ-spouse Handlers-cleaners Husband 5 Bachelors 13 Married-civ-spouse Prof-specialty Wife
marital.status occupation relationship race 1 Never-married Adm-clerical Not-in-family White 2 Married-civ-spouse Exec-managerial Husband White 3 Divorced Handlers-cleaners Not-in-family White 4 Married-civ-spouse Handlers-cleaners Husband Black 5 Married-civ-spouse Prof-specialty Wife Black 6 Married-civ-spouse Exec-managerial Wife White
#To print the last 5 rows with education as Bachelors and sex as Male myrows <- filter(census, education == "Bachelors", sex == "Male") tail(myrows,5)
#To print first 5 rows with occupation as Sales and race as White using pipe operator and select only the first 3 columns census %>% filter(occupation == "Sales", race == "White") %>% select(1:3) -> myrows head(myrows,5)
#To arrange age in ascending order and display only the age column census %>% arrange(age) %>% select(age) #To arrange hours.per.week in descending order census %>% arrange(desc(hours.per.week)) %>% select(hours.per.week)
Watch this Data Science for Beginners Tutorial video
#To get the count of “workclass” column census %>% count(workclass)
# A tibble: 7 x 2 workclass n <fct> <int> 1 Federal-gov 943 2 Local-gov 2067 3 Private 22286 4 Self-emp-inc 1074 5 Self-emp-not-inc 2499 6 State-gov 1279 7 Without-pay 14
Check out the top Data Science Interview Questions to learn what is expected from Data Science professionals!
#To find the mean of capital.gain column census %>% summarise(Mean= mean(capital.gain)) Mean 1 1092.008
Group By Function
#To find the mean of capital.gain column grouped according to the “workclass” column census %>% group_by(workclass) %>% summarise(Mean= mean(capital.gain))
# A tibble: 7 x 2 workclass Mean <fct> <dbl> 1 Federal-gov 832. 2 Local-gov 829. 3 Private 880. 4 Self-emp-inc 4811. 5 Self-emp-not-inc 1913. 6 State-gov 684. 7 Without-pay 488.
In this blog, we have learned the data extraction process using R programming and the different steps involved in the data extraction process. In the first step, we discussed the process of cleaning data in R using different techniques that are used to transform a dirty dataset into a clean or tidy one, making it easy to work with. After data cleaning, in the next step, we performed various operations for data manipulation in R and also data manipulation in R with dplyr package.
Still have queries? Come to Intellipaat’s Data Science Community, clarify all your doubts and excel in your career!