Data Extraction in R

In data extraction, the initial step is data pre-processing or data cleaning. In data cleaning, the task is to transform the dataset into a basic form that makes it easy to work with. One characteristic of a clean/tidy dataset is that it has one observation per row and one variable per column.

The next step in data extraction is data manipulation. In data manipulation, the task is to modify the data to make it easier to read and more organized. We manipulate the data for data analysis and data visualization. Data manipulation is also used with the term ‘data exploration’ which involves organizing data using the available sets of variables.

At times, the data collection process done by machines involves lots of errors and inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and make data more accurate and precise.

Interested in learning R Programming? Go through this R Tutorial!

Watch Data Extraction in R programming Tutorial

Data Extraction

Exploring Raw Data

Raw data is data collected from a source, which has not yet been processed for usage.Typically, the readily available data is not in a state in which it can be used efficiently for data extraction. Such data is difficult to manipulate and often needs to be processed in some way, before it can be used for data analysis and data extraction in general, and is referred to as raw data or source data.

In this blog, we are going to perform data extraction and data manipulation in R on the ‘Census Income’ dataset from the UCI Machine Learning Repository, which contains the income information of over 48,000 individuals, taken from the 1994 US census.

To import the dataset:

census <- read.csv("C:\\Users\\Intellipaat-Team\\Desktop\\census-income.csv")

To understand the structure of the dataset, we often use the following functions in R Programming:

    1. a) class() – class of data object
    1. b) str() – compact display of the internal structure of data
    1. c) summary() – summary of data
    1. d) names() – column names
    1. e) dim() – dimensions of data
    1. f) head() – view of the data from the top
    1. g) tail() – view of the data from the bottom

For example:


[1] "data.frame"

> dim(census)

[1] 30162    15

> names(census) 

[1] "age"            "workclass"      "fnlwgt"         "education"      "education.num"  

[6] "marital.status" "occupation"     "relationship"   "race"           "sex"           

[11] "capital.gain"   "capital.loss"   "hours.per.week" "" "X"     

> head(census)                #First six rows and all columns  

  age        workclass fnlwgt education education.num 
1  39        State-gov  77516 Bachelors            13  
2  50 Self-emp-not-inc  83311 Bachelors            13
3  38          Private 215646   HS-grad             9  
4  53          Private 234721      11th             7 
5  28          Private 338409 Bachelors            13 
6  37          Private 284582   Masters            14

Go for this in-depth job-oriented Data Science Training Course now!

Removing Leading and Trailing Whitespaces

While dealing with character columns in a dataset, there can be situations where there are whitespaces at the beginning and/or end of the strings. These whitespaces can cause issues when attempting to perform data extraction and various operations such as sorting, sub-setting, etc.

In this blog, we will use various functions from the stringr and the dplyr packages to remove any whitespaces from the columns.

If the whitespaces exist in a factor column that contains characters, then we first need to convert the factor columns to character columns before we can remove those whitespaces. After the whitespaces are removed, we can convert those columns back to factors. Studying the Machine Learning Course will give you a better insight on the project.

For example:

In the ‘census’ dataset, the factor columns that contain whitespaces are first converted to character columns before removing the whitespaces.

#Convert factor columns to character columns
census$workclass <- as.character(census$workclass)
census$occupation <- as.character(census$occupation)
census$ <- as.character(census$
census$education <- as.character(census$education)
census$marital.status <- as.character(census$marital.status)
census$relationship <- as.character(census$relationship)
census$race <- as.character(census$race)
census$sex <- as.character(census$sex)
census$X <- as.character(census$X)

To remove whitespaces from the above columns, we will use the mutate_if and the str_trim functions from the dplyr and the stringr packages, respectively.

census %>%  

mutate_if(is.character, str_trim) -> census

After performing the above operation, all the leading and trailing whitespaces will be removed.

Now, we need to convert the above columns back to factors to get back to the original structure of the dataset.

#Convert character columns back to factors
census$workclass <- as.factor(census$workclass)
census$occupation <- as.factor(census$occupation)
census$ <- as.factor(census$
census$education <- as.factor(census$education)
census$marital.status <- as.factor(census$marital.status)
census$relationship <- as.factor(census$relationship)
census$race <- as.factor(census$race)
census$sex <- as.factor(census$sex)
census$X <- as.factor(census$X)

Become a Data Science Architect IBM

Missing and Special Values

While performing data extraction, we might encounter some missing values or some special values in the dataset. When any information is not available, we call it a missing value.

Often times in data analysis, we want to get a sense of how many complete observations we have. This can be helpful in determining how we handle observations with missing data points.

In R, the missing values are represented as NA, but these missing values may appear in other forms as well.

There are also certain special values that we often encounter in our dataset.

For example:

  • Inf() – infinite values
  • 1/0 – numerical values divided by zero
  • NaN – not a number
  • 0/0 – zero divided by zero

To find missing values in a data frame, we can use the following functions:

  • – indicates which elements are missing
  • any( – indicates if there are any NAs
  • sum( – counts the number of NAs
  • summary() – finds NAs

To deal with NAs, we can use the following functions:

  • complete.cases() – to find rows with no missing values
  • na.omit() – to remove rows with NAs

For example:

df <- data.frame(A = c(2,NA,9,NA),                 

B = c(4,NA, 55,66),                 

C = c(3,44,1,22))  
       A     B     C
> any(
[1] TRUE 
> sum(
[1] 3 
> complete.cases(df)
  A  B C
1 2  4 3
3 9 55 1  

> na.omit(df)
A  B C
1 2  4 3
3 9 55 1

In the ‘census’ dataset, the missing values are represented by the “?” symbol. So, in order to remove these values, we will first convert them to NA values and then perform any one of the above operations to remove the NA values using the following code:

#Replace "?" with NA

census[census == "?"] <- NA 

#Remove NA

census <- na.omit(census) 


[1] 0

Become a master of Data Science with R by going through this online Data Science Course in London!

Watch Data Science Project Tutorial

Data Extraction

Learn about Apriori Algorithm in Python with our informative blog.

Data Extraction in R

After completing the data cleaning step, our data is now in a clean and tidy format that can be used to perform data extraction and data analysis.

First, we will use the base R functions to extract rows and columns from a data frame. While performing data analysis or working on Data Science projects, these commands come in handy to extract information from a dataset.

In this blog, we will use the indexing features in R to perform data extraction on the ‘census’ dataset.

For example:

#select columns age, education, sex

mycol <- c("age", "education", "sex")

> census[mycol]    

     age    education    sex
1    39    Bachelors   Male
2    50    Bachelors   Male
3    38      HS-grad   Male
4    53         11th   Male
5    28    Bachelors Female
6    37      Masters Female 

# First Row and 2nd and third column

census[1, 2:3]

workclass fnlwgt

1 State-gov  77516 

# First 6 Rows and Second Column as a data frame census[1:6,2], drop=false)    

census[1:10, 2]

1         State-gov
2  Self-emp-not-inc
3           Private
4           Private
5           Private
6           Private  

#Element at 5th row, tenth column

Levels: Female Male

# exclude columns age, occupation, and race
mycols <- names(census) %in% c("age", "occupation", "race") 

newdata <- census[!mycols] 

# exclude 3rd and 5th column 
newdata <- census[c(-3,-5)] 

# delete columns fnlwgt and education.num
census$fnlwgt <- census$education.num <- NULL

#selecting rows based on column values
newdata <- census[ which(census$sex=='Female'                          
& census$age > 65), ]

Subset Function

We can also use the subset() function from the base R package to select variables and observations.

#selecting rows where age is either greater than 50 or less than 20, and only age and education columns. 

newdata <- subset(census, age >= 50 | age < 20, select=c(age, education))
#selecting rows where education is Bachelors and Yearly Income(X) is >50K, and only education and capital.gain column.

> newdata <- subset(census, education == "Bachelors" & X == ">50K", select = c(education, capital.gain))

> head(newdata)   

   education    capital.gain
10 Bachelors         5178
12 Bachelors            0
26 Bachelors            0
46 Bachelors            0
54 Bachelors            0
73 Bachelors            0

Sample Function

We use the sample() function from the base R package to extract a random sample of size n from a dataset.

#To sample 20 random rows from the “census” dataset without replacement. 
mysample <- census[sample(1:nrow(census), 20, replace=FALSE),]

Table Function

We use the table() function to calculate the occurrences of unique values of a variable and display the frequency table of the categories of that variable.

#To print a frequency distribution of the “workclass” column.

freq.table <- table(census$workclass)

Federal-gov     Local-gov      Private   Self-emp-inc Self-emp-not-inc         

       943             2067       22286               1074                  2499        
State-gov      Without-pay
       1279                   14

Watch this insightful video of a Data Science Project:

Data Extraction

Data Extraction in R with dplyr

The dplyr package contains various functions that are specifically designed for data extraction and data manipulation. These functions are preferred over the base R functions because the former process data at a faster rate and are known as the best for data extraction, exploration, and transformation.

Some of the important functions for data manipulation in R are as follows:

  • select(): to select columns (variables)
  • filter(): to filter (subset) rows
  • mutate(): to create new columns
  • summarise(): to summarize (or aggregate) data
  • group_by(): to group data
  • arrange(): to sort data
  • join(): to join data frames

Now, let’s explore some examples of data extraction on the ‘census’ dataset using the dplyr package.

Select Function

#To select the following columns
mycols <- select(census, age, education,occupation)

#To select all columns from education to relationship
mycols <- select(census, education:relationship)

#To print first 5 rows 
head(mycols, 5)

#To select columns with numeric indexes
mycols <- select(mycols,c(6:9))


age  education        occupation

1  39 Bachelors      Adm-clerical

2  50 Bachelors   Exec-managerial

3  38   HS-grad Handlers-cleaners

4  53      11th Handlers-cleaners

5  28 Bachelors    Prof-specialty

6  37   Masters   Exec-managerial


education  education.num     marital.status        occupation  relationship

1 Bachelors            13      Never-married      Adm-clerical Not-in-family

2 Bachelors            13 Married-civ-spouse   Exec-managerial       Husband

3   HS-grad             9           Divorced Handlers-cleaners Not-in-family

4      11th             7 Married-civ-spouse Handlers-cleaners       Husband

5 Bachelors            13 Married-civ-spouse    Prof-specialty          Wife


marital.status        occupation  relationship  race

1      Never-married      Adm-clerical Not-in-family White

2 Married-civ-spouse   Exec-managerial       Husband White

3           Divorced Handlers-cleaners Not-in-family White

4 Married-civ-spouse Handlers-cleaners       Husband Black

5 Married-civ-spouse    Prof-specialty          Wife Black

6 Married-civ-spouse   Exec-managerial          Wife White

Filter Function

#To print the last 5 rows with education as Bachelors and sex as Male

myrows <- filter(census, education == "Bachelors", sex == "Male")

#To print first 5 rows with occupation as Sales and race as White using pipe operator and select only the first 3 columns

census  %>% filter(occupation == "Sales", race == "White") %>% select(1:3) -> myrows

Arrange Function

#To arrange age in ascending order and display only the age column

census %>% arrange(age) %>% select(age)  

#To arrange hours.per.week in descending order

census %>% arrange(desc(hours.per.week)) %>% select(hours.per.week)

Count Function

#To get the count of “workclass” column

census %>% count(workclass)


# A tibble: 7 x 2

workclass            n

<fct>            <int>

1 Federal-gov        943

2 Local-gov         2067

3 Private          22286

4 Self-emp-inc      1074

5 Self-emp-not-inc  2499

6 State-gov         1279

7 Without-pay         14

Check out the top Data Science Interview Questions to learn what is expected from Data Science professionals!

Summarise Function

#To find the mean of capital.gain column 

census %>% summarise(Mean= mean(capital.gain))      


1 1092.008

Group By Function

#To find the mean of capital.gain column grouped according to the “workclass” column 

census %>% group_by(workclass) %>% summarise(Mean= mean(capital.gain))


# A tibble: 7 x 2

workclass         Mean

<fct>            <dbl>

1 Federal-gov       832.

2 Local-gov         829.

3 Private           880.

4 Self-emp-inc     4811.

5 Self-emp-not-inc 1913.

6 State-gov         684.

7 Without-pay       488.

Watch Data Science Interview Questions for Interview Preparation

Data Extraction


In this blog, we have learned the data extraction process using R programming and the different steps involved in the data extraction process. In the first step, we discussed the process of cleaning data in R using different techniques that are used to transform a dirty dataset into a clean or tidy one, making it easy to work with. After data cleaning, in the next step, we performed various operations for data manipulation in R and also data manipulation in R with dplyr package.

Still have queries? Come to Intellipaat’s Data Science Community, clarify all your doubts and excel in your career!

Course Schedule

Name Date
Data Science Architect 2021-01-16 2021-01-17
(Sat-Sun) Weekend batch
View Details
Data Science Architect 2021-01-23 2021-01-24
(Sat-Sun) Weekend batch
View Details
Data Science Architect 2021-01-30 2021-01-31
(Sat-Sun) Weekend batch
View Details

Leave a Reply

Your email address will not be published. Required fields are marked *