What is Data Extraction?
Data extraction is when data is obtained from a database or SaaS platform for replication of the data to a destination like a data warehouse that is designed to support online analytical processing (OLAP). It is the first step in a data ingestion process (ETL). ETL prepares the data for analysis or Business Intelligence (BI).
An ETL tool can extract data from data sources like social media mentions, online reviews, online transactions, etc., and load them into a data warehouse. The data is then analyzed and mined for insights to gain brand perception.
Data Extraction Tools
While extracting data might seem like a daunting task, most companies and organizations take the help of these tools for the management of the extraction process from end to end. An ETL tool automates and streamlines the extraction process in order to deploy resources toward other priorities.
Following are the benefits of using a data extraction tool:
- Control – It allows data migration from outside sources into a user’s own databases. This prevents the data siloed by outdated software or application licenses. Extraction lets one have more control of their data.
- Agility – It allows the consolidation of information into a centralized system to unify multiple data sets.
- Simplified Sharing – It is an easy way to not only provide helpful but also limited data access when needed. It helps to share data in a common and usable format.
- Accuracy and Precision – Errors are more likely to occur with manual processes and hand-coding. Entering, editing, and re-entering huge volumes of data negatively affect data integrity. It automates the processes and helps reduce errors, and thus, the time spent on resolving them.
Watch Data Extraction in R programming Tutorial
Types of Data Extraction
Data can be extracted in the following ways:
- Update Notification
The easiest way to extract data from a source system is by having that system issue a notification whenever there is a change in a record. Most databases provide a mechanism for this to allow database replication (change data capture or binary logs) Many SaaS applications offer webhooks with conceptually similar functionality.
- Incremental Extraction
Some data sources are able to identify the records that have been modified and provide an extract of those records. The data extraction code needs to identify and propagate changes during subsequent ETL steps. One drawback of this method is that detection of deleted records in source data may not be possible.
- Full Extraction
A full extraction is necessary during the first replication of the source. Some data sources are not able to identify changed data. In those cases, reloading a whole table is the only way to extract data from that source. As full extraction involves high data transfer volume, it’s not the best option as it can put a load on the network.
Process of Data Extraction
We already know that data extraction is the first step in the three-step ETL process:
- Extraction: Data is taken from one or multiple sources or systems. This process locates and identifies relevant data, and prepares it for processing or transformation. Extraction allows various data to be combined and eventually mined for business intelligence.
- Transformation: After successfully extracting data, the data is ready to be refined. During this phase, data is sorted, organized, and cleansed. For example, missing values will be removed or enriched, duplicate entries will be deleted, and audits will be performed for reliable, consistent, and usable data.
- Loading: The transformed and high-quality data is then delivered to a single and unified target location for the purpose of storage and analysis.
In data extraction, the initial step is data pre-processing or data cleaning. In data cleaning, the task is to transform the dataset into a basic form that makes it easy to work with. One characteristic of a clean/tidy dataset is that it has one observation per row and one variable per column.
The next step in this process is data manipulation. In data manipulation, the task is to modify the data to make it easier to read and more organized. We manipulate the data for data analysis and data visualization. Data manipulation is also used with the term ‘data exploration’ which involves organizing data using the available sets of variables.
At times, the data collection process done by machines involves lots of errors and inaccuracies in reading. Data manipulation is also used to remove these inaccuracies and make data more accurate and precise.
Interested in learning R Programming? Go through this R Tutorial!
Exploring Raw Data
Raw data is data collected from a source, which has not yet been processed for usage. Typically, the readily available data is not in a state in which it can be used efficiently for data extraction. Such data is difficult to manipulate and often needs to be processed in some way before it can be used for data analysis and data extraction in general and is referred to as raw data or source data.
In this blog, we are going to perform data extraction and data manipulation in R on the ‘Census Income’ dataset from the UCI Machine Learning Repository, which contains the income information of over 48,000 individuals, taken from the 1994 US census.
To import the dataset:
census <- read.csv("C:\\Users\\Intellipaat-Team\\Desktop\\census-income.csv")
To understand the structure of the dataset, we often use the following functions in R Programming:
a) class() – class of data object
b) str() – compact display of the internal structure of data
c) summary() – summary of data
d) names() – column names
e) dim() – dimensions of data
f) head() – view of the data from the top
g) tail() – view of the data from the bottom
 30162 15
 "age" "workclass" "fnlwgt" "education" "education.num"
 "marital.status" "occupation" "relationship" "race" "sex"
 "capital.gain" "capital.loss" "hours.per.week" "native.country" "X"
> head(census) #First six rows and all columns
age workclass fnlwgt education education.num
1 39 State-gov 77516 Bachelors 13
2 50 Self-emp-not-inc 83311 Bachelors 13
3 38 Private 215646 HS-grad 9
4 53 Private 234721 11th 7
5 28 Private 338409 Bachelors 13
6 37 Private 284582 Masters 14
Go for this in-depth job-oriented Data Science Courses now!
Get 100% Hike!
Master Most in Demand Skills Now !
Removing Leading and Trailing Whitespaces
While dealing with character columns in a dataset, there can be situations where there are whitespaces at the beginning and/or end of the strings. These whitespaces can cause issues when attempting to perform data extraction and various operations such as sorting, sub-setting, etc.
In this blog, we will use various functions from the stringr and the dplyr packages to remove any whitespaces from the columns.
If the whitespaces exist in a factor column that contains characters, then we first need to convert the factor columns to character columns before we can remove those whitespaces. After the whitespaces are removed, we can convert those columns back to factors. Studying the Machine Learning Course will give you a better insight on the project.
In the ‘census’ dataset, the factor columns that contain whitespaces are first converted to character columns before removing the whitespaces.
#Convert factor columns to character columns
census$workclass <- as.character(census$workclass)
census$occupation <- as.character(census$occupation)
census$native.country <- as.character(census$native.country)
census$education <- as.character(census$education)
census$marital.status <- as.character(census$marital.status)
census$relationship <- as.character(census$relationship)
census$race <- as.character(census$race)
census$sex <- as.character(census$sex)
census$X <- as.character(census$X)
To remove whitespaces from the above columns, we will use the mutate_if and the str_trim functions from the dplyr and the stringr packages, respectively.
mutate_if(is.character, str_trim) -> census
After performing the above operation, all the leading and trailing whitespaces will be removed.
Now, we need to convert the above columns back to factors to get back to the original structure of the dataset.
#Convert character columns back to factors
census$workclass <- as.factor(census$workclass)
census$occupation <- as.factor(census$occupation)
census$native.country <- as.factor(census$native.country)
census$education <- as.factor(census$education)
census$marital.status <- as.factor(census$marital.status)
census$relationship <- as.factor(census$relationship)
census$race <- as.factor(census$race)
census$sex <- as.factor(census$sex)
census$X <- as.factor(census$X)
Missing and Special Values
While performing data extraction, we might encounter some missing values or some special values in the dataset. When any information is not available, we call it a missing value.
Often times in data analysis, we want to get a sense of how many complete observations we have. This can be helpful in determining how we handle observations with missing data points.
In R, the missing values are represented as NA, but these missing values may appear in other forms as well.
There are also certain special values that we often encounter in our dataset.
- Inf() – infinite values
- 1/0 – numerical values divided by zero
- NaN – not a number
- 0/0 – zero divided by zero
To find missing values in a data frame
, we can use the following functions:
- is.na() – indicates which elements are missing
- any(is.na()) – indicates if there are any NAs
- sum(is.na()) – counts the number of NAs
- summary() – finds NAs
To deal with NAs, we can use the following functions:
- complete.cases() – to find rows with no missing values
- na.omit() – to remove rows with NAs
df <- data.frame(A = c(2,NA,9,NA),
B = c(4,NA, 55,66),
C = c(3,44,1,22))
A B C
[1,] FALSE FALSE FALSE
[2,] TRUE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] TRUE FALSE FALSE
 TRUE FALSE TRUE FALSE
A B C
1 2 4 3
3 9 55 1
A B C
1 2 4 3
3 9 55 1
In the ‘census’ dataset, the missing values are represented by the “?” symbol. So, in order to remove these values, we will first convert them to NA values and then perform any one of the above operations to remove the NA values using the following code:
#Replace "?" with NA
census[census == "?"] <- NA
census <- na.omit(census)
Learn about Apriori Algorithm in Python with our informative blog.
Data Extraction in R
After completing the data cleaning step, our data is now in a clean and tidy format that can be used to perform data extraction and data analysis.
First, we will use the base R functions to extract rows and columns from a data frame. While performing data analysis or working on Data Science projects, these commands come in handy to extract information from a dataset.
In this blog, we will use the indexing features in R to perform data extraction on the ‘census’ dataset.
#select columns age, education, sex
mycol <- c("age", "education", "sex")
age education sex
1 39 Bachelors Male
2 50 Bachelors Male
3 38 HS-grad Male
4 53 11th Male
5 28 Bachelors Female
6 37 Masters Female
# First Row and 2nd and third column
1 State-gov 77516
# First 6 Rows and Second Column as a data frame
as.data.frame( census[1:6,2], drop=false)
#Element at 5th row, tenth column
Levels: Female Male
# exclude columns age, occupation, and race
mycols <- names(census) %in% c("age", "occupation", "race")
newdata <- census[!mycols]
# exclude 3rd and 5th column
newdata <- census[c(-3,-5)]
# delete columns fnlwgt and education.num
census$fnlwgt <- census$education.num <- NULL
#selecting rows based on column values
newdata <- census[ which(census$sex=='Female'
& census$age > 65), ]
We can also use the subset() function from the base R package to select variables and observations.
#selecting rows where age is either greater than 50 or less than 20, and only age and education columns.
newdata <- subset(census, age >= 50 | age < 20, select=c(age, education))
#selecting rows where education is Bachelors and Yearly Income(X) is >50K, and only education and capital.gain column.
> newdata <- subset(census, education == "Bachelors" & X == ">50K", select = c(education, capital.gain))
10 Bachelors 5178
12 Bachelors 0
26 Bachelors 0
46 Bachelors 0
54 Bachelors 0
73 Bachelors 0
We use the sample() function from the base R package to extract a random sample of size n from a dataset.
#To sample 20 random rows from the “census” dataset without replacement.
mysample <- census[sample(1:nrow(census), 20, replace=FALSE),]
We use the table() function to calculate the occurrences of unique values of a variable and display the frequency table of the categories of that variable.
#To print a frequency distribution of the “workclass” column.
freq.table <- table(census$workclass)
Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc
943 2067 22286 1074 2499
Interested to learn Data Science using R? Check out this Data Science course in Pune Today!
Check out this Data Science Course video to learn more about its basic concepts:
Data Extraction in R with dplyr
The dplyr package contains various functions that are specifically designed for data extraction and data manipulation. These functions are preferred over the base R functions because the former process data at a faster rate and are known as the best for data extraction, exploration, and transformation.
Some of the important functions for data manipulation in R are as follows:
- select(): to select columns (variables)
- filter(): to filter (subset) rows
- mutate(): to create new columns
- summarise(): to summarize (or aggregate) data
- group_by(): to group data
- arrange(): to sort data
- join(): to join data frames
Now, let’s explore some examples of data extraction on the ‘census’ dataset using the dplyr package.
#To select the following columns
mycols <- select(census, age, education,occupation)
#To select all columns from education to relationship
mycols <- select(census, education:relationship)
#To print first 5 rows
#To select columns with numeric indexes
mycols <- select(mycols,c(6:9))
age education occupation
1 39 Bachelors Adm-clerical
2 50 Bachelors Exec-managerial
3 38 HS-grad Handlers-cleaners
4 53 11th Handlers-cleaners
5 28 Bachelors Prof-specialty
6 37 Masters Exec-managerial
education education.num marital.status occupation relationship
1 Bachelors 13 Never-married Adm-clerical Not-in-family
2 Bachelors 13 Married-civ-spouse Exec-managerial Husband
3 HS-grad 9 Divorced Handlers-cleaners Not-in-family
4 11th 7 Married-civ-spouse Handlers-cleaners Husband
5 Bachelors 13 Married-civ-spouse Prof-specialty Wife
marital.status occupation relationship race
1 Never-married Adm-clerical Not-in-family White
2 Married-civ-spouse Exec-managerial Husband White
3 Divorced Handlers-cleaners Not-in-family White
4 Married-civ-spouse Handlers-cleaners Husband Black
5 Married-civ-spouse Prof-specialty Wife Black
6 Married-civ-spouse Exec-managerial Wife White
#To print the last 5 rows with education as Bachelors and sex as Male
myrows <- filter(census, education == "Bachelors", sex == "Male")
#To print first 5 rows with occupation as Sales and race as White using pipe operator and select only the first 3 columns
census %>% filter(occupation == "Sales", race == "White") %>% select(1:3) -> myrows
#To arrange age in ascending order and display only the age column
census %>% arrange(age) %>% select(age)
#To arrange hours.per.week in descending order
census %>% arrange(desc(hours.per.week)) %>% select(hours.per.week)
#To get the count of “workclass” column
census %>% count(workclass)
# A tibble: 7 x 2
1 Federal-gov 943
2 Local-gov 2067
3 Private 22286
4 Self-emp-inc 1074
5 Self-emp-not-inc 2499
6 State-gov 1279
7 Without-pay 14
Check out the top Data Science Interview Questions to learn what is expected from Data Science professionals!
#To find the mean of capital.gain column
census %>% summarise(Mean= mean(capital.gain))
Group By Function
#To find the mean of capital.gain column grouped according to the “workclass” column
census %>% group_by(workclass) %>% summarise(Mean= mean(capital.gain))
# A tibble: 7 x 2
1 Federal-gov 832.
2 Local-gov 829.
3 Private 880.
4 Self-emp-inc 4811.
5 Self-emp-not-inc 1913.
6 State-gov 684.
7 Without-pay 488.
In this blog, we have learned the data extraction process using R programming and the different steps involved in the data extraction process. In the first step, we discussed the process of cleaning data in R using different techniques that are used to transform a dirty dataset into a clean or tidy one, making it easy to work with. After data cleaning, in the next step, we performed various operations for data manipulation in R and also data manipulation in R with dplyr package.
Still have queries? Come to Intellipaat’s Data Science Community, clarify all your doubts and excel in your career!