Data Extraction - A Step-by-Step Guide

Table of content

Show More

What is Data Extraction?

Data extraction is the process of obtaining raw data from a database or SaaS platform for replication of the data to a destination like a data warehouse that is designed to support online analytical processing (OLAP). It is the first step in a data ingestion process (ETL). ETL prepares the data for analysis or Business Intelligence (BI).

An ETL tool can extract data from data sources like social media, online reviews, online transactions, etc., and load them into a data warehouse. The data is then analyzed and mined for insights to gain brand perception.

Types of Data Extraction

Data can be extracted in the following ways:

  1. Update Notification
    The easiest way to extract data from a source system is by having that system issue a notification whenever there is a change in a record. Most databases provide a mechanism for this to allow database replication (change data capture or binary logs) Many SaaS applications offer webhooks with conceptually similar functionality.
  2. Incremental Extraction
    Some data sources can identify the records that have been modified and provide an extract of those records. The data extraction code needs to identify and propagate changes during subsequent ETL steps. One drawback of this method is that detection of deleted records in source data may not be possible.
  3. Full Extraction
    A full extraction is necessary during the first replication of the source. Some data sources are not able to identify changed data. In those cases, reloading a whole table is the only way to extract data from that source. As full extraction involves high data transfer volume, it’s not the best option as it can put a load on the network.

ETL Lifecycle

The ETL (Extract, Transform, Load) lifecycle is a fundamental data management and analytics process. It involves extracting data from multiple sources, transforming it into a consistent format, and loading it into a target system for analysis and reporting. The ETL process ensures data is accurate, consistent, and ready for decision-making. Let’s look at the steps involved in the ETL lifecycle.

Extract, Transform, Load (ETL) Lifecycle:

  • Extraction: Gather data from various sources like databases, files, or APIs.
  • Transformation: Clean, filter, and convert the extracted data into a consistent format suitable for analysis.
  • Data Quality Checks: Validate data integrity, accuracy, and completeness to ensure high-quality information.
  • Integration: Combine and merge data from different sources into a unified dataset.
  • Loading: Load the transformed data into a target system or database for storage and analysis.
  • Schedule and Automation: Establish regular ETL processes with automated workflows for efficient data updates.
  • Monitoring: Continuously monitor ETL jobs, data quality, and performance for timely troubleshooting and improvements.
  • Documentation: Document the ETL process, including data sources, transformations, and load operations for future reference.
  • Maintenance: Regularly review and optimize the ETL process to align with changing data requirements and business needs.

Data Extraction Tools

While extracting data might seem like a daunting task, most companies and organizations take the help of Apache NiFi, Talend, Informatica, Microsoft SQL Server Integration Services (SSIS), and IBM InfoSphere DataStage for the management of the extraction process from end to end. An ETL tool automates and streamlines the extraction process in order to deploy resources toward other priorities.

Let’s understand these tools a bit:

  • Power BI: Power BI helps you analyze data, create interactive dashboards, and make data-driven decisions.
  • Tableau: Tableau visualizes data, uncovering insights through interactive charts and maps.
  • Excel: Excel organizes data, performs calculations, and generates reports.
  • Google Analytics: Google Analytics tracks website traffic, user behavior, and marketing performance.
  • QlikView: QlikView explores and analyzes data with interactive dashboards and powerful search.
  • SAS: SAS extracts insights from data, enabling informed decision-making.
  • IBM Cognos: IBM Cognos offers reporting, analytics, and scorecard for performance monitoring.
  • Apache Hadoop: Hadoop processes and stores large datasets for efficient big data analytics.
  • Microsoft SQL Server: SQL Server manages data storage, integration, reporting, and advanced analytics.
  • Alteryx: Alteryx blends, analyzes, and automates data preparation without coding.

Benefits of Data Extraction Tool

The following are the benefits of using a data extraction tool:

  • Control – It allows data migration from outside sources into a user’s own databases. This prevents the data siloed by outdated software or application licenses. Extraction lets one have more control of their data.
  • Agility – It allows the consolidation of information into a centralized system to unify multiple data sets.
  • Simplified Sharing – It is an easy way to not only provide helpful but also limited data access when needed. It helps to share data in a common and usable format.
  • Accuracy and Precision – Errors are more likely to occur with manual processes and hand-coding. Entering, editing, and re-entering huge volumes of data negatively affect data integrity. It automates the processes and helps reduce errors, and thus, the time spent on resolving them.

Watch this Tutorial video on Data Extraction using Python:

Video Thumbnail

Exploring Raw Data

Raw data is data collected from a source, which has not yet been processed for usage. Typically, the readily available data is not in a state in which it can be used efficiently for data extraction. Such data is difficult to manipulate and often needs to be processed in some way before it can be used for data analysis and data extraction in general and is referred to as raw data or source data.

In this blog, we are going to perform data extraction and data manipulation in R on the ‘Census Income’ dataset from the UCI Machine Learning Repository, which contains the income information of over 48,000 individuals, taken from the 1994 US census.

To import the dataset:

census <- read.csv("C:\\Users\\Intellipaat-Team\\Desktop\\census-income.csv")

To understand the structure of the dataset, we often use the following functions in R Programming:

    1. a) class() – class of data object
    1. b) str() – compact display of the internal structure of data
    1. c) summary() – summary of data
    1. d) names() – column names
    1. e) dim() – dimensions of data
    1. f) head() – view of the data from the top
    1. g) tail() – view of the data from the bottom

For example:

class(census)
[1] "data.frame"
> dim(census)
[1] 30162    15
> names(census) 
[1] "age"            "workclass"      "fnlwgt"         "education"      "education.num"  
[6] "marital.status" "occupation"     "relationship"   "race"           "sex"           
[11] "capital.gain"   "capital.loss"   "hours.per.week" "native.country" "X"     
> head(census)                #First six rows and all columns  
  age        workclass fnlwgt education education.num 
1  39        State-gov  77516 Bachelors            13  
2  50 Self-emp-not-inc  83311 Bachelors            13
3  38          Private 215646   HS-grad             9  
4  53          Private 234721      11th             7
5  28          Private 338409 Bachelors            13
6  37          Private 284582   Masters            14

Get 100% Hike!

Master Most in Demand Skills Now!

Removing Leading and Trailing Whitespaces

While dealing with character columns in a dataset, there can be situations where there are whitespaces at the beginning and/or end of the strings. These whitespaces can cause issues when attempting to perform data extraction and various operations such as sorting, sub-setting, etc.

In this blog, we will use various functions from the stringr and the dplyr packages to remove any whitespaces from the columns.

If the whitespaces exist in a factor column that contains characters, then we first need to convert the factor columns to character columns before we can remove those whitespaces. After the whitespaces are removed, we can convert those columns back to factors. Studying the Machine Learning Course will give you a better insight on the project.

For example:

In the ‘census’ dataset, the factor columns that contain whitespaces are first converted to character columns before removing the whitespaces.

#Convert factor columns to character columns
census$workclass <- as.character(census$workclass)
census$occupation <- as.character(census$occupation)
census$native.country <- as.character(census$native.country)
census$education <- as.character(census$education)
census$marital.status <- as.character(census$marital.status)
census$relationship <- as.character(census$relationship)
census$race <- as.character(census$race)
census$sex <- as.character(census$sex)
census$X <- as.character(census$X)

To remove whitespaces from the above columns, we will use the mutate_if and the str_trim functions from the dplyr and the stringr packages, respectively.

library("dplyr")
library(stringr)
census %>%  
mutate_if(is.character, str_trim) -> census

After performing the above operation, all the leading and trailing whitespaces will be removed.

Now, we need to convert the above columns back to factors to get back to the original structure of the dataset.

#Convert character columns back to factors
census$workclass <- as.factor(census$workclass)
census$occupation <- as.factor(census$occupation)
census$native.country <- as.factor(census$native.country)
census$education <- as.factor(census$education)
census$marital.status <- as.factor(census$marital.status)
census$relationship <- as.factor(census$relationship)
census$race <- as.factor(census$race)
census$sex <- as.factor(census$sex)
census$X <- as.factor(census$X)

Missing and Special Values

While performing data extraction, we might encounter some missing values or some special values in the dataset. When any information is not available, we call it a missing value.

Often times in data analysis, we want to get a sense of how many complete observations we have. This can be helpful in determining how we handle observations with missing data points.

In R, the missing values are represented as NA, but these missing values may appear in other forms as well.

There are also certain special values that we often encounter in our dataset.

For example:

  • Inf() – infinite values
  • 1/0 – numerical values divided by zero
  • NaN – not a number
  • 0/0 – zero divided by zero


To find missing values in a data frame, we can use the following functions:

  • is.na() – indicates which elements are missing
  • any(is.na()) – indicates if there are any NAs
  • sum(is.na()) – counts the number of NAs
  • summary() – finds NAs

To deal with NAs, we can use the following functions:

  • complete.cases() – to find rows with no missing values
  • na.omit() – to remove rows with NAs

For example:

df <- data.frame(A = c(2,NA,9,NA),                 
B = c(4,NA, 55,66),                 
C = c(3,44,1,22)) 
is.na(df)  
       A     B     C
[1,] FALSE FALSE FALSE
[2,]  TRUE  TRUE FALSE
[3,] FALSE FALSE FALSE
[4,]  TRUE FALSE FALSE
> any(is.na(df))
[1] TRUE
> sum(is.na(df))
[1] 3
> complete.cases(df)
[1]  TRUE FALSE  TRUE FALSE
df[complete.cases(df),] 
  A  B C
1 2  4 3
3 9 55 1  
> na.omit(df)
A  B C
1 2  4 3
3 9 55 1

In the ‘census’ dataset, the missing values are represented by the “?” symbol. So, in order to remove these values, we will first convert them to NA values and then perform any one of the above operations to remove the NA values using the following code:

#Replace "?" with NA
census[census == "?"] <- NA 
#Remove NA
census <- na.omit(census) 
sum(is.na(census))
[1] 0

Data Extraction in R

After completing the data cleaning step, our data is now in a clean and tidy format that can be used to perform data extraction and data analysis.

First, we will use the base R functions to extract rows and columns from a data frame. While performing data analysis or working on Data Science projects, these commands come in handy to extract information from a dataset.

In this blog, we will use the indexing features in R to perform data extraction on the ‘census’ dataset.

For example:

#select columns age, education, sex
mycol <- c("age", "education", "sex")
> census[mycol]    
     age    education    sex
1    39    Bachelors   Male
2    50    Bachelors   Male
3    38      HS-grad   Male
4    53         11th   Male
5    28    Bachelors Female
6    37      Masters Female 
# First Row and 2nd and third column
census[1, 2:3]
workclass fnlwgt
1 State-gov  77516 
# First 6 Rows and Second Column as a data frame
as.data.frame( census[1:6,2], drop=false)    
census[1:10, 2]
1         State-gov
2  Self-emp-not-inc
3           Private
4           Private
5           Private
6           Private  
#Element at 5th row, tenth column
census[5,10][1] 
Female
Levels: Female Male
# exclude columns age, occupation, and race
mycols <- names(census) %in% c("age", "occupation", "race") 
newdata <- census[!mycols] 
# exclude 3rd and 5th column
newdata <- census[c(-3,-5)] 
# delete columns fnlwgt and education.num
census$fnlwgt <- census$education.num <- NULL
#selecting rows based on column values
newdata <- census[ which(census$sex=='Female'
& census$age > 65), ]

Subset Function

We can also use the subset() function from the base R package to select variables and observations.

#selecting rows where age is either greater than 50 or less than 20, and only age and education columns. 
newdata <- subset(census, age >= 50 | age < 20, select=c(age, education))
#selecting rows where education is Bachelors and Yearly Income(X) is >50K, and only education and capital.gain column.
> newdata <- subset(census, education == "Bachelors" & X == ">50K", select = c(education, capital.gain))
> head(newdata)   
   education    capital.gain
10 Bachelors         5178
12 Bachelors            0
26 Bachelors            0
46 Bachelors            0
54 Bachelors            0
73 Bachelors            0

Sample Function

We use the sample() function from the base R package to extract a random sample of size n from a dataset.

#To sample 20 random rows from the “census” dataset without replacement. 
mysample <- census[sample(1:nrow(census), 20, replace=FALSE),]

Table Function

We use the table() function to calculate the occurrences of unique values of a variable and display the frequency table of the categories of that variable.

#To print a frequency distribution of the “workclass” column.
freq.table <- table(census$workclass)
freq.table
Federal-gov     Local-gov      Private   Self-emp-inc Self-emp-not-inc         
       943             2067       22286               1074                  2499
State-gov      Without-pay
       1279                   14

Check out this Data Science Course video to learn more about its basic concepts:

Video Thumbnail

Data Extraction in R with dplyr

The dplyr package contains various functions that are specifically designed for data extraction and data manipulation. These functions are preferred over the base R functions because the former process data at a faster rate and are known as the best for data extraction, exploration, and transformation.

Some of the important functions for data manipulation in R are as follows:

  • select(): to select columns (variables)
  • filter(): to filter (subset) rows
  • mutate(): to create new columns
  • summarise(): to summarize (or aggregate) data
  • group_by(): to group data
  • arrange(): to sort data
  • join(): to join data frames

Now, let’s explore some examples of data extraction on the ‘census’ dataset using the dplyr package.

Select Function

#To select the following columns
mycols <- select(census, age, education,occupation)
head(mycols) 
#To select all columns from education to relationship
mycols <- select(census, education:relationship)
#To print first 5 rows
head(mycols, 5)
#To select columns with numeric indexes
mycols <- select(mycols,c(6:9))
head(mycols)

Output:

age  education        occupation
1  39 Bachelors      Adm-clerical
2  50 Bachelors   Exec-managerial
3  38   HS-grad Handlers-cleaners
4  53      11th Handlers-cleaners
5  28 Bachelors    Prof-specialty
6  37   Masters   Exec-managerial

Output:

education  education.num     marital.status        occupation  relationship
1 Bachelors            13      Never-married      Adm-clerical Not-in-family
2 Bachelors            13 Married-civ-spouse   Exec-managerial       Husband
3   HS-grad             9           Divorced Handlers-cleaners Not-in-family
4      11th             7 Married-civ-spouse Handlers-cleaners       Husband
5 Bachelors            13 Married-civ-spouse    Prof-specialty          Wife

Output:

marital.status        occupation  relationship  race
1      Never-married      Adm-clerical Not-in-family White
2 Married-civ-spouse   Exec-managerial       Husband White
3           Divorced Handlers-cleaners Not-in-family White
4 Married-civ-spouse Handlers-cleaners       Husband Black
5 Married-civ-spouse    Prof-specialty          Wife Black
6 Married-civ-spouse   Exec-managerial          Wife White

Filter Function

#To print the last 5 rows with education as Bachelors and sex as Male
myrows <- filter(census, education == "Bachelors", sex == "Male")
tail(myrows,5)
#To print first 5 rows with occupation as Sales and race as White using pipe operator and select only the first 3 columns
census  %>% filter(occupation == "Sales", race == "White") %>% select(1:3) -> myrows
head(myrows,5)

Become a Data Science Architect

Arrange Function

#To arrange age in ascending order and display only the age column
census %>% arrange(age) %>% select(age)  
#To arrange hours.per.week in descending order
census %>% arrange(desc(hours.per.week)) %>% select(hours.per.week)

Count Function

#To get the count of “workclass” column
census %>% count(workclass)

Output:

# A tibble: 7 x 2
workclass            n
<fct>            <int>
1 Federal-gov        943
2 Local-gov         2067
3 Private          22286
4 Self-emp-inc      1074
5 Self-emp-not-inc  2499
6 State-gov         1279
7 Without-pay         14

Summarise Function

#To find the mean of capital.gain column 
census %>% summarise(Mean= mean(capital.gain))      
Mean
1 1092.008

Group By Function

#To find the mean of capital.gain column grouped according to the “workclass” column 
census %>% group_by(workclass) %>% summarise(Mean= mean(capital.gain))

Output:

# A tibble: 7 x 2
workclass         Mean
<fct>             <dbl>
1 Federal-gov       832.
2 Local-gov         829.
3 Private           880.
4 Self-emp-inc     4811.
5 Self-emp-not-inc 1913.
6 State-gov         684.
7 Without-pay       488.

Conclusion

In this blog, we have learned the data extraction process using R programming and the different steps involved in the data extraction process. In the first step, we discussed the process of cleaning data in R using different techniques that are used to transform a dirty dataset into a clean or tidy one, making it easy to work with. After data cleaning, in the next step, we performed various operations for data manipulation in R and also data manipulation in R with dplyr package.

 

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.