What is Data Extraction?
Data extraction is the process of obtaining raw data from a database or SaaS platform for replication of the data to a destination like a data warehouse that is designed to support online analytical processing (OLAP). It is the first step in a data ingestion process (ETL). ETL prepares the data for analysis or Business Intelligence (BI).
An ETL tool can extract data from data sources like social media, online reviews, online transactions, etc., and load them into a data warehouse. The data is then analyzed and mined for insights to gain brand perception.
Types of Data Extraction
Data can be extracted in the following ways:
- Update Notification
The easiest way to extract data from a source system is by having that system issue a notification whenever there is a change in a record. Most databases provide a mechanism for this to allow database replication (change data capture or binary logs) Many SaaS applications offer webhooks with conceptually similar functionality.
- Incremental Extraction
Some data sources can identify the records that have been modified and provide an extract of those records. The data extraction code needs to identify and propagate changes during subsequent ETL steps. One drawback of this method is that detection of deleted records in source data may not be possible.
- Full Extraction
A full extraction is necessary during the first replication of the source. Some data sources are not able to identify changed data. In those cases, reloading a whole table is the only way to extract data from that source. As full extraction involves high data transfer volume, it’s not the best option as it can put a load on the network.
ETL Lifecycle
The ETL (Extract, Transform, Load) lifecycle is a fundamental data management and analytics process. It involves extracting data from multiple sources, transforming it into a consistent format, and loading it into a target system for analysis and reporting. The ETL process ensures data is accurate, consistent, and ready for decision-making. Let’s look at the steps involved in the ETL lifecycle.
Extract, Transform, Load (ETL) Lifecycle:
- Extraction: Gather data from various sources like databases, files, or APIs.
- Transformation: Clean, filter, and convert the extracted data into a consistent format suitable for analysis.
- Data Quality Checks: Validate data integrity, accuracy, and completeness to ensure high-quality information.
- Integration: Combine and merge data from different sources into a unified dataset.
- Loading: Load the transformed data into a target system or database for storage and analysis.
- Schedule and Automation: Establish regular ETL processes with automated workflows for efficient data updates.
- Monitoring: Continuously monitor ETL jobs, data quality, and performance for timely troubleshooting and improvements.
- Documentation: Document the ETL process, including data sources, transformations, and load operations for future reference.
- Maintenance: Regularly review and optimize the ETL process to align with changing data requirements and business needs.
Data Extraction Tools
While extracting data might seem like a daunting task, most companies and organizations take the help of Apache NiFi, Talend, Informatica, Microsoft SQL Server Integration Services (SSIS), and IBM InfoSphere DataStage for the management of the extraction process from end to end. An ETL tool automates and streamlines the extraction process in order to deploy resources toward other priorities.
Let’s understand these tools a bit:
- Power BI: Power BI helps you analyze data, create interactive dashboards, and make data-driven decisions.
- Tableau: Tableau visualizes data, uncovering insights through interactive charts and maps.
- Excel: Excel organizes data, performs calculations, and generates reports.
- Google Analytics: Google Analytics tracks website traffic, user behavior, and marketing performance.
- QlikView: QlikView explores and analyzes data with interactive dashboards and powerful search.
- SAS: SAS extracts insights from data, enabling informed decision-making.
- IBM Cognos: IBM Cognos offers reporting, analytics, and scorecard for performance monitoring.
- Apache Hadoop: Hadoop processes and stores large datasets for efficient big data analytics.
- Microsoft SQL Server: SQL Server manages data storage, integration, reporting, and advanced analytics.
- Alteryx: Alteryx blends, analyzes, and automates data preparation without coding.
Benefits of Data Extraction Tool
The following are the benefits of using a data extraction tool:
- Control – It allows data migration from outside sources into a user’s own databases. This prevents the data siloed by outdated software or application licenses. Extraction lets one have more control of their data.
- Agility – It allows the consolidation of information into a centralized system to unify multiple data sets.
- Simplified Sharing – It is an easy way to not only provide helpful but also limited data access when needed. It helps to share data in a common and usable format.
- Accuracy and Precision – Errors are more likely to occur with manual processes and hand-coding. Entering, editing, and re-entering huge volumes of data negatively affect data integrity. It automates the processes and helps reduce errors, and thus, the time spent on resolving them.
Watch this Tutorial video on Data Extraction using Python:
Exploring Raw Data
Raw data is data collected from a source, which has not yet been processed for usage. Typically, the readily available data is not in a state in which it can be used efficiently for data extraction. Such data is difficult to manipulate and often needs to be processed in some way before it can be used for data analysis and data extraction in general and is referred to as raw data or source data.
In this blog, we are going to perform data extraction and data manipulation in R on the ‘Census Income’ dataset from the UCI Machine Learning Repository, which contains the income information of over 48,000 individuals, taken from the 1994 US census.
To import the dataset:
census <- read.csv("C:\\Users\\Intellipaat-Team\\Desktop\\census-income.csv")
To understand the structure of the dataset, we often use the following functions in R Programming:
-
a) class() – class of data object
-
b) str() – compact display of the internal structure of data
-
c) summary() – summary of data
-
d) names() – column names
-
e) dim() – dimensions of data
-
f) head() – view of the data from the top
-
g) tail() – view of the data from the bottom
For example:
class(census)
[1] "data.frame"
> dim(census)
[1] 30162 15
> names(census)
[1] "age" "workclass" "fnlwgt" "education" "education.num"
[6] "marital.status" "occupation" "relationship" "race" "sex"
[11] "capital.gain" "capital.loss" "hours.per.week" "native.country" "X"
> head(census) #First six rows and all columns
age workclass fnlwgt education education.num
1 39 State-gov 77516 Bachelors 13
2 50 Self-emp-not-inc 83311 Bachelors 13
3 38 Private 215646 HS-grad 9
4 53 Private 234721 11th 7
5 28 Private 338409 Bachelors 13
6 37 Private 284582 Masters 14
Get 100% Hike!
Master Most in Demand Skills Now!
Removing Leading and Trailing Whitespaces
While dealing with character columns in a dataset, there can be situations where there are whitespaces at the beginning and/or end of the strings. These whitespaces can cause issues when attempting to perform data extraction and various operations such as sorting, sub-setting, etc.
In this blog, we will use various functions from the stringr and the dplyr packages to remove any whitespaces from the columns.
If the whitespaces exist in a factor column that contains characters, then we first need to convert the factor columns to character columns before we can remove those whitespaces. After the whitespaces are removed, we can convert those columns back to factors. Studying the Machine Learning Course will give you a better insight on the project.
For example:
In the ‘census’ dataset, the factor columns that contain whitespaces are first converted to character columns before removing the whitespaces.
#Convert factor columns to character columns
census$workclass <- as.character(census$workclass)
census$occupation <- as.character(census$occupation)
census$native.country <- as.character(census$native.country)
census$education <- as.character(census$education)
census$marital.status <- as.character(census$marital.status)
census$relationship <- as.character(census$relationship)
census$race <- as.character(census$race)
census$sex <- as.character(census$sex)
census$X <- as.character(census$X)
To remove whitespaces from the above columns, we will use the mutate_if and the str_trim functions from the dplyr and the stringr packages, respectively.
library("dplyr")
library(stringr)
census %>%
mutate_if(is.character, str_trim) -> census
After performing the above operation, all the leading and trailing whitespaces will be removed.
Now, we need to convert the above columns back to factors to get back to the original structure of the dataset.
#Convert character columns back to factors
census$workclass <- as.factor(census$workclass)
census$occupation <- as.factor(census$occupation)
census$native.country <- as.factor(census$native.country)
census$education <- as.factor(census$education)
census$marital.status <- as.factor(census$marital.status)
census$relationship <- as.factor(census$relationship)
census$race <- as.factor(census$race)
census$sex <- as.factor(census$sex)
census$X <- as.factor(census$X)
Missing and Special Values
While performing data extraction, we might encounter some missing values or some special values in the dataset. When any information is not available, we call it a missing value.
Often times in data analysis, we want to get a sense of how many complete observations we have. This can be helpful in determining how we handle observations with missing data points.
In R, the missing values are represented as NA, but these missing values may appear in other forms as well.
There are also certain special values that we often encounter in our dataset.
For example:
- Inf() – infinite values
- 1/0 – numerical values divided by zero
- NaN – not a number
- 0/0 – zero divided by zero
To find missing values in a data frame, we can use the following functions:
- is.na() – indicates which elements are missing
- any(is.na()) – indicates if there are any NAs
- sum(is.na()) – counts the number of NAs
- summary() – finds NAs
To deal with NAs, we can use the following functions:
- complete.cases() – to find rows with no missing values
- na.omit() – to remove rows with NAs
For example:
df <- data.frame(A = c(2,NA,9,NA),
B = c(4,NA, 55,66),
C = c(3,44,1,22))
is.na(df)
A B C
[1,] FALSE FALSE FALSE
[2,] TRUE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] TRUE FALSE FALSE
> any(is.na(df))
[1] TRUE
> sum(is.na(df))
[1] 3
> complete.cases(df)
[1] TRUE FALSE TRUE FALSE
df[complete.cases(df),]
A B C
1 2 4 3
3 9 55 1
> na.omit(df)
A B C
1 2 4 3
3 9 55 1
In the ‘census’ dataset, the missing values are represented by the “?” symbol. So, in order to remove these values, we will first convert them to NA values and then perform any one of the above operations to remove the NA values using the following code:
#Replace "?" with NA
census[census == "?"] <- NA
#Remove NA
census <- na.omit(census)
sum(is.na(census))
[1] 0
Data Extraction in R
After completing the data cleaning step, our data is now in a clean and tidy format that can be used to perform data extraction and data analysis.
First, we will use the base R functions to extract rows and columns from a data frame. While performing data analysis or working on Data Science projects, these commands come in handy to extract information from a dataset.
In this blog, we will use the indexing features in R to perform data extraction on the ‘census’ dataset.
For example:
#select columns age, education, sex
mycol <- c("age", "education", "sex")
> census[mycol]
age education sex
1 39 Bachelors Male
2 50 Bachelors Male
3 38 HS-grad Male
4 53 11th Male
5 28 Bachelors Female
6 37 Masters Female
# First Row and 2nd and third column
census[1, 2:3]
workclass fnlwgt
1 State-gov 77516
# First 6 Rows and Second Column as a data frame
as.data.frame( census[1:6,2], drop=false)
census[1:10, 2]
1 State-gov
2 Self-emp-not-inc
3 Private
4 Private
5 Private
6 Private
#Element at 5th row, tenth column
census[5,10][1]
Female
Levels: Female Male
# exclude columns age, occupation, and race
mycols <- names(census) %in% c("age", "occupation", "race")
newdata <- census[!mycols]
# exclude 3rd and 5th column
newdata <- census[c(-3,-5)]
# delete columns fnlwgt and education.num
census$fnlwgt <- census$education.num <- NULL
#selecting rows based on column values
newdata <- census[ which(census$sex=='Female'
& census$age > 65), ]
Subset Function
We can also use the subset() function from the base R package to select variables and observations.
#selecting rows where age is either greater than 50 or less than 20, and only age and education columns.
newdata <- subset(census, age >= 50 | age < 20, select=c(age, education))
#selecting rows where education is Bachelors and Yearly Income(X) is >50K, and only education and capital.gain column.
> newdata <- subset(census, education == "Bachelors" & X == ">50K", select = c(education, capital.gain))
> head(newdata)
education capital.gain
10 Bachelors 5178
12 Bachelors 0
26 Bachelors 0
46 Bachelors 0
54 Bachelors 0
73 Bachelors 0
Sample Function
We use the sample() function from the base R package to extract a random sample of size n from a dataset.
#To sample 20 random rows from the “census” dataset without replacement.
mysample <- census[sample(1:nrow(census), 20, replace=FALSE),]
Table Function
We use the table() function to calculate the occurrences of unique values of a variable and display the frequency table of the categories of that variable.
#To print a frequency distribution of the “workclass” column.
freq.table <- table(census$workclass)
freq.table
Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc
943 2067 22286 1074 2499
State-gov Without-pay
1279 14
Check out this Data Science Course video to learn more about its basic concepts:
Data Extraction in R with dplyr
The dplyr package contains various functions that are specifically designed for data extraction and data manipulation. These functions are preferred over the base R functions because the former process data at a faster rate and are known as the best for data extraction, exploration, and transformation.
Some of the important functions for data manipulation in R are as follows:
- select(): to select columns (variables)
- filter(): to filter (subset) rows
- mutate(): to create new columns
- summarise(): to summarize (or aggregate) data
- group_by(): to group data
- arrange(): to sort data
- join(): to join data frames
Now, let’s explore some examples of data extraction on the ‘census’ dataset using the dplyr package.
Select Function
#To select the following columns
mycols <- select(census, age, education,occupation)
head(mycols)
#To select all columns from education to relationship
mycols <- select(census, education:relationship)
#To print first 5 rows
head(mycols, 5)
#To select columns with numeric indexes
mycols <- select(mycols,c(6:9))
head(mycols)
Output:
age education occupation
1 39 Bachelors Adm-clerical
2 50 Bachelors Exec-managerial
3 38 HS-grad Handlers-cleaners
4 53 11th Handlers-cleaners
5 28 Bachelors Prof-specialty
6 37 Masters Exec-managerial
Output:
education education.num marital.status occupation relationship
1 Bachelors 13 Never-married Adm-clerical Not-in-family
2 Bachelors 13 Married-civ-spouse Exec-managerial Husband
3 HS-grad 9 Divorced Handlers-cleaners Not-in-family
4 11th 7 Married-civ-spouse Handlers-cleaners Husband
5 Bachelors 13 Married-civ-spouse Prof-specialty Wife
Output:
marital.status occupation relationship race
1 Never-married Adm-clerical Not-in-family White
2 Married-civ-spouse Exec-managerial Husband White
3 Divorced Handlers-cleaners Not-in-family White
4 Married-civ-spouse Handlers-cleaners Husband Black
5 Married-civ-spouse Prof-specialty Wife Black
6 Married-civ-spouse Exec-managerial Wife White
Filter Function
#To print the last 5 rows with education as Bachelors and sex as Male
myrows <- filter(census, education == "Bachelors", sex == "Male")
tail(myrows,5)
#To print first 5 rows with occupation as Sales and race as White using pipe operator and select only the first 3 columns
census %>% filter(occupation == "Sales", race == "White") %>% select(1:3) -> myrows
head(myrows,5)
Arrange Function
#To arrange age in ascending order and display only the age column
census %>% arrange(age) %>% select(age)
#To arrange hours.per.week in descending order
census %>% arrange(desc(hours.per.week)) %>% select(hours.per.week)
Count Function
#To get the count of “workclass” column
census %>% count(workclass)
Output:
# A tibble: 7 x 2
workclass n
<fct> <int>
1 Federal-gov 943
2 Local-gov 2067
3 Private 22286
4 Self-emp-inc 1074
5 Self-emp-not-inc 2499
6 State-gov 1279
7 Without-pay 14
Summarise Function
#To find the mean of capital.gain column
census %>% summarise(Mean= mean(capital.gain))
Mean
1 1092.008
Group By Function
#To find the mean of capital.gain column grouped according to the “workclass” column
census %>% group_by(workclass) %>% summarise(Mean= mean(capital.gain))
Output:
# A tibble: 7 x 2
workclass Mean
<fct> <dbl>
1 Federal-gov 832.
2 Local-gov 829.
3 Private 880.
4 Self-emp-inc 4811.
5 Self-emp-not-inc 1913.
6 State-gov 684.
7 Without-pay 488.
Conclusion
In this blog, we have learned the data extraction process using R programming and the different steps involved in the data extraction process. In the first step, we discussed the process of cleaning data in R using different techniques that are used to transform a dirty dataset into a clean or tidy one, making it easy to work with. After data cleaning, in the next step, we performed various operations for data manipulation in R and also data manipulation in R with dplyr package.