Techniques for Scrubbing or Cleaning Data in Data Science

As we know the obtained data has inconsistencies, errors, weird characters, missing values or different problems. In this situation, you have to scrub or clean the data before to use this data.

We have the perfect professional Data Science Training Course for you!

So for scrubbing the data in Data Science, some techniques are used which are as follows:-

  • Filter lines
  • Extract certain columns or words
  • Replace values
  • Handle missing values
  • Convert data from one format to another

Data science masters program

Filtering Lines

The first scrubbing operation is to filter lines. It means that from the input data every line will be calculated to determine whether it may be passed on as output.

If you have any doubts or queries related to Data Science, do visit Intellipaat’s Data Science Community.

  • Based on location

Based on their location is the simplest way to filter lines. It is useful when you want to inspect, say, the top 5 lines of a file, or when you want to extract a particular row from the output of another command-line tool.

  • Based on pattern

If you want to extract or remove lines based on their contents then use grep which is the canonical command-line tool for filtering lines. We can print every line that matches a certain pattern or regular expression.

Want to get certified in Data Science! Learn Data Science from top Data Science experts and excel in your career with Intellipaat’s Data Science certification!

  • Based on randomness

When you’re in the process of formulating your data pipeline and have a bulk of data, then debugging your pipeline can be cumbersome. In that case, sampling from the data might be useful. The core reason of the command-line tool sample is to get a subset of the data by outputting only a particular percentage of the input on a line-by-line basis.

Watch this K Means Clustering Tutorial video

Scrubbing Data Techniques for Scrubbing or Cleaning Data in Data Science As we know the obtained data has inconsistencies, errors, weird characters, missing values or different problems. In this situation, you have to scrub or clean the data before to use this data. We have the perfect professional Data Science Training Course for

Replacing and Deleting Values

Command-line tool tr, which stands for translate that can be used to replace the individual characters. For example, spaces can be replaced by a comma as follows:

$ echo 'hello world!' | tr ' ' ','
Hello,world!

If more than one character needs to be replaced, then

$ echo 'hello world!' | tr ' !' ',?'
Hello,world?

tr can also be used to delete individual characters by specifying the -d option:

$ echo 'hello world!' | tr -d -c '[a-z]'
helloworld

Working with CSV

The command-line tools which are used to scrub plain text, like grep and tr, cannot always be applied to CSV. The reason is that these command-line tools have no notion of headers, bodies, and columns. In order to leverage ordinary command-line tools for CSV: body, header, and cols.
The first command-line tool is the body. With this command-line tool, you can apply any command-line tool to the body of a CSV file i.e., everything excluding the header.
For example:

$ echo -e "value\n7\n2\n5" | body sort -n
value
2
5
7

The second command-line tool header is used to permit us to operate the header of a CSV file. The third command-line tool is cols, which are similar to the header and body. It permits you to apply a certain command to only a subset of the columns.

Watch this Learn Data Science Tutorial video

Scrubbing Data Techniques for Scrubbing or Cleaning Data in Data Science As we know the obtained data has inconsistencies, errors, weird characters, missing values or different problems. In this situation, you have to scrub or clean the data before to use this data. We have the perfect professional Data Science Training Course for


Further, check our Data Science Training and prepare to excel in a career with our free Data Science interview questions and answer listed by the experts.

Leave a Reply

Your email address will not be published. Required fields are marked *