After scrubbing of the data, you are ready to explore it. Exploring your data can be done from three perspectives –
- Inspect the data and its properties
- Derive statistics from your data
- Create interesting visualizations
Read these Top Trending Data Science Interview Q’s blog now that helps you grab high-paying jobs!
Inspect the data and its properties
If you want to examine the raw data, then it is not good to use cat because cat prints all the data to the screen in one go. In order to examine the raw data at your own pace, we recommend using less with the -S option:
$ less -S file.csv
The -S option is used to ensure that long lines are not being wrapped when they don’t fit in the terminal. The advantage of less is that it does not load the entire file into memory, which is good for viewing large files.
- Feature Names and Data Types
To gain insight into the data set, it is useful to print the feature names and study them. The feature names may indicate the meaning of the feature. You can apply the sed expression as follows:
$ < data/iris.csv sed -e 's/,/\n/g;q' sepal_length sepal_width petal_length petal_width species
- Unique Identifiers, Continuous Variables, and Factors
To find out whether a feature should be treated as a unique identifier or categorical variable, count the number of unique values for a particular column:
$ cat data/iris.csv | csvcut -c species | body "sort | uniq | wc -l" species 3
In data set if the number of unique values is small in comparison to the number of rows then feature may be treated as a categorical and if the number is equal to the number of rows then it may be a treated as unique identifier.
Derive statistics from your data
- Using csvstat
The command-line tool csvstat gives a lot of information. For each feature it shows:
- The data type in Python terms
- The number of unique values
- Whether it has any missing values (Nulls)
- Different descriptive statistics i.e., min, max, sum, standard deviation, mean and median for those features for which it’s appropriate.
$ csvstat data/datatypes.csv
<type 'int'> Nulls: False Values: 2, 66, 42
<type 'float'> Nulls: True Values: 0.0, 3.1415
To create visualization mostly two software packages are used : Gnuplot and ggplot2
Introducing Gnuplot and feedgnuplot
The first software package to create visualizations is Gnuplot. It is different from most command-line tools we’ve been using for two reasons. First, it uses a script instead of command-line arguments. Second, the output is always written to a file and not printed to standard output. it’s able to produce visualizations for the command line. That is, it’s able to print its output to the terminal without the need for a GUI. Even then, you would still need to set up a script.
Feedgnuplot which is a command tool can help us with setting up a script for Gnuplot. It is completely configurable through command-line arguments and also reads from standard input.
Ggplot2 is a more modern software package for creating visualizations. It is an implementation of the grammar of graphics in R. When used through Rio, this is a very convenient way of creating visualizations from the command line.
Rio stands for R input/output, because it enables you to use R as a filter on the command line. You simply pipe CSV data into Rio and you specify the R commands that you want to run on it. Rio can execute multiple R commands that are separated by semicolons.
Learn more about Data Science in this insightful blog now!
K-Map stands for Karnaugh map which provides a simple method for minimising boolean expressions. It provides a graphical method of grouping together expressions with common factors and therefore eliminating useless variables. Karnaugh maps are used to simplify real-world logic requirements so that they can be implemented using a minimum number of physical logic gates.
A graph is a set of vertices and edges in which vertices are connected to each other using edges. But the difference in graph and tree is that tree does not contain any loop whereas graph contains loop. Graph is a pictorial representation which shows the relationships between different quantities, parameters or measurable variables. It describes that how one quantity changes if another quantity is changed.
Looking for top jobs in Data Science ? This blog post gives you all the information you need !