Pandas is a Python library which is simple yet powerful tool for Data Analysis. Pandas is one of the most widely used Python packages. This package is comprising of many data structures and tools for effective Data Manipulation and analysis.Python with Pandas is used everywhere including Commercial, Academic, Economics, Finance, Analytics, Statistics, etc. And if you are going to work with Data using Python you need to learn Pandas as well.
Some of the key features of Pandas are:
Learn Python in 16 hrs from experts
In this tutorial, we will use Pandas to analyse data on product reviews from Amazon, a popular eCommerce website. This dataset consist of information related to Various product reviews information of Amazon Website which includes
While analysing the product reviews, we will learn how to implement key Pandas concepts like indexing, plotting etc.
The data is in the CSV (Comma Separated Values) format — each record is separated by a comma “,”and rows are separated by a new line. There are approx. 1841 rows, including a header row, and 10 columns in the file.
Before we get started, a quick note on the software environments– here in this tutorial, we will be using Python 3.5. Our examples will be done using Jupyter notebook.
So, let’s begin with
The very first and the most important operation one needs keep in mind is to import the Pandas library properly.
Also, while importing we abbreviate it as pd.
A Series can contain any type of data, including mixed types. Now let’s have a look at how we can create series objects in Pandas with some examples.
Just to make sure this object that we have created just now is indeed a series object we can use type() on the object above.
Further we can specify the index the of the series object as shown below:
DataFrames in pandas are defined as two-dimensional labeled data structures with columns of potentially different types.
Create a DataFrame by from a List: Let’s take a List of integers and then create an DataFrame by using that List.
Create a DataFrame by from a List of Lists:
Now we will create a DataFrame by using a List of Lists.
Create a DataFrame by from a Dict: We can also create DataFrames with the help of Dictionary.
Note: Since we are familiar with DataFrame and Series Objects keep in mind that each column in a DataFrame is a Series Object.
Here, we will first read the data. The data is stored as a csv that is comma-separated values, where each column is separated by a comma “,” and each row by anew line. Here are the first few rows of the Amazon_Products_Review.csv file:
As you can see, each row in the data represents a single product that was reviewed by Amazon. Here, we also have a leading column that contains row index values. Currently, we will not discuss about this column, but later we’ll dive into what are index values. To work with the data in Python, first step is to import the file into a Pandas DataFrame. A DataFrame is nothing but a way to represent and work with tabular data. and tabular data has rows and columns.
Our file is of .csv format. So, pd.read_csv() function is goingto help us read the data stored in that file. This function will take input as a csv file and return the output as a DataFrame.
Let’s inspect the type of Product_Reviewby using type() function.
For file types other than .csv, the importing conventions are mentioned below:
Working with DataFrame:
Now that the DataFrame is ready, let’s have a look at some of the operations in pandas.
Did you notice, here it has been read in everything properly — we have 1840 rows and 11columns.
One of the big advantages of Python Pandas over Python NumPy is that Pandas allows us to have columns with different data types. Here, in Product_Reviews has columns that store float values, like Product_Rating, String values, Product_Review_Phrase, and integers, like Product_Launch_year.
Now as the data is read properly, next we will work on indexing the Product_Reviews, so that we can get the rows and columns as per our requirement.
Now, let’s say we want to select and have a look at a chunk of data from our DataFrame. There are two ways of achieving the same.First, selecting by position and second, selecting by label.
Selecting by Position: Usingiloc we can retrieve rows and columns by position. Here we need to specify the positions of the rows and columns.
Suppose we want only the first column out of the DataFrame. Then we would use iloc on the DataFrame as shown below:
This snippet of code shows that we want to have a look at all the rows of first column.Keep in mind that position of first column(or first row)always starts with 0. As we wanted all the rows, we specified just a colon “:” without mentioning any position.
Again, say we want to have a look at the first 5 rows of 4th column. We need to specify the position of the rows as 0:5. Which means that we wantto view the first 5 rows from position 0 to 4(Note that, position 5 is excluded). Also, instead of writing 0:5 we can leave off the first position value, like :5(But if we write 0: this mean 0th positionto last position).
Also, in the example show above, instead of writing 0:5 we can leave off the first position value, like :5. This has the same meaning. But if we write 0: this means indexing of 0th position to last position.
Now let’s update our DataFrame by removing the first column, which contains no useful information.
Now if you are aware of the numpy indexing methodologies you might have noticed that it is quite similar to pandas indexing by position. But unlike numpy, each of the columns and rows in Pandas has a label. Yes, selecting by position is easy. But for large DataFrames, keeping track of columns and their correspondingpositions becomes complicated. That’s when our next method of indexing comes in handy.
Selecting by Label:
The second method is selecting by Label of the columns. .locallows us to index using labels instead of positions. Let’s illustrate this method with some examples.
(Selecting some rows of one column)
Display the first five rows of the Product_Title using .loc method like this:
(selecting some rows of more than one column)
Display the first five rows of the column Product_Title and Product_Rating
Apart from indexing another very simple yet useful feature offered by Pandas is the sorting of DataFrame. To get a clear idea of sorting feature let’s look at the following examples.
To Sort the DataFrame based on the values of a column. Say we want to sort the Product_Rating column by its values.
Now if you want to sort the Product_Rating column by its values(Descending Order).
There are some special methods available in Pandas which makes our calculation easier. Let’s apply those methods in our Product_Review DataFrame.
1) Mean of all the columns in our DataFrame.
2) Median of each column in our DataFrame.
3) Standard deviation of each column in our DataFrame.
4) Maximun Value of each column in our DataFrame.
5) Minimum of each column in our DataFrame
6) Number of non-null values in each DataFrame column
7) Summary statistics for numerical columns
We can also perform mathematical operations on Series Objects or DataFrame objects.
For example, we can divide every value in the Product Rating column by 2.
Note: All the common mathematical operators that work in Python, like +, -, *, /, and ^ will also work in a DataFrame or a Series Object.
Now that we have learnt about mathematical operations in Pandas, let’s have a look at the filtering methods available in Pandas and use them in our DataFrames.
Say, I want to find all the Footwear that has a Product Rating greater than 3.
First, let us generate a Boolean series with our filtering condition and see first 5 results.
Now that we have got the Boolean Series, we use it to select only rows in a DataFrame where the Series contains the value True. So that we get the rows in Product_Review where Product_Rating is greater than 3:
Let’s make it a bit complicated by adding more than one condition. Since we wanted to have a look at the Footwear that has a Product Rating greater than 3. We will add our second condition in Product_Category column of the DataFrame.
In the example shown above we have seen filtering conditions with AND Boolean Operator (&). Similarly, OR operator(|) can also be applied when necessary.
Till now we have learnt how to do Data Manipulation using Pandas library. Pandas library also offer Data Visualization feature for better understanding of the data. Let us see how Data Visualization with Pandas work.
Data Visualization with Pandas is carried out with following ways.
2. Scatter Plot
Note: Call %matplotlib inline to set up plotting inside the Jupyter notebook.
In the histogram shown above we have seen the frequency of the Footwear based on the Product Rating.
Analysis: So, let us analyze from the histogram. It appears that Footwear with a Product Rating of 5 is higher. Or we can also say that Footwear with low Product Rating is very less in number.
2. Scatter Plot:
Now we will have a look at the scatter plot of the Product Ratings based on the Product Launch Year.
Note: Here both x and y columns need to numeric.
Analysis:From the scatter plot shown above we can analyze how product rating of the products launched in the year 2016 changes. It appears that low Product Ratings are les in number as the density near the low Product Rating is less.
This brings us to the end of the Pythons Pandas Library Tutorial. In this tutorial, we have learnt how different features in Pandas library work and how to use for the better understanding of our Data.
Yes, Pandas library plays a very important role in Data Science and Data Analysis. To have a deep understanding of Python Libraries like Pandas, numpy etc. our Data Science in Python Course is a must complete, which not only covers the various technique of how Python is deployed for Data Science also work with various libraries for Data Cleaning, Data Manipulating, Data Analysis and much more Depth.Previous
Learn SQL in 16 hrs from experts