What is Pandas in Python?

So, lets now tell you What is Pandas in Python. Pandas is a Python library which is a simple yet powerful tool for Data Science. Python Pandas is one of the most widely used Python packages. This package comprises many data structures and tools for effective data manipulation and analysis. Python Pandas is used everywhere including commercial and academic sectors and in fields like economics, finance, analytics, statistics, etc. If we are going to work with data using Python, we need to learn Python Pandas as well.
Python Pandas Library Tutorial
Some of the key features of Python Pandas are as follows:

  • It provides DataFrame objects with default and customized indexing which is very fast and efficient.
  • There are tools available for loading data of different file formats into in-memory data objects.
  • It is easy to perform data alignment and integrated handling of missing data in Python Pandas.
  • It is very simple to perform pivoting and reshaping of data sets in Pandas.
  • It also provides indexing, label-based slicing, and sub-setting of large data sets.
  • We can easily insert and delete columns from a data structure.
  • Data aggregation and transformations can be done using group by.
  • High-performance merging and joining of data can be done using Pandas.
  • It also provides time series functionality.

Watch this Python Pandas Tutorial Video for Beginners:

Python Pandas Tutorial What is Pandas in Python? So, lets now tell you What is Pandas in Python. Pandas is a Python library which is a simple yet powerful tool for Data Science. Python Pandas is one of the most widely used Python packages. This package comprises many data structures and tools for effective data

In this tutorial, we will use Pandas in Python to analyze the product reviews data set of Amazon, a popular e-commerce website. This data set consists of information related to various product reviews of Amazon website which includes the following:

  • Product_Review_Phrase: Description of a product according to its review
  • Product_Title: Product name
  • Website_URL: Link of the product on the website
  • Platform: Whether a product is available on the website/mobile app or on both
  • Product_Rating: Average product rating as per customers
  • Product_Category: Category of a product
  • Is_Amazon_Advantage_Product: Whether a product is a premium product or not
  • Product_Launch_Year: Year of launch of a product on the website
  • Product_Launch_Month: Month of launch of a product on the website
  • Product_Launch_Day: Date of launch of a product on the website


While analyzing the product reviews, we will learn how to implement key Pandas in Python concepts like indexing, plotting, etc.

The data is in the csv (comma-separated values) format—each record is separated by a comma ‘,’—and rows are separated by a new line. There are approximately 1,841 rows, including a header row, and 10 columns in the file.

Before we start, a quick note on the software environments: Here in this tutorial, we are using Python Version 3.5, and examples included in this tutorial will be done using Jupyter Notebook. 

Let’s have a look on the topics covered in this module:

Let’s begin!

How to Import Pandas in Python?

The very first and the most important operation is to import Python Pandas library properly.

import pandas as pd

While importing Pandas library, we abbreviate it as pd.

Panda Series Objects

A series object can contain any type of data, including mixed types. Now, let’s have a look at how we can create series objects in Python Pandas with some examples.

Example1:

series1 = pd.Series([1, 2, 3, 4]
series1

Series Objects and creating Series Objects in PandasJust to make sure that the object we have created just now is indeed a series object, we can use type() on the object.

type(series1)

index the of the series object in PandasExample 2:
Further we can specify the index the of the series object as shown below:

series2 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
series2

creating series objects in Pandas

Python Pandas DataFrames

DataFrames in Pandas are defined as 2-dimensional labeled data structures with columns of potentially different Python Data types.

Creating a DataFrame from a list: 

Let’s take a list of integers and then create a DataFrame using that Python list.

list1 = [1,2,3,4,5]
list1

Creating a DataFrame from a List in PandasCreating a DataFrame from a list of lists:

Now, we will create a DataFrame using a list of lists.

list_of_lists = [[‘apple’,10],[‘mango’,12],[‘banana’,13]]
df = pd.DataFrame(list_of_lists,columns=[‘fruit’,’count’],dtype=int)
df

creating a DataFrame by using a List of Lists in PandasCreating a DataFrame from a dictionary:

We can also create DataFrames with the help of Python dictionaries.

dict1 = {‘fruit’:[‘apple’, ‘mango’, ‘banana’],’count’:[10,12,13]}
df = pd.DataFrame(dict1)

creating DataFrames with the help of Dictionary in Pandas
Note: Since we are familiar with DataFrames and series objects, keep in mind that each column in a DataFrame is a series object.
Data science masters program

Importing Data with Pandas in Python

Here, we will first read the data. The data is stored in a csv format, i.e., comma-separated values, where each record is separated by a comma ‘,’ and each row by a new line. Here are the first few rows of the Amazon_Products_Review.csv file:
Importing Data with Pandas
As we can see, each row in the data represents a single product that was reviewed by Amazon. Here, we also have a leading column that contains row index values. Currently, we will not discuss about this column; later on, we’ll dive into what index values are. To work with data in Python, the first step is to import the file into a Pandas DataFrame. A DataFrame is nothing but a way to represent and work with tabular data, and tabular data has rows and columns.
Our file is of .csv format. So, pd.read_csv() function is going to help us read the data stored in that file. This function will take the input as a csv file and return the output as a DataFrame.

import pandas as pd
Product_Review=pd.read_csv(“Amazon_Products_Review.csv”)

Let’s inspect the type of Product_Review using the type() function.

type(Product_Review)

Output for Product_Review by using type() function in PandasFor file types other than .csv, the importing conventions are mentioned below:

  • pd.read_table(“filename”)
  • pd.read_excel(“filename”)
  • pd.read_sql(query, connection_object)
  • pd.read_json(json_string)

Working with DataFrames:

Now that our DataFrame is ready, let’s have a look at some of the operations in Pandas.

  • head():This prints the first five rows of the DataFrame.
  • tail():This prints the last five rows of the DataFrame.
  • shape: This gives the number of rows and columns. In our DataFrame, we have 1,840 rows and 11 columns.
  • info: This gives us the information of the index, data type, and memory of the DataFrame.
  • describe: This gives summary statistics for numerical columns.

One of the big advantages of Python Pandas over Python NumPy is that Pandas in Python allows us to have columns with different data types. Here, in Product_Review, we have columns that store float values like Product_Rating, string values like Product_Review_Phrase, and integers like Product_Launch_Year.

Now, as the data is read properly, we will work on indexing Product_Review so that we can get the rows and columns as per our requirement.

Indexing DataFrames with Pandas in Python

Now, let’s say, we want to select and have a look at a chunk of data from our DataFrame. There are two ways of achieving the same: First, selecting by position and, second, by label.

Selecting by Position: Using iloc, we can retrieve rows and columns by position. Here, we need to specify the positions of rows and columns.

Example 1:
Suppose, we want only the first column out of the DataFrame. Then, we would use iloc on the DataFrame as shown below:

Product_Review.iloc[:,0]

This snippet of code shows that we want to have a look at all rows of the first column. Keep in mind that the position of the first column (or the first row) always starts with 0. As we needed all the rows, we specified just a colon (:) without mentioning any position.

Example 2:

Again, imagine that we want to have a look at the first five rows of the fourth column. We need to specify the position of the rows as 0:5, which means that we want to view the first five rows from the position 0 to the position 4 (note that the position 5 is excluded here). Also, instead of writing 0:5, we can leave off the first position value and write like :5 (but if we write 0:, it means the ‘0th’ position to the last position).

Product_Review.iloc[0:5,4]

Also, in the example show above, instead of writing 0:5 we can leave off the first position value, like :5. This has the same meaning. But if we write 0: this means indexing of 0th position to last position.
Similar examples:

  • iloc[:,:] – To view the entire DataFrame
  • iloc[6:,4:] – To view from Row 6 and Column 4 onward

Now, let’s update our DataFrame by removing the first column, which contains no useful information:

Produt_Reviews= Product_Reviews.iloc[:,1]
Product_Reviews.head()

updating the DataFrame
Now, since we are aware of the NumPy indexing methodologies, we can notice that they are quite similar to Pandas indexing by position. But unlike NumPy, each of the columns and rows in Pandas has a label. Yes, selecting by position is easy. But for large DataFrames, keeping track of columns and their corresponding positions becomes complicated. That’s when our next method of indexing comes in handy.

Selecting by Label:

The second method is selecting by the label of columns. The .loc method allows us to index using labels instead of positions. Let’s illustrate this method with some examples.

Example 1:

(Selecting some rows of one column)
Displaying the first five rows of the Product_Title using the .loc method:

Prodcut_Reviews.loc[:5,”Product_Title”]

Example 2:
(selecting some rows of more than one column)

Displaying the first five rows of the columns, Product_Title and Product_Rating

Product_Reviews.loc[:5,”Product_Title”,”Product_Rating”]

Sorting DataFrames with Pandas in Python

Apart from indexing, another very simple yet useful feature offered by Pandas in Python is the sorting of DataFrames. To get a clear idea of the sorting feature, let’s look at the following examples.

Sorting the DataFrame based on the values of a column:

Say, we want to sort the Product_Rating column by its values.

Product_Review.sort_values(by=‘Product_Rating’)

Now, if we want to sort the Product_Rating column by its values in the descending order.

Product_Review.sort_values(‘Product_Rating’, ascending=False)

Pandas in Python DataFrame Methods

There are some special methods available in Pandas in Python which makes our calculation easier. Let’s apply those methods in our Product_Review DataFrame.
1) Mean of all columns in our DataFrame

Product_Review.mean()

Mean of all the columns in Pandas DataFrame
2) Median of each column in our DataFrame

Product_Review.median()

Median of each column in Pandas DataFrame3) Standard deviation of each column in our DataFrame

Product_Review.std()

Standard deviation of each column in Pandas DataFrame
4) Maximum value of each column in our DataFrame

Product_Review.max()

Maximun Value of each column in Pandas DataFrame5) Minimum of each column in our DataFrame

Product_Review.min()

Minimum of each column in Pandas DataFrame6) Number of non-null values in each DataFrame column

Product_Review.count()

Number of non-null values in DataFrame columns
7) Summary statistics for numerical columns

Product_Review.describe()

Summary statistics for numerical columns in Pandas

Mathematical Operations in Pandas Python

We can also perform mathematical operations on series objects or DataFrame objects.

For example, for dividing every value in the Product_Rating column by 2, we use the following code:

Product_Review[“Product_Rating”] /2

mathematical operations on Series Objects or DataFrame objects in PandasNote: All common mathematical operators that work in Python, like +, −, *, /, and ^, will also work in a DataFrame or a series object.

Filtering DataFrames in Python Panda

Now that we have learned about how to do mathematical operations in Pandas, let’s have a look at the filtering methods available in Pandas Python and use them in our DataFrame.

Say, we want to find footwear that has a Product_Rating greater than 3.
First, let us generate a Boolean series with our filtering condition and see the first five results.

filter1 = Product_Review["Product_Rating"] > 3
filter1.head()

generating a Boolean series in Pandas
Now that we have got the Boolean series, we use it to select only rows in a DataFrame where the series contains the value True so that we get the rows in Product_Review where Product_Rating is greater than 3.

filtered_new = Product_Review[filter1]
filtered_new.head()

selecting only rows in a DataFrame in Pandas
Let’s make it a bit complicated by adding more than one condition. Since we wanted to have a look at the footwear that has a Product_Rating greater than 3, we will now add our second condition in the Product_Category column of the DataFrame.

filter2 = (Product_Review["Product_Rating"] > 3) & (Product_Review["Product_Category"] == "Footwear")
filtered_review = Product_Review[filter2]
filtered_review.head()

adding second condition in column of the DataFrame
In the example shown above, we saw filtering conditions with AND Boolean operator (&). Similarly, OR operator (|) can also be applied when necessary.
Till now, we have learned how to do data manipulation using Pandas library. Pandas library also offer a data visualization feature for a better understanding of the data. Let us see how data visualization with Pandas works.

Data Visualization Using Pandas Python

Data visualization with Pandas in Python is carried out in following ways:

  1. Histogram
  2. Scatter Plot

Note: Call %matplotlib inline to set up plotting inside Jupyter Notebook.

1. Histogram:

%matplotlib inline
Product_Review[Product_Review["Product_Category"] == "Footwear"]["Product_Rating"].plot(kind="hist")

Data Visualization using Pandas - Histogram
In the histogram shown above, we have seen the frequency of the footwear based on Product_Rating.

Analysis: Let us now analyze the data from the histogram. It appears that footwear with a Product_Rating of 5 is more. In other words, footwear with low Product_Rating is very less in number.

2. Scatter Plot:Now, we will have a look at the scatter plot of the Product_Rating based on the Product_Launch_Year.
Note: Here, both x and y columns need to be numeric.

Product_Review.plot.scatter(x="Product_Launch_Year",y="Product_Rating")

Data Visualization using Pandas - Scatter Plot
Analysis: From the scatter plot shown above, we can analyze how Product_Rating of products launched in the year 2016 changes. It appears that low Product_Rating is less in number as the density near the low Product_Rating is less.
This brings us to the end of the Python Pandas Library Tutorial. In this tutorial, we have learned how different features in Pandas library work and how to use them for the better understanding of our data. Here is the complete Python Tutorial for you to refer

Refer to our Cheat Sheet in Pandas

Learnt much on Python? Check out how to get Python certification and a free guide for the recurring Python interview questions prepared by our experts.

Leave a Reply

Your email address will not be published. Required fields are marked *