**What is Pandas in Python?**

So, let’s now tell you What is Pandas in Python.** Pandas** is a Python library that is a simple yet powerful tool for Data Science. Python Pandas is one of the most widely used Python packages. This package comprises many data structures and tools for effective data manipulation and analysis. Python Pandas is used everywhere including commercial and academic sectors and in fields like economics, finance, analytics, statistics, etc. If we are going to work with data using Python, we need to learn Python Pandas as well.

## Features of Python Pandas

Some of the key features of Python Pandas are as follows:

- It provides DataFrame objects with default and customized indexing which is very fast and efficient.
- There are tools available for loading data of different file formats into in-memory data objects.
- It is easy to perform data alignment and integrated handling of missing data in Python Pandas.
- It is very simple to perform pivoting and reshaping of data sets in Pandas.
- It also provides indexing, label-based slicing, and sub-setting of large data sets.
- We can easily insert and delete columns from a data structure.
- Data aggregation and transformations can be done using group by.
- High-performance merging and joining of data can be done using Pandas.
- It also provides time-series functionality.

### Watch this Python Pandas Tutorial Video for Beginners:

In this tutorial, we will use Pandas in Python to analyze the product reviews data set off Amazon, a popular e-commerce website. This data set consists of information related to various product reviews of the Amazon website which includes the following:

- Product_Review_Phrase: Description of a product according to its review
- Product_Title: Product name
- Website_URL: Link of the product on the website
- Platform: Whether a product is available on the website/mobile app or both
- Product_Rating: Average product rating as per customers
- Product_Category: Category of a product
- Is_Amazon_Advantage_Product: Whether a product is a premium product or not
- Product_Launch_Year: Year of the launch of a product on the website
- Product_Launch_Month: Month of the launch of a product on the website
- Product_Launch_Day: Date of launch of a product on the website

While analyzing the product reviews, we will learn how to implement key Pandas in Python concepts like indexing, plotting, etc.

The data is in the **csv** (comma-separated values) format—each record is separated by a comma **‘,’—**and rows are separated by a *new line*. There are approximately 1,841 rows, including a header row, and 10 columns in the file.

Before we start, a quick note on the software environments: Here in this tutorial, we are using Python Version 3.5, and examples included in this tutorial will be done using **Jupyter Notebook**.** **

**Let’s have a look at the topics covered in this module:**

- How to Import Pandas in Python?
- Pandas Series Objects
- Python Pandas DataFrames
- Importing Data with Pandas
- Indexing DataFrames with Pandas
- Sorting DataFrames with Pandas
- Python Pandas DataFrame Methods
- Mathematical Operations with Pandas Python
- Filtering DataFrames in Python Pandas
- Data Visualization using Pandas Python

Let’s begin!

**How to Import Pandas in Python?**

The very first and the most important operation is to import the Python Pandas library properly*.*

import pandas as pd

While importing Pandas library, we abbreviate it as * pd*.

**Panda Series Objects **

A series object can contain any type of data, including mixed types. Now, let’s have a look at how we can create series objects in Python Pandas with some examples.

**Example1**:

series1 = pd.Series([1, 2, 3, 4] series1

Just to make sure that the object we have created just now is indeed a series object, we can use type() on the object.

type(series1)

**Example 2:**

Further, we can specify the index of the series object as shown below:

series2 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']) series2

**Python Pandas DataFrames**

DataFrames in Pandas is defined as 2-dimensional labeled data structures with columns of potentially different Python Datatypes.

**Creating a DataFrame from a list: **

Let’s take a list of integers and then create a DataFrame using that Python list.

list1 = [1,2,3,4,5] list1

**Creating a DataFrame from a list of lists**:

Now, we will create a DataFrame using a list of lists.

list_of_lists = [[‘apple’,10],[‘mango’,12],[‘banana’,13]] df = pd.DataFrame(list_of_lists,columns=[‘fruit’,’count’],dtype=int) df

**Creating a DataFrame from a dictionary:**

We can also create DataFrames with the help of Python dictionaries.

dict1 = {‘fruit’:[‘apple’, ‘mango’, ‘banana’],’count’:[10,12,13]} df = pd.DataFrame(dict1)

**Note: **Since we are familiar with DataFrames and series objects, keep in mind that each column in a DataFrame is a series object.

**Importing Data with Pandas in Python**

Here, we will first read the data. The data is stored in a csv format, i.e., comma-separated values, where each record is separated by a comma ‘,’ and each row by a new line. Here are the first few rows of the Amazon_Products_Review.csv file:

As we can see, each row in the data represents a single product that was reviewed by Amazon. Here, we also have a leading column that contains row index values. Currently, we will not discuss this column; later on, we’ll dive into what index values are. To work with data in Python, the first step is to import the file into a Pandas DataFrame. A DataFrame is nothing but a way to represent and work with tabular data, and tabular data has rows and columns.

Our file is of .csv format. So, **pd.read_csv() **function is going to help us read the data stored in that file. This function will take the input as a csv file and return the output as a DataFrame.

import pandas as pd Product_Review=pd.read_csv(“Amazon_Products_Review.csv”)

Let’s inspect the type of **Product_Review** using the **type()** function.

type(Product_Review)

For file types other than .csv, the importing conventions are mentioned below:

- pd.read_table(“filename”)
- pd.read_excel(“filename”)
- pd.read_sql(query, connection_object)
- pd.read_json(json_string)

**Working with DataFrames:**

Now that our DataFrame is ready, let’s have a look at some of the operations in Pandas.

- head(): This prints the first five rows of the DataFrame.
- tail(): This prints the last five rows of the DataFrame.
- shape: This gives the number of rows and columns. In our DataFrame, we have 1,840 rows and 11 columns.
- info: This gives us the information of the index, data type, and memory of the DataFrame.
- describe: This gives summary statistics for numerical columns.

One of the big advantages of Python Pandas over Python NumPy is that Pandas in Python allows us to have columns with different data types. Here, in Product_Review, we have columns that store float values like Product_Rating, string values like Product_Review_Phrase, and integers like Product_Launch_Year.

Now, as the data is read properly, we will work on indexing Product_Review so that we can get the rows and columns as per our requirements.

**Learn more about file handling from our blog on Python file handling**

**Indexing DataFrames with Pandas in Python**

Now, let’s say, we want to select and have a look at a chunk of data from our DataFrame. There are two ways of achieving the same: First, selecting by position and, second, by the label.

**Selecting by Position**

Using * iloc*, we can retrieve rows and columns by position. Here, we need to specify the positions of rows and columns.

**Example 1:**

Suppose, we want only the first column out of the DataFrame. Then, we would use * iloc* on the DataFrame as shown below:

Product_Review.iloc[:,0]

This snippet of code shows that we want to have a look at all rows of the first column. Keep in mind that the position of the first column (or the first row) always starts with 0. As we needed all the rows, we specified just a colon (:) without mentioning any position.

**Example 2:**

Again, imagine that we want to have a look at the first five rows of the fourth column. We need to specify the position of the rows as **0:5**, which means that we want to view the first five rows from position 0 to position 4 (note that position 5 is excluded here). Also, instead of writing 0:5, we can leave off the first position value and write like**:5** (but if we write **0:, **it means the ‘0^{th}’ position to the last position).

Product_Review.iloc[0:5,4]

Also, in the example shown above, instead of writing 0:5 we can leave off the first position value, like :5. This has the same meaning. But if we write **0: **this means indexing of 0th position to the last position.

Similar examples:

- iloc[:,:] – To view the entire DataFrame
- iloc[6:,4:] – To view from Row 6 and Column 4 onward

Now, let’s update our DataFrame by removing the first column, which contains no useful** information:**

Produt_Reviews= Product_Reviews.iloc[:,1] Product_Reviews.head()

Now, since we are aware of the **NumPy** indexing methodologies, we can notice that they are quite similar to Pandas indexing by position. But unlike NumPy, each of the columns and rows in Pandas has a label. Yes, selecting by position is easy. But for large DataFrames, keeping track of columns and their corresponding positions becomes complicated. That’s when our next method of indexing comes in handy.

**Selecting by Label:**

The second method is selecting by the label of columns. The **.loc ***method *allows us to index using labels instead of positions. Let’s illustrate this method with some examples.

**Example 1:**

**(Selecting some rows of one column)**

Displaying the first five rows of the * Product_Title* using the

*.*method:

**loc**Prodcut_Reviews.loc[:5,”Product_Title”]

**Example 2:**

**(selecting some rows of more than one column)**

Displaying the first five rows of the columns, * Product_Title *and

**Product_Rating**Product_Reviews.loc[:5,”Product_Title”,”Product_Rating”]

**Sorting DataFrames with Pandas in Python**

Apart from indexing, another very simple yet useful feature offered by Pandas in Python is the sorting of DataFrames. To get a clear idea of the sorting feature, let’s look at the following examples.

Sorting the DataFrame based on the values of a column:

Say, we want to sort the Product_Rating column by its values.

Product_Review.sort_values(by=‘Product_Rating’)

Now, if we want to sort the Product_Rating column by its values in descending order.

Product_Review.sort_values(‘Product_Rating’, ascending=False)

**Pandas in Python DataFrame Methods**

There are some special methods available in Pandas in Python which makes our calculation easier. Let’s apply those methods in our * Product_Review* DataFrame.

1) Mean of all columns in our DataFrame

Product_Review.mean()

2) Median of each column in our DataFrame

Product_Review.median()

3) Standard deviation of each column in our DataFrame

Product_Review.std()

4) Maximum value of each column in our DataFrame

Product_Review.max()

5) Minimum of each column in our DataFrame

Product_Review.min()

6) Number of non-null values in each DataFrame column

Product_Review.count()

7) Summary statistics for numerical columns

Product_Review.describe()

**Mathematical Operations in Pandas Python**

We can also perform mathematical operations on series objects or DataFrame objects.

For example, for dividing every value in the Product_Rating column by 2, we use the following code:

Product_Review[“Product_Rating”] /2

**Note**: All common mathematical operators that work in Python, like +, −, *, /, and ^, will also work in a DataFrame or a series object.

**Filtering DataFrames in Python Panda**

Now that we have learned about how to do mathematical operations in Pandas, let’s have a look at the filtering methods available in Pandas Python and use them in our DataFrame.

Say, we want to find footwear that has a Product_Rating greater than 3.

First, let us generate a Boolean series with our filtering condition and see the first five results.

filter1 = Product_Review["Product_Rating"] > 3 filter1.head()

Now that we have got the Boolean series, we use it to select only rows in a DataFrame where the series contains the value True so that we get the rows in Product_Review where Product_Rating is greater than 3.

filtered_new.head()

Let’s make it a bit complicated by adding more than one condition. Since we wanted to have a look at the footwear that has a Product_Rating greater than 3, we will now add our second condition in the Product_Category column of the DataFrame.

filter2 = (Product_Review["Product_Rating"] > 3) & (Product_Review["Product_Category"] == "Footwear") filtered_review = Product_Review[filter2] filtered_review.head()

In the example shown above, we saw filtering conditions with AND Boolean operator (**&**). Similarly, OR operator (**|**) can also be applied when necessary.

Till now, we have learned how to do data manipulation using the Pandas library. Pandas library also offers a data visualization feature for a better understanding of the data. Let us see how data visualization with Pandas works.

**Python is one of the most demanding skills right now in the market. Enroll in our best Python training in Bangalore and become a Python Expert.**

**Data Visualization Using Pandas Python**

Data visualization with Pandas in Python is carried out in the following ways:

- Histogram
- Scatter Plot

**Note**: Call %matplotlib inline to set up plotting inside Jupyter Notebook.

**1. Histogram**

%matplotlib inline Product_Review[Product_Review["Product_Category"] == "Footwear"]["Product_Rating"].plot(kind="hist")

In the histogram shown above, we have seen the frequency of the footwear based on Product_Rating.

**Analysis:** Let us now analyze the data from the histogram. It appears that footwear with a Product_Rating of 5 is more. In other words, footwear with low Product_Rating is very less in number.

**2. Scatter Plot**

Now, we will have a look at the scatter plot of the Product_Rating based on the Product_Launch_Year.

**Note: **Here, both x and y columns need to be numeric.

Product_Review.plot.scatter(x="Product_Launch_Year",y="Product_Rating")

**Analysis: **From the scatter plot shown above, we can analyze how Product_Rating of products launched in the year 2016 changes. It appears that low Product_Rating is less in number as the density near the low Product_Rating is less.

This brings us to the end of the **Python Pandas Library Tutorial**. In this tutorial, we have learned how different features in the Pandas library work and how to use them for a better understanding of our data. Here is the complete Python Tutorial for you to refer

*Refer to our Cheat Sheet in Pandas *

Learned much on Python? Check out how to get Python certification and a free guide for the recurring Python interview questions prepared by our experts.

It would help if you included a link to download the amazon csv file so that we could actually follow along and work with the data. thanks anyways