bing
Flat 10% & upto 50% off + 10% Cashback + Free additional Courses. Hurry up
×
UPTO
50%
OFF!
Intellipaat
Intellipaat
  • Live Instructor-led Classes
  • Expert Education
  • 24*7 Support
  • Flexible Schedule

Introduction to Pandas

Pandas is a Python library which is simple yet powerful tool for Data Analysis. Pandas is one of the most widely used Python packages. This package is comprising of many data structures and tools for effective Data Manipulation and analysis.Python with Pandas is used everywhere including Commercial, Academic, Economics, Finance, Analytics, Statistics, etc. And if you are going to work with Data using Python you need to learn Pandas as well.

Python Pandas Library Tutorial

Some of the key features of Pandas are:

  • It provides DataFrame object with default and customised indexing which is very fast and efficient.
  • There are tools available for loading different file formats data into in-memory data objects.
  • It is easy to perform data alignment and integrated handling of missing data in Pandas.
  • It is very simple to perform pivoting and reshaping of data sets in Pandas.
  • It also provides indexing, label-based slicing, and sub-setting of large data sets.
  • We can easily insert and delete the columns from a data structure
  • Data for aggregation and transformations can be done using group by
  • High-performance merge and join of data can be done using Pandas.
  • It also provides Time Series functionality.

Learn Python in 16 hrs from experts



In this tutorial, we will use Pandas to analyse data on product reviews from Amazon, a popular eCommerce website. This dataset consist of information related to Various product reviews information of Amazon Website which includes

  • Product-Review-Phrase – Description of a product according to their Review
  • Product-Title – Product Name
  • Website-URL – Link of the product in website
  • Platform – Whether product is available in Website or Mobile App or Both
  • Product-Rating – Average product rating as per customers
  • Product-Category – Category of product
  • Is-Amazon-Advantage-Product – Whether a product is a premium product or not
  • Product-Launch-Year – Year of launch of the product in website
  • Product-Launch-Month – Month of launch of the product in website
  • Product-Launch-Day – Date of launch of the product in website

While analysing the product reviews, we will learn how to implement key Pandas concepts like indexing, plotting etc.

The data is in the CSV (Comma Separated Values) format — each record is separated by a comma “,”and rows are separated by a new line. There are approx. 1841 rows, including a header row, and 10 columns in the file.

Before we get started, a quick note on the software environments– here in this tutorial, we will be using Python 3.5. Our examples will be done using Jupyter notebook.

Content:

So, let’s begin with

Importing Conventions for Pandas:

The very first and the most important operation one needs keep in mind is to import the Pandas library properly.

import pandas as pd

Also, while importing we abbreviate it as pd.

Wish to Learn Python? Click Here

Series Objects and creating Series Objects:

A Series can contain any type of data, including mixed types. Now let’s have a look at how we can create series objects in Pandas with some examples.

Example1:

series1 = pd.Series([1, 2, 3, 4]
series1

Series Objects and creating Series Objects in Pandas
Just to make sure this object that we have created just now is indeed a series object we can use type() on the object above.

type(series1)

index the of the series object in Pandas
Example 2:
Further we can specify the index the of the series object as shown below:

series2 = pd.Series([1, 2, 3, 4], index=[‘a’, ‘b’, ‘c’, ‘d’])
series2

creating series objects in Pandas

DataFrames and creating DataFrames:

DataFrames in pandas are defined as two-dimensional labeled data structures with columns of potentially different types.

Create a DataFrame by from a List: Let’s take a List of integers and then create an DataFrame by using that List.

list1 = [1,2,3,4,5]
list1

Creating a DataFrame from a List in Pandas
Create a DataFrame by from a List of Lists:
Now we will create a DataFrame by using a List of Lists.

list-of-lists = [[‘apple’,10],[‘mango’,12],[‘banana’,13]]
df = pd.DataFrame(list-of-lists,columns=[‘fruit’,’count’],dtype=int)
df

creating a DataFrame by using a List of Lists in Pandas
Create a DataFrame by from a Dict: We can also create DataFrames with the help of Dictionary.

dict1 = {‘fruit’:[‘apple’, ‘mango’, ‘banana’],’count’:[10,12,13]}
df = pd.DataFrame(dict1)

creating DataFrames with the help of Dictionary in Pandas

Note: Since we are familiar with DataFrame and Series Objects keep in mind that each column in a DataFrame is a Series Object.

Become Python Certified in 16 hrs.
CLICK HERE

Importing Data with Pandas

Here, we will first read the data. The data is stored as a csv that is comma-separated values, where each column is separated by a comma “,” and each row by anew line. Here are the first few rows of the Amazon-Products-Review.csv file:

Importing Data with Pandas

As you can see, each row in the data represents a single product that was reviewed by Amazon. Here, we also have a leading column that contains row index values. Currently, we will not discuss about this column, but later we’ll dive into what are index values. To work with the data in Python, first step is to import the file into a Pandas DataFrame. A DataFrame is nothing but a way to represent and work with tabular data. and tabular data has rows and columns.

Our file is of .csv format. So, pd.read-csv() function is goingto help us read the data stored in that file. This function will take input as a csv file and return the output as a DataFrame.

import pandas as pd
Product-Review=pd.read-csv(“Amazon-Products-Review.csv”)

Let’s inspect the type of Product-Reviewby using type() function.

type(Product-Review)

Output for Product-Review by using type() function in Pandas

For file types other than .csv, the importing conventions are mentioned below:

• pd.read-table(“filename”)
• pd.read-excel(“filename”)
• pd.read-sql(query, connection-object)
• pd.read-json(json-string)

Working with DataFrame:

Now that the DataFrame is ready, let’s have a look at some of the operations in pandas.

  • Product-Review.head() –This will print the first 5 rows of the DataFrame.
  • Product-Review.tail() –This will print the last 5 rows of the DataFrame.
  • Product-Review.shape- Gives the number of rows and columns. In our DataFrame we have 1840 rows and 11columns.
  • Product-Review.info– This will give usthe information of Index, Datatype and Memory in the DataFrame.
  • Product-Review.describe -Summary statistics for numerical columns.

Did you notice, here it has been read in everything properly — we have 1840 rows and 11columns.

One of the big advantages of Python Pandas over Python NumPy is that Pandas allows us to have columns with different data types. Here, in Product-Reviews has columns that store float values, like Product-Rating, String values, Product-Review-Phrase, and integers, like Product-Launch-year.

Now as the data is read properly, next we will work on indexing the Product-Reviews, so that we can get the rows and columns as per our requirement.

Indexing the DataFrames with Pandas:

Now, let’s say we want to select and have a look at a chunk of data from our DataFrame. There are two ways of achieving the same.First, selecting by position and second, selecting by label.
Selecting by Position: Usingiloc we can retrieve rows and columns by position. Here we need to specify the positions of the rows and columns.

Example1:

Suppose we want only the first column out of the DataFrame. Then we would use iloc on the DataFrame as shown below:

Product-Review.iloc[:,0]

This snippet of code shows that we want to have a look at all the rows of first column.Keep in mind that position of first column(or first row)always starts with 0. As we wanted all the rows, we specified just a colon “:” without mentioning any position.

Example 2:

Again, say we want to have a look at the first 5 rows of 4th column. We need to specify the position of the rows as 0:5. Which means that we wantto view the first 5 rows from position 0 to 4(Note that, position 5 is excluded). Also, instead of writing 0:5 we can leave off the first position value, like :5(But if we write 0: this mean 0th positionto last position).

Product-Review.iloc[0:5,4]

Also, in the example show above, instead of writing 0:5 we can leave off the first position value, like :5. This has the same meaning. But if we write 0: this means indexing of 0th position to last position.

Similar examples:

  • Product-Review.iloc[:,:] ->View entire DataFrame
  • Product-Review.iloc[6:,4:] ->View from Rows 6 and column 4 onwards

Now let’s update our DataFrame by removing the first column, which contains no useful information.

Produt-Reviews= Product-Reviews.iloc[:,1]
Product-Reviews.head()

updating the DataFrame

Now if you are aware of the numpy indexing methodologies you might have noticed that it is quite similar to pandas indexing by position. But unlike numpy, each of the columns and rows in Pandas has a label. Yes, selecting by position is easy. But for large DataFrames, keeping track of columns and their correspondingpositions becomes complicated. That’s when our next method of indexing  comes in handy.

Selecting by Label:

The second method is selecting by Label of the columns. .locallows us to index using labels instead of positions. Let’s illustrate this method with some examples.

Example1:

(Selecting some rows of one column)

Display the first five rows of the Product-Title  using .loc method like this:

Prodcut-Reviews.loc[:5,”Product-Title”]

Example2:

(selecting some rows of more than one column)

Display the first five rows of the column Product-Title and Product-Rating

Product-Reviews.loc[:5,”Product-Title”,”Product-Rating”]

Sorting DataFrames with Pandas:

Apart from indexing another very simple yet useful feature offered by Pandas is the sorting of DataFrame. To get a clear idea of sorting feature let’s look at the following examples.

To Sort the DataFrame based on the values of a column.  Say we want to sort the Product-Rating column by its values.

Product-Review.sort-values(by=‘Product-Rating’)

Now if you want to sort the Product-Rating column by its values(Descending Order).

Product-Review.sort-values(‘Product-Rating’, ascending=False)

Pandas DataFrame methods:

There are some special methods available in Pandas which makes our calculation easier. Let’s apply those methods in our Product-Review DataFrame.

1) Mean of all the columns in our DataFrame.

Product-Review.mean()

Mean of all the columns in Pandas DataFrame
2) Median of each column in our DataFrame.

Product-Review.median()

Median of each column in Pandas DataFrame
3) Standard deviation of each column in our DataFrame.

Product-Review.std()

Standard deviation of each column in Pandas DataFrame

4) Maximun Value of each column in our DataFrame.

Product-Review.max()

Maximun Value of each column in Pandas DataFrame

5) Minimum of each column in our DataFrame

Product-Review.min()

Minimum of each column in Pandas DataFrame

6) Number of non-null values in each DataFrame column

Product-Review.count()

Number of non-null values in DataFrame columns

7) Summary statistics for numerical columns

Product-Review.describe()

Summary statistics for numerical columns in Pandas

Mathematical Operations in Pandas:

We can also perform mathematical operations on Series Objects or DataFrame objects.

For example, we can divide every value in the Product Rating column by 2.

Product-Review[“Product-Rating”] /2

mathematical operations on Series Objects or DataFrame objects in Pandas
Note: All the common mathematical operators that work in Python, like +, -, *, /, and ^ will also work in a DataFrame or a Series Object.

Filtering DataFrames:

Now that we have learnt about mathematical operations in Pandas, let’s have a look at the filtering methods available in Pandas and use them in our DataFrames.

Say, I want to find all the Footwear that has a Product Rating greater than 3.

First, let us generate a Boolean series with our filtering condition and see first 5 results.

filter1 = Product-Review[“Product-Rating”] > 3
filter1.head()

generating a Boolean series in Pandas

Now that we have got the Boolean Series, we use it to select only rows in a DataFrame where the Series contains the value True. So that we  get the rows in Product-Review where Product-Rating is greater than 3:

filtered-new = Product-Review[filter1]
filtered-new.head()

selecting only rows in a DataFrame in Pandas

Let’s make it a bit complicated by adding more than one condition. Since we wanted to have a look at the Footwear that has a Product Rating greater than 3. We will add our second condition in Product-Category column of the DataFrame.

filter2 = (Product-Review[“Product-Rating”] > 3) & (Product-Review[“Product-Category”] == “Footwear”)
filtered-review = Product-Review[filter2]
filtered-review.head()

adding second condition in column of the DataFrame

In the example shown above we have seen filtering conditions with AND Boolean Operator (&). Similarly, OR operator(|) can also be applied when necessary.

Till now we have learnt how to do Data Manipulation using Pandas library. Pandas library also offer Data Visualization feature for better understanding of the data. Let us see how Data Visualization with Pandas work.

Data Visualization using Pandas:

Data Visualization with Pandas is carried out with following ways.

1. Histogram
2. Scatter Plot

Note: Call %matplotlib inline to set up plotting inside the Jupyter notebook.

1. Histogram:

%matplotlib inline
Product-Review[Product-Review[“Product-Category”] == “Footwear”][“Product-Rating”].plot(kind=”hist”)

Data Visualization using Pandas - Histogram

In the histogram shown above we have seen the frequency of the Footwear based on the Product Rating.

Analysis: So, let us analyze from the histogram. It appears that Footwear with a Product Rating of 5 is higher. Or we can also say that Footwear with low Product Rating is very less in number.

2. Scatter Plot:

Now we will have a look at the scatter plot of the Product Ratings based on the Product Launch Year.

Note: Here both x and y columns need to numeric.

Product-Review.plot.scatter(x=”Product-Launch-Year”,y=”Product-Rating”)

Data Visualization using Pandas - Scatter Plot

Analysis:From the scatter plot shown above we can analyze how product rating of the products launched in the year 2016 changes. It appears that low Product Ratings are les in number as the density near the low Product Rating is less.

This brings us to the end of the Pythons Pandas Library Tutorial. In this tutorial, we have learnt how different features in Pandas library work and how to use for the better understanding of our Data. Dive into the best way to learn Python by Intellipaat.

Refer to our Cheat Sheet in Pandas.

Yes, Pandas library plays a very important role in Data Science and Data Analysis. To have a deep understanding of Python Libraries like Pandas, numpy etc. our Data Science in Python Course is a must complete, which not only covers the various technique of how Python is deployed for Data Science also work with various libraries for Data Cleaning, Data Manipulating, Data Analysis and much more Depth.

Learnd Python? Check out how to get Python certification and a free guide for interview questions asked by the expert.

Previous Next

Download Interview Questions asked by top MNCs in 2018?

Learn SQL in 16 hrs from experts



"0 Responses on Python Pandas Tutorial"

Leave a Message

100% Secure Payments. All major credit & debit cards accepted Or Pay by Paypal.
top

Sales Offer

Sign Up or Login to view the Free Python Pandas Tutorial.