bing
Flat 10% & upto 50% off + 10% Cashback + Free additional Courses. Hurry up
×
UPTO
50%
OFF!
Intellipaat
Intellipaat
  • Live Instructor-led Classes
  • Expert Education
  • 24*7 Support
  • Flexible Schedule

Web Scraping

Web scraping has been around for a while now, but it has become very popular in the past decade. Python has made web scraping very easy. With the help of Python, extracting data from a web page can be done automatically. In this module, we will discuss web scraping from the scratch. Also, this tutorial will be guiding you through the step by step demonstration to your first Web Scraping project.

Watch this Python Web Scraping Video

Following is the list of all the topics that we will cover in this module, in case you need to jump to a specific one.

  • What is Web Scraping ?
  • Why Web Scraping?
  • Is Web Scraping Legal?
  • Web Scraping Rules
  • How to Perform Web Scraping Using Python?
    • Web Scraping Workflow
    • Setting up Python Web Scraper
  • Demo: Web Scraping Wikipedia

So, with further ado, let’s get started.

What Is Web Scraping?

Web scraping is nothing but the process of collecting data from the web. Web scraping involves automating the process of fetching the data from web. In order to fetch the web data, all we need is the URL or the web address that we want to scrap from. The fetched data is found in an unstructured form, in order to make use of the data or collect useful insights we transform it into structured form. Once converted into structured form store the data for further processing. The whole process is called web scraping.

Why Web Scaping?

Now that we are familiar with what web scraping is, let us discuss why do we perform web scraping or under what business scenarios web scraping is useful. We all agree to the fact that data has become a commodity in 21st century, data driven technologies have experienced a significant rise, we all know there is an abundance of data generated from different source on a daily basis. But how do we collect data in order to make use of it?

Some of industrial applications of web scraping:

Let us discuss under what business scenarios web scraping can be used.

  • Data Science: For the study of Data Science, we need large amount of data. Web Scraping can fulfil the data gap.
  • Market Research: Before launching a product or service, study the market in advance with the help of web scraping.
  • Track Competitive Pricing: Web scraping can help study the service or product pricing of the competitors. To stay ahead in the market,
  • Monitor Brand Value: Web Scraping can be used in order to build brand intelligence and monitor how customers feel about the product or the service.
  • Lead Generation: With the help of web scraping businesses can gather contact details of businesses or individuals.

Is Web Scraping Legal?

Well, this is one of the most common questions that arises when it comes to web scraping, also known as Data Scraping. The answer can’t be summed up in one word. Not all web scraping acts are considered as legal. Web scraping services that extracts publicly available data are legal. But sometimes it may cause legal issues. Just the way any tool or technique in the world can be used for good as well for bad. For example, web scraping non-public data, which are not accessible for everyone on the web can be unethical, also can be an invitation to legal trouble. So, avoid doing that. Let us take a look at some of the cases where Web Scrapers broke the rule and try to learn from them.

Some of the legal cases that found web scraping on the wrong side of the law:

Is Web Scraping Legal

That is why in order to perform ethical web scraping, web scrapers need to follow some rules.  Let us discuss them before scraping the web.

Web Scraping Rules:         

Before we start scraping the web,  there are some rules that we must follow some web scraping rules to avoid legal issues. They are:

  • Check the Terms and Conditions of the website before you scrape it. The Legal Use of Data section has the information about data that we all can use. Usually, the data you scrape should not be used for commercial purposes. Use the text method as shown below. Every website keeps its rules defined in a robots.txt file. We should inspect things that are allowed and most importantly things that are disallowed. For example, let us inspect the twitter page.

Web Scraping Rules 1

  • Keep the pace low. If you request data from the website too aggressively with your bot or your program as this may be considered as spamming. Add wait time in between to make the program behaves like a human.
  • Use public content only.

How to Perform Web Scraping Using Python?

Now that we are familiar with what web scraping is and why web scaping is used, we are all set to dive right into the understanding of how to carry out a web scraping project. Let us take a look at the work flow of web scraping proect before moving ahead with the actual hands on.

Web Scraping Workflow:

The web scraping project workflow is commonly categorized into three steps. First, fetch web pages that you want to retrieve data from. Second, apply web scrapping technologies and the last step is to store the data in a structured form. The below given image depicts the process of a web scraping project.

How to Perform Web Scraping Using Python

Setting up Python Web Scraper:

We will be using Python 3 and Jupyter notebook throughout the hands-on. We will be importing two packages as well.

  • For performing HTTP requests- import Python requests
  • For handling all of the HTML processing- import BeautifulSoup from bs4

Demo: Step by step guide on Web Scraping Wikipedia page

In this demonstration, we will be walking you through your first Web Scraping Project. We will be scraping the Wikipedia page to fetch the List of Indian Billionaires published by Forbes in the year 2018. We can fetch the List of Billionaires even after it gets updated for the year 2019 with the help of same Python Web Scraping program. Exciting right? Let us move ahead and get our hands dirty.

Step 1: First, fetch the web page and convert the html page into text with the help of Python request library.

#import the python request library to query a website
import requests
#specify the url you want to scrap from
Link = "https://en.wikipedia.org/wiki/Forbes_list_of_Indian_billionaires"
#convert the web page to text
Link_text = requests.get(Link).text
print(Link_text)

Output:

Output

Step 2: In order to fetch useful information, convert Link_text (which is of string datatype) into BeautifulSoup object. Import BeautifulSoup library from bs4.

#import BautifulSoup library to pull data out of HTML and XML files
from bs4 import BeautifulSoup
#to convert Link_text into a BeautifulSoup Object
soup = BeautifulSoup(Link_text, 'lxml')
print(soup)

Output:

Output 2

Step 3: With the help of prettify() function make the indentation proper.

#make the indentation proper
print(soup.prettify())

Output:

Output 3

Step 4: To fetch the web page title use the soup.title

#To take a look at the title of the web page
print(soup.title)

Output: The first title tag will be given out as an output.

<title>Forbes list of Indian billionaires - Wikipedia</title>

Step 5: But we want only the string part of the title not the tags.

#Only the string not the tags
print(soup.title.string)

Output:

Forbes list of Indian billionaires - Wikipedia

Step 6: We can also explore the <a></a> tags in the soup object.

#First <a></a> tag
soup.a

Output: First <a></a> tag can be seen here.

<a id="top"></a>

Step 7: Explore all <a></a> tags.

#all the <a> </a> tags
soup.find_all('a')

Output:

Output 7

Step 8: Again, just the way we fetched title tags, a tag, we will fetch all the table tags.

#Fetch all the table tags
all_table = soup.find_all('table')
print(all_table)

Output:

output 8

Step 9: Since our aim is to get the List of Billionaires from the wiki-page, we need to find out the table class name. Go to the webpage. Inspect the table by placing cursor over the table and inspect element by using “Shift+Q”.

step 9

So, our table class name is ‘wikitable sortable’. Let us move ahead and fetch the list.

Step 10: Now, fetch all the table tags with class name “wikitable sortable”

#fetch all the table tags with class name="wikitable sortable"
our_table = soup.find('table', class_= 'wikitable sortable')
print(our_table)

Output:

step 10

Step 11:  As you can see the information that we want to retrieve from the table has <a> tags in them. So, find all the <a> tags from table_links.

#In the table that we will fetch find the <a> </a>tags  
table_links = our_table.find_all('a')
print(table_links)

Output:

step 11

Step 12:  In order to put the title on a list iterate over the table_links and append the title by using get() function.

#put the title into a list 
billionaires = []
for links in table_links:
billionaires.append(links.get('title'))
print(billionaires)

Output:

Output 12

Step 13: Now that we have our required data in the form of a list, we will be using Python Pandas Library in order to save the data in an excel file. Before that we have to convert the list into a DataFrame.

#Convert the list into a dataframe 
import pandas as pd
df = pd.DataFrame(billionaires)
print(df)

Output:

output 13

Step 14: Use the following method in order to write to an excel file.

#To save the data into an excel file 
writer = pd.ExcelWriter('indian_billionaires.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='List')
writer.save()

Now our data has been saved into an excel workbook with the name ‘indian_billionaires.xlsx’ and inside a sheet named ‘List’.

Step 15: Just to make sure if the excel workbook is saved or not, read the file using read_excel.

#check if it’s done right or not
df1= pd.read_excel('indian_billionaires.xlsx')
df1

Output:

output 15

Congratulations! You have successfully created your first web scraping program.

In this tutorial, we have discussed Web Scraping from the scratch. We have also mentioned some of the must follow rules while Web Scraping. Demonstration given at the end was the quick walk through to your first Web Scraping Project. Now you have learnt how to collect data. In our next sessions, we will be discussing about processing the data and making use of the collected data. See you there.

Previous

Download Interview Questions asked by top MNCs in 2019?

"0 Responses on Python Web Scraping Tutorial"

Leave a Message

100% Secure Payments. All major credit & debit cards accepted Or Pay by Paypal.
top

Sales Offer

Sign Up or Login to view the Free Python Web Scraping Tutorial.