Web Scraping Using Python
Web scraping Python has been around for a while now, but it has become more popular in the past decade. Web Scraping using Python is very easy. With the help of Python, extracting data from a web page can be done automatically. In this module, we will discuss web scraping in Python from scratch. Also, this tutorial will be guiding us through a step-by-step demonstration to our first web scraping Python project.
Watch this Python Web Scraping Video
What Is Web Scraping in Python?
Python Web scraping is nothing but the process of collecting data from the web. Web scraping in Python involves automating the process of fetching data from the web. In order to fetch the web data, all we need is the URL or the web address that we want to scrape from. The fetched data will be found in an unstructured form. In order to make use of the data or collect useful insights from it, we transform it into a structured form. Once converted into a structured form, we need to store the data for further processing. The whole process is called web scraping.
Why Web Scraping Using Python?
Now that we are familiar with what web scraping in Python is, let us discuss why to perform web scraping using python or for what business scenarios Python web scraping is useful. We all agree to the fact that data has become a commodity in the 21st century, data-driven technologies have experienced a significant rise, and there is an abundance of data generated from different sources on a daily basis. But, how do we collect data in order to make use of it?
Some of the industrial applications of web scraping:
Let us discuss for what business scenarios web scraping can be used.
Data Science
For learning Data Science, we need large amounts of data. Web scraping Python can fulfill this requirement.
Market Research
Before launching a product or service, companies can study the market in advance with the help of web scraping.
Tracking Competitive Pricing
Web scraping Python can help study the service or product pricing of the competitors to stay ahead in the market.
Monitoring Brand Value
Web scraping can be used in order to build brand intelligence and monitor how customers feel about a product or a service.
Lead Generation
With the help of web scraping, businesses can grow their lead generation by gathering contact details of businesses or individuals.
Get 100% Hike!
Master Most in Demand Skills Now!
Is Web Scraping Python Legal?
Well, this is one of the most common questions that arise when it comes to web scraping (also known as data scraping). The answer to it can’t be summed up in one word. Not all web scraping acts are considered legal. Web scraping Python services that extract publicly available data is legal. But, at times, it may cause legal issues, just the way any tool or technique in the world can be used for good as well as for bad. For example, web scraping non-public data, which is not accessible to everyone on the web, can be unethical, and also it can be an invitation to legal trouble. So, it is advised to avoid doing that. Let us take a look at some of the cases where web scrapers broke the rule and try to learn from them.
Some of the legal cases that found web scraping to be on the wrong side of the law:
This is why, in order to perform ethical web scraping, web scrapers need to follow some rules. Let us discuss them before scraping the web.
Python Web Scraping Rules:
Before we start scraping the web, there are some rules that we must follow to avoid legal issues. They are:
- Check the Terms and Conditions of the website before we scrape it. The Legal Use of Data section will have the information about data that we all can use. Usually, the data we scrape should not be used for commercial purposes. Use the text method as shown below. Every website keeps its rules defined in a txt file. We should inspect it to find the things that are allowed and most importantly the things that are not allowed. For example, let us inspect the twitter page.
- Keep the pace low. If we request for data from the website too aggressively with our bot or our program, it might be considered as spamming. Add wait time in between to make the program behave like a human.
- Use public content only.
How to Perform Web Scraping Using Python?
Now that we are familiar with what web scraping is and why web scraping is used, we are all set to dive right into the understanding of how to carry out a Python web scraping project. Let us take a look at the workflow of a Python web scraping project before moving ahead with the actual hands-on.
Web Scraping Python Workflow:
A Python web scraping project workflow is commonly categorized into three steps: First, fetch web pages that we want to retrieve data from; second, apply web scraping technologies, and finally, store the data in a structured form. The below image depicts the process of a web scraping project.
Setting up Python Web Scraper:
We will be using Python 3 and Jupyter Notebook throughout the hands-on. We will be importing two packages as well.
- For performing HTTP requests: Import Python requests
- For handling all of the HTML processing: Import BeautifulSoup from bs4
Demo: A Step-by-step Guide on Python Web Scraping a Wikipedia Page
In this demonstration, we will be walking through our first Python web scraping project. We will be scraping the Wikipedia page to fetch the List of Indian Billionaires published by Forbes in the year 2018. We can fetch the List of Billionaires even after it gets updated for the year 2019 with the help of the same Python web scraping program. Exciting, right? Let us move ahead and get our hands dirty.
Step 1: Fetch the web page and convert the HTML page into text with the help of the Python request library
#import the python request library to query a website
import requests
#specify the url we want to scrape from
Link = "https://en.wikipedia.org/wiki/Forbes_list_of_Indian_billionaires"
#convert the web page to text
Link_text = requests.get(Link).text
print(Link_text)
Output:
Step 2: In order to fetch useful information, convert Link_text (which is of string data type) into a BeautifulSoup object. Import BeautifulSoup library from bs4
#import BautifulSoup library to pull data out of HTML and XML files
from bs4 import BeautifulSoup
#to convert Link_text into a BeautifulSoup Object
soup = BeautifulSoup(Link_text, 'lxml')
print(soup)
Output:
Step 3: With the help of the prettify() function, make the indentation proper
#make the indentation proper
print(soup.prettify())
Output:
Step 4: To fetch the web page title, use soup.title
#To take a look at the title of the web page
print(soup.title)
Output: The first title tag will be given out as an output.
<title>Forbes list of Indian billionaires - Wikipedia</title>
Step 5: We want only the string part of the title, not the tags
#Only the string not the tags
print(soup.title.string)
Output:
Forbes list of Indian billionaires - Wikipedia
Step 6: We can also explore <a></a> tags in the soup object
#First <a></a> tag
soup.a
Output: First <a></a> tag can be seen here.
<a id="top"></a>
Step 7: Explore all <a></a> tags
#all the <a> </a> tags
soup.find_all('a')
Output:
Step 8: Again, just the way we fetched title tags, we will fetch all table tags
#Fetch all the table tags
all_table = soup.find_all('table')
print(all_table)
Output:
Step 9: Since our aim is to get the List of Billionaires from the wiki-page, we need to find out the table class name. Go to the webpage. Inspect the table by placing cursor over the table and inspect the element using ‘Shift+Q’.
So, our table class name is ‘wikitable sortable’. Let us move ahead and fetch the list.
Step 10: Now, fetch all table tags with the class name ‘wikitable sortable’
#fetch all the table tags with class name="wikitable sortable"
our_table = soup.find('table', class_= 'wikitable sortable')
print(our_table)
Output:
Step 11: We can see that the information that we want to retrieve from the table has <a> tags in them. So, find all the <a> tags from table_links.
#In the table that we will fetch find the <a> </a>tags
table_links = our_table.find_all('a')
print(table_links)
Output:
Step 12: In order to put the title on a list, iterate over table_links and append the title using the get() function
#put the title into a list
billionaires = []
for links in table_links:
billionaires.append(links.get('title'))
print(billionaires)
Output:
Step 13: Now that we have our required data in the form of a list, we will be using Python Pandas library to save the data in an Excel file. Before that, we have to convert the list into a DataFrame
#Convert the list into a dataframe
import pandas as pd
df = pd.DataFrame(billionaires)
print(df)
Output:
Step 14: Use the following method to write data into an Excel file.
#To save the data into an excel file
writer = pd.ExcelWriter('indian_billionaires.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='List')
writer.save()
Now our data has been saved into an Excel workbook with the name ‘indian_billionaires.xlsx’ and inside a sheet named ‘List’.
Step 15: Just to make sure if the Excel workbook is saved or not, read the file using read_excel
#check if it’s done right or not
df1= pd.read_excel('indian_billionaires.xlsx')
df1
Output:
Congratulations! We have successfully created our first web scraping program.
In this Python Tutorial, we have discussed web scraping using Python from scratch. We have also mentioned some of the must-follow rules while performing web scraping using python. The demonstration given at the end of the tutorial was a quick walk-through of our first web scraping project.