Web scraping Python has been around for a while now, but it has become more popular in the past decade. Web Scraping using Python is very easy. With the help of Python, extracting data from a web page can be done automatically. In this module, we will discuss web scraping in Python from scratch. Also, this tutorial will be guiding us through a step-by-step demonstration to our first web scraping Python project.
Following is the list of all topics that we will cover in this module:
So, without further ado, let’s get started.
Python Web scraping is nothing but the process of collecting data from the web. Web scraping in Python involves automating the process of fetching data from the web. In order to fetch the web data, all we need is the URL or the web address that we want to scrape from. The fetched data will be found in an unstructured form. In order to make use of the data or collect useful insights from it, we transform it into a structured form. Once converted into a structured form, we need to store the data for further processing. The whole process is called web scraping.
Now that we are familiar with what web scraping in Python is, let us discuss why to perform web scraping using python or for what business scenarios Python web scraping is useful. We all agree to the fact that data has become a commodity in the 21st century, data-driven technologies have experienced a significant rise, and there is an abundance of data generated from different sources on a daily basis. But, how do we collect data in order to make use of it?
Some of industrial applications of web scraping:
Let us discuss for what business scenarios web scraping can be used.
Well, this is one of the most common questions that arise when it comes to web scraping (also known as data scraping). The answer to it can’t be summed up in one word. Not all web scraping acts are considered as legal. Python Web scraping services that extract publicly available data are legal. But, at times, it may cause legal issues, just the way any tool or technique in the world can be used for good as well as for bad. For example, web scraping non-public data, which are not accessible for everyone on the web, can be unethical, and also it can be an invitation to legal trouble. So, it is advised to avoid doing that. Let us take a look at some of the cases where web scrapers broke the rule and try to learn from them.
Some of the legal cases that found web scraping to be on the wrong side of the law:
This is why, in order to perform ethical web scraping, web scrapers need to follow some rules. Let us discuss them before scraping the web.
Before we start scraping the web, there are some rules that we must follow to avoid legal issues. They are:
Now that we are familiar with what web scraping is and why web scaping is used, we are all set to dive right into the understanding of how to carry out a Python web scraping project. Let us take a look at the work flow of a Python web scraping project before moving ahead with the actual hands-on.
Web Scraping Python Workflow:
A Python web scraping project workflow is commonly categorized into three steps: First, fetch web pages that we want to retrieve data from; second, apply web scraping technologies, and finally store the data in a structured form. The below image depicts the process of a web scraping project.
Setting up Python Web Scraper:
We will be using Python 3 and Jupyter Notebook throughout the hands-on. We will be importing two packages as well.
In this demonstration, we will be walking through our first Python web scraping project. We will be scraping the Wikipedia page to fetch the List of Indian Billionaires published by Forbes in the year 2018. We can fetch the List of Billionaires even after it gets updated for the year 2019 with the help of the same Python web scraping program. Exciting, right? Let us move ahead and get our hands dirty.
Step 1: Fetch the web page and convert the html page into text with the help of Python request library
#import the python request library to query a website import requests #specify the url we want to scrape from Link = "https://en.wikipedia.org/wiki/Forbes_list_of_Indian_billionaires" #convert the web page to text Link_text = requests.get(Link).text print(Link_text)
Step 2: In order to fetch useful information, convert Link_text (which is of string data type) into BeautifulSoup object. Import BeautifulSoup library from bs4
#import BautifulSoup library to pull data out of HTML and XML files from bs4 import BeautifulSoup #to convert Link_text into a BeautifulSoup Object soup = BeautifulSoup(Link_text, 'lxml') print(soup)
Step 3: With the help of the prettify() function, make the indentation proper
#make the indentation proper print(soup.prettify())
Step 4: To fetch the web page title, use soup.title
#To take a look at the title of the web page print(soup.title)
Output: The first title tag will be given out as an output.
<title>Forbes list of Indian billionaires - Wikipedia</title>
Step 5: We want only the string part of the title, not the tags
#Only the string not the tags print(soup.title.string)
Forbes list of Indian billionaires - Wikipedia
Step 6: We can also explore <a></a> tags in the soup object
#First <a></a> tag soup.a
Output: First <a></a> tag can be seen here.
Step 7: Explore all <a></a> tags
#all the <a> </a> tags soup.find_all('a')
Step 8: Again, just the way we fetched title tags, we will fetch all table tags
#Fetch all the table tags all_table = soup.find_all('table') print(all_table)
Step 9: Since our aim is to get the List of Billionaires from the wiki-page, we need to find out the table class name. Go to the webpage. Inspect the table by placing cursor over the table and inspect the element using ‘Shift+Q’.
So, our table class name is ‘wikitable sortable’. Let us move ahead and fetch the list.
Step 10: Now, fetch all table tags with the class name ‘wikitable sortable’
#fetch all the table tags with class name="wikitable sortable" our_table = soup.find('table', class_= 'wikitable sortable') print(our_table)
Step 11: We can see that the information that we want to retrieve from the table has <a> tags in them. So, find all the <a> tags from table_links.
#In the table that we will fetch find the <a> </a>tags table_links = our_table.find_all('a') print(table_links)
Step 12: In order to put the title on a list, iterate over table_links and append the title using the get() function
#put the title into a list billionaires =  for links in table_links: billionaires.append(links.get('title')) print(billionaires)
Step 13: Now that we have our required data in the form of a list, we will be using Python Pandas library to save the data in an Excel file. Before that, we have to convert the list into a DataFrame
#Convert the list into a dataframe import pandas as pd df = pd.DataFrame(billionaires) print(df)
Step 14: Use the following method to write data into an Excel file.
#To save the data into an excel file writer = pd.ExcelWriter('indian_billionaires.xlsx', engine='xlsxwriter') df.to_excel(writer, sheet_name='List') writer.save()
Now our data has been saved into an Excel workbook with the name ‘indian_billionaires.xlsx’ and inside a sheet named ‘List’.
Step 15: Just to make sure if the Excel workbook is saved or not, read the file using read_excel
#check if it’s done right or not df1= pd.read_excel('indian_billionaires.xlsx') df1
Congratulations! We have successfully created our first web scraping program.
In this Python Tutorial, we have discussed web scraping using Python from scratch. We have also mentioned some of the must-follow rules while performing web scraping using python. The demonstration given at the end of the tutorial was the quick walk through to our first web scraping project.
If you are interested in doing an end-t0-end certification course in Python, you can go through our Python Training.
Download Interview Questions asked by top MNCs in 2019?