Web scraping has been around for a while now, but it has become very popular in the past decade. Python has made web scraping very easy. With the help of Python, extracting data from a web page can be done automatically. In this module, we will discuss web scraping from the scratch. Also, this tutorial will be guiding you through the step by step demonstration to your first Web Scraping project.
Following is the list of all the topics that we will cover in this module, in case you need to jump to a specific one.
So, with further ado, let’s get started.
Web scraping is nothing but the process of collecting data from the web. Web scraping involves automating the process of fetching the data from web. In order to fetch the web data, all we need is the URL or the web address that we want to scrap from. The fetched data is found in an unstructured form, in order to make use of the data or collect useful insights we transform it into structured form. Once converted into structured form store the data for further processing. The whole process is called web scraping.
Now that we are familiar with what web scraping is, let us discuss why do we perform web scraping or under what business scenarios web scraping is useful. We all agree to the fact that data has become a commodity in 21st century, data driven technologies have experienced a significant rise, we all know there is an abundance of data generated from different source on a daily basis. But how do we collect data in order to make use of it?
Some of industrial applications of web scraping:
Let us discuss under what business scenarios web scraping can be used.
Well, this is one of the most common questions that arises when it comes to web scraping, also known as Data Scraping. The answer can’t be summed up in one word. Not all web scraping acts are considered as legal. Web scraping services that extracts publicly available data are legal. But sometimes it may cause legal issues. Just the way any tool or technique in the world can be used for good as well for bad. For example, web scraping non-public data, which are not accessible for everyone on the web can be unethical, also can be an invitation to legal trouble. So, avoid doing that. Let us take a look at some of the cases where Web Scrapers broke the rule and try to learn from them.
Some of the legal cases that found web scraping on the wrong side of the law:
That is why in order to perform ethical web scraping, web scrapers need to follow some rules. Let us discuss them before scraping the web.
Before we start scraping the web, there are some rules that we must follow some web scraping rules to avoid legal issues. They are:
Now that we are familiar with what web scraping is and why web scaping is used, we are all set to dive right into the understanding of how to carry out a web scraping project. Let us take a look at the work flow of web scraping proect before moving ahead with the actual hands on.
Web Scraping Workflow:
The web scraping project workflow is commonly categorized into three steps. First, fetch web pages that you want to retrieve data from. Second, apply web scrapping technologies and the last step is to store the data in a structured form. The below given image depicts the process of a web scraping project.
Setting up Python Web Scraper:
We will be using Python 3 and Jupyter notebook throughout the hands-on. We will be importing two packages as well.
In this demonstration, we will be walking you through your first Web Scraping Project. We will be scraping the Wikipedia page to fetch the List of Indian Billionaires published by Forbes in the year 2018. We can fetch the List of Billionaires even after it gets updated for the year 2019 with the help of same Python Web Scraping program. Exciting right? Let us move ahead and get our hands dirty.
Step 1: First, fetch the web page and convert the html page into text with the help of Python request library.
#import the python request library to query a website import requests #specify the url you want to scrap from Link = "https://en.wikipedia.org/wiki/Forbes_list_of_Indian_billionaires" #convert the web page to text Link_text = requests.get(Link).text print(Link_text)
Step 2: In order to fetch useful information, convert Link_text (which is of string datatype) into BeautifulSoup object. Import BeautifulSoup library from bs4.
#import BautifulSoup library to pull data out of HTML and XML files from bs4 import BeautifulSoup #to convert Link_text into a BeautifulSoup Object soup = BeautifulSoup(Link_text, 'lxml') print(soup)
Step 3: With the help of prettify() function make the indentation proper.
#make the indentation proper print(soup.prettify())
Step 4: To fetch the web page title use the soup.title
#To take a look at the title of the web page print(soup.title)
Output: The first title tag will be given out as an output.
<title>Forbes list of Indian billionaires - Wikipedia</title>
Step 5: But we want only the string part of the title not the tags.
#Only the string not the tags print(soup.title.string)
Forbes list of Indian billionaires - Wikipedia
Step 6: We can also explore the <a></a> tags in the soup object.
#First <a></a> tag soup.a
Output: First <a></a> tag can be seen here.
Step 7: Explore all <a></a> tags.
#all the <a> </a> tags soup.find_all('a')
Step 8: Again, just the way we fetched title tags, a tag, we will fetch all the table tags.
#Fetch all the table tags all_table = soup.find_all('table') print(all_table)
Step 9: Since our aim is to get the List of Billionaires from the wiki-page, we need to find out the table class name. Go to the webpage. Inspect the table by placing cursor over the table and inspect element by using “Shift+Q”.
So, our table class name is ‘wikitable sortable’. Let us move ahead and fetch the list.
Step 10: Now, fetch all the table tags with class name “wikitable sortable”
#fetch all the table tags with class name="wikitable sortable" our_table = soup.find('table', class_= 'wikitable sortable') print(our_table)
Step 11: As you can see the information that we want to retrieve from the table has <a> tags in them. So, find all the <a> tags from table_links.
#In the table that we will fetch find the <a> </a>tags table_links = our_table.find_all('a') print(table_links)
Step 12: In order to put the title on a list iterate over the table_links and append the title by using get() function.
#put the title into a list billionaires =  for links in table_links: billionaires.append(links.get('title')) print(billionaires)
Step 13: Now that we have our required data in the form of a list, we will be using Python Pandas Library in order to save the data in an excel file. Before that we have to convert the list into a DataFrame.
#Convert the list into a dataframe import pandas as pd df = pd.DataFrame(billionaires) print(df)
Step 14: Use the following method in order to write to an excel file.
#To save the data into an excel file writer = pd.ExcelWriter('indian_billionaires.xlsx', engine='xlsxwriter') df.to_excel(writer, sheet_name='List') writer.save()
Now our data has been saved into an excel workbook with the name ‘indian_billionaires.xlsx’ and inside a sheet named ‘List’.
Step 15: Just to make sure if the excel workbook is saved or not, read the file using read_excel.
#check if it’s done right or not df1= pd.read_excel('indian_billionaires.xlsx') df1
Congratulations! You have successfully created your first web scraping program.
In this tutorial, we have discussed Web Scraping from the scratch. We have also mentioned some of the must follow rules while Web Scraping. Demonstration given at the end was the quick walk through to your first Web Scraping Project. Now you have learnt how to collect data. In our next sessions, we will be discussing about processing the data and making use of the collected data. See you there.Previous
Download Interview Questions asked by top MNCs in 2019?