Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I am learning data science using the internet, So I trying to pull the tables from the HTML page to the jupyter notebook. The problem I am facing is when I use the code class= 'table' it is showing all the contents in the tabs and all the tables which is so messy. 

This is the code I am using:

import requests

import lxml.html as lh

import pandas as pd

import csv

import requests

from bs4 import BeautifulSoup

url = 'https://www.worldometers.info/coronavirus/#countries'

page = requests.get(url)

print(page.status_code) #Checking the http response status code. Should be 200

soup = BeautifulSoup(page.content, 'html.parser')

print(soup.prettify())

all_tables=soup.find_all("table")

right_table = soup.find('table',{'class':'table'})

col_headers = [th.getText() for th in right_table.findAll('th')]

data = [[td.getText() for td in right_table.findAll('td')] for tr in right_table()]

I have 13 columns but when I combine col_headers it is telling I have 2990 columns. Kindly help me solve it.

1 Answer

0 votes
by (36.8k points)

You have used "flattened" in the table to create the list of <td>. So you need to use the nested list as shown below: 

data = [ [ td.text for td in tr.find_all("td") ] for tr in right_table.find_all("tr")]

df = pd.DataFrame(data, columns=col_header)

print(df.shape) # (231, 13)

 If you are a beginner and want to know more about Data Science the do check out the Data Science course

Browse Categories

...