Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I'm trying to do web scraping on a page that doesn't use much h1, h2, h3 structures, etc. It predominantly uses the strong tag. I want to search for a specific word (in a p tag) and, if I find it, also take the texts from above levels (tagged with strong) ...

I noticed that my lists created with the command I .find_previous_siblings ('strong') return blank list. While if I use soup.body.findAll ('strong') it works, returning a huge list of items (is what I need!!)

How to get the list of strong tags using the function find_previous_siblings??

Examples / This worked (and print a huge list):

url = 'http://www.mpsp.mp.br/portal/page/portal/DO_Estado/2020/DO_20-06-2020.html'

page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

for i in soup.body.findAll('strong'):

    print(i.text.strip())

Not worked (print empty list):

url = 'http://www.mpsp.mp.br/portal/page/portal/DO_Estado/2020/DO_20-06-2020.html'

page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

for i in soup.body.contents:

    if isinstance(i, element.NavigableString):

        continue

    if isinstance(i, element.Tag):

        texts = i.text

        if texts == 'HELENA BONILHA DE TOLEDO LEITE':

            print(i.find_previous_siblings('h1'))

            print(i.find_previous_siblings('strong'))

            print(i)

1 Answer

0 votes
by (36.8k points)

They are not siblings because strong is inside another paragraph tag, p.

I think you want find_previous like:

from bs4 import BeautifulSoup, element

import requests

url = 'http://www.mpsp.mp.br/portal/page/portal/DO_Estado/2020/DO_20-06-2020.html'

page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

for i in soup.body.contents:

    if isinstance(i, element.NavigableString):

        continue

    if isinstance(i, element.Tag):

        texts = i.text

        if texts == 'HELENA BONILHA DE TOLEDO LEITE':

            print(i.find_previous('h1'))

            print(i.find_previous('strong'))

            print(i)

 Learn Python for Data Science to improve your technical knowledge.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...