Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in Python by (107k points)

I am currently working on a package named thenewboston video on writing a web crawler using python. For some reason, I'm getting an SSLError. I don't know why the error is occurring:

import requests

from bs4 import BeautifulSoup

def creepy_crawly(max_pages):

    page = 1

    #requests.get('https://www.thenewboston.com/', verify = True)

    while page <= max_pages:

        url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page)

        source_code = requests.get(url)

        plain_text = source_code.text 

        soup = BeautifulSoup(plain_text)

        for link in soup.findAll('a', {'class' : 'item-name'}):

            href = "https://www.thenewboston.com" + link.get('href')

            print(href)

        page += 1

creepy_crawly(1)

1 Answer

0 votes
by (50.2k points)

You can work with the web crawler using urllib, it can be more durable and has no problem obtaining the HTTPS pages,  Bellow, there's a usage example of that lib:

link = 'https://www.intellipaat.com'    

html = urllib.urlopen(link).read()

print(html)

3 lines are all you need to grab the HTML from a page.

I also recommend you use regex on the HTML to grab other links, an example for that (using re library) would be:

    for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I):  # Searches the HTML for other URLs

        link = url.split("#", 1)[0] \

        if url.startswith("http") \

        else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it  

Join this Python Training course now if you want to gain more knowledge in Python.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...