Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in Python by (108k points)

I am currently working on a package named thenewboston video on writing a web crawler using python. For some reason, I'm getting an SSLError. I don't know why the error is occurring:

import requests

from bs4 import BeautifulSoup

def creepy_crawly(max_pages):

    page = 1

    #requests.get('https://www.thenewboston.com/', verify = True)

    while page <= max_pages:

        url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page)

        source_code = requests.get(url)

        plain_text = source_code.text 

        soup = BeautifulSoup(plain_text)

        for link in soup.findAll('a', {'class' : 'item-name'}):

            href = "https://www.thenewboston.com" + link.get('href')

            print(href)

        page += 1

creepy_crawly(1)

1 Answer

0 votes
by (50.2k points)

You can work with the web crawler using urllib, it can be more durable and has no problem obtaining the HTTPS pages,  Bellow, there's a usage example of that lib:

link = 'https://www.intellipaat.com'    

html = urllib.urlopen(link).read()

print(html)

3 lines are all you need to grab the HTML from a page.

I also recommend you use regex on the HTML to grab other links, an example for that (using re library) would be:

    for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I):  # Searches the HTML for other URLs

        link = url.split("#", 1)[0] \

        if url.startswith("http") \

        else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it  

Join this Python Training course now if you want to gain more knowledge in Python.

Browse Categories

...