Python Web Crawler from thenewboston

Question

asked Feb 4, 2021 in Python by vinita (107k points)

I am currently working on a package named thenewboston video on writing a web crawler using python. For some reason, I'm getting an SSLError. I don't know why the error is occurring:

import requests
from bs4 import BeautifulSoup
def creepy_crawly(max_pages):
page = 1
#requests.get('https://www.thenewboston.com/', verify = True)
while page <= max_pages:
url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class' : 'item-name'}):
href = "https://www.thenewboston.com" + link.get('href')
print(href)
page += 1
creepy_crawly(1)

1 Answer

ashely · Answer 1 · 2021-02-04T08:59:29+0000

You can work with the web crawler using urllib, it can be more durable and has no problem obtaining the HTTPS pages, Bellow, there's a usage example of that lib:

link = 'https://intellipaat.com/ '
html = urllib.urlopen(link).read()
print(html)

3 lines are all you need to grab the HTML from a page.

I also recommend you use regex on the HTML to grab other links, an example for that (using re library) would be:

for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I): # Searches the HTML for other URLs
link = url.split("#", 1)[0] \
if url.startswith("http") \
else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it

Join this Python Training course now if you want to gain more knowledge in Python.

Python Web Crawler from thenewboston

1 Answer

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources