Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (16.4k points)

I have followed a few online guides trying to fabricate a script that can recognize and download all pdfs from a site to save me from doing it physically. Here is my code up until this point:

from urllib import request

from bs4 import BeautifulSoup

import re

import os

import urllib

# connect to website and get list of all pdfs

url="http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"

response = request.urlopen(url).read()

soup= BeautifulSoup(response, "html.parser")     

links = soup.find_all('a', href=re.compile(r'(.pdf)'))

# clean the pdf link names

url_list = []

for el in links:

    url_list.append(("http://www.gatsby.ucl.ac.uk/teaching/courses/" + el['href']))

#print(url_list)

# download the pdfs to a specified location

for url in url_list:

    print(url)

    fullfilename = os.path.join('E:\webscraping', url.replace("http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016/", "").replace(".pdf",""))

    print(fullfilename)

    request.urlretrieve(url, fullfilename)

The code can seem to track down all the pdfs (uncomment the print(url_list) to see this). Be that as it may, it falls flat at the download stage. Specifically, I get this error and I am not ready to comprehend what's turned out badly:

E:\webscraping>python get_pdfs.py

http://www.gatsby.ucl.ac.uk/teaching/courses/http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016/cribsheet.pdf

E:\webscraping\http://www.gatsby.ucl.ac.uk/teaching/courses/cribsheet

Traceback (most recent call last):

  File "get_pdfs.py", line 26, in <module>

    request.urlretrieve(url, fullfilename)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 248, in urlretrieve

    with contextlib.closing(urlopen(url, data)) as fp:

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 223, in urlopen

    return opener.open(url, data, timeout)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 532, in open

    response = meth(req, response)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 642, in http_response

    'http', request, response, code, msg, hdrs)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 570, in error

    return self._call_chain(*args)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 504, in _call_chain

    result = func(*args)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 650, in http_error_default

    raise HTTPError(req.full_url, code, msg, hdrs, fp)

urllib.error.HTTPError: HTTP Error 404: Not Found

Can someone help me?

1 Answer

0 votes
by (26.4k points)

Look at the accompanying execution. I've utilized the requests module rather than urllib to do the download. In addition, I've utilized .select() technique rather than .find_all() to try not to utilize re.

import os

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup

url = "http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"

#If there is no such folder, the script will create one automatically

folder_location = r'E:\webscraping'

if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)

soup= BeautifulSoup(response.text, "html.parser")     

for link in soup.select("a[href$='.pdf']"):

    #Name the pdf files using the last portion of each link which are unique in this case

    filename = os.path.join(folder_location,link['href'].split('/')[-1])

    with open(filename, 'wb') as f:

        f.write(requests.get(urljoin(url,link['href'])).content)

Interested to learn python in detail? Come and Join the python course.

Browse Categories

...