Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Python by (12.7k points)

I have followed a few online guides trying to fabricate a script that can recognize and download all pdfs from a site to save me from doing it physically. Here is my code up until this point:

from urllib import request

from bs4 import BeautifulSoup

import re

import os

import urllib

# connect to website and get list of all pdfs

url="http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"

response = request.urlopen(url).read()

soup= BeautifulSoup(response, "html.parser")     

links = soup.find_all('a', href=re.compile(r'(.pdf)'))

# clean the pdf link names

url_list = []

for el in links:

    url_list.append(("http://www.gatsby.ucl.ac.uk/teaching/courses/" + el['href']))

#print(url_list)

# download the pdfs to a specified location

for url in url_list:

    print(url)

    fullfilename = os.path.join('E:\webscraping', url.replace("http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016/", "").replace(".pdf",""))

    print(fullfilename)

    request.urlretrieve(url, fullfilename)

The code can seem to track down all the pdfs (uncomment the print(url_list) to see this). Be that as it may, it falls flat at the download stage. Specifically, I get this error and I am not ready to comprehend what's turned out badly:

E:\webscraping>python get_pdfs.py

http://www.gatsby.ucl.ac.uk/teaching/courses/http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016/cribsheet.pdf

E:\webscraping\http://www.gatsby.ucl.ac.uk/teaching/courses/cribsheet

Traceback (most recent call last):

  File "get_pdfs.py", line 26, in <module>

    request.urlretrieve(url, fullfilename)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 248, in urlretrieve

    with contextlib.closing(urlopen(url, data)) as fp:

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 223, in urlopen

    return opener.open(url, data, timeout)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 532, in open

    response = meth(req, response)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 642, in http_response

    'http', request, response, code, msg, hdrs)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 570, in error

    return self._call_chain(*args)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 504, in _call_chain

    result = func(*args)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 650, in http_error_default

    raise HTTPError(req.full_url, code, msg, hdrs, fp)

urllib.error.HTTPError: HTTP Error 404: Not Found

Can someone help me?

1 Answer

0 votes
by (26.4k points)

Look at the accompanying execution. I've utilized the requests module rather than urllib to do the download. In addition, I've utilized .select() technique rather than .find_all() to try not to utilize re.

import os

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup

url = "http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"

#If there is no such folder, the script will create one automatically

folder_location = r'E:\webscraping'

if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)

soup= BeautifulSoup(response.text, "html.parser")     

for link in soup.select("a[href$='.pdf']"):

    #Name the pdf files using the last portion of each link which are unique in this case

    filename = os.path.join(folder_location,link['href'].split('/')[-1])

    with open(filename, 'wb') as f:

        f.write(requests.get(urljoin(url,link['href'])).content)

Interested to learn python in detail? Come and Join the python course.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.7k users

Browse Categories

...