Explore Courses Blog Tutorials Interview Questions
0 votes
in Python by (16.4k points)

I have followed a few online guides trying to fabricate a script that can recognize and download all pdfs from a site to save me from doing it physically. Here is my code up until this point:

from urllib import request

from bs4 import BeautifulSoup

import re

import os

import urllib

# connect to website and get list of all pdfs


response = request.urlopen(url).read()

soup= BeautifulSoup(response, "html.parser")     

links = soup.find_all('a', href=re.compile(r'(.pdf)'))

# clean the pdf link names

url_list = []

for el in links:

    url_list.append(("" + el['href']))


# download the pdfs to a specified location

for url in url_list:


    fullfilename = os.path.join('E:\webscraping', url.replace("", "").replace(".pdf",""))


    request.urlretrieve(url, fullfilename)

The code can seem to track down all the pdfs (uncomment the print(url_list) to see this). Be that as it may, it falls flat at the download stage. Specifically, I get this error and I am not ready to comprehend what's turned out badly:



Traceback (most recent call last):

  File "", line 26, in <module>

    request.urlretrieve(url, fullfilename)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 248, in urlretrieve

    with contextlib.closing(urlopen(url, data)) as fp:

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 223, in urlopen

    return, data, timeout)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 532, in open

    response = meth(req, response)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 642, in http_response

    'http', request, response, code, msg, hdrs)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 570, in error

    return self._call_chain(*args)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 504, in _call_chain

    result = func(*args)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 650, in http_error_default

    raise HTTPError(req.full_url, code, msg, hdrs, fp)

urllib.error.HTTPError: HTTP Error 404: Not Found

Can someone help me?

1 Answer

0 votes
by (26.4k points)

Look at the accompanying execution. I've utilized the requests module rather than urllib to do the download. In addition, I've utilized .select() technique rather than .find_all() to try not to utilize re.

import os

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup

url = ""

#If there is no such folder, the script will create one automatically

folder_location = r'E:\webscraping'

if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)

soup= BeautifulSoup(response.text, "html.parser")     

for link in"a[href$='.pdf']"):

    #Name the pdf files using the last portion of each link which are unique in this case

    filename = os.path.join(folder_location,link['href'].split('/')[-1])

    with open(filename, 'wb') as f:


Interested to learn python in detail? Come and Join the python course.

Browse Categories