Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Python by (12.7k points)

I have followed a few online guides trying to fabricate a script that can recognize and download all pdfs from a site to save me from doing it physically. Here is my code up until this point:

from urllib import request

from bs4 import BeautifulSoup

import re

import os

import urllib

# connect to website and get list of all pdfs


response = request.urlopen(url).read()

soup= BeautifulSoup(response, "html.parser")     

links = soup.find_all('a', href=re.compile(r'(.pdf)'))

# clean the pdf link names

url_list = []

for el in links:

    url_list.append(("" + el['href']))


# download the pdfs to a specified location

for url in url_list:


    fullfilename = os.path.join('E:\webscraping', url.replace("", "").replace(".pdf",""))


    request.urlretrieve(url, fullfilename)

The code can seem to track down all the pdfs (uncomment the print(url_list) to see this). Be that as it may, it falls flat at the download stage. Specifically, I get this error and I am not ready to comprehend what's turned out badly:



Traceback (most recent call last):

  File "", line 26, in <module>

    request.urlretrieve(url, fullfilename)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 248, in urlretrieve

    with contextlib.closing(urlopen(url, data)) as fp:

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 223, in urlopen

    return, data, timeout)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 532, in open

    response = meth(req, response)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 642, in http_response

    'http', request, response, code, msg, hdrs)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 570, in error

    return self._call_chain(*args)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 504, in _call_chain

    result = func(*args)

  File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\", line 650, in http_error_default

    raise HTTPError(req.full_url, code, msg, hdrs, fp)

urllib.error.HTTPError: HTTP Error 404: Not Found

Can someone help me?

1 Answer

0 votes
by (26.4k points)

Look at the accompanying execution. I've utilized the requests module rather than urllib to do the download. In addition, I've utilized .select() technique rather than .find_all() to try not to utilize re.

import os

import requests

from urllib.parse import urljoin

from bs4 import BeautifulSoup

url = ""

#If there is no such folder, the script will create one automatically

folder_location = r'E:\webscraping'

if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)

soup= BeautifulSoup(response.text, "html.parser")     

for link in"a[href$='.pdf']"):

    #Name the pdf files using the last portion of each link which are unique in this case

    filename = os.path.join(folder_location,link['href'].split('/')[-1])

    with open(filename, 'wb') as f:


Interested to learn python in detail? Come and Join the python course.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers


94.7k users

Browse Categories