Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (19.9k points)

I am trying to scrape this website (that has multiple pages), using scrapy. the problem is that I can't find the next page URL. Do you have an idea on how to scrape a website with multiple pages (with scrapy) or how to solve the error I'm getting with my code.

I tried the code below but it's not working:

class AbcdspiderSpider(scrapy.Spider):

    """

    Class docstring

    """

    name = 'abcdspider'

    allowed_domains = ['abcd-terroir.smartrezo.com']

    alphabet = list(string.ascii_lowercase)

    url = "https://abcd-terroir.smartrezo.com/n31-france/annuaireABCD.html?page=1&spe=1&anIDS=31&search="

    start_urls = [url + letter for letter in alphabet]

    main_url = "https://abcd-terroir.smartrezo.com/n31-france/"

    crawl_datetime = str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

    start_time = datetime.datetime.now()

    def parse(self, response):

        self.crawler.stats.set_value("start_time", self.start_time)

        try:

            page = response.xpath('//div[@class="pageStuff"]/span/text()').get()

            page_max = get_num_page(page)

            for index in range(page_max):

                producer_list = response.xpath('//div[@class="clearfix encart_ann"]/@onclick').getall()

                for producer in producer_list:

                    link_producer = self.main_url + producer

                    yield scrapy.Request(url=link_producer, callback=self.parse_details)

                next_page_url = "/annuaireABCD.html?page={}&spe=1&anIDS=31&search=".format(index)

                if next_page_url is not None:

                    yield scrapy.Request(response.urljoin(self.main_url + next_page_url))

        except Exception as e:

            self.crawler.stats.set_value("error", e.args)

I am getting this error:

'error': ('range() integer end argument expected, got unicode.',)

1 Answer

0 votes
by (25.1k points)

Error is being caused by this line of code:

page = response.xpath('//div[@class="pageStuff"]/span/text()').get() 

page_max = get_num_page(page)

for index in range(page_max):

In the above lines of code the page variable is getting unicode string instead of an integer.

Then page_max gets a string which gets passed into the range function which is raising the issue. 

The range function expects an integer. To solve this issue you can split the unicode string and then use the int() function to get the number. For example you can use the following code.

page = response.xpath('//div[@class="pageStuff"]/span/text()').get().split('/ ')[1]

 for index in range(int(page)): 

Related questions

0 votes
1 answer
asked Nov 12, 2020 in Python by ashely (50.2k points)
0 votes
4 answers
asked Apr 28, 2021 in Python by Lessly Enume Sakah (150 points)
0 votes
0 answers

Browse Categories

...