Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in DevOps and Agile by (19.7k points)

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

  • starts with a product_list page with 10 products
  • a click on "next" button loads the next 10 products (URL doesn't change between the two pages)
  • I use LinkExtractor to follow each product link into the product page and get all the information I need

I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's web driver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?

My spider is pretty standard, like the following:

class ProductSpider(CrawlSpider):

    name = "product_spider"

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/shanghai']

    rules = [

        Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),

        ]

    def parse_product(self, response):

        self.log("parsing product %s" %response.url, level=INFO)

        hxs = HtmlXPathSelector(response)

        # actual data follows

Any idea is appreciated. Thank you!

 

1 Answer

0 votes
by (62.9k points)

Scraping is fun, but when the page loads via AJAX it starts to be boring with all that Javascript reverse engineering, etc.

Selenium is an automation testing suite that is used to drive the browser from your favorite programming language. It was developed for testing, hence favourable for scraping. 

So, when I hit a dynamic page this is what I did:

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.common.by import By

from selenium import webdriver

# Start the WebDriver and load the page

wd = webdriver.Firefox()

wd.get(URL)

# Wait for the dynamically loaded elements to show up

WebDriverWait(wd, 10).until(

    EC.visibility_of_element_located((By.CLASS_NAME, "pricerow")))

# And grab the page HTML source

html_page = wd.page_source

wd.quit()

# Now you can use html_page as you like

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_page)

You could even do the scraping with Selenium, but I load the HTML into BeautifulSoup.

Selenium API is pragmatic, a bit too much, and not Pytonic at all. Yeah, you pass a tuple to visibility_of_element_located...

If you are interested to learn Selenium on a much deeper level and want to become a professional in the testing domain, check out Intellipaat’s Selenium certification!

Browse Categories

...