Back

Explore Courses Blog Tutorials Interview Questions
+1 vote
2 views
in DevOps and Agile by (19.7k points)

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:

from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'

SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium

# (first need to download Chrome webdriver, or a firefox webdriver, etc)

driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)

driver.get(URL)

time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search

seq_query_field = driver.find_element_by_id("seq")

seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")

blast_button.click()

time.sleep(60)

At that point, I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html and does not leave me with a local file that looks like it does in the web browser, with images, etc.

content = driver.page_source

with open('webpage.html', 'w') as f:

    f.write(content)

Is there a simple way to 'save [full page] as' using python? Ideally, I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?

UPDATE - Follow up question for James' answer So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).enter image description here

This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:

<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">

These options control formatting of alignments in results pages. The

default is HTML, but other formats (including plain text) are available.

PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST. 

The Advanced view option allows the database descriptions to be sorted by various indices in a table.

</p>

Any idea why this is happening? 

1 Answer

0 votes
by (62.9k points)

As you noted, selenium cannot interact with the browser's context menu to use Save as..., thus instead to do so, you may use an external automation library like pyautogui.

pyautogui.hotkey('ctrl', 's')

time.sleep(1)

pyautogui.typewrite(SEQUENCE + '.html')

pyautogui.hotkey('enter')

This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some additional work with the tab and arrow keys.

Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.

Full code, in which I also added conditional waits to improve speed:

# open page with selenium

# (first need to download Chrome webdriver, or a firefox webdriver, etc)

driver = webdriver.Chrome()

driver.get(URL)

# enter sequence into the query field and hit 'blast' button to search

seq_query_field = driver.find_element_by_id("seq")

seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")

blast_button.click()

# wait until results are loaded

WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

# open 'Save as...' to save html and assets

pyautogui.hotkey('ctrl', 's')

time.sleep(1)

pyautogui.typewrite(SEQUENCE + '.html')

pyautogui.hotkey('enter')

To learn in-depth about Selenium, sign up for an industry based Selenium certification

Interested to learn more about Python? Come & join our Python online course

Browse Categories

...