How to get html with javascript rendered sourcecode by using selenium

Question

asked Jul 19, 2019 in DevOps and Agile by Han Zhyang (19.7k points)

I run a query on one web page, then I get a result URL. If I right-click see HTML source, I can see the HTML code generated by JS. If I simply use urllib, python cannot get the JS code. So I see some solutions using selenium. Here's my code:

from selenium import webdriver
url = 'http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2'
driver = webdriver.PhantomJS(executable_path='C:\python27\scripts\phantomjs.exe')
driver.get(url)
print driver.page_source
>>> <html><head></head><body></body></html>

Obviously, It's not right!!

Here's the source code I need in right-click windows, (I want the INFORMATION part)

</script></div><div class="searchColRight"><div id="topActions" class="clearfix
noPrint"><div id="breadcrumbs" class="left"><a title="Results Summary"
href="Default.aspx? _act=VitalSearchR ...... <<INFORMATION I NEED>> ...
to view the entire record.</p></div><script xmlns:msxsl="urn:schemas-microsoft-com:xslt">
jQuery(document).ready(function() {
jQuery(".ancestry-information-tooltip").actooltip({
href: "#AncestryInformationTooltip", orientation: "bottomleft"});
});

=========== So my question is =============== How to get the information generated by JS?

1 Answer

Prabhpreet Kaur · Answer 1 · 2019-07-19T12:00:14+0000

I have had the same problem and I solved it using the below code:

First, execute_script

driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body")
#print(driver.page_source)

2. Second, parse HTML using beautifulsoup(You can Downloaded beautifulsoup by pip command)

import bs4 #import beautifulsoup
import re
from time import sleep
sleep(1) #wait one second
root=bs4.BeautifulSoup(innerHTML,"lxml") #parse HTML using beautifulsoup
viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'}) #find the value which you need.

3. Third, print out the value you need

for span in viewcount:
print(span.string)

Complete code

from selenium import webdriver
import lxml
urls="http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2"
driver = webdriver.PhantomJS()
##driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body")
##print(driver.page_source)
import bs4
import re
from time import sleep
sleep(1)
root=bs4.BeautifulSoup(innerHTML,"lxml")
viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})
for span in viewcount:
print(span.string)
driver.quit()

I hope this helps!

If you are interested to learn Selenium on a much deeper level and want to become a professional in the testing domain, check out Intellipaat’s Selenium course!

How to get html with javascript rendered sourcecode by using selenium

1 Answer

Related questions

Browse Categories