0 votes
1 view
in DevOps and Agile by (19.7k points)

I want to get var declared inside a JS in the htm;. but there are no ids, elements. How can I get this data?

Because there is no address, but only var name, I don't know how to do it

Website HTML:

image

Website HTML picture

<script type="text/javascript">

var imgInfoData = 'data which i want to crawl'

</script>

My python Code:

#set url

HOMEPAGE = "https://land.naver.com/info/complexGallery.nhn?newComplex=Y&startImage=Y&rletNo=102235"

#open web

driver = webdriver.Firefox()

driver.wait = WebDriverWait(driver, 2)

driver.get(HOMEPAGE)

#try to get text from html

time.sleep(1)

WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, '//script["var"]'))).text

1 Answer

0 votes
by (62.9k points)

I check the site you are scraping and it seems the scripts were already included in the html page, so I think you don't need to use webdriver and you can just use requests and beautifulsoup.get the html data using requests:

res = requests.get(url, headers=headers, params=params)

Then, Soup the html text to get the script tags and find which tags has the var imgInfoData:

soup = BeautifulSoup(res.text, "html5lib")

    scripts = soup.findAll('script', attrs={'type':'text/javascript'})

    for script in scripts:

        if "var imgInfoData" in script.text: #script with imgInfoData captured

            return script.text.replace("var imgInfoData =","").strip()[:-1]

just remove the

var imgInfoData =

and

;

of the text to get the string value or you could use regex to get the JSON string inside a text.

Full Code:

import requests

from bs4 import BeautifulSoup

    soup = BeautifulSoup(res.text, "html5lib")

    scripts = soup.findAll('script', attrs={'type':'text/javascript'})

    for script in scripts:

        if "var imgInfoData" in script.text: #script with imgInfoData captured

            return script.text.replace("var imgInfoData =","").strip()[:-1]

    return None

print(getimgInfoData())

def getimgInfoData():

    url =

"https://land.naver.com/info/complexGallery.nhn"

    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    params = {"newComplex":"Y",

              "startImage":"Y",

              "rletNo":"102235"}

    res = requests.get(url, headers=headers, params=params)

then, just convert the result from getimgInfoData() to json if you want.

If you want to make your career in the testing field you must take up the following selenium automation certification.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !

Categories

...