0 votes
1 view
in Devops and Agile by (24.3k points)

I want to get var declared inside a JS in the htm;. but there are no ids, elements. How can I get this data?

Because there is no address, but only var name, I don't know how to do it

Website HTML:

image

Website HTML picture

<script type="text/javascript">

var imgInfoData = 'data which i want to crawl'

</script>

My python Code:

#set url

HOMEPAGE = "https://land.naver.com/info/complexGallery.nhn?newComplex=Y&startImage=Y&rletNo=102235"

#open web

driver = webdriver.Firefox()

driver.wait = WebDriverWait(driver, 2)

driver.get(HOMEPAGE)

#try to get text from html

time.sleep(1)

WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, '//script["var"]'))).text

1 Answer

0 votes
by (61.8k points)

I check the site you are scraping and it seems the scripts were already included in the html page, so I think you don't need to use webdriver and you can just use requests and beautifulsoup.get the html data using requests:

res = requests.get(url, headers=headers, params=params)

Then, Soup the html text to get the script tags and find which tags has the var imgInfoData:

soup = BeautifulSoup(res.text, "html5lib")

    scripts = soup.findAll('script', attrs={'type':'text/javascript'})

    for script in scripts:

        if "var imgInfoData" in script.text: #script with imgInfoData captured

            return script.text.replace("var imgInfoData =","").strip()[:-1]

just remove the

var imgInfoData =

and

;

of the text to get the string value or you could use regex to get the JSON string inside a text.

Full Code:

import requests

from bs4 import BeautifulSoup

    soup = BeautifulSoup(res.text, "html5lib")

    scripts = soup.findAll('script', attrs={'type':'text/javascript'})

    for script in scripts:

        if "var imgInfoData" in script.text: #script with imgInfoData captured

            return script.text.replace("var imgInfoData =","").strip()[:-1]

    return None

print(getimgInfoData())

def getimgInfoData():

    url =

"https://land.naver.com/info/complexGallery.nhn"

    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    params = {"newComplex":"Y",

              "startImage":"Y",

              "rletNo":"102235"}

    res = requests.get(url, headers=headers, params=params)

then, just convert the result from getimgInfoData() to json if you want.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...