Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in BI by (17.6k points)

I want to scrap data from this page (and pages similar to it): https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx

This page uses Power BI. Unfortunately, finding a way to scrap Power BI is hard, because everyone wants to scrap using/into Power BI, not from it. The closest answer was this question. Yet unrelated.

Firstly, I used Apache tika, and soon I realized the table data is been loading after loading the page. I need the rendered version of the page.

Therefore, I used Selenium. I wanted to Select All at the begining (sending Ctrl+A key combination), but it doesn't work. Maybe it is restricted by the page events (I also tried to remove all the events using developer tools, yet still Ctrl+A doesn't work.

I also tried to read the HTML contents, but Power BI puts div elements on the screen using position:absolute and distinguishing the location of a div in the table (both row and column) is an effortful activity.

Since Power BI uses JSON, I tried to read data from there. However, it is so complicated I couldn't find out the rules. It seems it puts keywords somewhere and uses their indices in the table.

Note: I realized that all of the data is not loaded and even shown at the same time. A div of class scroll-bar-part-bar is responsible to act as a scroll bar and moving that loads/show other parts of the data.

The code I used to read data is as follows. As mentioned, the order of the produced data differs from what is rendered on the browser:

from selenium import webdriver from selenium.webdriver.common.keys import Keys options = webdriver.ChromeOptions() options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe" driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe") driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9") parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div') children = parent.find_elements_by_xpath('.//*') values = [child.get_attribute('title') for child in children]

I appreciate solutions for any of the above problems. The most interesting for me though is the convention of storing Power BI data in JSON format. 

1 Answer

0 votes
by (47.2k points)
  • Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question):

parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')

children = parent.find_elements_by_xpath('.//*')

  • Then sort them using their location:

x = [child.location['x'] for child in children]

y = [child.location['y'] for child in children]

index = np.lexsort((x,y))

  • To sort what we have read in different lines, this code may help:

rows = []

row = []

last_line = y[index[0]]

for i in index:

    if last_line != y[i]:

        row.append[children[i].get_attribute('title')]

    else:

        rows.append(row)

        row = list([children[i].get_attribute('title')]

rows.append(row)

Want to learn more about Power BI, Come & join Power BI tutorial. Also, enroll in Power BI training to become proficient in this BI tool.

Related questions

0 votes
0 answers
asked Apr 27, 2021 in R Programming by Aytan (160 points)
0 votes
1 answer
asked Nov 12, 2020 in Python by ashely (50.2k points)
0 votes
1 answer
asked Nov 12, 2020 in Python by ashely (50.2k points)
Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.5k answers

500 comments

108k users

Browse Categories

...