Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (19.9k points)

Problem: I have the following XML snippet:

...snip...

<p class="p_cat_heading">DEFINITION</p>

<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>

<p class="p_cat_heading">PRONUNCIATION </p>

..snip...

I need to search the totality of the XML, find the heading that has text DEFINITION, and print the associated definitions. The format of the definitions is not consistent and can change attributes/elements so the only reliable way of capturing all of it is to read until the next element with attribute p_cat_heading.

Right now I am using the following code to find all of the headers:

for heading in root.findall(".//*[@class='p_cat_heading']"):

    if heading.text == "DEFINITION":

        <WE FOUND THE CORRECT HEADER - TAKE ACTION HERE>

Things I have tried:

Using lxml's getnext method. This gets the next sibling which has the attribute "p_cat_heading" which isn't what I want.

following_sibling - lxml is supposed to support this but it throws "following-sibling is not found in prefix-map"

My Solution:

I haven't finished it, but because my XML is short I was just going to get a list of all elements, iterate until the one with the DEFINITION attribute, and then iterate until the next element with the p_cat_heading attribute. This solution is horrible and ugly, but I can't seem to find a clean alternative.

What I'm looking for:

A more Pythonic way of printing the definition which is "this, these" in our case. Solution may use either xpath or some alternative. Python-native solutions preferred, but anything will do.

1 Answer

0 votes
by (25.1k points)

You can use xpath:

//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]

Or you can use lxml:

from lxml import html

 

data = [your snippet above]

exp = "//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]"

 

tree = html.fromstring(data)

target = tree.xpath(exp)

 

for i in target:

    print(i.text_content())

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

Browse Categories

...