Back

Explore Courses Blog Tutorials Interview Questions
0 votes
4 views
in Web Technology by (40.7k points)

One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:

People want to treat a file as a sequence of lines, but this is valid:

<tag

attr="5"

/>

People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:

<img src="imgtag.gif" alt="<img>" />

People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):

<span id="outer"><span id="inner">foo</span></span> 

People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):

<span class="phonenum">(<span class="area code">703</span>)

<span class="prefix">348</span>-<span class="linenum">3020</span></span>

Comments may contain poorly formatted or incomplete tags:

<a href="foo">foo</a>

<!-- FIXME:

    <a href="

-->

<a href="bar">bar</a>

What other gotchas are you aware of?

1 Answer

0 votes
by (20.3k points)

You can use this XML code:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>

<x>

<a b="&y;>" />

    <![CDATA[[a>b <a>b <a]]>

    <?x <a> <!-- <b> ?> c --> d

</x>

In HTML, you can try this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [

    <!ENTITY % e "href='hello'">

    <!ENTITY e "<a %e;>">

]>

    <title>x</TITLE>

</head>

    <p id  =  a:b center>

    <span / hello </span>

    &amp<br left>

    <!---- >t<!---> < -->

    &e link </a>

</body>

Code given below is valid for HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"

  "http://www.w3.org/TR/html4/strict.dtd"> 

<HTML/

  <HEAD/

    <TITLE/>/

    <P/>

Related questions

0 votes
1 answer
0 votes
1 answer
asked Dec 18, 2020 in Python by laddulakshana (16.4k points)
0 votes
1 answer
asked Aug 18, 2020 in Web Technology by Sudhir_1997 (55.6k points)

Browse Categories

...