Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Question

asked Aug 30, 2019 in Web Technology by Soni Kumari (40.6k points)

One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:

People want to treat a file as a sequence of lines, but this is valid:

<tag
attr="5"
/>

People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:

<img src="imgtag.gif" alt="<img>" />

People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):

foo

People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):

(703)
348-3020

Comments may contain poorly formatted or incomplete tags:

<a href="foo">foo</a>

<a href="bar">bar</a>

What other gotchas are you aware of?

1 Answer

Tech4ever · Answer 1 · 2019-08-30T10:57:18+0000

You can use this XML code:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>
<x>
<a b="&y;>" />
<![CDATA[[a>b <a>b <a]]>
<?x <a>  d
</x>

In HTML, you can try this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [
<!ENTITY % e "href='hello'">
<!ENTITY e "<a %e;>">
]>
<title>x</TITLE>
</head>


&amp 
 < -->
&e link </a>
</body>

Code given below is valid for HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<HTML/
<HEAD/
<TITLE/>/

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources