0 votes
1 view
in Java by (3.9k points)
I code a lot of parsers. Up until now, I was using HtmlUnit headless browser for parsing and browser automation.

Now, I want to separate both the tasks.

As 80% of my work involves just parsing, I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.

I want to know which HTML parser is the best. The parser would be better if it is close to HtmlUnit parser.

1 Answer

0 votes
by (41.5k points)

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"

  + "<body><p>Parsed HTML into a doc.</p></body></html>";

Document doc = Jsoup.parse(html);

Elements links = doc.select("a");

Element head = doc.select("head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
Welcome to Intellipaat Community. Get your technical queries answered by top developers !