Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Java by (3.9k points)

Is there a good way to remove HTML from a Java string? A simple regex like

 replaceAll("\\<.*?>","") 

will work, but things like &amp; won't be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

1 Answer

0 votes
by (46k points)

Use an HTML parser rather than regex. This is lifeless simple with Jsoup.

public static String html2text(String html) {

    return Jsoup.parse(html).text();

}

Jsoup also recommends removing HTML tags upon a customizable whitelist, which is very helpful if you require to provide only e.g. <b>, <i> and <u>.

Related questions

0 votes
1 answer
0 votes
1 answer
asked Oct 13, 2019 in Java by Ritik (3.5k points)
0 votes
1 answer
asked Sep 29, 2019 in Java by Shubham (3.9k points)
0 votes
1 answer
asked Nov 13, 2019 in Java by Nigam (4k points)
Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94k users

Browse Categories

...