0 votes
1 view
in AI and Deep Learning by (19k points)

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc

I plugged into the WebKit parser code to easily navigate the webpage as a tree. To eliminate navigation and other non-news content I take the text version of the article (minus the HTML tags, WebKit provides API for the same). Then I run the diff algorithm comparing various article's text from the same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.

Despite the above approach, I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 articles i.e. 50%. Error as in

Can you

  • Suggest an alternative strategy for the extraction of pure content,
  • Would/Can learning Natural Language processing help in extracting the correct abstract from these articles?
  • How would you approach the above problem?
  • Are there any research papers on the same?

Regards

1 Answer

0 votes
by (42.2k points)

You can have a look at the boilerpipe project and can test it on pages of your choice using the live web app on Google AppEngine.

The automatic making of abstracts is not a developed field. It is usually referred to as 'sentence selection' because the standard approach right now is to just select entire sentences.

...