Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I would like to extract a specific type of information from web pages in Python. Let's say the postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number of forms, it would be probably very difficult to write a regular expression or even something like grammar and to use a parser generator for parsing it out.

So I think the way I should go to is machine learning. If I understand it well, I should be able to make a sample of data where I will point out what should be the result and then I have something which can learn from this how to recognize the result by itself. This is all I know about machine learning. Maybe I could use some natural language processing, but probably not much as all the libraries work with English mostly and I need this for Czech.

1 Answer

0 votes
by (33.1k points)

Your task comes under the information extraction domain that comes under the area of research.

There are two ways to start working on this task:

  • You can directly extract information from an HTML page or a website with a fixed template. In your case, the best way is to look at the HTML code of the pages and craft the corresponding XPath or DOM selectors to get to the right info. The disadvantage with this approach is that it is not generalizable to new websites since you have to do it for each website one by one.

  • You can also create a model that extracts the same information from many websites within one domain. In this case, you should create some features to use the ML approach and let the IE algorithm to "understand the content of pages". The most common features are the DOM path, the format of the value (attribute) to be extracted, layout (like bold, italic and etc.), and surrounding context words. If you label some values (you need at least 100-300 pages depending on domain to do it with some sort of reasonable quality). Then you train a model on the labeled pages. In this case, your algorithm tries to find repetitive patterns across pages (without labeling).

You need to work with the DOM tree and generate the right features. Also, data labeling in the right way is a deliberate task. For ML models,You should have a look at CRF, 2DCRF, semi-Markov CRF.

In the general case a cutting edge in IE research and not a hack that you can do it a few evenings.

I hope this answer helps.

Also, you can learn more about machine learning and its concepts by joining Intellipaat's ML Training.

Browse Categories

...