Explore Courses Blog Tutorials Interview Questions
0 votes
in AI and Deep Learning by (50.2k points)

I am new to AI. I am working on an application that text classification via machine learning. The application needs to classify different parts of an HTML document. For example, most webpages have the head, menu, sidebar, footer, main content, etc. I want to use a text classifier to classify these parts of an HTML document, and to identify different types of forms on the page.

  1. It would be very helpful if anyone could provide detailed guidance on this subject.

  2. Examples of a similar application would also be very helpful.

I am looking for more technical suggestions, relating to code & implementation.

I can assign labels to html tag attributes, like class or id

<div class="menu-1"> 

<div id="entry"> 

<div id="content"> 

<div id="footer"> 

<div id="comment-12"> 

<div id="comment-title">

like for first item:

TrainClassifier(label: "Menu", value: "menu-1", attribute: "class", position-in-string: "21%", tag: "div");


  1. "menu-1" (attribute value)

  2. List item

  3. "class" (attribute name)

  4. "21" (tag position in string)

  5. "div" (tag name)


  1. "Menu" (classified as label)

What neural network library, can take the above inputs, and classify them in to labels (i.e. Menu).

All users cannot create regex or XPath, they need more easy approach, so it is important, to make the software intelligent, user can highlight the part of HTML document he/she needs, using web browser control, and train the software till it can work on its own.

but I don't know how to make the software train using AI,

the AI I am looking for is like it should be able to accept various inputs, and classify on the basis of that, as I have already said new to AI, don't know much about it.

It would be helpful to me if I get an answer to the question I have asked, like what library I should use, and how to implement, answers suggesting Xpath or Regex or other methods pls don't answer, it often happens that you get all suggestions but the one you need.

1 Answer

0 votes
by (108k points)

The text classification in Artificial Intelligence is the process of classifying the documents/data into predefined categories based on their content. It is an automated assignment of natural language texts to its predefined categories. Text classification is our primary requirement to retrieve text from the systems in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data.

The commonly followed steps are usually the following:

  • Identify as many attributes/features as you can get (and a set of labels).

  • Collect data which is a set { Label, Attribute1, A2, A3, ... }

  • Select a minimal set of important attributes using feature selection algorithms (also available in the WEKA toolkit)

  • Train the classifier using a standard algorithm

  • Perform testing on your system, until you receive the desired accuracy, recall, or other params.

For the WEKA content, refer the following link:

For the Text Classification using Algorithms, refer the following link:

Browse Categories