0 votes
1 view
in Machine Learning by (12.2k points)

I'm Working on document classification tasks in java.

Both algorithms came highly recommended, what are the benefits and disadvantages of each and which is more commonly used in the literature for Natural Language Processing tasks?

1 Answer

0 votes
by (31.3k points)

The main difference between the Porter and Lancaster Stemming algorithms is that the Lancaster stemmer is significantly more dynamic than the Porter Stemmer. 

The three major stemming algorithms in use nowadays:

  • Porter Stemmer

  • Snowball Stemmer

  • Lancaster Stemmer

Porter is the least aggressive algorithm, with the description of each algorithm actually being somewhat lengthy and technical. 

Porter: It is the most commonly used stemmer nowadays. It is one of the few stemmers that actually have Java support and it is also the most computationally intensive of the algorithms. It is also the oldest stemming algorithm by a large margin.

Snowball: This is an improvement over porter. It is slightly faster computation time than porter, with a reasonably large community around it.

Lancaster: It is a very aggressive stemming algorithm. With Porter and Snowball, the stemmed representations are intuitive to a reader, not so with Lancaster, as many shorter words will become totally confusing. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.

I’d suggest that Snowball is better than Porter and Lancaster.

Hope this answer helps.

...