Understanding min_df and max_df in scikit CountVectorizer

Question

1 Answer

Anurag · Answer 1 · 2019-07-03T05:55:22+0000

max_df is used for removing data values that appear too frequently, also known as "corpus-specific stop words".

For example:

max_df = 0.50 means "It ignores terms that appear in more than 50% of the documents".
max_df = 25 means "It ignores terms that appear in more than 25 documents".

The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus the default setting does not ignore any terms.

min_df is used for removing terms that appear too infrequently.

For example:

min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
min_df = 5 means "ignore terms that appear in less than 5 documents".

The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

Hope this answer helps.

Understanding min_df and max_df in scikit CountVectorizer

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources