Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly means? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (5 txt files)?

How is it different when min_df and max_df are provided as integers or as floats?

The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of min_df and/or max_df. Could someone provide an explanation or example demonstrating min_df or max_df?

1 Answer

0 votes
by (33.1k points)

max_df is used for removing data values that appear too frequently, also known as "corpus-specific stop words".

 For example:

  • max_df = 0.50 means "It ignores terms that appear in more than 50% of the documents".

  • max_df = 25 means "It ignores terms that appear in more than 25 documents".

The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus the default setting does not ignore any terms.

min_df is used for removing terms that appear too infrequently.

For example:

  • min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".

  • min_df = 5 means "ignore terms that appear in less than 5 documents".

The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

Hope this answer helps.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.5k answers

500 comments

108k users

Browse Categories

...