Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (19k points)
I want to learn a Naive Bayes model for a problem where the class is boolean (takes on one of two values). Some of the features are boolean, but other features are categorical and can take on a small number of values (~5).

If all my features were boolean then I would want to use sklearn.naive_bayes.BernoulliNB. It seems clear that sklearn.naive_bayes.MultinomialNB is not what I want.

One solution is to split up my categorical features into boolean features. For instance, if a variable "X" takes on values "red", "green", "blue", I can have three variables: "X is red", "X is green", "X is blue". That violates the assumption of conditional independence of the variables given the class, so it seems totally inappropriate.

Another possibility is to encode the variable as a real-valued variable where 0.0 means red, 1.0 means green, and 2.0 means blue. That also seems totally inappropriate to use GaussianNB (for obvious reasons).

What I'm trying to do doesn't seem weird, but I don't understand how to fit it into the Naive Bayes models that sklearn gives me. It's easy to code up myself, but I prefer to use sklearn if possible for obvious reasons (most: to avoid bugs).

1 Answer

0 votes
by (33.1k points)

You can consider the case where you have a dataset consisting of several features:

  1. Categorical
  2. Bernoulli
  3. Normal

These variables are independent when using NB. Consequently, you can do the following:

  1. Build an NB classifier for each of the categorical data separately, using your dummy variables and a multinomial NB.
  2. Build an NB classifier for all of the Bernoulli data at once - this is because sklearn's Bernoulli NB is simply a shortcut for several single-feature Bernoulli NBs.
  3. Same as 2 for all the normal features.

The probability for an instance is the product of the probabilities of instances by these classifiers.

Browse Categories