2 views

I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

by (33.1k points)

You can’t train a model by giving mix (categorial and continuous) values by input. You can transform all your data into a categorical representation by computing percentiles for each continuous variable and then binning the continuous variables using the percentiles as bin boundaries.

For example, binning for the height of a person, create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set.

We can’t perform this task automatically using any library, because it's the part of data preprocessing. Data preprocessing should be done manually for better results. We have a Pandas library for this manipulative task.

You can use an independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features:

np.hstack((multinomial_probas, gaussian_probas))

and then easily refit a new model (e.g. a new gaussian NB) on the new features.