Combining bag of words and other features in one model using sklearn and pandas

Question

asked Jul 18, 2019 in Machine Learning by ParasSharma1 (19k points)

I am trying to model the score that a post receives, based on both the text of the post, and other features (time of day, length of post, etc.)

I am wondering how to best combine these different types of features into one model. Right now, I have something like the following (stolen from here and here).

import pandas as pd

...

def features(p):
    terms = vectorizer(p[0])
    d = {'feature_1': p[1], 'feature_2': p[2]}
    for t in terms:
        d[t] = d.get(t, 0) + 1
    return d
posts = pd.read_csv('path/to/csv')
# Create vectorizer for function to use
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2)).build_tokenizer()
y = posts["score"].values.astype(np.float32)
vect = DictVectorizer()
# This is the part I want to fix
temp = zip(list(posts.message), list(posts.feature_1), list(posts.feature_2))
tokenized = map(lambda x: features(x), temp)
X = vect.fit_transform(tokenized)

It seems very silly to extract all of the features I want out of the pandas dataframe, just to zip them all back together. Is there a better way of doing this step?

The CSV looks something like the following:

ID,message,feature_1,feature_2
1,'This is the text',4,7
2,'This is more text',3,2

1 Answer

Anurag · Answer 1 · 2019-07-19T07:39:45+0000

You can simply use map and lambda method:

tokenized=map(lambda msg, ft1, ft2: features([msg,ft1,ft2]), posts.message,posts.feature_1, posts.feature_2)

You should convert the messages into their CountVectorizer sparse matrix and join this matrix with the feature values from the posts dataframe

For example:

import scipy as sp
posts = pd.read_csv('post.csv')
# Create vectorizer for function to use
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
y = posts["score"].values.astype(np.float32)
X = sp.sparse.hstack((vectorizer.fit_transform(posts.message),posts[['feature_1','feature_2']].values),format='csr')
X_columns=vectorizer.get_feature_names()+posts[['feature_1','feature_2']].columns.tolist()

posts
Out[38]:
   ID message feature_1 feature_2 score
0 1 'This is the text' 4 7 10
1 2 'This is more text' 3 2 9
2 3 'More random text' 3 2 9
X_columns
Out[39]:
[u'is',
u'is more',
u'is the',
u'more',
u'more random',
u'more text',
u'random',
u'random text',
u'text',
u'the',
u'the text',
u'this',
u'this is',
'feature_1',
'feature_2']
X.toarray()
Out[40]:
array([[1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 4, 7],
       [1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 3, 2],
       [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 3, 2]])

The sklearn-pandas has DataFrameMapper which does what you're looking for too:

from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([
    (['feature_1', 'feature_2'], None),
    ('message',CountVectorizer(binary=True, ngram_range=(1, 2)))
])
X=mapper.fit_transform(posts)
X
Out[71]:
array([[4, 7, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
       [3, 2, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1],
       [3, 2, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0]])
X_columns=mapper.features[0][0]+mapper.features[1][1].get_feature_names()
X_columns
Out[76]:
['feature_1',
'feature_2',
u'is',
u'is more',
u'is the',
u'more',
u'more random',
u'more text',
u'random',
u'random text',
u'text',
u'the',
u'the text',
u'this',
u'this is']

For more details, study Scikit Learn Tutorial. Also, go through Python Course to master the topic.

Hope this answer helps.

Combining bag of words and other features in one model using sklearn and pandas

1 Answer

Related questions

Browse Categories