Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I am trying to model the score that a post receives, based on both the text of the post, and other features (time of day, length of post, etc.)

I am wondering how to best combine these different types of features into one model. Right now, I have something like the following (stolen from here and here).

import pandas as pd

...

def features(p):

    terms = vectorizer(p[0])

    d = {'feature_1': p[1], 'feature_2': p[2]}

    for t in terms:

        d[t] = d.get(t, 0) + 1

    return d

posts = pd.read_csv('path/to/csv')

# Create vectorizer for function to use

vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2)).build_tokenizer()

y = posts["score"].values.astype(np.float32) 

vect = DictVectorizer()

# This is the part I want to fix

temp = zip(list(posts.message), list(posts.feature_1), list(posts.feature_2))

tokenized = map(lambda x: features(x), temp)

X = vect.fit_transform(tokenized)

It seems very silly to extract all of the features I want out of the pandas dataframe, just to zip them all back together. Is there a better way of doing this step?

The CSV looks something like the following:

ID,message,feature_1,feature_2

1,'This is the text',4,7

2,'This is more text',3,2

1 Answer

0 votes
by (33.1k points)

You can simply use map and lambda method:

tokenized=map(lambda msg, ft1, ft2: features([msg,ft1,ft2]), posts.message,posts.feature_1, posts.feature_2)


 

You should convert the messages into their CountVectorizer sparse matrix and join this matrix with the feature values from the posts dataframe 

For example:

import scipy as sp

posts = pd.read_csv('post.csv')

# Create vectorizer for function to use

vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))

y = posts["score"].values.astype(np.float32) 

X = sp.sparse.hstack((vectorizer.fit_transform(posts.message),posts[['feature_1','feature_2']].values),format='csr')

X_columns=vectorizer.get_feature_names()+posts[['feature_1','feature_2']].columns.tolist()


 

posts

Out[38]: 

   ID              message feature_1  feature_2 score

0   1 'This is the text'          4 7 10

1   2 'This is more text'          3 2 9

2   3 'More random text'          3 2 9

X_columns

Out[39]: 

[u'is',

 u'is more',

 u'is the',

 u'more',

 u'more random',

 u'more text',

 u'random',

 u'random text',

 u'text',

 u'the',

 u'the text',

 u'this',

 u'this is',

 'feature_1',

 'feature_2']

X.toarray()

Out[40]: 

array([[1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 4, 7],

       [1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 3, 2],

       [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 3, 2]])

The sklearn-pandas has DataFrameMapper which does what you're looking for too:

from sklearn_pandas import DataFrameMapper

mapper = DataFrameMapper([

    (['feature_1', 'feature_2'], None),

    ('message',CountVectorizer(binary=True, ngram_range=(1, 2)))

])

X=mapper.fit_transform(posts)

X

Out[71]: 

array([[4, 7, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],

       [3, 2, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1],

       [3, 2, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0]])

X_columns=mapper.features[0][0]+mapper.features[1][1].get_feature_names()

X_columns

Out[76]: 

['feature_1',

 'feature_2',

 u'is',

 u'is more',

 u'is the',

 u'more',

 u'more random',

 u'more text',

 u'random',

 u'random text',

 u'text',

 u'the',

 u'the text',

 u'this',

 u'this is']

For more details, study Scikit Learn Tutorial. Also, go through Python Course to master the topic.

Hope this answer helps.

Browse Categories

...