Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I am trying to model the score that a post receives, based on both the text of the post, and other features (time of day, length of post, etc.)

I am wondering how to best combine these different types of features into one model. Right now, I have something like the following (stolen from here and here).

import pandas as pd

...

def features(p):

    terms = vectorizer(p[0])

    d = {'feature_1': p[1], 'feature_2': p[2]}

    for t in terms:

        d[t] = d.get(t, 0) + 1

    return d

posts = pd.read_csv('path/to/csv')

# Create vectorizer for function to use

vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2)).build_tokenizer()

y = posts["score"].values.astype(np.float32) 

vect = DictVectorizer()

# This is the part I want to fix

temp = zip(list(posts.message), list(posts.feature_1), list(posts.feature_2))

tokenized = map(lambda x: features(x), temp)

X = vect.fit_transform(tokenized)

It seems very silly to extract all of the features I want out of the pandas dataframe, just to zip them all back together. Is there a better way of doing this step?

The CSV looks something like the following:

ID,message,feature_1,feature_2

1,'This is the text',4,7

2,'This is more text',3,2

1 Answer

0 votes
by (33.1k points)

You can simply use map and lambda method:

tokenized=map(lambda msg, ft1, ft2: features([msg,ft1,ft2]), posts.message,posts.feature_1, posts.feature_2)


 

You should convert the messages into their CountVectorizer sparse matrix and join this matrix with the feature values from the posts dataframe 

For example:

import scipy as sp

posts = pd.read_csv('post.csv')

# Create vectorizer for function to use

vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))

y = posts["score"].values.astype(np.float32) 

X = sp.sparse.hstack((vectorizer.fit_transform(posts.message),posts[['feature_1','feature_2']].values),format='csr')

X_columns=vectorizer.get_feature_names()+posts[['feature_1','feature_2']].columns.tolist()


 

posts

Out[38]: 

   ID              message feature_1  feature_2 score

0   1 'This is the text'          4 7 10

1   2 'This is more text'          3 2 9

2   3 'More random text'          3 2 9

X_columns

Out[39]: 

[u'is',

 u'is more',

 u'is the',

 u'more',

 u'more random',

 u'more text',

 u'random',

 u'random text',

 u'text',

 u'the',

 u'the text',

 u'this',

 u'this is',

 'feature_1',

 'feature_2']

X.toarray()

Out[40]: 

array([[1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 4, 7],

       [1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 3, 2],

       [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 3, 2]])

The sklearn-pandas has DataFrameMapper which does what you're looking for too:

from sklearn_pandas import DataFrameMapper

mapper = DataFrameMapper([

    (['feature_1', 'feature_2'], None),

    ('message',CountVectorizer(binary=True, ngram_range=(1, 2)))

])

X=mapper.fit_transform(posts)

X

Out[71]: 

array([[4, 7, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],

       [3, 2, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1],

       [3, 2, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0]])

X_columns=mapper.features[0][0]+mapper.features[1][1].get_feature_names()

X_columns

Out[76]: 

['feature_1',

 'feature_2',

 u'is',

 u'is more',

 u'is the',

 u'more',

 u'more random',

 u'more text',

 u'random',

 u'random text',

 u'text',

 u'the',

 u'the text',

 u'this',

 u'this is']

For more details, study Scikit Learn Tutorial. Also, go through Python Course to master the topic.

Hope this answer helps.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...