2 views

I will be putting the max bounty on this as I am struggling to learn these concepts! I am trying to use some ranking data in logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.

I have put my data in a .csv as follows :

In my Test CSV we have :

A label is a binary classification indicating "good" with 1 or "bad" with 0.

I currently have my LR running using only the website text; which I run a TF-IDF on.

I have two questions which I need help with. I'll be putting a max bounty on this question and awarding it to the best answer as this is something I'd like some good help with so I, and others may learn.

How can I normalize my ranking data for AlexaRank? I have a set of 10,000 web pages, for which I have the Alexa rank of all of them; however, they aren't ranked 1-10,000. They are ranked out of the entire Internet, so while http://www.google.com may be ranked #1, http://www.notasite.com may be ranked #83904803289480. How do I normalize this in Scikit to learn to get the best possible results from my data?

I am running my Logistic Regression in this way; I am nearly sure I have done this incorrectly. I am trying to do the TF-IDF on the website text, then add the two other relevant columns and fit the Logistic Regression. I'd appreciate it if someone could quickly verify that I am taking in the three columns I want to use in my LR correctly. Any feedback on how I can improve myself would also be appreciated here.

loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')

tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode', analyzer='word',

token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)

rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True,    intercept_scaling=1.0, class_weight=None, random_state=None)

X_all = traindata + testdata

lentrain = len(traindata)

print "fitting pipeline"

tfv.fit(X_all)

print "transforming data"

X_all = tfv.transform(X_all)

X = X_all[:lentrain]

X_test = X_all[lentrain:]

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

AlexaAndGoogleTrainData = list(np.array(p.read_table('train.tsv'))[2:,3])#Not sure if I am doing this correctly. Expecting it to contain AlexaRank and GooglePageRank columns.

X = np.append(X, AllAlexaAndGoogleInfo, 1) #Think I have done this incorrectly.

print "training on full data"

rd.fit(X,y)

pred = rd.predict_proba(X_test)[:,1]

testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)

pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])

pred_df.to_csv('benchmark.csv')

print "submission file created.."`

Thank you very much for all the feedback - please post if you need any further information!

by (108k points)

sklearn.preprocessing.StandardScaler would be the first thing you want to try. StandardScaler modifies all of your features into Mean-0-Std-1 features.

This gets rid of your first problem. AlexaRank will be guaranteed to be spread around 0 and bounded. Of course, the results will not be integers between 1 and 10000 but they will maintain the same order as the original ranks. And in this case, keeping the rank bounded and normalized will help solve your second problem as follows.

To know why normalization would help in LR, let's visit the logit formulation of LR.

In your case, X1, X2, X3 are three TF-IDF features and X4, X5 is Alexa/Google rank related features.

Now, the linear form of the equation suggests that the coefficients represent the change in the logit of y with one unit change in a variable. Imagine what happens when your X4 is kept fixed at a massive rank value, say 83904803289480. In that case, the Alexa Rank variable dominates your LR fit and a small change in TF-IDF value has almost no effect on the Logistic Regression fit. Now one might think that the coefficient should be able to adjust to small/large values to account for differences between these features. Not, in this case, it's not only the magnitude of variables that matter but also their range. Alexa Rank has a large range and should dominate your LR fit in this case. Therefore, I suppose normalizing all variables using StandardScaler to adjust their range will improve the fit.

Here is a representation of how you can scale the X matrix.

sc = proprocessing.StandardScaler().fit(X)

X = sc.transform(X)

Don't forget to use the same scaler to transform X_test.

X_test = sc.transform(X_test)

Now you can use the fitting procedure etc.

rd.fit(X, y)

re.predict_proba(X_test)

Check this out for more on sklearn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html

Parsing and column merging part can be easily done using pandas, i.e., there is no need to convert the matrices into the list and then append them. Moreover, pandas data frames can be directly indexed by their column names.

Notice that we are passing header=0 argument to read_table to maintain original header names from the tsv file. And also note how we can index using the whole set of columns. Finally, you can stack this new matrix with X with the help of numpy.hstack.

hstack horizontally combined two multi-dimensional array-like structures provided their lengths are the same.