I will be putting the max bounty on this as I am struggling to learn these concepts! I am trying to use some ranking data in logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.

I have put my data in a .csv as follows :

URL WebsiteText AlexaRank GooglePageRank

In my Test CSV we have :

URL WebsiteText AlexaRank GooglePageRank Label

A label is a binary classification indicating "good" with 1 or "bad" with 0.

I currently have my LR running using only the website text; which I run a TF-IDF on.

I have two questions which I need help with. I'll be putting a max bounty on this question and awarding it to the best answer as this is something I'd like some good help with so I, and others may learn.

How can I normalize my ranking data for AlexaRank? I have a set of 10,000 web pages, for which I have the Alexa rank of all of them; however, they aren't ranked 1-10,000. They are ranked out of the entire Internet, so while http://www.google.com may be ranked #1, http://www.notasite.com may be ranked #83904803289480. How do I normalize this in Scikit to learn to get the best possible results from my data?

I am running my Logistic Regression in this way; I am nearly sure I have done this incorrectly. I am trying to do the TF-IDF on the website text, then add the two other relevant columns and fit the Logistic Regression. I'd appreciate it if someone could quickly verify that I am taking in the three columns I want to use in my LR correctly. Any feedback on how I can improve myself would also be appreciated here.

loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')

print "loading data.."

traindata = list(np.array(p.read_table('train.tsv'))[:,2])#Reading WebsiteText column for TF-IDF.

testdata = list(np.array(p.read_table('test.tsv'))[:,2])

y = np.array(p.read_table('train.tsv'))[:,-1] #reading label

tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',

token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)

rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)

X_all = traindata + testdata

lentrain = len(traindata)

print "fitting pipeline"

tfv.fit(X_all)

print "transforming data"

X_all = tfv.transform(X_all)

X = X_all[:lentrain]

X_test = X_all[lentrain:]

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

#Add Two Integer Columns

AlexaAndGoogleTrainData = list(np.array(p.read_table('train.tsv'))[2:,3])#Not sure if I am doing this correctly. Expecting it to contain AlexaRank and GooglePageRank columns.

AlexaAndGoogleTestData = list(np.array(p.read_table('test.tsv'))[2:,3])

AllAlexaAndGoogleInfo = AlexaAndGoogleTestData + AlexaAndGoogleTrainData

#Add two columns to X.

X = np.append(X, AllAlexaAndGoogleInfo, 1) #Think I have done this incorrectly.

print "training on full data"

rd.fit(X,y)

pred = rd.predict_proba(X_test)[:,1]

testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)

pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])

pred_df.to_csv('benchmark.csv')

print "submission file created.."`

Thank you very much for all the feedback - please post if you need any further information!