0 votes
1 view
in AI and Deep Learning by (43.2k points)

I use a linear SVM to predict the sentiment of tweets. The LSVM classifies the tweets as neutral or positive. I use a Pipeline to (in order) clean, vectorize and classify the tweets. But when predicting the sentiment I'm only able to get a 0 (for neg) or 4 (neg). I want to get predicting scores between -1 and 1 in decimal digits to get a better scale/understanding of 'how' positive and negative the tweets are:

the code:

#read in influential twitter users on stock market

twitter_users = pd.read_csv('core/infl_users.csv', encoding = "ISO-8859-1")

twitter_users.columns = ['users']

df = pd.DataFrame()

#MODEL TRAINING

#read trainingset for model : csv to dataframe

df = pd.read_csv("../trainingset.csv", encoding='latin-1')

#label trainingsset dataframe columns

frames = [df]

for colnames in frames:

    colnames.columns = ["target","id","data","query","user","text"]

#remove unnecessary columns

df = df.drop("id",1)

df = df.drop("data",1)

df = df.drop("query",1)

df = df.drop("user",1)

pat1 = r'@[A-Za-z0-9_]+'        # remove @ mentions fron tweets

pat2 = r'https?://[^ ]+'        # remove URL's from tweets

combined_pat = r'|'.join((pat1, pat2)) #addition of pat1 and pat2

www_pat = r'www.[^ ]+'         # remove URL's from tweets

negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",   # converting words like isn't to is not

                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",

                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",

                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",

                "mustn't":"must not"}

neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def tweet_cleaner(text):  # define tweet_cleaner function to clean the tweets

    soup = BeautifulSoup(text, 'lxml')    # call beautiful object

    souped = soup.get_text()   # get only text from the tweets

    try:

        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")    # remove utf-8-sig codeing

    except:

        bom_removed = souped

    stripped = re.sub(combined_pat, '', bom_removed) # calling combined_pat

    stripped = re.sub(www_pat, '', stripped) #remove URL's

    lower_case = stripped.lower()      # converting all into lower case

    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case) # converting word's like isn't to is not

    letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)       # will replace # by space

    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1] # Word Punct Tokenize and only consider words whose length is greater than 1

    return (" ".join(words)).strip() # join the words

# Build a list of stopwords to use to filter

stopwords = list(STOP_WORDS)

# Use the punctuations of string module

punctuations = string.punctuation

# Creating a Spacy Parser

parser = English()

class predictors(TransformerMixin):

    def transform(self, X, **transform_params):

        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):

        return self

    def get_params(self, deep=True):

        return {}

# Basic function to clean the text

def clean_text(text):

    return text.strip().lower()

def spacy_tokenizer(sentence):

    mytokens = parser(sentence)

    mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens]

    #mytokens = [word.lemma_.lower().strip() for word in mytokens]

    mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations]

    #mytokens = preprocess2(mytokens)

    return mytokens

# Vectorization

# Convert a collection of text documents to a matrix of token counts

# ngrams : extension of the unigram model by taking n words together

# big advantage: it preserves context. -> words that appear together in the text will also appear together in a n-gram

# n-grams can increase the accuracy in classifying pos & neg

vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

# Linear Support Vector Classification.

# "Similar" to SVC with parameter kernel=’linear’

# more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

# LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:

classifier = LinearSVC(C=0.5)

# Using Tfidf

tfvectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)

#put tweet-text in X and target in ylabels to train model

X = df['text']

ylabels = df['target']

#T he next step is to load the data and split it into training and test datasets. In this example,

# we will use 80% of the dataset to train the model.This 80% is then splitted again in 80-20. 80% tot train the model, 20% to test results.

# the remaining 20% is kept to train the final model

X_tr, X_kast, y_tr, y_kast = train_test_split(X, ylabels, test_size=0.2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X_tr, y_tr, test_size=0.2, random_state=42)

# Create the  pipeline to clean, tokenize, vectorize, and classify

# Tying together different pieces of the ML process is known as a pipeline.

# Each stage of a pipeline is fed data processed from its preceding stage

# Pipelines only transform the observed data (X).

# Pipeline can be used to chain multiple estimators into one.

# The pipeline object is in the form of (key, value) pairs.

# Key is a string that has the name for a particular step

# value is the name of the function or actual method.

#Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

pipe_tfid = Pipeline([("cleaner", predictors()),

('vectorizer', tfvectorizer),('classifier', classifier)])

# Fit our data, fit = training the model

pipe_tfid.fit(X_train,y_train)

# Predicting with a test dataset

#sample_prediction1 = pipe_tfid.predict(X_test)

accur = pipe_tfid.score(X_test,y_test)

when predicting a sentiment score I do

pipe_tfid.predict('textoftweet')

1 Answer

0 votes
by (92.8k points)

During the training, SVM calculates the weights w such that the margin which separates the classes is maximum. The predictions are then made using the function (in case of binary classifier)

Choose A1 if w^Tx + bias > 0 else Choose A2

Here A1 and A2 are predictions.

SVM is not able to return the probabilities because it is not a probabilistic model. There are some probabilistic interpretations of SVM like this. But you can use some standard probabilistic models (like NaiveBayes, LogisticRegression, etc) if you want to know the confidence of the prediction.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...