predict sentiment score with LineairSVM in a integer/double value

Question

asked Jul 2, 2019 in AI and Deep Learning by ashely (50.2k points)

I use a linear SVM to predict the sentiment of tweets. The LSVM classifies the tweets as neutral or positive. I use a Pipeline to (in order) clean, vectorize and classify the tweets. But when predicting the sentiment I'm only able to get a 0 (for neg) or 4 (neg). I want to get predicting scores between -1 and 1 in decimal digits to get a better scale/understanding of 'how' positive and negative the tweets are:

the code:

#read in influential twitter users on stock market
twitter_users = pd.read_csv('core/infl_users.csv', encoding = "ISO-8859-1")
twitter_users.columns = ['users']
df = pd.DataFrame()
#MODEL TRAINING
#read trainingset for model : csv to dataframe
df = pd.read_csv("../trainingset.csv", encoding='latin-1')
#label trainingsset dataframe columns
frames = [df]
for colnames in frames:
colnames.columns = ["target","id","data","query","user","text"]
#remove unnecessary columns
df = df.drop("id",1)
df = df.drop("data",1)
df = df.drop("query",1)
df = df.drop("user",1)
pat1 = r'@[A-Za-z0-9_]+' # remove @ mentions fron tweets
pat2 = r'https?://[^ ]+' # remove URL's from tweets
combined_pat = r'|'.join((pat1, pat2)) #addition of pat1 and pat2
www_pat = r'www.[^ ]+' # remove URL's from tweets
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not", # converting words like isn't to is not
"haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
"wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
"can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
"mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')
def tweet_cleaner(text): # define tweet_cleaner function to clean the tweets
soup = BeautifulSoup(text, 'lxml') # call beautiful object
souped = soup.get_text() # get only text from the tweets
try:
bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?") # remove utf-8-sig codeing
except:
bom_removed = souped
stripped = re.sub(combined_pat, '', bom_removed) # calling combined_pat
stripped = re.sub(www_pat, '', stripped) #remove URL's
lower_case = stripped.lower() # converting all into lower case
neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case) # converting word's like isn't to is not
letters_only = re.sub("[^a-zA-Z]", " ", neg_handled) # will replace # by space
words = [x for x in tok.tokenize(letters_only) if len(x) > 1] # Word Punct Tokenize and only consider words whose length is greater than 1
return (" ".join(words)).strip() # join the words
# Build a list of stopwords to use to filter
stopwords = list(STOP_WORDS)
# Use the punctuations of string module
punctuations = string.punctuation
# Creating a Spacy Parser
parser = English()
class predictors(TransformerMixin):
def transform(self, X, **transform_params):
return [clean_text(text) for text in X]
def fit(self, X, y=None, **fit_params):
return self
def get_params(self, deep=True):
return {}
# Basic function to clean the text
def clean_text(text):
return text.strip().lower()
def spacy_tokenizer(sentence):
mytokens = parser(sentence)
mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens]
#mytokens = [word.lemma_.lower().strip() for word in mytokens]
mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations]
#mytokens = preprocess2(mytokens)
return mytokens
# Vectorization
# Convert a collection of text documents to a matrix of token counts
# ngrams : extension of the unigram model by taking n words together
# big advantage: it preserves context. -> words that appear together in the text will also appear together in a n-gram
# n-grams can increase the accuracy in classifying pos & neg
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
# Linear Support Vector Classification.
# "Similar" to SVC with parameter kernel=’linear’
# more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
# LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:
classifier = LinearSVC(C=0.5)
# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
#put tweet-text in X and target in ylabels to train model
X = df['text']
ylabels = df['target']
#T he next step is to load the data and split it into training and test datasets. In this example,
# we will use 80% of the dataset to train the model.This 80% is then splitted again in 80-20. 80% tot train the model, 20% to test results.
# the remaining 20% is kept to train the final model
X_tr, X_kast, y_tr, y_kast = train_test_split(X, ylabels, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_tr, y_tr, test_size=0.2, random_state=42)
# Create the pipeline to clean, tokenize, vectorize, and classify
# Tying together different pieces of the ML process is known as a pipeline.
# Each stage of a pipeline is fed data processed from its preceding stage
# Pipelines only transform the observed data (X).
# Pipeline can be used to chain multiple estimators into one.
# The pipeline object is in the form of (key, value) pairs.
# Key is a string that has the name for a particular step
# value is the name of the function or actual method.
#Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
pipe_tfid = Pipeline([("cleaner", predictors()),
('vectorizer', tfvectorizer),('classifier', classifier)])
# Fit our data, fit = training the model
pipe_tfid.fit(X_train,y_train)
# Predicting with a test dataset
#sample_prediction1 = pipe_tfid.predict(X_test)
accur = pipe_tfid.score(X_test,y_test)

when predicting a sentiment score I do

pipe_tfid.predict('textoftweet')

1 Answer

vinita · Answer 1 · 2019-07-02T08:39:00+0000

During the training, SVM calculates the weights w such that the margin which separates the classes is maximum. The predictions are then made using the function (in case of binary classifier)

Choose A1 if w^Tx + bias > 0 else Choose A2

Here A1 and A2 are predictions.

SVM is not able to return the probabilities because it is not a probabilistic model. There are some probabilistic interpretations of SVM like this. But you can use some standard probabilistic models (like NaiveBayes, Log is tic Regression, etc) if you want to know the confidence of the prediction.

predict sentiment score with LineairSVM in a integer/double value

predict sentiment score with LineairSVM in a integer/double value

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions