Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)
I am collecting a lot of really interesting data points as users come to my Python web service. For example, I have their current city, state, country, user-agent, etc. What I'd like to be able to do is run these through some type of machine learning system / algorithm (maybe a Bayesian classifier?), with the eventual goal of getting e-mail notifications when something out-of-the-ordinary occurs (anomaly detection). For example, Jane Doe has only ever logged in from USA on Chrome. So if she suddenly logs into my web service from the Ukraine on Firefox, I want to see that as a highly 'unusual' event and fire off a notification.

I am using CouchDB already, and I see people often saying here and there online that Cloudant / CouchDB is perfect for this sort of thing (big data analysis). However I am at a complete loss for where to start. I have not found much in terms of documentation regarding relatively simple tracking of outlying events for a web service, let alone storing previously 'learned' data using CouchDB. I see several dedicated systems for doing this type of data crunching (PredictionIO comes to mind), but I can't help but feel that they are overkill given the nature of CouchDB in the first place.

Any insight would be much appreciated. Thanks!

1 Answer

0 votes
by (33.1k points)

Let's say the variations in log-in details for a given user is low, but any large variation from this would trigger your alert. 

For example, compress each log-in detail into one dimension, and then create a log-in detail vector for each user.

Every time a user logs in, create this detail array and store it. If you have accumulated a large set of test data you can try running some ML routines.

So, we have a user and a set of log-in data corresponding to successful log-ins. We can now train a Support Vector Machine to recognize this users log-in pattern:

from sklearn import svm

# training data [[11.0, 2, 2], [11.3, 2, 2] ... etc]

train_data = my_training_data()

# create and fit the model

clf = svm.OneClassSVM()

clf.fit(train_data)

and then, every time a new log-in even occurs, create a single log-in detail array and pass that past the SVM

if clf.predict(log_in_data) < 0:

    fire_alert_event()

else:

    print('log in ok')

if the SVM finds the new data point different from its training set then it will fire the alarm.

A good training set, there are many more ML techniques that may be better suited to your task but creating your training sets and training the routines would be the most significant.

If you have bad log-in attempts, you can add these to the training sets by using a more complex SVM which you train with good and bad log-ins. Instead of using an array of disparate 'location' values, you could find the Euclidean different log-ins and use that!

Thus, for more details on this, Machine Learning Course would be quite useful. Also, study the Couch Db Course to master the course. 

Hope this answer helps you!

...