Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I'm working on a TREC task involving the use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?

1 Answer

0 votes
by (33.1k points)

You can HashingVectorizer, which will work if you iteratively feed your data into batches of 10k or 100k documents that fit in memory for instance.

You should pass the batch of transformed documents to a linear classifier that supports the partial_fit method, e.g. SGDClassifier or PassiveAggressiveClassifier, and then simply iterate on new batches.

Evaluate the model on a held-out validation set. Then check the accuracy of the partially trained model without waiting for the samples.

Then average the resulting coef_ and intercept_ attribute to get a final linear model for all datasets.

Study Sk Learn and Datasets For Machine Learning to gain more insights on the aforementioned topics.

I hope this answer helps you!

Browse Categories

...