0 votes
1 view
in Machine Learning by (19k points)

I need to run various machine learning techniques on a big dataset (10-100 billion records) The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some Bayesian methods, bootstrapping, gradient boosting, regression trees -- many different problems and ways to solve them)

What would be the best implementation? I'm experienced in ML but do not have much experience how to do it for huge datasets Is there any extendable and customizable Machine Learning libraries utilizing MapReduce infrastructure Strong preference to c++, but Java and python are ok Amazon Azure or own datacenter (we can afford it)?

1 Answer

0 votes
by (33.2k points)

If the data set for classification state space is extremely large, then there could be redundancy in data, because of recorded samples and randomly extracted data from different sources. You need to pick random sample for training of a machine learning model. Cross-validation is a good choice to work on that random sample.

You can also use BigData Hadoop or Mapreduce techniques to handle really large data.

Hope this answer helps.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !