I need to run various machine learning techniques on a big dataset (10-100 billion records) The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some Bayesian methods, bootstrapping, gradient boosting, regression trees -- many different problems and ways to solve them)
What would be the best implementation? I'm experienced in ML but do not have much experience how to do it for huge datasets Is there any extendable and customizable Machine Learning libraries utilizing MapReduce infrastructure Strong preference to c++, but Java and python are ok Amazon Azure or own datacenter (we can afford it)?