What is out of bag error in Random Forests? Is it the optimal parameter for finding the right number of trees in a Random Forest?

+1 vote

Best answer

Let's suppose our training data set is represented by T and the data set has M number of features

T = {(X1,y1), (X2,y2), ... (Xn, yn)}

and

Xi is input vector {xi1, xi2, ... xiM}

Here, yi is the actual label.

**Random Forests** algorithm is a classifier based on primarily two methods:

Bagging

Random subspace method.

If we take S number of trees in our random forest algorithm. Then we first create S datasets of "the same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of them is called a **bootstrap dataset**.

Due to the "**with-replacement**" parameter, every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called** Bootstrapping.**

**Bagging **is the process of taking bootstraps & then aggregating the models learned on each bootstrap.

Random Forest creates an S number of trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to build any tree. This is called a **random subspace method.**

So for each Ti bootstrap dataset, you create a tree, Ki. You can classify some input data D = {x1, x2, ..., xM} you can let it pass through each tree and produce S outputs which can be denoted by Y = {y1, y2, ..., ys}. The final prediction is a majority vote on this set.

**Out-of-bag error:**

After building the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of bootstrap datasets which do not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).

The out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's).

The study of error estimates for bagged classifiers gives empirical evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set-aside test set.

Hope this answer helps.