0 votes
1 view
in Machine Learning by (16.1k points)

TL;DR :

Is there something I can flag in the original randomForest call to avoid having to re-run the predict function to get predicted categorical probabilities, instead of just the likely category?

Details:

I am using the randomForest package.

I have a model something like:

model <- randomForest(x=out.data[train.rows, feature.cols],

                      y=out.data[train.rows, response.col],

                      xtest=out.data[test.rows, feature.cols],

                      ytest=out.data[test.rows, response.col],

                      importance= TRUE)

where out.data is a data frame, with feature.cols a mixture of numeric and categorical features, while response.col is a TRUE / FALSE binary variable, that I forced into factor so that randomForest model will properly treat it as categorical.

All runs well, and the variable model is returned to me properly. However, I cannot seem to find a flag or parameter to pass to the randomForest function so that model is returned to me with the probabilities of TRUE or FALSE. Instead, I get simply predicted values. That is, if I look at model$predicted, I'll see something like:

FALSE

FALSE

TRUE

TRUE

FALSE

.

.

.

Instead, I want to see something like:

   FALSE  TRUE

1  0.84   0.16

2  0.66   0.34

3  0.11   0.89

4  0.17   0.83

5  0.92   0.08

.   .   .

.   .   .

.   .   .

I can get the above, but in order to do so, I need to do something like:

tmp <- predict(model, out.data[test.rows, feature.cols], "prob")

[test.rows captures the row numbers for those that were used during the model testing. The details are not shown here, but are simple since the test row IDs are output into model.]

Then everything works fine. The problem is that the model is big and takes a very long time to run, and even the prediction itself takes a while. Since the prediction should be entirely unnecessary (I am simply looking to calculate the ROC curve on the test data set, the data set that should have already been calculated), I was hoping to skip this step. Is there something I can flag in the original randomForest call to avoid having to re-run the predict function?

1 Answer

0 votes
by (33.2k points)

You should understand the model$predicted is not similar to the output returned by predict(). When the probability of the TRUE or FALSE class is needed, then you must run predict().

For example:

randomForest(x,y,xtest=x,ytest=y), 

x=out.data[, feature.cols], y=out.data[, response.col]

The model$predicted method returns the class based on which, the class had the larger value in model$votes for each record. 

The predict() returns the true probability for each class based on votes by all the trees.

Using randomForest(x,y,xtest=x,ytest=y) functions, passing a formula or simply randomForest(x,y). randomForest(x,y,xtest=x,ytest=y) would return the probability for each class.

If randomForest(x,y,xtest=x,ytest=y) is used, then, use predict() function the keep.forest flag should be set to TRUE.

model=randomForest(x,y,xtest=x,ytest=y,keep.forest=TRUE). 

prob=predict(model,x,type="prob")

prob would be equivalent to model$test$votes since the test data input are both x. 

For more details, study R programming.

Hope this answer helps.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...