2 views

I am trying to find how the C4.5 algorithm determines the threshold value for numeric attributes. I have researched and can not understand, in most places I've found this information:

The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1,v2, …,vm}. Any threshold value lying between vi and vi+1 will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.

It is usual to choose the midpoint of each interval: (vi +vi+1)/2 as the representative threshold. C4.5 chooses as the threshold a smaller value vi for every interval {vi, vi+1}, rather than the midpoint itself.

I am studying an example of Play/Don't Play and do not understand how you get the number 75 for the attribute humidity when the state is sunny because the values ​​of humidity to the sunny state are {70,85,90,95}.

Does anyone know?

by (108k points)

We need to convert continuous values to nominal ones. C4.5 proposes to perform binary split based on a threshold value. The threshold should be a value that offers maximum gain for that attribute. For A Step By Step C4.5 Decision Tree Example, refer to the following link:

https://sefiks.com/2018/05/13/a-step-by-step-c4-5-decision-tree-example/