Top down “Greedy” algorithm used to partition data by repeatedly splitting into two subsets.
Method
- We start with all observations in a single region and we split the points into two regions.
- We take one of those two regions and we split that region again
- Repeat
Regression
The goal is to find a good split of the predictor space. We would like to minimize the RSS.
For regions for
In other words, across all regions, we would like to have the smallest residuals resulting from a split. Note, is the mean response for observations in the region.
Classification
We use the Gini Index as a substitute for RSS (in the case of regression) for classification. We could also use Cross Entropy or Residual Mean Deviance
Determining the First Split
We set some as the first predictor we would like to divide. In the two resulting spaces, we would like to minimize the mean response.
We have the two regions and
and
such that we want to minimize the RSS from the mean response in the resulting regions.
We want to find split some predictor at some point so that we have the smallest RSS.
This process is then repeated for each of the resulting regions. The region chosen for the split should have the smallest RSS and this process is recursively repeated for on child of the split.
Stopping Criterion
We could stop if there are points in each region. We could stop if the .
Limitations
This approach does not ensure that the best tree will be found.