Top down “Greedy” algorithm used to partition data by repeatedly splitting into two subsets.

Method

  1. We start with all observations in a single region and we split the points into two regions.
  2. We take one of those two regions and we split that region again
  3. Repeat

Regression

The goal is to find a good split of the predictor space. We would like to minimize the RSS.

For regions for

In other words, across all regions, we would like to have the smallest residuals resulting from a split. Note, is the mean response for observations in the region.

Classification

We use the Gini Index as a substitute for RSS (in the case of regression) for classification. We could also use Cross Entropy or Residual Mean Deviance

Determining the First Split

We set some as the first predictor we would like to divide. In the two resulting spaces, we would like to minimize the mean response.

We have the two regions and

and

such that we want to minimize the RSS from the mean response in the resulting regions.

We want to find split some predictor at some point so that we have the smallest RSS.

This process is then repeated for each of the resulting regions. The region chosen for the split should have the smallest RSS and this process is recursively repeated for on child of the split.

Stopping Criterion

We could stop if there are points in each region. We could stop if the .

Limitations

This approach does not ensure that the best tree will be found.