Recursive Binary Splitting

Top down “Greedy” algorithm used to partition data by repeatedly splitting into two subsets.

Method

We start with all observations in a single region and we split the points into two regions.
We take one of those two regions and we split that region again
Repeat

Regression

The goal is to find a good split of the predictor space. We would like to minimize the RSS.

$RSS = \sum_{j = i}^{J} \sum_{i \in R_{j}} (y_{i} - \overset{y}{^}_{R_{j}})^{2}$ For regions $R_{i}$ for $i \in 1, 2, ..., J$

In other words, across all regions, we would like to have the smallest residuals resulting from a split. Note, $\overset{y}{^}_{R_{j}}$ is the mean response for observations in the $j^{th}$ region.

Classification

We use the Gini Index as a substitute for RSS (in the case of regression) for classification. We could also use Cross Entropy or Residual Mean Deviance

Determining the First Split

We set some $X_{j}$ as the first predictor we would like to divide. In the two resulting spaces, we would like to minimize the mean response.

We have the two regions $R_{1}, R_{2}$ and $\forall (j, s)$

$R_{1} (j, s) = X ∣ X_{j} < s$ and $R_{2} (j, s) = X ∣ X_{j} \geq s$

such that we want $(j, s)$ to minimize the RSS from the mean response in the resulting regions.

We want to find split some predictor $X_{j}$ at some point $s$ so that we have the smallest RSS.

This process is then repeated for each of the resulting regions. The region chosen for the split should have the smallest RSS and this process is recursively repeated for on child of the split.

Stopping Criterion

We could stop if there are $< 5$ points in each region. We could stop if the $Δ RSS < ϵ$ .

Limitations

This approach does not ensure that the best tree will be found.

Explorer

Recursive Binary Splitting

Method

Regression

Classification

Determining the First Split

Stopping Criterion

Limitations

Graph View

Table of Contents

Backlinks