https://mathpn.com/posts/climbing-trees-1/ Skip to content MPN * Posts * Notes * Tags * About * Archives * Search * --------------------------------------------------------------------- Go back Climbing trees 1: what are decision trees? Published:Jan 7, 2025 This is the first in a series of posts about decision trees in the context of machine learning. The goal here is to provide a foundational understanding of decision trees and to implement them. Climbing trees series * Climbing trees 1: what are decision trees? (you are here) * Climbing trees 2: implementing decision trees * Climbing trees 3: from trees to forests --------------------------------------------------------------------- Decision trees are not amazing algorithms by themselves. They have limitations that can result in suboptimal and even weird predictions. And yet, they have become extremely popular. Some would even say they are the de facto go-to algorithm for many machine learning domains. This is due to bagging and boosting, techniques that turned subpar decision trees into state- of- the- art algorithms. We'll explore them in the future. First, we'll build an intuition for what are decision trees and define them mathematically. Then, we'll explore how decision trees are built. This will allow us to grasp their main characteristics, advantages and disadvantages. I will try to introduce complexity gradually, but I will assume you have some knowledge on mathematical notation, statistics and basic machine learning concepts. If things become too complicated, try to read the provided references. I've drawn upon various sources instrumental to my understanding of decision trees, including books, documentation, articles, blog posts and lectures. Even if you understand everything, check the references: there is great content there. What is a decision tree? Imagine you're trying to decide whether to take an umbrella when leaving home. You might ask questions like: "Are there clouds?". If yes, you might then ask "What's the humidity level?". Each question helps you narrow down the decision. This is how a decision tree works. Let's simulate this weather example: Scatter plot of humidity and cloud coverage showing that it's more likely to rain when its cloudy and humid A decision tree can be thought of as making consecutive decisions by asking a series of questions about our data. Each internal tree node uses a certain feature (in our example, cloud cover or humidity) to divide its region into two using a split value. Each new region can be further divided into two. A node that divides its region into two is called an internal node. Leaf (or terminal) nodes don't ask any more questions, but rather provide a prediction for its region. In our example, it might say "Rain" or "No rain". More precisely, it assigns a probability for each outcome. Let's take a look at a decision tree fitted to our weather example: Graph of decision tree splits This decision tree was kept intentionally small. From top to bottom, this graph represents all decision boundaries of our simple tree. Each internal node produces two branches, that is, two paths that can be followed. These branches recursively partitions the feature space such that the samples with the same labels or similar target values are grouped together. For instance, we predict that it'll rain if humidity is above 59% and cloud cover is above 45% (rightmost path in the graph) because most points (instances) in this region are of the "Rain" class. The class shown is only relevant for leaf nodes, that is, those at the bottom row. The value property shows how many samples there are for each class in each region. A pure node has only instances of one class in its region. Since all nodes contain at least one instance of both classes (that is, "Rain" and "No Rain"), all nodes are impure. The objective during training is to reduce impurity as much as possible. If all nodes are pure, then the tree has zero error on the training data set. This is how decision trees learn. This visualization of the tree is very easy to interpret. We can follow each path and clearly see why a prediction was made, that is, the model is easily explainable (the opposite of a black- box model). This is one of the reasons why decision trees are popular. Simple decision trees can even be applied in some practical settings (e.g. medicine) without machine assistance. We can also visualize the decision boundaries of the tree by overlaying them onto the scatter plot of our data: Scatter plot of humidity and cloud coverage showing prediction regions from a decision tree We can see the straight boundaries between regions. More formally, a decision tree is a hierarchical structure that recursively divide our features into cuboid regions. Since we have 2 features (2 dimensions) in our example, the cuboid regions are squares. Types of trees Broadly, there are two types of decision trees: classification and regression trees. While they share the same fundamental structure and splitting methodology, they differ in their output and how they make predictions. Classification trees are designed to predict categorical outcomes -- they assign input data to predefined classes or categories. At each leaf node, the tree predicts the most common class among the training samples that reached that node. Our weather example involves a classification tree and the leaves predict whether it will rain or not. The prediction is made by counting the proportion of training samples of each class at the leaf node and selecting the majority class. Think of it as the tree asking a series of yes/no questions about the input features until it can make an educated guess about which category the input belongs to. Regression trees, on the other hand, predict continuous numerical values rather than categories. Instead of predicting a class at each leaf node, regression trees typically output the average value of all training samples that reached that node. For instance, a regression tree might predict a house's price based on features like square footage, number of bedrooms, and location. Each split in the tree tries to group together similar numerical values. When a new example comes in, the tree can guide it to a leaf node containing training examples with similar target values and use their average as the prediction. Decision tree algorithms Decision trees have been widely explored and there is a wide range of different decision tree algorithms. The most influential ones are ID3 (Iterative Dichotomiser 3), C4.5 and CART (Classification and Regression Trees). ID3 is one of the earliest decision tree algorithms and support only classification with categorical features. Nodes can be split into more than two depending on the number of categorical levels of each feature. C4.5 extends the ID3 algorithm to support numerical features and missing values. CART is similar to C4.5 but also adds support for regression. The CART approach only performs binary splits, resulting in taller trees when compared to ID3 and C4.5. Although the principles discussed here apply to most decision tree algorithms, I am focused on CART -- which is also the algorithm we'll implement. Mathematical definition Mathematically, a decision tree can be described as: f(x)=[?]m=1McmI(x[?]Rm)f(x) = \sum_{m=1}^{M} c_m \mathbb{I}(x \in R_m)f(x )=m=1[?]M cm I(x[?]Rm ) Where xxx are the input features, RmR_mRm is the mmm'th region and cmc_mcm is the value of this region. I(x[?]Rm)\mathbb{I}(x \in R_m)I( x[?]Rm ) is 1 if xxx is contained in the mmm'th region, 0 otherwise. The values (ccc) are typically a constant (regression) or a vector of probabilities (classification). The combination of regions and values defines the decision tree. The regions cannot assume arbitrary boundaries, though. They are always parallel to some axis and can only divide a previous region. This can be a limitation, but it greatly reduces the computational complexity of constructing a decision tree. In our weather example, it's trivial to find the following decision boundary using other methods (I have used logistic regression): Weather scatter plot with diagonal decision boundary However, axis- parallel splits (single feature) are much easier to compute than oblique splits (multiple features). Finding the best split of a single feature involves sorting the data and evaluating splits. Since the latter is negligible compared to sorting, this operation has a time complexity of O(nlog[?]n)O(n \ log n)O(nlogn), where nnn is the number of data points. To find the best oblique split combining two features, however, we must first consider all possible O(n2)O(n^2)O(n2) lines formed by pairs of points. For each line, you need to evaluate which side each point falls on: O(n)O(n)O(n). This amounts to a total time complexity of O(n3)O(n^3)O(n3). More generally, an oblique split has a time complexity of O(nd+1)O(n^{d+1})O(nd+1), in which ddd is the number of features. Therefore, we compromise on using only axis- parallel splits, which define cuboid regions. Each region RmR_mRm is defined by a path from the root to the mmm 'th leaf of the tree. For instance, consider the path that defines the region R={(Humidity, Cloud) | Humidity>59 and Cloud>45}R = \ {(Humidity,\ Cloud)\ |\ Humidity > 59\ \text{and}\ Cloud > 45\}R={(Hu midity, Cloud) | Humidity>59 and Cloud>45}. Path between the root node and a leaf node This region can be visualized on the scatter plot: Defined region in scatter plot Unlike linear models, decision trees do not model the entire data distribution. Rather, each region has independent predicted values. More formally, adding all independent regions defines a piecewise function that can approximate any pattern in the data, but it may struggle to represent smooth or continuous functions properly. Bias- variance tradeoff As all other machine learning algorithms, trees are also haunted by the bias- variance tradeoff. If this concept is new to you, I highly recommend reading about it first (check the references), but come back later. Bias- variance tradeoff refresher In summary, bias refers to the error that a model makes due to oversimplification of relationships between features (underfitting). Variance measures how sensitive model predictions are to small fluctuations in the training set. High variance means that the model is capturing noise rather than true relationships ( overfitting). Reducing bias tends to increase variance and vice- versa. Finding an optimal bias- variance balance is crucial to achieve good prediction accuracy. Decision tree variance Due to the non- linear and non- smooth nature of trees, they easily capture noise. Thus, deeper trees present more variance. If you fully grow a tree, it will partition the feature space until the error is zero^1. The model will have effectively memorized the training set -- including noise -- which results in overfitting. If we rebuild our example tree until the error is 0 we get the following regions: Scatter plot of humidity and cloud coverage showing prediction regions from a deep decision tree It has perfect accuracy, but it has very unusual boundaries due to noise (high variance). This model doesn't generalize well, that is, it would score poorly with new data. There are different ways to limit tree variance, namely: * Limiting depth * Requiring a minimum number of points per node * Requiring a minimum decrease in loss to split the node (usually not a good idea) The number of samples required to fill a tree doubles for each additional level (depth) of the tree. These limits ensure leaf nodes are not overly sparse. Another possibility is to fully grow the tree and later prune it to balance complexity and accuracy. Let's build decision trees with increasing depths on a toy dataset: [frame_00] Play[0 ] Depth: 1 On the left we can see the training set^2 with two features and two classes. We're building classification trees, whose decision boundary is overlaid onto the left plot. The plot on the right shows training and test error, as well as the test error of a logistic regression model as a baseline. The test set was sampled from the same distribution, but is different from the training set. Use the slider to compare depths. We can see that the best test error happens with a depth of 3. Increasing the depth further results in overfitting. Decision tree bias Even with a depth of three in the example above the error is larger than the logistic regression baseline. This is an example where linear models outperform decision trees due to the additive structure of the data. The logistic regression equation in this case is: logit(p)=b0+b1x1+b2x2+elogit(p) = \beta_{0} + \beta_{1} x_{1} + \ beta_{2} x_{2} + \varepsilonlogit(p)=b0 +b1 x1 +b2 x2 +e Where ppp is the probability of the outcome being of class 1. Ignoring constants and the noise (e\varepsilone), the probability is determined by a linear combination of both features. The relationship between features is not hierarchical, therefore decision trees require multiple, sometimes redundant, splits to approximate this concept, leading to deep or overly complex trees. There are other concepts that fall under the same category, such as: * XOR (Exclusive OR): the output is of one class if exactly one (but not both) of two inputs is true. * Parity problem: the output is different if the number of true values in a set of inputs is even or odd. * Multiplexer problem: a multiplexer selects one of several input values based on a separate "selector" input. These may seem overly specific, but they're used as benchmarks to test the capabilities machine learning algorithms. Additive structure, XOR and parity involve global dependencies that require multiple inputs to be considered together. Decision trees split data based on one feature at a time, therefore are inefficient at capturing these relationships. Multiplexer problems require conditional rules which are not hierarchical, making the tree very large to account for all combinations of selector- input pairs. In more complex real- world datasets, it's quite likely that multiple non- hierarchical concepts are present, leading to suboptimal bias. If they have subpar variance and bias, decision trees may look like a poor choice of algorithm. Indeed, they rarely shine on their own, but rather as ensembles (with bagging or boosting), which we'll cover in the future. The staircase effect When decision trees attempt to model additive structure, they face a structural constraint that is common enough for me to name it the "staircase" effect^3. This limitation manifests differently in classification and regression tasks, but stems from axis- aligned splits. In classification problems, the staircase effect becomes apparent when the optimal decision boundary between classes involves a linear combination of features (an oblique boundary). Consider a simple case where the optimal boundary is a diagonal line in a two- dimensional feature space. Because decision trees can only make splits parallel to the feature axes, they must approximate this diagonal boundary through a series of rectangular regions, creating a stair- like pattern. Demonstration of the staircase effect in classification In regression settings, the staircase effect is even more prominent because regression trees must approximate continuous functions using discrete, constant- valued regions. When the underlying relationship is smooth -- whether linear, polynomial, or any other continuous function -- the tree's prediction surface has a stair- like structure with abrupt changes between adjacent regions. Increasing tree depth approximates the prediction surface to the training set, yet the surface does not necessarily become smooth as it captures feature noise. Demonstration of the staircase effect in regression Regression example with the staircase effect. There is only one feature (xxx) and the vertical axis is the outcome. Objective functions We need to define objective functions to optimize for during training. Classification trees use either Gini impurity or entropy as objective functions. Regression trees usually use the squared loss. In all examples, consider the data D={(x1,y1),...,(xn,yn)},yi[?] {1,...,c}\mathcal{D} = \{(x_1, y_1), ..., (x_n, y_n)\}, y_i \in \{1, ..., c\}D={(x1 ,y1 ),...,(xn ,yn )},yi [?]{1,...,c} where ccc is the number of classes. Misclassification rate MR(D)=[?]i=1nI(y^i[?]yi)nMR(\mathcal{D}) = \frac{\sum_{i=1}^n \mathbb{I} (\hat{y}_i \neq y_i)}{n}MR(D)=n[?]i=1n I(y^ i =yi ) This is a very intuitive objective for classification. It measures the proportion of misclassified examples. The prediction y^\hat{y}y^ is the majority vote of the node. Our goal is to decrease leaf node impurity, which is aligned with the misclassification rate function. Gini impurity The Gini impurity^4 over a set of class probabilities ppp is defined as: G(D)=[?]k=1cpk(1-pk)=1-[?]k=1cpk2G(\mathcal{D}) = \sum_{k=1}^{c} p_{k}(1 - p_{k}) = 1 - \sum_{k=1}^c p_{k}^2G(D)=k=1[?]c pk (1-pk )=1-k=1[?]c pk2 It measures how often a randomly chosen element of a set would be incorrectly labeled if it were labeled randomly and independently according to the distribution of labels in the set. When the node is pure, one class has a probability of 1 and the rest 0, so the Gini impurity is also 0. The worst Gini impurity arises when the probabilities are uniform among classes, that is, a node with maximum Gini impurity is no better than a coin toss at classifying examples. It can be shown that the upper bound of Gini impurity is 1 as the number of classes grow. To evaluate the quality of a split (S\mathcal{S}S), we calculate the Gini impurity of each child node and take the weighted average. G(D|S)=|DL||D|G(DL)+|DR||D|G(DR)G(\mathcal{D}|\mathcal{S}) = \frac{|\ mathcal{D}_L|}{|\mathcal{D}|} G(\mathcal{D}_L) + \frac{|\mathcal{D}_R |}{|\mathcal{D}|} G(\mathcal{D}_R)G(D|S)=|D||DL | G(DL )+|D||DR | G( DR ) Where |DL||D|\frac{|\mathcal{D}_L|}{|\mathcal{D}|}|D||DL | is the fraction of inputs in the left child node and |DR||D|\frac{|\ mathcal{D}_R|}{|\mathcal{D}|}|D||DR | , in the right. Entropy H(D)=-[?]k=1cpk log2pkH(\mathcal{D}) = - \sum_{k=1}^{c} p_{k}\ log_{2} p_{k}H(D)=-k=1[?]c pk log2 pk The entropy criterion is a concept from information theory (Shannon entropy). It measures the average level of uncertainty of a set of probabilities. In other words, it's the expected amount of information required to describe the potential outcomes. When probabilities are uniformly distributed entropy is maximum, just like Gini impurity. To understand the connection between information and probabilities, consider you have a set with two classes that are equally likely (let's say red and blue). If you draw a sample, you have the least possible amount of certainty about which color you've drawn. On the other hand, if red has a 95% probability, you can be pretty confident about which color you'll draw -- this has much lower uncertainty. When probabilities are uniform you need extra information to accurately convey the result of a series of draws. The range of the entropy criterion is from 0 to log2(c)log_{2}(c)l og2 (c). We also take the weighted average of child nodes to evaluate the quality of a split: H(D|S)=|DL||D|H(DL)+|DR||D|H(DR)H(\mathcal{D}|\mathcal{S}) = \frac{|\ mathcal{D}_L|}{|\mathcal{D}|} H(\mathcal{D}_L) + \frac{|\mathcal{D}_R |}{|\mathcal{D}|} H(\mathcal{D}_R)H(D|S)=|D||DL | H(DL )+|D||DR | H( DR ) It's also common to see the objective function expressed in terms of information gain, which is the decrease in entropy yielded by a node split. IG(D)=H(D)-H(D|S)IG(\mathcal{D}) = H(\mathcal{D}) - H(\mathcal{D}|\ mathcal{S})IG(D)=H(D)-H(D|S) It represents the amount of information gained about our target variable given our split. In these terms, the objective is to maximize information gain. Comparing classification objectives Although misclassification rate is intuitive and commonly used as a model metric, it's not used as an optimization objective. We can come up with splits (S\mathcal{S}S) that have the same misclassification rate, but one has a pure node while the other doesn't. For instance: S={A:10,B:0}, {A:10,B:5}\mathcal{S} = \{A: 10, B: 0\},\ \{A: 10, B: 5 \}S={A:10,B:0}, {A:10,B:5} S'={A:10,B:2}, {A:10,B:3}\mathcal{S}' = \ {A: 10, B: 2\},\ \{A: 10, B: 3\}S'={A:10,B:2}, {A:10,B:3} Both splits misclassify 5 samples of class B (same misclassification rate), however the first has a pure node. From an optimization perspective, we should favor pure nodes as they reduce the number of subsequent splits. You can check that indeed both the entropy and the Gini impurity of the second split are higher. Checking that the affirmation holds Misclassification rate: MR(D|S)=10250+1525515=0.2MR(\mathcal{D}|\mathcal{S}) = \frac{10}{25} 0 + \frac{15}{25} \frac{5}{15} = 0.2MR(D|S)=2510 0+2515 155 =0.2MR (D|S')=1225212+1325313=0.2MR(\mathcal{D}|\mathcal{S}') = \frac{12} {25} \frac{2}{12} + \frac{13}{25} \frac{3}{13} = 0.2MR(D|S')=2512 122 +2513 133 =0.2MR(D|S)=MR(D|S')MR(\mathcal{D}|\mathcal{S}) = MR(\ mathcal{D}|\mathcal{S}')MR(D|S)=MR(D|S') Gini impurity: G(D|S)=10250+15250.444=0.266G(\mathcal{D}|\mathcal{S}) = \frac{10} {25} 0 + \frac{15}{25} 0.444 = 0.266G(D|S)=2510 0+2515 0.444=0.266G (D|S')=12250.277+13250.355=0.318G(\mathcal{D}|\mathcal{S}') = \frac {12}{25} 0.277 + \frac{13}{25} 0.355 = 0.318G(D|S')=2512 0.277+2513 0.355=0.318G(D|S)