Classification and Regression

Decision Trees can be used for both.

Classification

Spam / not Spam

Admit to ICU /not

Lend money / deny

Intrusion detections

Regression

Predict stock returns

Pricing a house or a car

Weather predictions (temp, rain fall etc)

Economic growth predictions

Predicting sports scores

Decision Trees

The general idea is that we will segment the space into a number of simple regions.

The segmentation can be illustrated as a tree

The end nodes can have a category (classification) or a continuous number (regression)

These methods, while quite simple are very powerful.

Visualizing Classification as a Tree

Metrics

Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items.

Different algorithms use different metrics for measuring “best"

These metrics measure how similar a region or a node is. They are said to measure the impurity of a region.

Larger these impurity metrics the larger the “dissimilarity” of a nodes/regions data.

Examples: Gini impurity, Entropy, Variance

Algorithms for building Decision Trees

Popular ones include

CART (Classification And Regression Tree)

CHAID (CHi-squared Automatic Interaction Detector)

C4.5

We will focus on CART, that uses the Gini impurity as its impurity measure.

Gini impurity

Used by the CART

Is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

Splitting using Gini impurity

When splitting, the Gini impurity of the two resulting nodes are combined using a weighted average.

With weights being the fraction of data on each node.

The CART algorithm simply chooses the right “split” by finding the split that maximizes the “decrease in Gini impurity” - also called the Gini Gain.

Decision trees are prone to ‘overfitting’

Decision Tree is a powerful algorithm that can adapt well and capture various patterns in the data

If allowed to grow fully, they become over-complex & tend to fit even the noise 10

Thus, a fully grown tree may not ‘generalize’ well on test or new unseen data

Pruning

Ideally we would like a tree that does not over-fit the given data

One popular and simple way to prune a decision tree is by limiting the depth of the tree to avoid over fitting.

For example the tree on the right below is generated with a max depth of 2 while the tree on the left has no depth restriction (and hence overfits the data)

Pre-Pruning

Stop growing the tree before it grows too big

This can be achieved by bounding hyperparameters 12 Hyperparameters in Decision Trees

- max_depth - The maximum depth of the tree. If set to ‘None’, then nodes are

expanded until all leaves are pure. Higher the value, more complex the tree

- min_samples_split - The minimum number of samples required to split a node.

Doesn’t split any node that is smaller than this number. Higher the values, less

complex the tree

- min_samples_leaf - The minimum number of samples required at a leaf node. All

leaf nodes have at least these many data points. Higher the value, less

complex the tree

Hyperparameter Tuning using Grid Search

Grid Search is a process of searching the best combination of hyperparameters from a predefined set of values

A parameter grid (Hyperparameters and corresponding values) is provided as an input to the Grid-search function

It tries all the combinations of the values passed and evaluates the model for each combination

It returns the combination of hyperparameter values that works best as per the metric provided for model evaluation

GridSearchCV( ) is an implementation of Grid Search with Cross Validation.

Cross Validation

- Cross Validation is a common Machine Learning technique that splits the data into n non-overlapping groups, and runs n experiments:

In each experiment, n-1 groups are used to train a model and the model is tested on the left out group.

The results are summarized over the n experiments.

- It gives a mechanism that allows us to test a model repeatedly on data that was not used to build the model.

- For Decision Trees, a very common approach is simply to choose the tree with minimum cross validation error.

Post-Pruning: Cost-complexity pruning

- Starting from the Full tree, create a sequence of trees that are sequentially smaller (pruned)

- At each step the algorithm

try removing each possible subtree

find the ‘relative error decrease per node’ for that subtree - Complexity parameter,

And remove the subtree with the minimum

- With the list of subtrees, one usually reverts back to using crossvalidation errors to find the best final pruned tree

Few Additional Thoughts

Other impurity measures

Regression Trees

Pros and Cons

Impurity Measures in Decision Trees

Regression Trees

Pros and Cons of Decision Trees

Pros -

Easy to understand and interpret

Useful in data exploration as it gives the splitting based on the significance of variables • Not influenced by the outlier/Null values and hence requires less data cleaning. Requires less time and effort during data pre-processing than other algorithms.

Can handle both continuous and categorical variables

Does not require any underlying assumptions in data. Works with both linearly and nonlinearly related variables.

Cons-

A small change in the data-set can result in large change in the structure of the decision tree causing instability in the model.

Large trees can be difficult to interpret.

Tends to overfit.