[Decision Tree] Experimenting and visualizing classification and regression trees with different depths

3 minute read

1. Decision Classification Tree

(1) Importing modules and data(iris data)

from sklearn.datasets import load_iris
from sklearn import tree
from os import system
system("pip install graphviz")
0
import graphviz 
iris=load_iris()
iris.feature_names
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

(2) Splitting train dataset and test dataset

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data,iris.target, stratify=iris.target,random_state=123)

(3) Making a classificaion decision tree model and comparing accuracy scores with different depths

from sklearn.metrics  import accuracy_score
from sklearn.metrics import confusion_matrix
def compare_depth(max_depth):
    clf=tree.DecisionTreeClassifier(criterion='gini', max_depth=max_depth, random_state=1)
    clf=clf.fit(X_train, y_train)
    
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    print(accuracy_score(y_train, y_train_pred))
    print(confusion_matrix(y_train, y_train_pred))
    print(accuracy_score(y_test, y_test_pred))
    print(confusion_matrix(y_test, y_test_pred))
compare_depth(3)
0.9642857142857143
[[38  0  0]
 [ 0 36  1]
 [ 0  3 34]]
0.9210526315789473
[[12  0  0]
 [ 0 12  1]
 [ 0  2 11]]
compare_depth(5)
0.9910714285714286
[[38  0  0]
 [ 0 37  0]
 [ 0  1 36]]
0.9736842105263158
[[12  0  0]
 [ 0 12  1]
 [ 0  0 13]]
compare_depth(7)
1.0
[[38  0  0]
 [ 0 37  0]
 [ 0  0 37]]
0.9736842105263158
[[12  0  0]
 [ 0 12  1]
 [ 0  0 13]]

As it is seen, I made a function for comparing accuracy scores with a different scale of depth. The more depth, the better accuracy yields for train set. This means that the model can explain train data better as it learns data more in details. Yet, when I compared the model with its depth 5 and 7, the accuracy of test dataset was not improved while that of trainset increased up to 100%. It implies that overfitting has occurred between depth = 5 to 7.

(4) Visualizing classification trees

In the beginning, each class has the same or similar number of sample : setosa(38), versicolor(37), virginica(37). Value implies that how many samples belong to each class.

decisiontree_comparision(3)
decisiontree_comparision(5)

decisiontree_comparision(7)

Among above three, to alleviate the overfitting problem, I would choose the second classification decision tree where the tree depth is 5.


2. Decision regression tree

(1) Importing modules and making random data

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))

(2) Making regression trees with different depths

regr1=tree.DecisionTreeRegressor(max_depth=2)
regr2=tree.DecisionTreeRegressor(max_depth=5)
regr1.fit(X,y)
regr2.fit(X,y)
DecisionTreeRegressor(max_depth=5)
y_1=regr1.predict(X)
y_2=regr2.predict(X)
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y, y_1))
0.12967126328231798
print(mean_squared_error(y, y_2))
0.025236948989861896

The MSE is smaller when the depth=5 than the depth = 2.

(3) Visualizing how correct regression trees can predict the data with different settings of the depth

X_test=np.arange(0.0,5.0,0.01)[:,np.newaxis]

y_pred_1=regr1.predict(X_test)
y_pred_2=regr2.predict(X_test)

The figure below shows when the tree depth is 2 while the next one shows when the depth is 5

plt.figure()
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_pred_1, color="cornflowerblue", label="max_depth=2", linewidth=2)

plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_pred_2, color="yellowgreen", label="max_depth=5", linewidth=2)

plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()
### Plottingn to see the difference when the tree depth is 2 and 5

plt.figure()
plt.scatter(X, y, s=20, edgecolor="black",
            c="darkorange", label="data")
plt.plot(X_test, y_pred_1, color="cornflowerblue",
         label="max_depth=2", linewidth=2)
plt.plot(X_test, y_pred_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

The regression model with depth=2 could find a general trend, while the one with depth=5 could predict more in details. However, the latter might react too sensitive to outliers.

(4) Visualizing regression trees

reg_tree1 = tree.export_graphviz(regr1, out_file=None, 
                                filled=True, rounded=True,  
                                special_characters=True)
g_reg_tree1 = graphviz.Source(reg_tree1)
g_reg_tree1
reg_tree2 = tree.export_graphviz(regr2, out_file=None, 
                                filled=True, rounded=True,  
                                special_characters=True)
g_reg_tree2 = graphviz.Source(reg_tree2)
g_reg_tree2

Each graph shows regression trees when its depth is 2 and 5. In the regression trees, value refers to a representative value of samples(for instance, the mean of samples).


Reference

Updated: