Scikit-Learn : Decision Trees

Scikit-Learn : Decision Trees

In this guide, we will learn about learning method in Sklearn which is termed as decision trees.

Decisions tress (DTs) are the most powerful non-parametric supervised learning method. They can be used for the classification and regression tasks. The main goal of DTs is to create a model predicting target variable value by learning simple decision rules deduced from the data features. Decision trees have two main entities; one is root node, where the data splits, and other is decision nodes or leaves, where we got final output.

Decision Tree Algorithms

Different Decision Tree algorithms are explained below −

ID3

It was developed by Ross Quinlan in 1986. It is also called Iterative Dichotomiser 3. The main goal of this algorithm is to find those categorical features, for every node, that will yield the largest information gain for categorical targets.

It lets the tree to be grown to their maximum size and then to improve the tree’s ability on unseen data, applies a pruning step. The output of this algorithm would be a multiway tree.

C4.5

It is the successor to ID3 and dynamically defines a discrete attribute that partitions the continuous attribute value into a discrete set of intervals. That’s the reason it removed the restriction of categorical features. It converts the ID3 trained tree into sets of ‘IF-THEN’ rules.

In order to determine the sequence in which these rules should apply, the accuracy of each rule will be evaluated first.

C5.0

It works similarly to C4.5 but it uses less memory and builds smaller rulesets. It is more accurate than C4.5.

CART

It is called the Classification and Regression Trees algorithm. It basically generates binary splits by using the features and threshold yielding the largest information gain at each node (called the Gini index).

Homogeneity depends upon the Gini index, the higher the value of the Gini index, the higher would be the homogeneity. It is like the C4.5 algorithm, but, the difference is that it does not compute rule sets and does not support numerical target variables (regression) as well.

Classification with decision trees

In this case, the decision variables are categorical.

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeClassifier for performing multiclass classification on the dataset.

Parameters

The following table consists of the parameters used by sklearn.tree.DecisionTreeClassifier module −

Sr.NoParameter & Description
1criterion − string, optional default= “gini”It represents the function to measure the quality of a split. Supported criteria are “gini” and “entropy”. The default is gini which is for Gini impurity while entropy is for the information gain.
2splitter − string, optional default= “best”It tells the model, which strategy from “best” or “random” to choose the split at each node.
3max_depth − int or None, optional default=NoneThis parameter decides the maximum depth of the tree. The default value is None which means the nodes will expand until all leaves are pure or until all leaves contain less than min_smaples_split samples.
4min_samples_split − int, float, optional default=2This parameter provides the minimum number of samples required to split an internal node.
5min_samples_leaf − int, float, optional default=1This parameter provides the minimum number of samples required to be at a leaf node.
6min_weight_fraction_leaf − float, optional default=0.With this parameter, the model will get the minimum weighted fraction of the sum of weights required to be at a leaf node.
7max_features − int, float, string or None, optional default=NoneIt gives the model the number of features to be considered when looking for the best split.
8random_state − int, RandomState instance or None, optional, default = noneThis parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options −int − In this case, random_state is the seed used by random number generator.RandomState instance − In this case, random_state is the random number generator.None − In this case, the random number generator is the RandonState instance used by np.random.
9max_leaf_nodes − int or None, optional default=NoneThis parameter will let grow a tree with max_leaf_nodes in best-first fashion. The default is none which means there would be unlimited number of leaf nodes.
10min_impurity_decrease − float, optional default=0.This value works as a criterion for a node to split because the model will split a node if this split induces a decrease of the impurity greater than or equal to min_impurity_decrease value.
11min_impurity_split − float, default=1e-7It represents the threshold for early stopping in tree growth.
12class_weight − dict, list of dicts, “balanced” or None, default=NoneIt represents the weights associated with classes. The form is {class_label: weight}. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight: balanced, it will use the values of y to automatically adjust weights.
13presort − bool, optional default=FalseIt tells the model whether to presort the data to speed up the finding of best splits in fitting. The default is false but of set to true, it may slow down the training process.

Attributes

The following table consists of the attributes used by sklearn.tree.DecisionTreeClassifier module −

Sr.NoParameter & Description
1feature_importances_ − array of shape =[n_features]This attribute will return the feature importance.
2classes_: − array of shape = [n_classes] or a list of such arraysIt represents the classes labels i.e. the single output problem, or a list of arrays of class labels i.e. multi-output problem.
3max_features_ − intIt represents the deduced value of max_features parameter.
4n_classes_ − int or listIt represents the number of classes i.e. the single output problem, or a list of a number of classes for every output i.e. multi-output problem.
5n_features_ − intIt gives the number of features when fit() method is performed.
6n_outputs_ − intIt gives the number of outputs when fit() method is performed.

Methods

The following table consists of the methods used by sklearn.tree.DecisionTreeClassifier module −

Sr.NoParameter & Description
1apply(self, X[, check_input])This method will return the index of the leaf.
2decision_path(self, X[, check_input])As name suggests, this method will return the decision path in the tree
3fit(self, X, y[, sample_weight, …])fit() method will build a decision tree classifier from given training set (X, y).
4get_depth(self)As name suggests, this method will return the depth of the decision tree
5get_n_leaves(self)As name suggests, this method will return the number of leaves of the decision tree.
6get_params(self[, deep])We can use this method to get the parameters for estimator.
7predict(self, X[, check_input])It will predict class value for X.
8predict_log_proba(self, X)It will predict class log-probabilities of the input samples provided by us, X.
9predict_proba(self, X[, check_input])It will predict class probabilities of the input samples provided by us, X.
10score(self, X, y[, sample_weight])As the name implies, the score() method will return the mean accuracy on the given test data and labels..
11set_params(self, \*\*params)We can set the parameters of estimator with this method.

Implementation Example

The Python script below will use sklearn.tree.DecisionTreeClassifier module to construct a classifier for predicting male or female from our data set having 25 samples and two features namely ‘height’ and ‘length of hair’ −

from sklearn import tree
from sklearn.model_selection import train_test_split
X=[[165,19],[175,32],[136,35],[174,65],[141,28],[176,15]
,[131,32],[166,6],[128,32],[179,10],[136,34],[186,2],[12
6,25],[176,28],[112,38],[169,9],[171,36],[116,25],[196,2
5], [196,38], [126,40], [197,20], [150,25], [140,32],[136,35]]
Y=['Man','Woman','Woman','Man','Woman','Man','Woman','Ma
n','Woman','Man','Woman','Man','Woman','Woman','Woman','
Man','Woman','Woman','Man', 'Woman', 'Woman', 'Man', 'Man', 'Woman', 'Woman']
data_feature_names = ['height','length of hair']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1)
DTclf = tree.DecisionTreeClassifier()
DTclf = clf.fit(X,Y)
prediction = DTclf.predict([[135,29]])
print(prediction)

Output

['Woman']

We can also predict the probability of each class by using the following python predict_proba() method as follows −

Example

prediction = DTclf.predict_proba([[135,29]])
print(prediction)

Output

[[0. 1.]]

Regression with decision trees

In this case, the decision variables are continuous.

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeRegressor for applying decision trees on regression problems.

Parameters

Parameters used by DecisionTreeRegressor are almost the same as that were used in the DecisionTreeClassifier module. The difference lies in the ‘criterion’ parameter. For DecisionTreeRegressor modules ‘criterion: string, optional default= “mse”’ parameter has the following values −

  • mse − It stands for the mean squared error. It is equal to variance reduction as feature selectin criterion. It minimises the L2 loss using the mean of each terminal node.
  • freidman_mse − It also uses mean squared error but with Friedman’s improvement score.
  • mae − It stands for the mean absolute error. It minimizes the L1 loss using the median of each terminal node.

Another difference is that it does not have ‘class_weight’ parameter.

Attributes

Attributes of DecisionTreeRegressor are also the same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘classes_’ and ‘n_classes_’ attributes.

Methods

Methods of DecisionTreeRegressor are also the same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘predict_log_proba()’ and ‘predict_proba()’’ attributes.

Implementation Example

The fit() method in the Decision tree regression model will take floating point values of y. let’s see a simple implementation example by using Sklearn.tree.DecisionTreeRegressor −

from sklearn import tree
X = [[1, 1], [5, 5]]
y = [0.1, 1.5]
DTreg = tree.DecisionTreeRegressor()
DTreg = clf.fit(X, y)

Once fitted, we can use this regression model to make prediction as follows −

DTreg.predict([[4, 5]])

Output

array([1.5])

Next Topic : Click Here

This Post Has One Comment

Leave a Reply