Decision tree learners are powerful classifiers, which utilizes a tree structure to model the relationship among the features and the potential outcomes. The tree has a root node and decision nodes where choices are made. The choices split the data across branches that indicate the potential outcomes of a decision. The tree is terminated by leaf nodes (or terminal nodes) that denote the action to be taken as the result of the series of the decisions.
After the model is created, many decision trees algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn’t work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results needs to be shared with others in order to inform business practices.
Decision tree models are often biased towards splits on features having a large number of levels and they can handle numeric or nominal features, as well as missing data.
Some potential uses of this algorithm include:
- Credit scoring model
- Marketing studies of costumer behavior
- Diagnosis of medical conditions based on laboratory measurements
There are numerous implementations of decisions trees, here we will use C5.0 algorithm.
- Create train and test data sets:
We will divide the iris data set into two different sets: a train data set that will be used to build the model and a test data set that will be used to estimate the predictive accuracy of the model.
The data set will be divided into train (70%) and test (30%) sets, we create the data sets using the
train_ind= createDataPartition(y = iris$Species,p = 0.7,list = FALSE)
train = iris[train_ind,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 8 5.0 3.4 1.5 0.2 setosa
test = iris[-train_ind,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 4.7 3.2 1.3 0.2 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 23 4.6 3.6 1.0 0.2 setosa
- Training a model on the data
## 'data.frame': 105 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.6 5 5.4 5 4.9 5.4 4.8 4.8 ...
## $ Sepal.Width : num 3.5 3 3.1 3.6 3.9 3.4 3.1 3.7 3.4 3 ...
## $ Petal.Length: num 1.4 1.4 1.5 1.4 1.7 1.5 1.5 1.5 1.6 1.4 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.4 0.2 0.1 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
model =C5.0(train[c(1:4)], train$Species) #(variables used from training, result)
## Call:
## C5.0.default(x = train[c(1:4)], y = train$Species)
## Classification Tree
## Number of samples: 105
## Number of predictors: 4
## Tree size: 4
## Non-standard options: attempt to group attributes
## Call:
## C5.0.default(x = train[c(1:4)], y = train$Species)
## C5.0 [Release 2.07 GPL Edition] Thu Aug 31 10:23:13 2017
## -------------------------------
## Class specified by attribute `outcome'
## Read 105 cases (5 attributes) from
## Decision tree:
## Petal.Length <= 1.7: setosa (35)
## Petal.Length > 1.7:
## :...Petal.Width > 1.7: virginica (32/1)
## Petal.Width <= 1.7:
## :...Petal.Length <= 5: versicolor (34/1)
## Petal.Length > 5: virginica (4/1)
## Evaluation on training data (105 cases):
## Decision Tree
## ----------------
## Size Errors
## 4 3( 2.9%) <<
## (a) (b) (c) <-classified ----="" 0.0="" 100.00="" 1="" 2="" 33="" 34="" 35="" 66.67="" a="" as="" attribute="" b="" c="" class="" code="" petal.length="" petal.width="" secs="" setosa="" time:="" usage:="" versicolor="" virginica="">-classified>
- Evaluating model performance
predictions =predict(model, test) #(model with the training data set, data set to be predicted)
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa versicolor setosa setosa setosa versicolor
## [13] setosa setosa setosa versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica virginica virginica virginica virginica virginica
## [37] versicolor virginica virginica virginica virginica virginica
## [43] virginica virginica virginica
## Levels: setosa versicolor virginica
## Warning: package 'gmodels' was built under R version 3.4.1
CrossTable(test$Species, predictions, prop.chisq = FALSE, prop.c= FALSE, prop.r = FALSE, dnn= c("actual", "predicted"))
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
## Total Observations in Table: 45
## | predicted
## actual | setosa | versicolor | virginica | Row Total |
## -------------|------------|------------|------------|------------|
## setosa | 13 | 2 | 0 | 15 |
## | 0.289 | 0.044 | 0.000 | |
## -------------|------------|------------|------------|------------|
## versicolor | 0 | 15 | 0 | 15 |
## | 0.000 | 0.333 | 0.000 | |
## -------------|------------|------------|------------|------------|
## virginica | 0 | 1 | 14 | 15 |
## | 0.000 | 0.022 | 0.311 | |
## -------------|------------|------------|------------|------------|
## Column Total | 13 | 18 | 14 | 45 |
## -------------|------------|------------|------------|------------|
confu =confusionMatrix(predictions, test$Species , positive = 'setosa')
## Confusion Matrix and Statistics
## Reference
## Prediction setosa versicolor virginica
## setosa 13 0 0
## versicolor 2 15 1
## virginica 0 0 14
## Overall Statistics
## Accuracy : 0.9333
## 95% CI : (0.8173, 0.986)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
## Kappa : 0.9
## Mcnemar's Test P-Value : NA
## Statistics by Class:
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 0.8667 1.0000 0.9333
## Specificity 1.0000 0.9000 1.0000
## Pos Pred Value 1.0000 0.8333 1.0000
## Neg Pred Value 0.9375 1.0000 0.9677
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.2889 0.3333 0.3111
## Detection Prevalence 0.2889 0.4000 0.3111
## Balanced Accuracy 0.9333 0.9500 0.9667
Importance of each variable in the model:
#model = 'usage' or 'splits'
#pcr = logical, importance values converted between 0 and 100
C5imp(model, metric = 'usage', pct = TRUE)
## Overall
## Petal.Length 100.00
## Petal.Width 66.67
## Sepal.Length 0.00
## Sepal.Width 0.00
C5imp(model, metric = 'splits', pct = TRUE)
## Overall
## Petal.Length 66.66667
## Petal.Width 33.33333
## Sepal.Length 0.00000
## Sepal.Width 0.00000
C5imp(model, metric = 'usage', pct = FALSE)
## Overall
## Petal.Length 100.00
## Petal.Width 66.67
## Sepal.Length 0.00
## Sepal.Width 0.00
C5imp(model, metric = 'splits', pct = FALSE)
## Overall
## Petal.Length 2
## Petal.Width 1
## Sepal.Length 0
## Sepal.Width 0
Plot the model predicted:
plot(model) #entire tree
plot(model, subtree = 3) #subtree starting with the node selected, in this case, node = 5)
plot(model, subtree = 4) #subtree starting with the node selected, in this case, node = 6)