Skip to main content

Decision Tree Algorithm using iris data set

Decision tree learners are powerful classifiers, which utilizes a tree structure to model the relationship among the features and the potential outcomes. The tree has a root node and decision nodes where choices are made. The choices split the data across branches that indicate the potential outcomes of a decision. The tree is terminated by leaf nodes (or terminal nodes) that denote the action to be taken as the result of the series of the decisions.
After the model is created, many decision trees algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn’t work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results needs to be shared with others in order to inform business practices.
Decision tree models are often biased towards splits on features having a large number of levels and they can handle numeric or nominal features, as well as missing data.
Some potential uses of this algorithm include:
  • Credit scoring model
  • Marketing studies of costumer behavior
  • Diagnosis of medical conditions based on laboratory measurements
There are numerous implementations of decisions trees, here we will use C5.0 algorithm.
  1. Create train and test data sets:
We will divide the iris data set into two different sets: a train data set that will be used to build the model and a test data set that will be used to estimate the predictive accuracy of the model.
The data set will be divided into train (70%) and test (30%) sets, we create the data sets using the caret package:
library(caret)
set.seed(12345)

train_ind= createDataPartition(y = iris$Species,p = 0.7,list = FALSE)
train = iris[train_ind,]
head(train)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 8          5.0         3.4          1.5         0.2  setosa
test = iris[-train_ind,]
head(test)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3           4.7         3.2          1.3         0.2  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 23          4.6         3.6          1.0         0.2  setosa
  1. Training a model on the data
#install.packages('C50')
library(C50)
str(train)
## 'data.frame':    105 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.6 5 5.4 5 4.9 5.4 4.8 4.8 ...
##  $ Sepal.Width : num  3.5 3 3.1 3.6 3.9 3.4 3.1 3.7 3.4 3 ...
##  $ Petal.Length: num  1.4 1.4 1.5 1.4 1.7 1.5 1.5 1.5 1.6 1.4 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.4 0.2 0.1 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
model =C5.0(train[c(1:4)], train$Species) #(variables used from training, result)
model
## 
## Call:
## C5.0.default(x = train[c(1:4)], y = train$Species)
## 
## Classification Tree
## Number of samples: 105 
## Number of predictors: 4 
## 
## Tree size: 4 
## 
## Non-standard options: attempt to group attributes
summary(model)
## 
## Call:
## C5.0.default(x = train[c(1:4)], y = train$Species)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Aug 31 10:23:13 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 105 cases (5 attributes) from undefined.data
## 
## Decision tree:
## 
## Petal.Length <= 1.7: setosa (35)
## Petal.Length > 1.7:
## :...Petal.Width > 1.7: virginica (32/1)
##     Petal.Width <= 1.7:
##     :...Petal.Length <= 5: versicolor (34/1)
##         Petal.Length > 5: virginica (4/1)
## 
## 
## Evaluation on training data (105 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       4    3( 2.9%)   <<
## 
## 
##     (a)   (b)   (c)    <-classified ----="" 0.0="" 100.00="" 1="" 2="" 33="" 34="" 35="" 66.67="" a="" as="" attribute="" b="" c="" class="" code="" petal.length="" petal.width="" secs="" setosa="" time:="" usage:="" versicolor="" virginica="">
  1. Evaluating model performance
predictions =predict(model, test) #(model with the training data set, data set to be predicted)
predictions
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     versicolor setosa     setosa     setosa     versicolor
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] versicolor virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.4.1
CrossTable(test$Species, predictions, prop.chisq = FALSE, prop.c= FALSE, prop.r = FALSE, dnn= c("actual", "predicted"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  45 
## 
##  
##              | predicted 
##       actual |     setosa | versicolor |  virginica |  Row Total | 
## -------------|------------|------------|------------|------------|
##       setosa |         13 |          2 |          0 |         15 | 
##              |      0.289 |      0.044 |      0.000 |            | 
## -------------|------------|------------|------------|------------|
##   versicolor |          0 |         15 |          0 |         15 | 
##              |      0.000 |      0.333 |      0.000 |            | 
## -------------|------------|------------|------------|------------|
##    virginica |          0 |          1 |         14 |         15 | 
##              |      0.000 |      0.022 |      0.311 |            | 
## -------------|------------|------------|------------|------------|
## Column Total |         13 |         18 |         14 |         45 | 
## -------------|------------|------------|------------|------------|
## 
## 
library(caret)
confu =confusionMatrix(predictions, test$Species , positive = 'setosa')
confu
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         13          0         0
##   versicolor      2         15         1
##   virginica       0          0        14
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9333         
##                  95% CI : (0.8173, 0.986)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9            
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 0.8667            1.0000           0.9333
## Specificity                 1.0000            0.9000           1.0000
## Pos Pred Value              1.0000            0.8333           1.0000
## Neg Pred Value              0.9375            1.0000           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.2889            0.3333           0.3111
## Detection Prevalence        0.2889            0.4000           0.3111
## Balanced Accuracy           0.9333            0.9500           0.9667
Importance of each variable in the model:
#model = 'usage' or 'splits'
#pcr = logical, importance values converted between 0 and 100
C5imp(model, metric = 'usage', pct = TRUE)
##              Overall
## Petal.Length  100.00
## Petal.Width    66.67
## Sepal.Length    0.00
## Sepal.Width     0.00
C5imp(model, metric = 'splits', pct = TRUE)
##               Overall
## Petal.Length 66.66667
## Petal.Width  33.33333
## Sepal.Length  0.00000
## Sepal.Width   0.00000
C5imp(model, metric = 'usage', pct = FALSE)
##              Overall
## Petal.Length  100.00
## Petal.Width    66.67
## Sepal.Length    0.00
## Sepal.Width     0.00
C5imp(model, metric = 'splits', pct = FALSE)
##              Overall
## Petal.Length       2
## Petal.Width        1
## Sepal.Length       0
## Sepal.Width        0
Plot the model predicted:
plot(model) #entire tree
plot(model, subtree = 3) #subtree starting with the node selected, in this case, node = 5)
plot(model, subtree = 4) #subtree starting with the node selected, in this case, node = 6)

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...