Skip to main content

Decision tree for predicting diabetes based on diagnostic measures


Decision tree learners are powerfull classifiers, which utilizes a tree structure to model the relationship among the features and the potential outcomes. The tree has a root node and decision nodes where choices are made. The choices split the data across branches that indicate the potential outcoumes of a decision. The tree is terminated by leaf nodes (or terminal nodes) that denote the action to be taken as the result of the series of the decisions. In the case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree.
After the model is created, many decision trees algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn’t work well for a particular task. This also makes decision trees particularly appropiate for applications in which the classification machanism needs to be transparent for legal reasons, or in case the results needs to be shared with others in order to inform business practices.
Decision tree models are often biased towars splits on features having a large number of levels and they can handle numeric or nominal features, as well as missing data.
Some potential uses of this algorithm include:
  • Credit scoring model
  • Marketing studies of costumer behavior
  • Diagnosis of medical conditions based on laboratory measurements
There are numerous implementations of decisions trees, here we will use C5.0 algorithm.

Step 1. Collecting data. Exploring and preparing the data.

diabetes = read.csv("C:/Users/ester/Desktop/BLOG/diabetes.csv", sep = "," , dec = ".", header = TRUE)
head(diabetes)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0
summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
str(diabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...
Here, we have to be careful about how the data has been coded. First, we see that the Outcome is numeric while it should be categorical:
diabetes$Outcome = factor(diabetes$Outcome)
summary(diabetes$Outcome)
##   0   1 
## 500 268
levels(diabetes$Outcome) = c('negative', 'positive')
Secondly, we see that in many variables there are zeros that do not make sense, for example, BloodPressureBMI, etc. We can assume that the zeros represent the NA values and need to be recoded correctly:
diabetes$Glucose[diabetes$Glucose == 0 ] = NA
diabetes$BloodPressure[diabetes$BloodPressure == 0 ] = NA
diabetes$SkinThickness[diabetes$SkinThickness == 0 ] = NA
diabetes$Insulin[diabetes$Insulin == 0 ] = NA
diabetes$BMI[diabetes$BMI == 0 ] = NA
diabetes = na.omit(diabetes)
summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 56.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.:21.00  
##  Median : 2.000   Median :119.0   Median : 70.00   Median :29.00  
##  Mean   : 3.301   Mean   :122.6   Mean   : 70.66   Mean   :29.15  
##  3rd Qu.: 5.000   3rd Qu.:143.0   3rd Qu.: 78.00   3rd Qu.:37.00  
##  Max.   :17.000   Max.   :198.0   Max.   :110.00   Max.   :63.00  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0850           Min.   :21.00  
##  1st Qu.: 76.75   1st Qu.:28.40   1st Qu.:0.2697           1st Qu.:23.00  
##  Median :125.50   Median :33.20   Median :0.4495           Median :27.00  
##  Mean   :156.06   Mean   :33.09   Mean   :0.5230           Mean   :30.86  
##  3rd Qu.:190.00   3rd Qu.:37.10   3rd Qu.:0.6870           3rd Qu.:36.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##      Outcome   
##  negative:262  
##  positive:130  
##                
##                
##                
## 
The data we are going to work with has a dimention of 392 rows and 9 columns.

Step 2. Creating training and testing datasets

We will divide our data into two different sets: a training dataset that will be used to build the model and a test dataset that will be used to estimate the predictive accuracy of the model.
The dataset will be divided into training (70%) and testing (30%) sets, we create the data sets using the caret package:
library(caret)
set.seed(123)

train_ind= createDataPartition(y = diabetes$Outcome,p = 0.7,list = FALSE)
train = diabetes[train_ind,]
head(train)
##    Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 4            1      89            66            23      94 28.1
## 5            0     137            40            35     168 43.1
## 14           1     189            60            23     846 30.1
## 15           5     166            72            19     175 25.8
## 19           1     103            30            38      83 43.3
## 20           1     115            70            30      96 34.6
##    DiabetesPedigreeFunction Age  Outcome
## 4                     0.167  21 negative
## 5                     2.288  33 positive
## 14                    0.398  59 positive
## 15                    0.587  51 positive
## 19                    0.183  33 negative
## 20                    0.529  32 positive
test = diabetes[-train_ind,]
head(test)
##    Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 7            3      78            50            32      88 31.0
## 9            2     197            70            45     543 30.5
## 17           0     118            84            47     230 45.8
## 21           3     126            88            41     235 39.3
## 25          11     143            94            33     146 36.6
## 26          10     125            70            26     115 31.1
##    DiabetesPedigreeFunction Age  Outcome
## 7                     0.248  26 positive
## 9                     0.158  53 positive
## 17                    0.551  31 positive
## 21                    0.704  27 negative
## 25                    0.254  51 positive
## 26                    0.205  41 positive
The training set has 275 samples, and the testing set has 117 samples.

Step 3. Training a model on the data

#install.packages('C50')
library(C50)
str(train)
## 'data.frame':    275 obs. of  9 variables:
##  $ Pregnancies             : int  1 0 1 5 1 1 3 3 9 1 ...
##  $ Glucose                 : int  89 137 189 166 103 115 158 88 171 103 ...
##  $ BloodPressure           : int  66 40 60 72 30 70 76 58 110 80 ...
##  $ SkinThickness           : int  23 35 23 19 38 30 36 11 24 11 ...
##  $ Insulin                 : int  94 168 846 175 83 96 245 54 240 82 ...
##  $ BMI                     : num  28.1 43.1 30.1 25.8 43.3 34.6 31.6 24.8 45.4 19.4 ...
##  $ DiabetesPedigreeFunction: num  0.167 2.288 0.398 0.587 0.183 ...
##  $ Age                     : int  21 33 59 51 33 32 28 22 54 22 ...
##  $ Outcome                 : Factor w/ 2 levels "negative","positive": 1 2 2 2 1 2 2 1 2 1 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:376] 1 2 3 6 8 10 11 12 13 16 ...
##   .. ..- attr(*, "names")= chr [1:376] "1" "2" "3" "6" ...
model =C5.0(train[c(1:8)], train$Outcome) #(varibale used from training, result)
model
## 
## Call:
## C5.0.default(x = train[c(1:8)], y = train$Outcome)
## 
## Classification Tree
## Number of samples: 275 
## Number of predictors: 8 
## 
## Tree size: 24 
## 
## Non-standard options: attempt to group attributes
summary(model)
## 
## Call:
## C5.0.default(x = train[c(1:8)], y = train$Outcome)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Sep 21 05:33:47 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 275 cases (9 attributes) from undefined.data
## 
## Decision tree:
## 
## Glucose <= 127:
## :...Insulin <= 95:
## :   :...BloodPressure <= 74: negative (61)
## :   :   BloodPressure > 74:
## :   :   :...DiabetesPedigreeFunction <= 0.817: negative (18/1)
## :   :       DiabetesPedigreeFunction > 0.817: positive (2)
## :   Insulin > 95:
## :   :...Age <= 28:
## :       :...SkinThickness <= 31: negative (30)
## :       :   SkinThickness > 31:
## :       :   :...Pregnancies > 3: positive (2)
## :       :       Pregnancies <= 3:
## :       :       :...Glucose <= 104: positive (4/1)
## :       :           Glucose > 104: negative (12/1)
## :       Age > 28:
## :       :...Pregnancies <= 1: positive (8/1)
## :           Pregnancies > 1:
## :           :...DiabetesPedigreeFunction <= 0.559: negative (17/1)
## :               DiabetesPedigreeFunction > 0.559:
## :               :...BMI <= 25.9: negative (2)
## :                   BMI > 25.9:
## :                   :...SkinThickness <= 31: positive (6)
## :                       SkinThickness > 31:
## :                       :...BloodPressure <= 74: negative (2)
## :                           BloodPressure > 74: positive (2)
## Glucose > 127:
## :...Glucose > 165:
##     :...DiabetesPedigreeFunction <= 1.154: positive (27/1)
##     :   DiabetesPedigreeFunction > 1.154: negative (3/1)
##     Glucose <= 165:
##     :...SkinThickness <= 22: negative (13)
##         SkinThickness > 22:
##         :...Insulin > 402: negative (6/1)
##             Insulin <= 402:
##             :...SkinThickness > 31: positive (35/8)
##                 SkinThickness <= 31:
##                 :...SkinThickness <= 24: positive (2)
##                     SkinThickness > 24:
##                     :...DiabetesPedigreeFunction > 0.997: positive (2)
##                         DiabetesPedigreeFunction <= 0.997:
##                         :...DiabetesPedigreeFunction > 0.571: negative (6)
##                             DiabetesPedigreeFunction <= 0.571:
##                             :...Pregnancies <= 3: positive (4)
##                                 Pregnancies > 3: [S1]
## 
## SubTree [S1]
## 
## DiabetesPedigreeFunction <= 0.415: negative (9/1)
## DiabetesPedigreeFunction > 0.415: positive (2)
## 
## 
## Evaluation on training data (275 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      24   17( 6.2%)   <<
## 
## 
##     (a)   (b)    <-classified ----="" 0.0="" 100.00="" 11="" 173="" 25.45="" 30.91="" 37.09="" 4.36="" 49.82="" 6="" 84.36="" 85="" a="" age="" as="" attribute="" b="" bloodpressure="" bmi="" class="" code="" diabetespedigreefunction="" glucose="" insulin="" negative="" positive="" pregnancies="" secs="" skinthickness="" time:="" usage:="">
plot(model) #the tree is too bog to be plotted in the screen
plot(model, subtree = 2) #to see it better we will use the parameter `subtree` 
plot(model, subtree = 18) #to see it better we will use the parameter `subtree`

Step 4. Evaluating model performance

predictions =predict(model, test) #(model with the training data set, data set to be predicted)
head(predictions)
## [1] negative positive positive negative positive negative
## Levels: negative positive
library(gmodels)
CrossTable(test$Outcome, predictions, prop.chisq = FALSE, prop.c= FALSE, prop.r = FALSE, dnn= c("actual", "predicted"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  117 
## 
##  
##              | predicted 
##       actual |  negative |  positive | Row Total | 
## -------------|-----------|-----------|-----------|
##     negative |        63 |        15 |        78 | 
##              |     0.538 |     0.128 |           | 
## -------------|-----------|-----------|-----------|
##     positive |        17 |        22 |        39 | 
##              |     0.145 |     0.188 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        80 |        37 |       117 | 
## -------------|-----------|-----------|-----------|
## 
## 
library(caret)
confu1 =confusionMatrix(predictions, test$Outcome , positive = 'positive')
confu1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative       63       17
##   positive       15       22
##                                           
##                Accuracy : 0.7265          
##                  95% CI : (0.6364, 0.8048)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.09981         
##                                           
##                   Kappa : 0.3766          
##  Mcnemar's Test P-Value : 0.85968         
##                                           
##             Sensitivity : 0.5641          
##             Specificity : 0.8077          
##          Pos Pred Value : 0.5946          
##          Neg Pred Value : 0.7875          
##              Prevalence : 0.3333          
##          Detection Rate : 0.1880          
##    Detection Prevalence : 0.3162          
##       Balanced Accuracy : 0.6859          
##                                           
##        'Positive' Class : positive        
## 
The accuracy of the model is 72.65 %, whit an error rate of 27.35 %.
The kappa statistic of the model is 0.37662.
The sensitivity of the model is 0.5641,and the especificity of the model is 0.80769.
The precision of the model is 0.59459,and the recall of the model is 0.5641.
The value of the F-measure of the model is 0.5789.

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...