Skip to main content

Predicting survival on the Titanic (Decision Tree algorithm)

Decision tree learners are powerful classifiers, which utilizes a tree structure to model the relationship among the features and the potential outcomes. The tree has a root node and decision nodes where choices are made. The choices split the data across branches that indicate the potential outcomes of a decision. The tree is terminated by leaf nodes (or terminal nodes) that denote the action to be taken as the result of the series of the decisions. In the case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree.
After the model is created, many decision trees algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn’t work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results needs to be shared with others in order to inform business practices.
Decision tree models are often biased toward splits on features having a large number of levels and they can handle numeric or nominal features, as well as missing data.
Some potential uses of this algorithm include:
  • Credit scoring model
  • Marketing studies of costumer behavior
  • Diagnosis of medical conditions based on laboratory measurements
There are numerous implementations of decisions trees, here we will use C5.0 algorithm.

1.Get and initial analysis of the data.

We will work with the data from the data set Titanic (https://www.kaggle.com/c/titanic).
In order to know which people survive we will applied a Decision Tree algorithm which allows us to know the classification mechanism.
The data is already divided into train and test data sets, along with third file (gender_submission) which has the result for the test data set.
  • Train data set:
train = read.csv("C:/Users/ester/Desktop/Titanic/train.csv", sep = "," , dec = ".", header = TRUE)
head(train)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q
summary(train)
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
#recode some variables:
train$Survived =as.factor(train$Survived)
train$Pclass =as.factor(train$Pclass)
train$SibSp =as.factor(train$SibSp)
train$Parch =as.factor(train$Parch)
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
We can see that some variables as PassengerIdNameTicket or Cabin do not make sense to be taken into account when running the algorithm since this info is irrelevant when predicting.
#We will work only with the following variables:
train1 =train[-c(1,4,9,11)]
str(train1)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
#4 levels in variable 'embarked':
levels(train1$Embarked)[1] = "Missing"
str(train1)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "Missing","C",..: 4 2 4 4 4 3 4 4 4 2 ...
  • Test data set:
test = read.csv("C:/Users/ester/Desktop/Titanic/test.csv", sep = "," , dec = ".", header = TRUE) #test dataset withuout the "Survived" column
str(test)
## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
submision = read.csv("C:/Users/ester/Desktop/Titanic/gender_submission.csv", sep = "," , dec = ".", header = TRUE) #has the results of "Survived" column for the test passangers
str(submision)
## 'data.frame':    418 obs. of  2 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Survived   : int  0 1 0 0 1 0 1 0 1 0 ...
test1 =merge(test,submision, by = "PassengerId") #merge both dataframes (test + submision)

#recode some variables:
test1$Survived =as.factor(test1$Survived)
test1$Pclass =as.factor(test1$Pclass)
test1$SibSp =as.factor(test1$SibSp)
test1$Parch =as.factor(test1$Parch)
str(test1)
## 'data.frame':    418 obs. of  12 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
##  $ Parch      : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
#only columns we are interested to work with:
test2 =test1[-c(1,3,8,10)]
str(test2)
## 'data.frame':    418 obs. of  8 variables:
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age     : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
##  $ Parch   : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ Fare    : num  7.83 7 9.69 8.66 12.29 ...
##  $ Embarked: Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
The training set has 891 samples, and the testing set has 418 samples.

2. Training a model on the data

#install.packages('C50')
library(C50)
## Warning: package 'C50' was built under R version 3.4.1
str(train1)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "Missing","C",..: 4 2 4 4 4 3 4 4 4 2 ...
model =C5.0(train1[c(2:8)], train1$Survived) #(varibale used from training, result)
model
## 
## Call:
## C5.0.default(x = train1[c(2:8)], y = train1$Survived)
## 
## Classification Tree
## Number of samples: 891 
## Number of predictors: 7 
## 
## Tree size: 21 
## 
## Non-standard options: attempt to group attributes
summary(model)
## 
## Call:
## C5.0.default(x = train1[c(2:8)], y = train1$Survived)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Aug 29 09:13:31 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 891 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## Sex = female:
## :...Pclass in {1,2}: 1 (170/9)
## :   Pclass = 3:
## :   :...Fare > 23.25: 0 (27/3)
## :       Fare <= 23.25:
## :       :...Embarked in {Missing,Q}: 1 (31/8)
## :           Embarked = S:
## :           :...Parch in {0,3,4,5,6}: 0 (46/20)
## :           :   Parch in {1,2}: 1 (17/6)
## :           Embarked = C:
## :           :...Fare > 15.2458: 1 (7)
## :               Fare <= 15.2458:
## :               :...Fare <= 13.8625: 1 (6)
## :                   Fare > 13.8625: 0 (10/2)
## Sex = male:
## :...Pclass in {2,3}:
##     :...Age > 9: 0 (416.2/46.1)
##     :   Age <= 9:
##     :   :...SibSp in {1,2}: 1 (12/0.8)
##     :       SibSp in {3,4,5,8}: 0 (14.4/1)
##     :       SibSp = 0:
##     :       :...Parch in {0,3,4,5,6}: 0 (7.3/0.7)
##     :           Parch in {1,2}: 1 (5)
##     Pclass = 1:
##     :...Parch in {1,3,4,5,6}: 0 (15/4)
##         Parch = 2: 1 (8/3)
##         Parch = 0:
##         :...Fare <= 26: 0 (10)
##             Fare > 26:
##             :...Fare > 36.75: 0 (42/13)
##                 Fare <= 36.75:
##                 :...Age > 53: 0 (17/3.8)
##                     Age <= 53:
##                     :...Fare <= 27: 1 (12.3/1.6)
##                         Fare > 27:
##                         :...Fare <= 29: 0 (4.3)
##                             Fare > 29: 1 (13.5/4.9)
## 
## 
## Evaluation on training data (891 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      21  127(14.3%)   <<
## 
## 
##     (a)   (b)    <-classified ----="" 0.0="" 0="" 100.00="" 13.13="" 14.93="" 1="" 246="" 27.27="" 30.98="" 31="" 43.55="" 518="" 96="" a="" age="" as="" attribute="" b="" class="" code="" embarked="" fare="" parch="" pclass="" secs="" sex="" sibsp="" time:="" usage:="">

3. Evaluatig model performance

predictions =predict(model, test2) #(model with the training data set, data set to be predicted)
predictions
##   [1] 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0
##  [36] 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1
##  [71] 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
## [106] 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0
## [176] 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0
## [211] 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0
## [246] 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0
## [281] 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1
## [316] 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1
## [351] 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0
## [386] 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0
## Levels: 0 1
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.4.1
CrossTable(test2$Survived, predictions, prop.chisq = FALSE, prop.c= FALSE, prop.r = FALSE, dnn= c("actual", "predicted"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  418 
## 
##  
##              | predicted 
##       actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       247 |        19 |       266 | 
##              |     0.591 |     0.045 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        31 |       121 |       152 | 
##              |     0.074 |     0.289 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       278 |       140 |       418 | 
## -------------|-----------|-----------|-----------|
## 
## 
library(caret)
confu1 =confusionMatrix(predictions, test2$Survived , positive = '1')
confu1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 247  31
##          1  19 121
##                                           
##                Accuracy : 0.8804          
##                  95% CI : (0.8454, 0.9099)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : <2e-16 0.1198="" 0.2895="" 0.3349="" 0.3636="" 0.7371="" 0.7961="" 0.8623="" 0.8643="" 0.8885="" 0.9286="" 1="" :="" accuracy="" balanced="" class="" code="" detection="" kappa="" mcnemar="" neg="" ositive="" p-value="" pos="" pred="" prevalence="" rate="" s="" sensitivity="" specificity="" test="" value="">
The accuracy of the model is 88.04 %, whit an error rate of 11.96 %.
The kappa statistic of the model is 0.73709.
The sensitivity of the model is 0.79605,and the specificity of the model is 0.92857.
The precision of the model is 0.86429,and the recall of the model is 0.79605.
The value of the F-measure of the model is 0.8288.

Improving model performance:
If we take a deeper look at the variables we have used, we see that Fare and Pclass should be related and using only one of them should be enough, also regardless of what they paid people on first class get preferential treatment, so we will remove Fare variable. Also, we think that the parameter Embarked is irrelevant when in comes to surviving so we will remove it to see if we get a better model.
#correlation between Pclass and Fare:
cor(train$Fare, as.numeric(train$Pclass))
## [1] -0.5494996
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
#Now, we will work only with the following variables:
train3 =train[-c(1,4,9,10,11,12)]
str(train3)
## 'data.frame':    891 obs. of  6 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
  • Test data set:
#only columns we are interested to work with:
str(test1)
## 'data.frame':    418 obs. of  12 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
##  $ Parch      : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
test3 =test1[-c(1,3,8,9,10,11)]
str(test3)
## 'data.frame':    418 obs. of  6 variables:
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age     : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
##  $ Parch   : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...

2. Training a model on the data

#install.packages('C50')
library(C50)
str(train3)
## 'data.frame':    891 obs. of  6 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
model2 =C5.0(train1[c(2:6)], train1$Survived) #(varibale used from training, result)
model2
## 
## Call:
## C5.0.default(x = train1[c(2:6)], y = train1$Survived)
## 
## Classification Tree
## Number of samples: 891 
## Number of predictors: 5 
## 
## Tree size: 6 
## 
## Non-standard options: attempt to group attributes
summary(model2)
## 
## Call:
## C5.0.default(x = train1[c(2:6)], y = train1$Survived)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Aug 29 09:13:33 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 891 cases (6 attributes) from undefined.data
## 
## Decision tree:
## 
## Sex = female:
## :...Pclass in {1,2}: 1 (170/9)
## :   Pclass = 3:
## :   :...SibSp in {0,2}: 1 (88/36)
## :       SibSp in {1,3,4,5,8}: 0 (56/20)
## Sex = male:
## :...Age > 9: 0 (536.2/88.9)
##     Age <= 9:
##     :...SibSp in {0,1,2}: 1 (26.4/7.3)
##         SibSp in {3,4,5,8}: 0 (14.4/1)
## 
## 
## Evaluation on training data (891 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       6  156(17.5%)   <<
## 
## 
##     (a)   (b)    <-classified ----="" 0.0="" 0="" 100.00="" 111="" 1="" 231="" 33.67="" 35.24="" 45="" 50.84="" 504="" a="" age="" as="" attribute="" b="" class="" code="" pclass="" secs="" sex="" sibsp="" time:="" usage:="">

3. Evaluatig model performance

predictions2 =predict(model2, test3) #(model with the training data set, data set to be predicted)
predictions2
##   [1] 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0
##  [36] 0 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1
##  [71] 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0
## [106] 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0
## [141] 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0
## [176] 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 1 0
## [211] 0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0
## [246] 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0
## [281] 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1
## [316] 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1
## [351] 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0
## [386] 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0
## Levels: 0 1
library(gmodels)
CrossTable(test3$Survived, predictions2, prop.chisq = FALSE, prop.c= FALSE, prop.r = FALSE, dnn= c("actual", "predicted"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  418 
## 
##  
##              | predicted 
##       actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       257 |         9 |       266 | 
##              |     0.615 |     0.022 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        24 |       128 |       152 | 
##              |     0.057 |     0.306 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       281 |       137 |       418 | 
## -------------|-----------|-----------|-----------|
## 
## 
library(caret)
confu2 =confusionMatrix(predictions2, test3$Survived , positive = '1')
confu2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 257  24
##          1   9 128
##                                          
##                Accuracy : 0.9211         
##                  95% CI : (0.8909, 0.945)
##     No Information Rate : 0.6364         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.8257         
##  Mcnemar's Test P-Value : 0.01481        
##                                          
##             Sensitivity : 0.8421         
##             Specificity : 0.9662         
##          Pos Pred Value : 0.9343         
##          Neg Pred Value : 0.9146         
##              Prevalence : 0.3636         
##          Detection Rate : 0.3062         
##    Detection Prevalence : 0.3278         
##       Balanced Accuracy : 0.9041         
##                                          
##        'Positive' Class : 1              
## 
The accuracy of the model is 92.11 %, whit an error rate of’7.89 %.
The kappa statistic of the model is 0.82573.
The sensitivity of the model is 0.84211,and the especificity of the model is 0.96617.
The precision of the model is 0.93431,and the recall of the model is 0.84211.
The value of the F-measure of the model is 0.8858.

We see that with the second model we get a better reult. If we look at the decision tree we can get some more information:
summary(model2)
## 
## Call:
## C5.0.default(x = train1[c(2:6)], y = train1$Survived)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Aug 29 09:13:33 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 891 cases (6 attributes) from undefined.data
## 
## Decision tree:
## 
## Sex = female:
## :...Pclass in {1,2}: 1 (170/9)
## :   Pclass = 3:
## :   :...SibSp in {0,2}: 1 (88/36)
## :       SibSp in {1,3,4,5,8}: 0 (56/20)
## Sex = male:
## :...Age > 9: 0 (536.2/88.9)
##     Age <= 9:
##     :...SibSp in {0,1,2}: 1 (26.4/7.3)
##         SibSp in {3,4,5,8}: 0 (14.4/1)
## 
## 
## Evaluation on training data (891 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       6  156(17.5%)   <<
## 
## 
##     (a)   (b)    <-classified ----="" 0.0="" 0="" 100.00="" 111="" 1="" 231="" 33.67="" 35.24="" 45="" 50.84="" 504="" a="" age="" as="" attribute="" b="" class="" code="" pclass="" secs="" sex="" sibsp="" time:="" usage:="">
We see that the most important factor is the Sex variable, if they are female the second most important factor is the Pclass but when it come to men the second most important factor is Age; this gives us the idea that they tried to save woman and children, and woman in first and second class got preferential treatment.

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...