Predicting survival on the Titanic (Decision Tree algorithm)

Decision tree learners are powerful classifiers, which utilizes a tree structure to model the relationship among the features and the potential outcomes. The tree has a root node and decision nodes where choices are made. The choices split the data across branches that indicate the potential outcomes of a decision. The tree is terminated by leaf nodes (or terminal nodes) that denote the action to be taken as the result of the series of the decisions. In the case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree.

After the model is created, many decision trees algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn’t work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results needs to be shared with others in order to inform business practices.

Decision tree models are often biased toward splits on features having a large number of levels and they can handle numeric or nominal features, as well as missing data.

Some potential uses of this algorithm include:

Credit scoring model
Marketing studies of costumer behavior
Diagnosis of medical conditions based on laboratory measurements

There are numerous implementations of decisions trees, here we will use C5.0 algorithm.

1.Get and initial analysis of the data.

We will work with the data from the data set Titanic (https://www.kaggle.com/c/titanic).

In order to know which people survive we will applied a Decision Tree algorithm which allows us to know the classification mechanism.

The data is already divided into train and test data sets, along with third file (gender_submission) which has the result for the test data set.

Train data set:

train = read.csv("C:/Users/ester/Desktop/Titanic/train.csv", sep = "," , dec = ".", header = TRUE)
head(train)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

summary(train)

##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

#recode some variables:
train$Survived =as.factor(train$Survived)
train$Pclass =as.factor(train$Pclass)
train$SibSp =as.factor(train$SibSp)
train$Parch =as.factor(train$Parch)
str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

We can see that some variables as PassengerId, Name, Ticket or Cabin do not make sense to be taken into account when running the algorithm since this info is irrelevant when predicting.

#We will work only with the following variables:
train1 =train[-c(1,4,9,11)]
str(train1)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

#4 levels in variable 'embarked':
levels(train1$Embarked)[1] = "Missing"
str(train1)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "Missing","C",..: 4 2 4 4 4 3 4 4 4 2 ...

Test data set:

test = read.csv("C:/Users/ester/Desktop/Titanic/test.csv", sep = "," , dec = ".", header = TRUE) #test dataset withuout the "Survived" column
str(test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

submision = read.csv("C:/Users/ester/Desktop/Titanic/gender_submission.csv", sep = "," , dec = ".", header = TRUE) #has the results of "Survived" column for the test passangers
str(submision)

## 'data.frame':    418 obs. of  2 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Survived   : int  0 1 0 0 1 0 1 0 1 0 ...

test1 =merge(test,submision, by = "PassengerId") #merge both dataframes (test + submision)

#recode some variables:
test1$Survived =as.factor(test1$Survived)
test1$Pclass =as.factor(test1$Pclass)
test1$SibSp =as.factor(test1$SibSp)
test1$Parch =as.factor(test1$Parch)
str(test1)

## 'data.frame':    418 obs. of  12 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
##  $ Parch      : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...

#only columns we are interested to work with:
test2 =test1[-c(1,3,8,10)]
str(test2)

## 'data.frame':    418 obs. of  8 variables:
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age     : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
##  $ Parch   : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ Fare    : num  7.83 7 9.69 8.66 12.29 ...
##  $ Embarked: Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...

The training set has 891 samples, and the testing set has 418 samples.

2. Training a model on the data

#install.packages('C50')
library(C50)

## Warning: package 'C50' was built under R version 3.4.1

str(train1)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "Missing","C",..: 4 2 4 4 4 3 4 4 4 2 ...

model =C5.0(train1[c(2:8)], train1$Survived) #(varibale used from training, result)
model

## 
## Call:
## C5.0.default(x = train1[c(2:8)], y = train1$Survived)
## 
## Classification Tree
## Number of samples: 891 
## Number of predictors: 7 
## 
## Tree size: 21 
## 
## Non-standard options: attempt to group attributes

summary(model)

## 
## Call:
## C5.0.default(x = train1[c(2:8)], y = train1$Survived)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Aug 29 09:13:31 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 891 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## Sex = female:
## :...Pclass in {1,2}: 1 (170/9)
## :   Pclass = 3:
## :   :...Fare > 23.25: 0 (27/3)
## :       Fare <= 23.25:
## :       :...Embarked in {Missing,Q}: 1 (31/8)
## :           Embarked = S:
## :           :...Parch in {0,3,4,5,6}: 0 (46/20)
## :           :   Parch in {1,2}: 1 (17/6)
## :           Embarked = C:
## :           :...Fare > 15.2458: 1 (7)
## :               Fare <= 15.2458:
## :               :...Fare <= 13.8625: 1 (6)
## :                   Fare > 13.8625: 0 (10/2)
## Sex = male:
## :...Pclass in {2,3}:
##     :...Age > 9: 0 (416.2/46.1)
##     :   Age <= 9:
##     :   :...SibSp in {1,2}: 1 (12/0.8)
##     :       SibSp in {3,4,5,8}: 0 (14.4/1)
##     :       SibSp = 0:
##     :       :...Parch in {0,3,4,5,6}: 0 (7.3/0.7)
##     :           Parch in {1,2}: 1 (5)
##     Pclass = 1:
##     :...Parch in {1,3,4,5,6}: 0 (15/4)
##         Parch = 2: 1 (8/3)
##         Parch = 0:
##         :...Fare <= 26: 0 (10)
##             Fare > 26:
##             :...Fare > 36.75: 0 (42/13)
##                 Fare <= 36.75:
##                 :...Age > 53: 0 (17/3.8)
##                     Age <= 53:
##                     :...Fare <= 27: 1 (12.3/1.6)
##                         Fare > 27:
##                         :...Fare <= 29: 0 (4.3)
##                             Fare > 29: 1 (13.5/4.9)
## 
## 
## Evaluation on training data (891 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      21  127(14.3%)   <<
## 
## 
##     (a)   (b)    <-classified ----="" 0.0="" 0="" 100.00="" 13.13="" 14.93="" 1="" 246="" 27.27="" 30.98="" 31="" 43.55="" 518="" 96="" a="" age="" as="" attribute="" b="" class="" code="" embarked="" fare="" parch="" pclass="" secs="" sex="" sibsp="" time:="" usage:="">

3. Evaluatig model performance

predictions =predict(model, test2) #(model with the training data set, data set to be predicted)
predictions

##   [1] 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0
##  [36] 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1
##  [71] 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
## [106] 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0
## [176] 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0
## [211] 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0
## [246] 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0
## [281] 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1
## [316] 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1
## [351] 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0
## [386] 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0
## Levels: 0 1

library(gmodels)

## Warning: package 'gmodels' was built under R version 3.4.1

CrossTable(test2$Survived, predictions, prop.chisq = FALSE, prop.c= FALSE, prop.r = FALSE, dnn= c("actual", "predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  418 
## 
##  
##              | predicted 
##       actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       247 |        19 |       266 | 
##              |     0.591 |     0.045 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        31 |       121 |       152 | 
##              |     0.074 |     0.289 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       278 |       140 |       418 | 
## -------------|-----------|-----------|-----------|
## 
##

library(caret)

confu1 =confusionMatrix(predictions, test2$Survived , positive = '1')
confu1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 247  31
##          1  19 121
##                                           
##                Accuracy : 0.8804          
##                  95% CI : (0.8454, 0.9099)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : <2e-16 0.1198="" 0.2895="" 0.3349="" 0.3636="" 0.7371="" 0.7961="" 0.8623="" 0.8643="" 0.8885="" 0.9286="" 1="" :="" accuracy="" balanced="" class="" code="" detection="" kappa="" mcnemar="" neg="" ositive="" p-value="" pos="" pred="" prevalence="" rate="" s="" sensitivity="" specificity="" test="" value="">

The accuracy of the model is 88.04 %, whit an error rate of 11.96 %.

The kappa statistic of the model is 0.73709.

The sensitivity of the model is 0.79605,and the specificity of the model is 0.92857.

The precision of the model is 0.86429,and the recall of the model is 0.79605.

The value of the F-measure of the model is 0.8288.

Improving model performance:

If we take a deeper look at the variables we have used, we see that Fare and Pclass should be related and using only one of them should be enough, also regardless of what they paid people on first class get preferential treatment, so we will remove Fare variable. Also, we think that the parameter Embarked is irrelevant when in comes to surviving so we will remove it to see if we get a better model.

#correlation between Pclass and Fare:
cor(train$Fare, as.numeric(train$Pclass))

## [1] -0.5494996

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

#Now, we will work only with the following variables:
train3 =train[-c(1,4,9,10,11,12)]
str(train3)

## 'data.frame':    891 obs. of  6 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...

Test data set:

#only columns we are interested to work with:
str(test1)

## 'data.frame':    418 obs. of  12 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
##  $ Parch      : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...

test3 =test1[-c(1,3,8,9,10,11)]
str(test3)

## 'data.frame':    418 obs. of  6 variables:
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age     : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
##  $ Parch   : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...

2. Training a model on the data

#install.packages('C50')
library(C50)
str(train3)

## 'data.frame':    891 obs. of  6 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...

model2 =C5.0(train1[c(2:6)], train1$Survived) #(varibale used from training, result)
model2

## 
## Call:
## C5.0.default(x = train1[c(2:6)], y = train1$Survived)
## 
## Classification Tree
## Number of samples: 891 
## Number of predictors: 5 
## 
## Tree size: 6 
## 
## Non-standard options: attempt to group attributes

summary(model2)

## 
## Call:
## C5.0.default(x = train1[c(2:6)], y = train1$Survived)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Aug 29 09:13:33 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 891 cases (6 attributes) from undefined.data
## 
## Decision tree:
## 
## Sex = female:
## :...Pclass in {1,2}: 1 (170/9)
## :   Pclass = 3:
## :   :...SibSp in {0,2}: 1 (88/36)
## :       SibSp in {1,3,4,5,8}: 0 (56/20)
## Sex = male:
## :...Age > 9: 0 (536.2/88.9)
##     Age <= 9:
##     :...SibSp in {0,1,2}: 1 (26.4/7.3)
##         SibSp in {3,4,5,8}: 0 (14.4/1)
## 
## 
## Evaluation on training data (891 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       6  156(17.5%)   <<
## 
## 
##     (a)   (b)    <-classified ----="" 0.0="" 0="" 100.00="" 111="" 1="" 231="" 33.67="" 35.24="" 45="" 50.84="" 504="" a="" age="" as="" attribute="" b="" class="" code="" pclass="" secs="" sex="" sibsp="" time:="" usage:="">

3. Evaluatig model performance

predictions2 =predict(model2, test3) #(model with the training data set, data set to be predicted)
predictions2

##   [1] 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0
##  [36] 0 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1
##  [71] 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0
## [106] 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0
## [141] 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0
## [176] 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 1 0
## [211] 0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0
## [246] 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0
## [281] 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1
## [316] 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1
## [351] 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0
## [386] 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0
## Levels: 0 1

library(gmodels)
CrossTable(test3$Survived, predictions2, prop.chisq = FALSE, prop.c= FALSE, prop.r = FALSE, dnn= c("actual", "predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  418 
## 
##  
##              | predicted 
##       actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       257 |         9 |       266 | 
##              |     0.615 |     0.022 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        24 |       128 |       152 | 
##              |     0.057 |     0.306 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       281 |       137 |       418 | 
## -------------|-----------|-----------|-----------|
## 
##

library(caret)
confu2 =confusionMatrix(predictions2, test3$Survived , positive = '1')
confu2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 257  24
##          1   9 128
##                                          
##                Accuracy : 0.9211         
##                  95% CI : (0.8909, 0.945)
##     No Information Rate : 0.6364         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.8257         
##  Mcnemar's Test P-Value : 0.01481        
##                                          
##             Sensitivity : 0.8421         
##             Specificity : 0.9662         
##          Pos Pred Value : 0.9343         
##          Neg Pred Value : 0.9146         
##              Prevalence : 0.3636         
##          Detection Rate : 0.3062         
##    Detection Prevalence : 0.3278         
##       Balanced Accuracy : 0.9041         
##                                          
##        'Positive' Class : 1              
##

The accuracy of the model is 92.11 %, whit an error rate of’7.89 %.

The kappa statistic of the model is 0.82573.

The sensitivity of the model is 0.84211,and the especificity of the model is 0.96617.

The precision of the model is 0.93431,and the recall of the model is 0.84211.

The value of the F-measure of the model is 0.8858.

We see that with the second model we get a better reult. If we look at the decision tree we can get some more information:

summary(model2)

## 
## Call:
## C5.0.default(x = train1[c(2:6)], y = train1$Survived)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Aug 29 09:13:33 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 891 cases (6 attributes) from undefined.data
## 
## Decision tree:
## 
## Sex = female:
## :...Pclass in {1,2}: 1 (170/9)
## :   Pclass = 3:
## :   :...SibSp in {0,2}: 1 (88/36)
## :       SibSp in {1,3,4,5,8}: 0 (56/20)
## Sex = male:
## :...Age > 9: 0 (536.2/88.9)
##     Age <= 9:
##     :...SibSp in {0,1,2}: 1 (26.4/7.3)
##         SibSp in {3,4,5,8}: 0 (14.4/1)
## 
## 
## Evaluation on training data (891 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       6  156(17.5%)   <<
## 
## 
##     (a)   (b)    <-classified ----="" 0.0="" 0="" 100.00="" 111="" 1="" 231="" 33.67="" 35.24="" 45="" 50.84="" 504="" a="" age="" as="" attribute="" b="" class="" code="" pclass="" secs="" sex="" sibsp="" time:="" usage:="">

We see that the most important factor is the Sex variable, if they are female the second most important factor is the Pclass but when it come to men the second most important factor is Age; this gives us the idea that they tried to save woman and children, and woman in first and second class got preferential treatment.

Data World Blog

Search This Blog