Decision tree learners are powerful classifiers, which utilizes a tree structure to model the relationship among the features and the potential outcomes. The tree has a root node and decision nodes where choices are made. The choices split the data across branches that indicate the potential outcomes of a decision. The tree is terminated by leaf nodes (or terminal nodes) that denote the action to be taken as the result of the series of the decisions. In the case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree.
After the model is created, many decision trees algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn’t work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results needs to be shared with others in order to inform business practices.
Decision tree models are often biased toward splits on features having a large number of levels and they can handle numeric or nominal features, as well as missing data.
Some potential uses of this algorithm include:
- Credit scoring model
- Marketing studies of costumer behavior
- Diagnosis of medical conditions based on laboratory measurements
There are numerous implementations of decisions trees, here we will use C5.0 algorithm.
1.Get and initial analysis of the data.
We will work with the data from the data set Titanic (https://www.kaggle.com/c/titanic).
In order to know which people survive we will applied a Decision Tree algorithm which allows us to know the classification mechanism.
The data is already divided into train and test data sets, along with third file (gender_submission) which has the result for the test data set.
- Train data set:
train = read.csv("C:/Users/ester/Desktop/Titanic/train.csv", sep = "," , dec = ".", header = TRUE)
head(train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
summary(train)
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
#recode some variables:
train$Survived =as.factor(train$Survived)
train$Pclass =as.factor(train$Pclass)
train$SibSp =as.factor(train$SibSp)
train$Parch =as.factor(train$Parch)
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
We can see that some variables as
PassengerId
, Name
, Ticket
or Cabin
do not make sense to be taken into account when running the algorithm since this info is irrelevant when predicting.#We will work only with the following variables:
train1 =train[-c(1,4,9,11)]
str(train1)
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
#4 levels in variable 'embarked':
levels(train1$Embarked)[1] = "Missing"
str(train1)
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "Missing","C",..: 4 2 4 4 4 3 4 4 4 2 ...
- Test data set:
test = read.csv("C:/Users/ester/Desktop/Titanic/test.csv", sep = "," , dec = ".", header = TRUE) #test dataset withuout the "Survived" column
str(test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
submision = read.csv("C:/Users/ester/Desktop/Titanic/gender_submission.csv", sep = "," , dec = ".", header = TRUE) #has the results of "Survived" column for the test passangers
str(submision)
## 'data.frame': 418 obs. of 2 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Survived : int 0 1 0 0 1 0 1 0 1 0 ...
test1 =merge(test,submision, by = "PassengerId") #merge both dataframes (test + submision)
#recode some variables:
test1$Survived =as.factor(test1$Survived)
test1$Pclass =as.factor(test1$Pclass)
test1$SibSp =as.factor(test1$SibSp)
test1$Parch =as.factor(test1$Parch)
str(test1)
## 'data.frame': 418 obs. of 12 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
## $ Parch : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
## $ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
#only columns we are interested to work with:
test2 =test1[-c(1,3,8,10)]
str(test2)
## 'data.frame': 418 obs. of 8 variables:
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
## $ Parch : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Embarked: Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## $ Survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
The training set has 891 samples, and the testing set has 418 samples.
2. Training a model on the data
#install.packages('C50')
library(C50)
## Warning: package 'C50' was built under R version 3.4.1
str(train1)
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "Missing","C",..: 4 2 4 4 4 3 4 4 4 2 ...
model =C5.0(train1[c(2:8)], train1$Survived) #(varibale used from training, result)
model
##
## Call:
## C5.0.default(x = train1[c(2:8)], y = train1$Survived)
##
## Classification Tree
## Number of samples: 891
## Number of predictors: 7
##
## Tree size: 21
##
## Non-standard options: attempt to group attributes
summary(model)
##
## Call:
## C5.0.default(x = train1[c(2:8)], y = train1$Survived)
##
##
## C5.0 [Release 2.07 GPL Edition] Tue Aug 29 09:13:31 2017
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 891 cases (8 attributes) from undefined.data
##
## Decision tree:
##
## Sex = female:
## :...Pclass in {1,2}: 1 (170/9)
## : Pclass = 3:
## : :...Fare > 23.25: 0 (27/3)
## : Fare <= 23.25:
## : :...Embarked in {Missing,Q}: 1 (31/8)
## : Embarked = S:
## : :...Parch in {0,3,4,5,6}: 0 (46/20)
## : : Parch in {1,2}: 1 (17/6)
## : Embarked = C:
## : :...Fare > 15.2458: 1 (7)
## : Fare <= 15.2458:
## : :...Fare <= 13.8625: 1 (6)
## : Fare > 13.8625: 0 (10/2)
## Sex = male:
## :...Pclass in {2,3}:
## :...Age > 9: 0 (416.2/46.1)
## : Age <= 9:
## : :...SibSp in {1,2}: 1 (12/0.8)
## : SibSp in {3,4,5,8}: 0 (14.4/1)
## : SibSp = 0:
## : :...Parch in {0,3,4,5,6}: 0 (7.3/0.7)
## : Parch in {1,2}: 1 (5)
## Pclass = 1:
## :...Parch in {1,3,4,5,6}: 0 (15/4)
## Parch = 2: 1 (8/3)
## Parch = 0:
## :...Fare <= 26: 0 (10)
## Fare > 26:
## :...Fare > 36.75: 0 (42/13)
## Fare <= 36.75:
## :...Age > 53: 0 (17/3.8)
## Age <= 53:
## :...Fare <= 27: 1 (12.3/1.6)
## Fare > 27:
## :...Fare <= 29: 0 (4.3)
## Fare > 29: 1 (13.5/4.9)
##
##
## Evaluation on training data (891 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 21 127(14.3%) <<
##
##
## (a) (b) <-classified ----="" 0.0="" 0="" 100.00="" 13.13="" 14.93="" 1="" 246="" 27.27="" 30.98="" 31="" 43.55="" 518="" 96="" a="" age="" as="" attribute="" b="" class="" code="" embarked="" fare="" parch="" pclass="" secs="" sex="" sibsp="" time:="" usage:="">-classified>
3. Evaluatig model performance
predictions =predict(model, test2) #(model with the training data set, data set to be predicted)
predictions
## [1] 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0
## [36] 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1
## [71] 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
## [106] 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0
## [176] 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0
## [211] 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0
## [246] 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0
## [281] 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1
## [316] 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1
## [351] 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0
## [386] 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0
## Levels: 0 1
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.4.1
CrossTable(test2$Survived, predictions, prop.chisq = FALSE, prop.c= FALSE, prop.r = FALSE, dnn= c("actual", "predicted"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 418
##
##
## | predicted
## actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 247 | 19 | 266 |
## | 0.591 | 0.045 | |
## -------------|-----------|-----------|-----------|
## 1 | 31 | 121 | 152 |
## | 0.074 | 0.289 | |
## -------------|-----------|-----------|-----------|
## Column Total | 278 | 140 | 418 |
## -------------|-----------|-----------|-----------|
##
##
library(caret)
confu1 =confusionMatrix(predictions, test2$Survived , positive = '1')
confu1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 247 31
## 1 19 121
##
## Accuracy : 0.8804
## 95% CI : (0.8454, 0.9099)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : <2e-16 0.1198="" 0.2895="" 0.3349="" 0.3636="" 0.7371="" 0.7961="" 0.8623="" 0.8643="" 0.8885="" 0.9286="" 1="" :="" accuracy="" balanced="" class="" code="" detection="" kappa="" mcnemar="" neg="" ositive="" p-value="" pos="" pred="" prevalence="" rate="" s="" sensitivity="" specificity="" test="" value="">2e-16>
The accuracy of the model is 88.04 %, whit an error rate of 11.96 %.
The kappa statistic of the model is 0.73709.
The sensitivity of the model is 0.79605,and the specificity of the model is 0.92857.
The precision of the model is 0.86429,and the recall of the model is 0.79605.
The value of the F-measure of the model is 0.8288.
Improving model performance:
If we take a deeper look at the variables we have used, we see that
Fare
and Pclass
should be related and using only one of them should be enough, also regardless of what they paid people on first class get preferential treatment, so we will remove Fare
variable. Also, we think that the parameter Embarked
is irrelevant when in comes to surviving so we will remove it to see if we get a better model.#correlation between Pclass and Fare:
cor(train$Fare, as.numeric(train$Pclass))
## [1] -0.5494996
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
#Now, we will work only with the following variables:
train3 =train[-c(1,4,9,10,11,12)]
str(train3)
## 'data.frame': 891 obs. of 6 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
- Test data set:
#only columns we are interested to work with:
str(test1)
## 'data.frame': 418 obs. of 12 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
## $ Parch : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
## $ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
test3 =test1[-c(1,3,8,9,10,11)]
str(test3)
## 'data.frame': 418 obs. of 6 variables:
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
## $ Parch : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
## $ Survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
2. Training a model on the data
#install.packages('C50')
library(C50)
str(train3)
## 'data.frame': 891 obs. of 6 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
model2 =C5.0(train1[c(2:6)], train1$Survived) #(varibale used from training, result)
model2
##
## Call:
## C5.0.default(x = train1[c(2:6)], y = train1$Survived)
##
## Classification Tree
## Number of samples: 891
## Number of predictors: 5
##
## Tree size: 6
##
## Non-standard options: attempt to group attributes
summary(model2)
##
## Call:
## C5.0.default(x = train1[c(2:6)], y = train1$Survived)
##
##
## C5.0 [Release 2.07 GPL Edition] Tue Aug 29 09:13:33 2017
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 891 cases (6 attributes) from undefined.data
##
## Decision tree:
##
## Sex = female:
## :...Pclass in {1,2}: 1 (170/9)
## : Pclass = 3:
## : :...SibSp in {0,2}: 1 (88/36)
## : SibSp in {1,3,4,5,8}: 0 (56/20)
## Sex = male:
## :...Age > 9: 0 (536.2/88.9)
## Age <= 9:
## :...SibSp in {0,1,2}: 1 (26.4/7.3)
## SibSp in {3,4,5,8}: 0 (14.4/1)
##
##
## Evaluation on training data (891 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 6 156(17.5%) <<
##
##
## (a) (b) <-classified ----="" 0.0="" 0="" 100.00="" 111="" 1="" 231="" 33.67="" 35.24="" 45="" 50.84="" 504="" a="" age="" as="" attribute="" b="" class="" code="" pclass="" secs="" sex="" sibsp="" time:="" usage:="">-classified>
3. Evaluatig model performance
predictions2 =predict(model2, test3) #(model with the training data set, data set to be predicted)
predictions2
## [1] 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0
## [36] 0 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1
## [71] 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0
## [106] 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0
## [141] 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0
## [176] 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 1 0
## [211] 0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0
## [246] 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0
## [281] 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1
## [316] 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1
## [351] 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0
## [386] 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0
## Levels: 0 1
library(gmodels)
CrossTable(test3$Survived, predictions2, prop.chisq = FALSE, prop.c= FALSE, prop.r = FALSE, dnn= c("actual", "predicted"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 418
##
##
## | predicted
## actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 257 | 9 | 266 |
## | 0.615 | 0.022 | |
## -------------|-----------|-----------|-----------|
## 1 | 24 | 128 | 152 |
## | 0.057 | 0.306 | |
## -------------|-----------|-----------|-----------|
## Column Total | 281 | 137 | 418 |
## -------------|-----------|-----------|-----------|
##
##
library(caret)
confu2 =confusionMatrix(predictions2, test3$Survived , positive = '1')
confu2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 257 24
## 1 9 128
##
## Accuracy : 0.9211
## 95% CI : (0.8909, 0.945)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8257
## Mcnemar's Test P-Value : 0.01481
##
## Sensitivity : 0.8421
## Specificity : 0.9662
## Pos Pred Value : 0.9343
## Neg Pred Value : 0.9146
## Prevalence : 0.3636
## Detection Rate : 0.3062
## Detection Prevalence : 0.3278
## Balanced Accuracy : 0.9041
##
## 'Positive' Class : 1
##
The accuracy of the model is 92.11 %, whit an error rate of’7.89 %.
The kappa statistic of the model is 0.82573.
The sensitivity of the model is 0.84211,and the especificity of the model is 0.96617.
The precision of the model is 0.93431,and the recall of the model is 0.84211.
The value of the F-measure of the model is 0.8858.
We see that with the second model we get a better reult. If we look at the decision tree we can get some more information:
summary(model2)
##
## Call:
## C5.0.default(x = train1[c(2:6)], y = train1$Survived)
##
##
## C5.0 [Release 2.07 GPL Edition] Tue Aug 29 09:13:33 2017
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 891 cases (6 attributes) from undefined.data
##
## Decision tree:
##
## Sex = female:
## :...Pclass in {1,2}: 1 (170/9)
## : Pclass = 3:
## : :...SibSp in {0,2}: 1 (88/36)
## : SibSp in {1,3,4,5,8}: 0 (56/20)
## Sex = male:
## :...Age > 9: 0 (536.2/88.9)
## Age <= 9:
## :...SibSp in {0,1,2}: 1 (26.4/7.3)
## SibSp in {3,4,5,8}: 0 (14.4/1)
##
##
## Evaluation on training data (891 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 6 156(17.5%) <<
##
##
## (a) (b) <-classified ----="" 0.0="" 0="" 100.00="" 111="" 1="" 231="" 33.67="" 35.24="" 45="" 50.84="" 504="" a="" age="" as="" attribute="" b="" class="" code="" pclass="" secs="" sex="" sibsp="" time:="" usage:="">-classified>
We see that the most important factor is the
Sex
variable, if they are female the second most important factor is the Pclass
but when it come to men the second most important factor is Age
; this gives us the idea that they tried to save woman and children, and woman in first and second class got preferential treatment.