Skip to main content

k-NN algorithm using different distance functions (breast cancer data set)

General information about k-NN algorithm can be found:  http://dataworldblog.blogspot.com.es/2017/08/k-nn-algorithm.html
In order to measure the similarity between two instances is used a distance function. There are different ways to calculate distance, but traditionally the k-NN algorithm uses Euclidean distance, which is the “ordinary” or “straight-line” distance between two points.
It has been demonstrated that the chosen distance function can affect the classification accuracy of the k-NN classifier.
The distance calculation for k-NN is heavily dependent on the measurement scale of the input features. Since different inputs have different ranges of values, those inputs with larger range of value will have a larger impact than those that have smaller range of values.
This could potentially cause problems for our classifier, so we have to apply normalization to rescale the features into a standard range of values. We have seen that when using the dataset iris the there is no difference using different distance function since the variables in the iris dataset have similar ranges: http://dataworldblog.blogspot.com/2017/08/distances-for-k-nn-algorithm.html 
Here, we will use a data set where the variables have different ranges to see the the effect of using different distance functions.
So here we are going to compare the results of using different distance functions with the breast cancer data set. This data set collects all the 31 features from breast cancer tumors, we want to use the k-NN algorithm to predict if the cancer is benign or malignant (diagnosis), taking into account 30 features of the tumor (radius_mean, texture_mean, etc...). 

1.WITHOUT RESCALING THE FEATURES
1.1. Initial data analysis
breastcancer = read.csv("C:/Users/ester/Downloads/breast-cancer-wisconsin-data-data.csv", sep = "," , dec = ".", header = TRUE)
dim(breastcancer)
## [1] 569  33
head(breastcancer)
##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         M       17.99        10.38         122.80    1001.0
## 2   842517         M       20.57        17.77         132.90    1326.0
## 3 84300903         M       19.69        21.25         130.00    1203.0
## 4 84348301         M       11.42        20.38          77.58     386.1
## 5 84358402         M       20.29        14.34         135.10    1297.0
## 6   843786         M       12.45        15.70          82.57     477.1
##   smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1         0.11840          0.27760         0.3001             0.14710
## 2         0.08474          0.07864         0.0869             0.07017
## 3         0.10960          0.15990         0.1974             0.12790
## 4         0.14250          0.28390         0.2414             0.10520
## 5         0.10030          0.13280         0.1980             0.10430
## 6         0.12780          0.17000         0.1578             0.08089
##   symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1        0.2419                0.07871    1.0950     0.9053        8.589
## 2        0.1812                0.05667    0.5435     0.7339        3.398
## 3        0.2069                0.05999    0.7456     0.7869        4.585
## 4        0.2597                0.09744    0.4956     1.1560        3.445
## 5        0.1809                0.05883    0.7572     0.7813        5.438
## 6        0.2087                0.07613    0.3345     0.8902        2.217
##   area_se smoothness_se compactness_se concavity_se concave.points_se
## 1  153.40      0.006399        0.04904      0.05373           0.01587
## 2   74.08      0.005225        0.01308      0.01860           0.01340
## 3   94.03      0.006150        0.04006      0.03832           0.02058
## 4   27.23      0.009110        0.07458      0.05661           0.01867
## 5   94.44      0.011490        0.02461      0.05688           0.01885
## 6   27.19      0.007510        0.03345      0.03672           0.01137
##   symmetry_se fractal_dimension_se radius_worst texture_worst
## 1     0.03003             0.006193        25.38         17.33
## 2     0.01389             0.003532        24.99         23.41
## 3     0.02250             0.004571        23.57         25.53
## 4     0.05963             0.009208        14.91         26.50
## 5     0.01756             0.005115        22.54         16.67
## 6     0.02165             0.005082        15.47         23.75
##   perimeter_worst area_worst smoothness_worst compactness_worst
## 1          184.60     2019.0           0.1622            0.6656
## 2          158.80     1956.0           0.1238            0.1866
## 3          152.50     1709.0           0.1444            0.4245
## 4           98.87      567.7           0.2098            0.8663
## 5          152.20     1575.0           0.1374            0.2050
## 6          103.40      741.6           0.1791            0.5249
##   concavity_worst concave.points_worst symmetry_worst
## 1          0.7119               0.2654         0.4601
## 2          0.2416               0.1860         0.2750
## 3          0.4504               0.2430         0.3613
## 4          0.6869               0.2575         0.6638
## 5          0.4000               0.1625         0.2364
## 6          0.5355               0.1741         0.3985
##   fractal_dimension_worst  X
## 1                 0.11890 NA
## 2                 0.08902 NA
## 3                 0.08758 NA
## 4                 0.17300 NA
## 5                 0.07678 NA
## 6                 0.12440 NA
summary(breastcancer)
##        id            diagnosis  radius_mean      texture_mean  
##  Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean   
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
##  Median :0.06154   Median :0.03350     Median :0.1792  
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##     area_se        smoothness_se      compactness_se      concavity_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst    X          
##  Min.   :0.1565   Min.   :0.05504         Mode:logical  
##  1st Qu.:0.2504   1st Qu.:0.07146         NA's:569      
##  Median :0.2822   Median :0.08004                       
##  Mean   :0.2901   Mean   :0.08395                       
##  3rd Qu.:0.3179   3rd Qu.:0.09208                       
##  Max.   :0.6638   Max.   :0.20750
breastcan = breastcancer[-c(1,33)] #we don't need the first and last columns
summary(breastcan)
##  diagnosis  radius_mean      texture_mean   perimeter_mean  
##  B:357     Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
##  M:212     1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
##            Median :13.370   Median :18.84   Median : 86.24  
##            Mean   :14.127   Mean   :19.29   Mean   : 91.97  
##            3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
##            Max.   :28.110   Max.   :39.28   Max.   :188.50  
##    area_mean      smoothness_mean   compactness_mean  concavity_mean   
##  Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.00000  
##  1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956  
##  Median : 551.1   Median :0.09587   Median :0.09263   Median :0.06154  
##  Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.08880  
##  3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070  
##  Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.42680  
##  concave.points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se    
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst
##  Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2822   Median :0.08004        
##  Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.6638   Max.   :0.20750
boxplot(breastcan, col = 'gold')
We see that the inputs have different ranges of values.

1.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
We will divide our data into two different sets: a training dataset (67%) that will be used to build the model and a testing dataset (33%) that will be used to estimate the predictive accuracy of the model.
The function createDataPartition can be used to create random balanced splits of the data. Using this function the random sampling occurs within each class and preserves the overall class distribution of the data.
library(caret)
set.seed(123)
inTrain =createDataPartition(y = breastcan$diagnosis,p = 0.67,list = FALSE)
str(inTrain)
##  int [1:383, 1] 2 4 5 8 9 10 11 12 13 14 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "Resample1"
train =breastcan[inTrain,]
test =breastcan[-inTrain,]
We have to make sure the samples have been fairly split between the training and testing dataset, and that the proportion is similar in the complete dataset:
Proportions of the complete dataset:
## 
##     B     M 
## 0.627 0.373
Proportions of the training dataset:
## 
##     B     M 
## 0.627 0.373
Proportions of the testing dataset:
## 
##     B     M 
## 0.629 0.371
We see that the proportions are similar among the datasets.

1.3. Use a k-NN algorithm (k = 1:15) to predict id the cancer is benign or malignant.
#labels of train dataset
breast_train_labels =train[,1]
head(breast_train_labels)
## [1] M M M M M M
## Levels: B M
#labels of test dataset
breast_test_labels =test[,1]
head(breast_test_labels)
## [1] M M M M M M
## Levels: B M
#to run the algorithm:
#install.packages("class")
library(class)
breast_test_pred =knn(train = train[2:31], test = test[2:31], 
                        cl = breast_train_labels, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = breast_test_labels, y = breast_test_pred, 
                      prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  186 
## 
##  
##                    | breast_test_pred 
## breast_test_labels |         B |         M | Row Total | 
## -------------------|-----------|-----------|-----------|
##                  B |       111 |         6 |       117 | 
##                    |     0.949 |     0.051 |     0.629 | 
##                    |     0.941 |     0.088 |           | 
##                    |     0.597 |     0.032 |           | 
## -------------------|-----------|-----------|-----------|
##                  M |         7 |        62 |        69 | 
##                    |     0.101 |     0.899 |     0.371 | 
##                    |     0.059 |     0.912 |           | 
##                    |     0.038 |     0.333 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |       118 |        68 |       186 | 
##                    |     0.634 |     0.366 |           | 
## -------------------|-----------|-----------|-----------|
## 
## 
Table with different k results to see which value of k is better for getting better results:
#to create a table from 0 to 15 k results:
result=matrix(0,15,3)
colnames(result)=c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:15)){
  breast_test_pred =knn(train = train[2:31], test = test[2:31], 
                        cl = breast_train_labels, k = i)
  tablepredict =CrossTable(x = breast_test_labels, y = breast_test_pred, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[2,1]
  result[i,1]=i 
  result[i,2]=wrong
  result[i,3]=round(((wrong/569) * 100),2)}
result
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                            12                           2.11
##  [2,]      2                            13                           2.28
##  [3,]      3                            13                           2.28
##  [4,]      4                            15                           2.64
##  [5,]      5                            13                           2.28
##  [6,]      6                            13                           2.28
##  [7,]      7                            12                           2.11
##  [8,]      8                            12                           2.11
##  [9,]      9                            12                           2.11
## [10,]     10                            11                           1.93
## [11,]     11                            11                           1.93
## [12,]     12                            11                           1.93
## [13,]     13                            13                           2.28
## [14,]     14                            13                           2.28
## [15,]     15                            13                           2.28

1.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “B” (bening).
#install.packages(caret)
library(caret)
confu =confusionMatrix(breast_test_pred, breast_test_labels, positive = "B")
confu
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 113   9
##          M   4  60
##                                           
##                Accuracy : 0.9301          
##                  95% CI : (0.8834, 0.9623)
##     No Information Rate : 0.629           
##     P-Value [Acc > NIR] : <2e-16 0.2673="" 0.6075="" 0.6290="" 0.6559="" 0.848="" 0.8696="" 0.9177="" 0.9262="" 0.9375="" 0.9658="" :="" accuracy="" b="" balanced="" class="" code="" detection="" kappa="" mcnemar="" neg="" ositive="" p-value="" pos="" pred="" prevalence="" rate="" s="" sensitivity="" specificity="" test="" value="">

2. MIN-MAX NORMALIZATION
2.1 Initial data analysis
This process transforms a feature such that all of its values fall in a range between 0 and 1.
normalize =function(x){
  return((x-min(x))/(max(x)-min(x)))}

breast_nor =as.data.frame(lapply(breastcan[,-1], normalize))
breast_nor$diagnosis = breastcan$diagnosis
head(breast_nor)
##   radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1   0.5210374    0.0226581      0.5459885 0.3637328       0.5937528
## 2   0.6431445    0.2725736      0.6157833 0.5015907       0.2898799
## 3   0.6014956    0.3902604      0.5957432 0.4494168       0.5143089
## 4   0.2100904    0.3608387      0.2335015 0.1029056       0.8113208
## 5   0.6298926    0.1565776      0.6309861 0.4892895       0.4303512
## 6   0.2588386    0.2025702      0.2679842 0.1415058       0.6786133
##   compactness_mean concavity_mean concave.points_mean symmetry_mean
## 1        0.7920373      0.7031396           0.7311133     0.6863636
## 2        0.1817680      0.2036082           0.3487575     0.3797980
## 3        0.4310165      0.4625117           0.6356859     0.5095960
## 4        0.8113613      0.5656045           0.5228628     0.7762626
## 5        0.3478928      0.4639175           0.5183897     0.3782828
## 6        0.4619962      0.3697282           0.4020378     0.5186869
##   fractal_dimension_mean  radius_se texture_se perimeter_se    area_se
## 1              0.6055181 0.35614702 0.12046941   0.36903360 0.27381126
## 2              0.1413227 0.15643672 0.08258929   0.12444047 0.12565979
## 3              0.2112468 0.22962158 0.09430251   0.18037035 0.16292179
## 4              1.0000000 0.13909107 0.17587518   0.12665504 0.03815479
## 5              0.1868155 0.23382220 0.09306489   0.22056260 0.16368757
## 6              0.5511794 0.08075321 0.11713225   0.06879329 0.03808008
##   smoothness_se compactness_se concavity_se concave.points_se symmetry_se
## 1     0.1592956     0.35139844   0.13568182         0.3006251  0.31164518
## 2     0.1193867     0.08132304   0.04696970         0.2538360  0.08453875
## 3     0.1508312     0.28395470   0.09676768         0.3898466  0.20569032
## 4     0.2514532     0.54321507   0.14295455         0.3536655  0.72814769
## 5     0.3323588     0.16791841   0.14363636         0.3570752  0.13617943
## 6     0.1970629     0.23431069   0.09272727         0.2153817  0.19372995
##   fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1            0.1830424    0.6207755     0.1415245       0.6683102
## 2            0.0911101    0.6069015     0.3035714       0.5398177
## 3            0.1270055    0.5563856     0.3600746       0.5084417
## 4            0.2872048    0.2483102     0.3859275       0.2413467
## 5            0.1457996    0.5197439     0.1239339       0.5069476
## 6            0.1446596    0.2682319     0.3126333       0.2639076
##   area_worst smoothness_worst compactness_worst concavity_worst
## 1 0.45069799        0.6011358         0.6192916       0.5686102
## 2 0.43521431        0.3475533         0.1545634       0.1929712
## 3 0.37450845        0.4835898         0.3853751       0.3597444
## 4 0.09400806        0.9154725         0.8140117       0.5486422
## 5 0.34157491        0.4373638         0.1724151       0.3194888
## 6 0.13674794        0.7127386         0.4827837       0.4277157
##   concave.points_worst symmetry_worst fractal_dimension_worst diagnosis
## 1            0.9120275      0.5984624               0.4188640         M
## 2            0.6391753      0.2335896               0.2228781         M
## 3            0.8350515      0.4037059               0.2134330         M
## 4            0.8848797      1.0000000               0.7737111         M
## 5            0.5584192      0.1575005               0.1425948         M
## 6            0.5982818      0.4770353               0.4549390         M
summary(breast_nor)
##   radius_mean      texture_mean    perimeter_mean     area_mean     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2233   1st Qu.:0.2185   1st Qu.:0.2168   1st Qu.:0.1174  
##  Median :0.3024   Median :0.3088   Median :0.2933   Median :0.1729  
##  Mean   :0.3382   Mean   :0.3240   Mean   :0.3329   Mean   :0.2169  
##  3rd Qu.:0.4164   3rd Qu.:0.4089   3rd Qu.:0.4168   3rd Qu.:0.2711  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  smoothness_mean  compactness_mean concavity_mean    concave.points_mean
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000     
##  1st Qu.:0.3046   1st Qu.:0.1397   1st Qu.:0.06926   1st Qu.:0.1009     
##  Median :0.3904   Median :0.2247   Median :0.14419   Median :0.1665     
##  Mean   :0.3948   Mean   :0.2606   Mean   :0.20806   Mean   :0.2431     
##  3rd Qu.:0.4755   3rd Qu.:0.3405   3rd Qu.:0.30623   3rd Qu.:0.3678     
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000     
##  symmetry_mean    fractal_dimension_mean   radius_se      
##  Min.   :0.0000   Min.   :0.0000         Min.   :0.00000  
##  1st Qu.:0.2823   1st Qu.:0.1630         1st Qu.:0.04378  
##  Median :0.3697   Median :0.2439         Median :0.07702  
##  Mean   :0.3796   Mean   :0.2704         Mean   :0.10635  
##  3rd Qu.:0.4530   3rd Qu.:0.3404         3rd Qu.:0.13304  
##  Max.   :1.0000   Max.   :1.0000         Max.   :1.00000  
##    texture_se      perimeter_se        area_se        smoothness_se   
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.1047   1st Qu.:0.04000   1st Qu.:0.02064   1st Qu.:0.1175  
##  Median :0.1653   Median :0.07209   Median :0.03311   Median :0.1586  
##  Mean   :0.1893   Mean   :0.09938   Mean   :0.06264   Mean   :0.1811  
##  3rd Qu.:0.2462   3rd Qu.:0.12251   3rd Qu.:0.07170   3rd Qu.:0.2187  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##  compactness_se     concavity_se     concave.points_se  symmetry_se    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:0.08132   1st Qu.:0.03811   1st Qu.:0.1447    1st Qu.:0.1024  
##  Median :0.13667   Median :0.06538   Median :0.2070    Median :0.1526  
##  Mean   :0.17444   Mean   :0.08054   Mean   :0.2235    Mean   :0.1781  
##  3rd Qu.:0.22680   3rd Qu.:0.10619   3rd Qu.:0.2787    3rd Qu.:0.2195  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000    Max.   :1.0000  
##  fractal_dimension_se  radius_worst    texture_worst    perimeter_worst 
##  Min.   :0.00000      Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.04675      1st Qu.:0.1807   1st Qu.:0.2415   1st Qu.:0.1678  
##  Median :0.07919      Median :0.2504   Median :0.3569   Median :0.2353  
##  Mean   :0.10019      Mean   :0.2967   Mean   :0.3640   Mean   :0.2831  
##  3rd Qu.:0.12656      3rd Qu.:0.3863   3rd Qu.:0.4717   3rd Qu.:0.3735  
##  Max.   :1.00000      Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    area_worst      smoothness_worst compactness_worst concavity_worst  
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000    Min.   :0.00000  
##  1st Qu.:0.08113   1st Qu.:0.3000   1st Qu.:0.1163    1st Qu.:0.09145  
##  Median :0.12321   Median :0.3971   Median :0.1791    Median :0.18107  
##  Mean   :0.17091   Mean   :0.4041   Mean   :0.2202    Mean   :0.21740  
##  3rd Qu.:0.22090   3rd Qu.:0.4942   3rd Qu.:0.3025    3rd Qu.:0.30583  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000    Max.   :1.00000  
##  concave.points_worst symmetry_worst   fractal_dimension_worst diagnosis
##  Min.   :0.0000       Min.   :0.0000   Min.   :0.0000          B:357    
##  1st Qu.:0.2231       1st Qu.:0.1851   1st Qu.:0.1077          M:212    
##  Median :0.3434       Median :0.2478   Median :0.1640                   
##  Mean   :0.3938       Mean   :0.2633   Mean   :0.1896                   
##  3rd Qu.:0.5546       3rd Qu.:0.3182   3rd Qu.:0.2429                   
##  Max.   :1.0000       Max.   :1.0000   Max.   :1.0000
boxplot(breast_nor[1:30], las=2, col="lightblue2", main="Normalize data") 

2.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
train_nor =breast_nor[inTrain,]
test_nor =breast_nor[-inTrain,]

2.3. Use a k-NN algorithm (k = 1:15) to predict if the cancer is benign or malignant.
#labels of train dataset
breast_train_labels_nor = train_nor[,31]
head(breast_train_labels_nor)
## [1] M M M M M M
## Levels: B M
#labels of test dataset
breast_test_labels_nor = test_nor[,31]
head(breast_test_labels_nor)
## [1] M M M M M M
## Levels: B M
#to run the algorithm:
#install.packages("class")
library(class)
breast_test_pred_nor =knn(train = train_nor[1:30], test = test_nor[1:30], 
                        cl = breast_train_labels_nor, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = breast_test_labels_nor, y = breast_test_pred_nor, 
                      prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  186 
## 
##  
##                        | breast_test_pred_nor 
## breast_test_labels_nor |         B |         M | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                      B |       115 |         2 |       117 | 
##                        |     0.983 |     0.017 |     0.629 | 
##                        |     0.943 |     0.031 |           | 
##                        |     0.618 |     0.011 |           | 
## -----------------------|-----------|-----------|-----------|
##                      M |         7 |        62 |        69 | 
##                        |     0.101 |     0.899 |     0.371 | 
##                        |     0.057 |     0.969 |           | 
##                        |     0.038 |     0.333 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |       122 |        64 |       186 | 
##                        |     0.656 |     0.344 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 
#to create a table with the k results:
result_nor = matrix(0,15,3)
colnames(result_nor)= c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:15)){
  breast_test_pred_nor =knn(train = train_nor[1:30], test = test_nor[1:30], 
                        cl = breast_train_labels_nor, k = i)
  tablepredict =CrossTable(x = breast_test_labels_nor, y = breast_test_pred_nor, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[2,1]
  result_nor[i,1]=i 
  result_nor[i,2]=wrong
  result_nor[i,3]=round(((wrong/569) * 100),2)}
result_nor
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                            10                           1.76
##  [2,]      2                             5                           0.88
##  [3,]      3                             8                           1.41
##  [4,]      4                             7                           1.23
##  [5,]      5                             8                           1.41
##  [6,]      6                             8                           1.41
##  [7,]      7                             8                           1.41
##  [8,]      8                            11                           1.93
##  [9,]      9                            10                           1.76
## [10,]     10                             9                           1.58
## [11,]     11                            11                           1.93
## [12,]     12                            11                           1.93
## [13,]     13                            11                           1.93
## [14,]     14                            10                           1.76
## [15,]     15                            10                           1.76

2.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “B”.
#install.packages(caret)
library(caret)
confu_nor =confusionMatrix(breast_test_pred_nor, breast_test_labels_nor, positive = "B")
confu_nor
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 117  10
##          M   0  59
##                                           
##                Accuracy : 0.9462          
##                  95% CI : (0.9034, 0.9739)
##     No Information Rate : 0.629           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8813          
##  Mcnemar's Test P-Value : 0.004427        
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8551          
##          Pos Pred Value : 0.9213          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6290          
##          Detection Rate : 0.6290          
##    Detection Prevalence : 0.6828          
##       Balanced Accuracy : 0.9275          
##                                           
##        'Positive' Class : B               
## 

3.Z-SCORE STANDARDIZATION
3.1. Initial data analysis

This process rescales each of the feature’s values in terms of how many standard deviations they fall above or below the mean value. The resulting value is called z-score. The z-score falls in an unbound range of negative and positive numbers. Unlike the normalized values, they have no predefined minimum and maximum.
breast_z = as.data.frame(scale(breastcan[,-1]))
breast_z$diagnosis = breastcan$diagnosis
summary(breast_z)
##   radius_mean       texture_mean     perimeter_mean      area_mean      
##  Min.   :-2.0279   Min.   :-2.2273   Min.   :-1.9828   Min.   :-1.4532  
##  1st Qu.:-0.6888   1st Qu.:-0.7253   1st Qu.:-0.6913   1st Qu.:-0.6666  
##  Median :-0.2149   Median :-0.1045   Median :-0.2358   Median :-0.2949  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4690   3rd Qu.: 0.5837   3rd Qu.: 0.4992   3rd Qu.: 0.3632  
##  Max.   : 3.9678   Max.   : 4.6478   Max.   : 3.9726   Max.   : 5.2459  
##  smoothness_mean    compactness_mean  concavity_mean   
##  Min.   :-3.10935   Min.   :-1.6087   Min.   :-1.1139  
##  1st Qu.:-0.71034   1st Qu.:-0.7464   1st Qu.:-0.7431  
##  Median :-0.03486   Median :-0.2217   Median :-0.3419  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.63564   3rd Qu.: 0.4934   3rd Qu.: 0.5256  
##  Max.   : 4.76672   Max.   : 4.5644   Max.   : 4.2399  
##  concave.points_mean symmetry_mean      fractal_dimension_mean
##  Min.   :-1.2607     Min.   :-2.74171   Min.   :-1.8183       
##  1st Qu.:-0.7373     1st Qu.:-0.70262   1st Qu.:-0.7220       
##  Median :-0.3974     Median :-0.07156   Median :-0.1781       
##  Mean   : 0.0000     Mean   : 0.00000   Mean   : 0.0000       
##  3rd Qu.: 0.6464     3rd Qu.: 0.53031   3rd Qu.: 0.4706       
##  Max.   : 3.9245     Max.   : 4.48081   Max.   : 4.9066       
##    radius_se         texture_se       perimeter_se        area_se       
##  Min.   :-1.0590   Min.   :-1.5529   Min.   :-1.0431   Min.   :-0.7372  
##  1st Qu.:-0.6230   1st Qu.:-0.6942   1st Qu.:-0.6232   1st Qu.:-0.4943  
##  Median :-0.2920   Median :-0.1973   Median :-0.2864   Median :-0.3475  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2659   3rd Qu.: 0.4661   3rd Qu.: 0.2428   3rd Qu.: 0.1067  
##  Max.   : 8.8991   Max.   : 6.6494   Max.   : 9.4537   Max.   :11.0321  
##  smoothness_se     compactness_se     concavity_se     concave.points_se
##  Min.   :-1.7745   Min.   :-1.2970   Min.   :-1.0566   Min.   :-1.9118  
##  1st Qu.:-0.6235   1st Qu.:-0.6923   1st Qu.:-0.5567   1st Qu.:-0.6739  
##  Median :-0.2201   Median :-0.2808   Median :-0.1989   Median :-0.1404  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3680   3rd Qu.: 0.3893   3rd Qu.: 0.3365   3rd Qu.: 0.4722  
##  Max.   : 8.0229   Max.   : 6.1381   Max.   :12.0621   Max.   : 6.6438  
##   symmetry_se      fractal_dimension_se  radius_worst    
##  Min.   :-1.5315   Min.   :-1.0960      Min.   :-1.7254  
##  1st Qu.:-0.6511   1st Qu.:-0.5846      1st Qu.:-0.6743  
##  Median :-0.2192   Median :-0.2297      Median :-0.2688  
##  Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.0000  
##  3rd Qu.: 0.3554   3rd Qu.: 0.2884      3rd Qu.: 0.5216  
##  Max.   : 7.0657   Max.   : 9.8429      Max.   : 4.0906  
##  texture_worst      perimeter_worst     area_worst      smoothness_worst 
##  Min.   :-2.22204   Min.   :-1.6919   Min.   :-1.2213   Min.   :-2.6803  
##  1st Qu.:-0.74797   1st Qu.:-0.6890   1st Qu.:-0.6416   1st Qu.:-0.6906  
##  Median :-0.04348   Median :-0.2857   Median :-0.3409   Median :-0.0468  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.65776   3rd Qu.: 0.5398   3rd Qu.: 0.3573   3rd Qu.: 0.5970  
##  Max.   : 3.88249   Max.   : 4.2836   Max.   : 5.9250   Max.   : 3.9519  
##  compactness_worst concavity_worst   concave.points_worst
##  Min.   :-1.4426   Min.   :-1.3047   Min.   :-1.7435     
##  1st Qu.:-0.6805   1st Qu.:-0.7558   1st Qu.:-0.7557     
##  Median :-0.2693   Median :-0.2180   Median :-0.2233     
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000     
##  3rd Qu.: 0.5392   3rd Qu.: 0.5307   3rd Qu.: 0.7119     
##  Max.   : 5.1084   Max.   : 4.6965   Max.   : 2.6835     
##  symmetry_worst    fractal_dimension_worst diagnosis
##  Min.   :-2.1591   Min.   :-1.6004         B:357    
##  1st Qu.:-0.6413   1st Qu.:-0.6913         M:212    
##  Median :-0.1273   Median :-0.2163                  
##  Mean   : 0.0000   Mean   : 0.0000                  
##  3rd Qu.: 0.4497   3rd Qu.: 0.4504                  
##  Max.   : 6.0407   Max.   : 6.8408
boxplot(breast_z[1:30], las=2, col="pink", main="Z-score standarization")

3.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
train_z =breast_z[inTrain,]
test_z =breast_z[-inTrain,]

3.3. Use a k-NN algorithm (k = 1:15) to predict if the cancer is benign or malignant.
#labels of train dataset
breast_train_labels_z = train_z[,31]
head(breast_train_labels_z)
## [1] M M M M M M
## Levels: B M
#labels of test dataset
breast_test_labels_z = test_z[,31]
head(breast_test_labels_z)
## [1] M M M M M M
## Levels: B M
#to run the algorithm:
#install.packages("class")
library(class)
breast_test_pred_z =knn(train = train_z[1:30], test = test_z[1:30], 
                        cl = breast_train_labels_z, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = breast_test_labels_z, y = breast_test_pred_z, 
                      prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  186 
## 
##  
##                      | breast_test_pred_z 
## breast_test_labels_z |         B |         M | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                    B |       117 |         0 |       117 | 
##                      |     1.000 |     0.000 |     0.629 | 
##                      |     0.921 |     0.000 |           | 
##                      |     0.629 |     0.000 |           | 
## ---------------------|-----------|-----------|-----------|
##                    M |        10 |        59 |        69 | 
##                      |     0.145 |     0.855 |     0.371 | 
##                      |     0.079 |     1.000 |           | 
##                      |     0.054 |     0.317 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |       127 |        59 |       186 | 
##                      |     0.683 |     0.317 |           | 
## ---------------------|-----------|-----------|-----------|
## 
## 
#to create a table with the k results:
result_z = matrix(0,15,3)
colnames(result_z) = c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:15)){
  breast_test_pred_z =knn(train = train_z[1:30], test = test_z[1:30], 
                        cl = breast_train_labels_z, k = i)
  tablepredict =CrossTable(x = breast_test_labels_z, y = breast_test_pred_z, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[2,1]
  result_z[i,1]=i 
  result_z[i,2]=wrong
  result_z[i,3]=round(((wrong/569) * 100),2)}
result_z
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                            12                           2.11
##  [2,]      2                            15                           2.64
##  [3,]      3                             9                           1.58
##  [4,]      4                            10                           1.76
##  [5,]      5                            11                           1.93
##  [6,]      6                            13                           2.28
##  [7,]      7                            10                           1.76
##  [8,]      8                            12                           2.11
##  [9,]      9                            14                           2.46
## [10,]     10                            13                           2.28
## [11,]     11                            13                           2.28
## [12,]     12                            13                           2.28
## [13,]     13                            14                           2.46
## [14,]     14                            14                           2.46
## [15,]     15                            12                           2.11

3.4.Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “B”.
#install.packages(caret)
library(caret)
confu_z =confusionMatrix(breast_test_pred_z, breast_test_labels_z, positive = "B")
confu_z
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 116  11
##          M   1  58
##                                         
##                Accuracy : 0.9355        
##                  95% CI : (0.89, 0.9662)
##     No Information Rate : 0.629         
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.8575        
##  Mcnemar's Test P-Value : 0.009375      
##                                         
##             Sensitivity : 0.9915        
##             Specificity : 0.8406        
##          Pos Pred Value : 0.9134        
##          Neg Pred Value : 0.9831        
##              Prevalence : 0.6290        
##          Detection Rate : 0.6237        
##    Detection Prevalence : 0.6828        
##       Balanced Accuracy : 0.9160        
##                                         
##        'Positive' Class : B             
## 

Comparing results:
result 
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                            12                           2.11
##  [2,]      2                            13                           2.28
##  [3,]      3                            13                           2.28
##  [4,]      4                            15                           2.64
##  [5,]      5                            13                           2.28
##  [6,]      6                            13                           2.28
##  [7,]      7                            12                           2.11
##  [8,]      8                            12                           2.11
##  [9,]      9                            12                           2.11
## [10,]     10                            11                           1.93
## [11,]     11                            11                           1.93
## [12,]     12                            11                           1.93
## [13,]     13                            13                           2.28
## [14,]     14                            13                           2.28
## [15,]     15                            13                           2.28
result_nor
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                            10                           1.76
##  [2,]      2                             5                           0.88
##  [3,]      3                             8                           1.41
##  [4,]      4                             7                           1.23
##  [5,]      5                             8                           1.41
##  [6,]      6                             8                           1.41
##  [7,]      7                             8                           1.41
##  [8,]      8                            11                           1.93
##  [9,]      9                            10                           1.76
## [10,]     10                             9                           1.58
## [11,]     11                            11                           1.93
## [12,]     12                            11                           1.93
## [13,]     13                            11                           1.93
## [14,]     14                            10                           1.76
## [15,]     15                            10                           1.76
result_z
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                            12                           2.11
##  [2,]      2                            15                           2.64
##  [3,]      3                             9                           1.58
##  [4,]      4                            10                           1.76
##  [5,]      5                            11                           1.93
##  [6,]      6                            13                           2.28
##  [7,]      7                            10                           1.76
##  [8,]      8                            12                           2.11
##  [9,]      9                            14                           2.46
## [10,]     10                            13                           2.28
## [11,]     11                            13                           2.28
## [12,]     12                            13                           2.28
## [13,]     13                            14                           2.46
## [14,]     14                            14                           2.46
## [15,]     15                            12                           2.11
confu
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 113   9
##          M   4  60
##                                           
##                Accuracy : 0.9301          
##                  95% CI : (0.8834, 0.9623)
##     No Information Rate : 0.629           
##     P-Value [Acc > NIR] : <2e-16 0.2673="" 0.6075="" 0.6290="" 0.6559="" 0.848="" 0.8696="" 0.9177="" 0.9262="" 0.9375="" 0.9658="" :="" accuracy="" b="" balanced="" class="" code="" detection="" kappa="" mcnemar="" neg="" ositive="" p-value="" pos="" pred="" prevalence="" rate="" s="" sensitivity="" specificity="" test="" value="">
confu_nor
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 117  10
##          M   0  59
##                                           
##                Accuracy : 0.9462          
##                  95% CI : (0.9034, 0.9739)
##     No Information Rate : 0.629           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8813          
##  Mcnemar's Test P-Value : 0.004427        
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8551          
##          Pos Pred Value : 0.9213          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6290          
##          Detection Rate : 0.6290          
##    Detection Prevalence : 0.6828          
##       Balanced Accuracy : 0.9275          
##                                           
##        'Positive' Class : B               
## 
confu_z
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 116  11
##          M   1  58
##                                         
##                Accuracy : 0.9355        
##                  95% CI : (0.89, 0.9662)
##     No Information Rate : 0.629         
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.8575        
##  Mcnemar's Test P-Value : 0.009375      
##                                         
##             Sensitivity : 0.9915        
##             Specificity : 0.8406        
##          Pos Pred Value : 0.9134        
##          Neg Pred Value : 0.9831        
##              Prevalence : 0.6290        
##          Detection Rate : 0.6237        
##    Detection Prevalence : 0.6828        
##       Balanced Accuracy : 0.9160        
##                                         
##        'Positive' Class : B             
## 
With these results we can see there is an improvement of the model when using different distance functions, this is due to that the variables in the iris dataset have inputs of different ranges, and hence, modifying them does produces an improvement of the algorithm. In this case, we see that using the min-max normalization distance function we get a better result.

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...