Distances for k-NN algorithm (iris dataset)

General information about k-NN algorithm can be found: http://dataworldblog.blogspot.com.es/2017/08/k-nn-algorithm.html

In order to measure the similarity between two instances is used a distance function. There are different ways to calculate distance, but traditionally the k-NN algorithm uses Euclidean distance, which is the “ordinary” or “straight-line” distance between two points.

It has been demonstrated that the chosen distance function can affect the classification accuracy of the k-NN classifier.

The distance calculation for k-NN is heavily dependent on the measurement scale of the input features. Since different inputs have different ranges of values, those inputs with larger range of value will have a larger impact than those that have smaller range of values.

This could potentially cause problems for our classifier, so we have to apply normalization to rescale the features into a standard range of values.

So here we are going to compare the results of using different distance functions.

1. WITHOUT RESCALING THE FEATURES

1.1. Initial data analysis.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

Here we see that the inputs have slightly different range of values.

1.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).

We will divide our data into two different sets: a training data set (67%) that will be used to build the model and a testing data set (33%) that will be used to estimate the predictive accuracy of the model.

The function createDataPartition can be used to create random balanced splits of the data. Using this function the random sampling occurs within each class and preserves the overall class distribution of the data.

library(caret)

set.seed(123)
inTrain =createDataPartition(y = iris$Species,p = 0.67,list = FALSE)
str(inTrain)

##  int [1:102, 1] 2 3 4 7 9 11 13 14 15 18 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "Resample1"

train =iris[inTrain,]
test =iris[-inTrain,]

We have to make sure the samples have been fairly split between the training and testing data set, and that the proportion is similar in the complete data set:

Proportions of the complete dataset:

## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333

Proportions of the training dataset:

## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333

Proportions of the testing dataset:

## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333

We see that the proportions are similar among the datasets.

1.3. Use a k-NN algorithm (k = 1:20) to predict the species.

#labels of train dataset
iris_train_labels =train[,5]
head(iris_train_labels)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

#labels of test dataset
iris_test_labels =test[,5]
head(iris_test_labels)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred =knn(train = train[1:4], test = test[1:4], 
                        cl = iris_train_labels, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)

## Warning: package 'gmodels' was built under R version 3.4.1

tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred, 
                      prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  48 
## 
##  
##                  | iris_test_pred 
## iris_test_labels |     setosa | versicolor |  virginica |  Row Total | 
## -----------------|------------|------------|------------|------------|
##           setosa |         16 |          0 |          0 |         16 | 
##                  |      1.000 |      0.000 |      0.000 |      0.333 | 
##                  |      1.000 |      0.000 |      0.000 |            | 
##                  |      0.333 |      0.000 |      0.000 |            | 
## -----------------|------------|------------|------------|------------|
##       versicolor |          0 |         12 |          4 |         16 | 
##                  |      0.000 |      0.750 |      0.250 |      0.333 | 
##                  |      0.000 |      0.923 |      0.211 |            | 
##                  |      0.000 |      0.250 |      0.083 |            | 
## -----------------|------------|------------|------------|------------|
##        virginica |          0 |          1 |         15 |         16 | 
##                  |      0.000 |      0.062 |      0.938 |      0.333 | 
##                  |      0.000 |      0.077 |      0.789 |            | 
##                  |      0.000 |      0.021 |      0.312 |            | 
## -----------------|------------|------------|------------|------------|
##     Column Total |         16 |         13 |         19 |         48 | 
##                  |      0.333 |      0.271 |      0.396 |            | 
## -----------------|------------|------------|------------|------------|
## 
##

Table with different k results to see which value of k is better for getting better results:

#to create a table from 0 to 20 k results:
result<- span="">matrix(0,20,3)
colnames(result)<- span="">c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:20)){
  iris_test_pred =knn(train = train[1:4], test = test[1:4], 
                        cl = iris_train_labels, k = i)
  tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
  result[i,1]=i 
  result[i,2]=wrong
  result[i,3]=round(((wrong/150) * 100),2)}

result

##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             4                           2.67
##  [3,]      3                             3                           2.00
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             4                           2.67
##  [8,]      8                             3                           2.00
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67
## [11,]     11                             3                           2.00
## [12,]     12                             4                           2.67
## [13,]     13                             4                           2.67
## [14,]     14                             4                           2.67
## [15,]     15                             4                           2.67
## [16,]     16                             4                           2.67
## [17,]     17                             4                           2.67
## [18,]     18                             4                           2.67
## [19,]     19                             4                           2.67
## [20,]     20                             3                           2.00

1.4. Results

It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.

#install.packages(caret)
library(caret)
confu =confusionMatrix(iris_test_pred, iris_test_labels, positive = "setosa")
confu

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         14         1
##   virginica       0          2        15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9375         
##                  95% CI : (0.828, 0.9869)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9062         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8750           0.9375
## Specificity                 1.0000            0.9688           0.9375
## Pos Pred Value              1.0000            0.9333           0.8824
## Neg Pred Value              1.0000            0.9394           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2917           0.3125
## Detection Prevalence        0.3333            0.3125           0.3542
## Balanced Accuracy           1.0000            0.9219           0.9375

2. MIN-MAX NORMALIZING NUMERIC DATA

2.1 Initial Data Analysis

This process transforms a feature such that all of its values fall in a range between 0 and 1.

normalize =function(x){
  return((x-min(x))/(max(x)-min(x)))}

iris_nor =as.data.frame(lapply(iris[,-5], normalize))
iris_nor$Species =iris$Species
head(iris_nor)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   0.22222222   0.6250000   0.06779661  0.04166667  setosa
## 2   0.16666667   0.4166667   0.06779661  0.04166667  setosa
## 3   0.11111111   0.5000000   0.05084746  0.04166667  setosa
## 4   0.08333333   0.4583333   0.08474576  0.04166667  setosa
## 5   0.19444444   0.6666667   0.06779661  0.04166667  setosa
## 6   0.30555556   0.7916667   0.11864407  0.12500000  setosa

summary(iris_nor)

##   Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
##  Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
##  Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
##  3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

boxplot(iris_nor[1:4], las=2, col="lightblue2", main="Normalize data")

2.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).

train_nor =iris_nor[inTrain,]
test_nor =iris_nor[-inTrain,]

2.3. Use a k-NN algorithm (k = 1:20) to predict the species.

#labels of train dataset
iris_train_labels_nor = train_nor[,5]
head(iris_train_labels_nor)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

#labels of test dataset
iris_test_labels_nor = test_nor[,5]
head(iris_test_labels_nor)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred_nor =knn(train = train_nor[1:4], test = test_nor[1:4], 
                        cl = iris_train_labels_nor, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = iris_test_labels_nor, y = iris_test_pred_nor, 
                      prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  48 
## 
##  
##                      | iris_test_pred_nor 
## iris_test_labels_nor |     setosa | versicolor |  virginica |  Row Total | 
## ---------------------|------------|------------|------------|------------|
##               setosa |         16 |          0 |          0 |         16 | 
##                      |      1.000 |      0.000 |      0.000 |      0.333 | 
##                      |      1.000 |      0.000 |      0.000 |            | 
##                      |      0.333 |      0.000 |      0.000 |            | 
## ---------------------|------------|------------|------------|------------|
##           versicolor |          0 |         12 |          4 |         16 | 
##                      |      0.000 |      0.750 |      0.250 |      0.333 | 
##                      |      0.000 |      0.923 |      0.211 |            | 
##                      |      0.000 |      0.250 |      0.083 |            | 
## ---------------------|------------|------------|------------|------------|
##            virginica |          0 |          1 |         15 |         16 | 
##                      |      0.000 |      0.062 |      0.938 |      0.333 | 
##                      |      0.000 |      0.077 |      0.789 |            | 
##                      |      0.000 |      0.021 |      0.312 |            | 
## ---------------------|------------|------------|------------|------------|
##         Column Total |         16 |         13 |         19 |         48 | 
##                      |      0.333 |      0.271 |      0.396 |            | 
## ---------------------|------------|------------|------------|------------|
## 
##

#to create a table with the k results:
result_nor<- span="">matrix(0,20,3)
colnames(result_nor)<- span="">c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:20)){
  iris_test_pred_nor =knn(train = train_nor[1:4], test = test_nor[1:4], 
                        cl = iris_train_labels_nor, k = i)
  tablepredict =CrossTable(x = iris_test_labels_nor, y = iris_test_pred_nor, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
  result_nor[i,1]=i 
  result_nor[i,2]=wrong
  result_nor[i,3]=round(((wrong/150) * 100),2)}

result_nor

##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             5                           3.33
##  [3,]      3                             4                           2.67
##  [4,]      4                             4                           2.67
##  [5,]      5                             4                           2.67
##  [6,]      6                             5                           3.33
##  [7,]      7                             5                           3.33
##  [8,]      8                             5                           3.33
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67
## [11,]     11                             3                           2.00
## [12,]     12                             3                           2.00
## [13,]     13                             3                           2.00
## [14,]     14                             3                           2.00
## [15,]     15                             3                           2.00
## [16,]     16                             3                           2.00
## [17,]     17                             4                           2.67
## [18,]     18                             3                           2.00
## [19,]     19                             3                           2.00
## [20,]     20                             3                           2.00

2.4. Results

#install.packages(caret)
library(caret)
confu_nor =confusionMatrix(iris_test_pred_nor, iris_test_labels_nor, positive = "setosa")
confu_nor

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         14         1
##   virginica       0          2        15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9375         
##                  95% CI : (0.828, 0.9869)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9062         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8750           0.9375
## Specificity                 1.0000            0.9688           0.9375
## Pos Pred Value              1.0000            0.9333           0.8824
## Neg Pred Value              1.0000            0.9394           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2917           0.3125
## Detection Prevalence        0.3333            0.3125           0.3542
## Balanced Accuracy           1.0000            0.9219           0.9375

3. Z-SCORE STANDARDIZATION

3.1 Data Initial Analysis

This process rescales each of the feature’s values in terms of how many standard deviations they fall above or below the mean value. The resulting value is called z-score. The z-score falls in an unbound range of negative and positive numbers. Unlike the normalized values, they have no predefined minimum and maximum.

iris_z =as.data.frame(scale(iris[,-5]))
iris_z$Species =iris$Species
summary(iris_z)

##   Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
##  Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
##  1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
##  Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

boxplot(iris_z[1:4], las=2, col="gold", main="Z-score standarization")

3.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).

train_z =iris_z[inTrain,]
test_z =iris_z[-inTrain,]

3.3. Use a k-NN algorithm (k = 1:20) to predict the species.

#labels of train dataset
iris_train_labels_z = train_z[,5]
head(iris_train_labels_z)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

#labels of test dataset
iris_test_labels_z = test_z[,5]
head(iris_test_labels_z)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred_z =knn(train = train_z[1:4], test = test_z[1:4], 
                        cl = iris_train_labels_z, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = iris_test_labels_z, y = iris_test_pred_z, 
                      prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  48 
## 
##  
##                    | iris_test_pred_z 
## iris_test_labels_z |     setosa | versicolor |  virginica |  Row Total | 
## -------------------|------------|------------|------------|------------|
##             setosa |         16 |          0 |          0 |         16 | 
##                    |      1.000 |      0.000 |      0.000 |      0.333 | 
##                    |      1.000 |      0.000 |      0.000 |            | 
##                    |      0.333 |      0.000 |      0.000 |            | 
## -------------------|------------|------------|------------|------------|
##         versicolor |          0 |         12 |          4 |         16 | 
##                    |      0.000 |      0.750 |      0.250 |      0.333 | 
##                    |      0.000 |      0.923 |      0.211 |            | 
##                    |      0.000 |      0.250 |      0.083 |            | 
## -------------------|------------|------------|------------|------------|
##          virginica |          0 |          1 |         15 |         16 | 
##                    |      0.000 |      0.062 |      0.938 |      0.333 | 
##                    |      0.000 |      0.077 |      0.789 |            | 
##                    |      0.000 |      0.021 |      0.312 |            | 
## -------------------|------------|------------|------------|------------|
##       Column Total |         16 |         13 |         19 |         48 | 
##                    |      0.333 |      0.271 |      0.396 |            | 
## -------------------|------------|------------|------------|------------|
## 
##

#to create a table with the k results:
result_z<- span="">matrix(0,20,3)
colnames(result_z)<- span="">c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:20)){
  iris_test_pred_z =knn(train = train_z[1:4], test = test_z[1:4], 
                        cl = iris_train_labels_z, k = i)
  tablepredict =CrossTable(x = iris_test_labels_z, y = iris_test_pred_z, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
  result_z[i,1]=i 
  result_z[i,2]=wrong
  result_z[i,3]=round(((wrong/150) * 100),2)}

result_z

##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             4                           2.67
##  [2,]      2                             5                           3.33
##  [3,]      3                             4                           2.67
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             3                           2.00
##  [8,]      8                             3                           2.00
##  [9,]      9                             3                           2.00
## [10,]     10                             3                           2.00
## [11,]     11                             4                           2.67
## [12,]     12                             5                           3.33
## [13,]     13                             4                           2.67
## [14,]     14                             5                           3.33
## [15,]     15                             4                           2.67
## [16,]     16                             4                           2.67
## [17,]     17                             4                           2.67
## [18,]     18                             3                           2.00
## [19,]     19                             4                           2.67
## [20,]     20                             4                           2.67

3.4. Results

#install.packages(caret)
library(caret)
confu_z =confusionMatrix(iris_test_pred_z, iris_test_labels_z, positive = "setosa")
confu_z

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         13         1
##   virginica       0          3        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8002, 0.9768)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.875           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8125           0.9375
## Specificity                 1.0000            0.9688           0.9062
## Pos Pred Value              1.0000            0.9286           0.8333
## Neg Pred Value              1.0000            0.9118           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2708           0.3125
## Detection Prevalence        0.3333            0.2917           0.3750
## Balanced Accuracy           1.0000            0.8906           0.9219

- Comparing results:

result

##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             4                           2.67
##  [3,]      3                             3                           2.00
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             4                           2.67
##  [8,]      8                             3                           2.00
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67
## [11,]     11                             3                           2.00
## [12,]     12                             4                           2.67
## [13,]     13                             4                           2.67
## [14,]     14                             4                           2.67
## [15,]     15                             4                           2.67
## [16,]     16                             4                           2.67
## [17,]     17                             4                           2.67
## [18,]     18                             4                           2.67
## [19,]     19                             4                           2.67
## [20,]     20                             3                           2.00

result_nor

##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             5                           3.33
##  [3,]      3                             4                           2.67
##  [4,]      4                             4                           2.67
##  [5,]      5                             4                           2.67
##  [6,]      6                             5                           3.33
##  [7,]      7                             5                           3.33
##  [8,]      8                             5                           3.33
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67
## [11,]     11                             3                           2.00
## [12,]     12                             3                           2.00
## [13,]     13                             3                           2.00
## [14,]     14                             3                           2.00
## [15,]     15                             3                           2.00
## [16,]     16                             3                           2.00
## [17,]     17                             4                           2.67
## [18,]     18                             3                           2.00
## [19,]     19                             3                           2.00
## [20,]     20                             3                           2.00

result_z

##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             4                           2.67
##  [2,]      2                             5                           3.33
##  [3,]      3                             4                           2.67
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             3                           2.00
##  [8,]      8                             3                           2.00
##  [9,]      9                             3                           2.00
## [10,]     10                             3                           2.00
## [11,]     11                             4                           2.67
## [12,]     12                             5                           3.33
## [13,]     13                             4                           2.67
## [14,]     14                             5                           3.33
## [15,]     15                             4                           2.67
## [16,]     16                             4                           2.67
## [17,]     17                             4                           2.67
## [18,]     18                             3                           2.00
## [19,]     19                             4                           2.67
## [20,]     20                             4                           2.67

confu

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         14         1
##   virginica       0          2        15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9375         
##                  95% CI : (0.828, 0.9869)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9062         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8750           0.9375
## Specificity                 1.0000            0.9688           0.9375
## Pos Pred Value              1.0000            0.9333           0.8824
## Neg Pred Value              1.0000            0.9394           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2917           0.3125
## Detection Prevalence        0.3333            0.3125           0.3542
## Balanced Accuracy           1.0000            0.9219           0.9375

confu_nor

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         14         1
##   virginica       0          2        15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9375         
##                  95% CI : (0.828, 0.9869)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9062         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8750           0.9375
## Specificity                 1.0000            0.9688           0.9375
## Pos Pred Value              1.0000            0.9333           0.8824
## Neg Pred Value              1.0000            0.9394           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2917           0.3125
## Detection Prevalence        0.3333            0.3125           0.3542
## Balanced Accuracy           1.0000            0.9219           0.9375

confu_z

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         13         1
##   virginica       0          3        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8002, 0.9768)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.875           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8125           0.9375
## Specificity                 1.0000            0.9688           0.9062
## Pos Pred Value              1.0000            0.9286           0.8333
## Neg Pred Value              1.0000            0.9118           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2708           0.3125
## Detection Prevalence        0.3333            0.2917           0.3750
## Balanced Accuracy           1.0000            0.8906           0.9219

With these results we can see the differences are minimal using different distance functions, this is due to that the variables in the iris data set have inputs of similar ranges, and hence, modifying them does not produces a clear improvement of the algorithm.

Data World Blog

Search This Blog

Distances for k-NN algorithm (iris dataset)

Labels

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Initial Data Analysis (infert dataset)

Ant Colony Optimization (part 2) : Graph optimization using ACO