Skip to main content

Distances for k-NN algorithm (iris dataset)

General information about k-NN algorithm can be found: http://dataworldblog.blogspot.com.es/2017/08/k-nn-algorithm.html
In order to measure the similarity between two instances is used a distance function. There are different ways to calculate distance, but traditionally the k-NN algorithm uses Euclidean distance, which is the “ordinary” or “straight-line” distance between two points.
It has been demonstrated that the chosen distance function can affect the classification accuracy of the k-NN classifier.
The distance calculation for k-NN is heavily dependent on the measurement scale of the input features. Since different inputs have different ranges of values, those inputs with larger range of value will have a larger impact than those that have smaller range of values.
This could potentially cause problems for our classifier, so we have to apply normalization to rescale the features into a standard range of values.
So here we are going to compare the results of using different distance functions.

1. WITHOUT RESCALING THE FEATURES
1.1. Initial data analysis.
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
Here we see that the inputs have slightly different range of values.

1.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
We will divide our data into two different sets: a training data set (67%) that will be used to build the model and a testing data set (33%) that will be used to estimate the predictive accuracy of the model.
The function createDataPartition can be used to create random balanced splits of the data. Using this function the random sampling occurs within each class and preserves the overall class distribution of the data.
library(caret)
set.seed(123)
inTrain =createDataPartition(y = iris$Species,p = 0.67,list = FALSE)
str(inTrain)
##  int [1:102, 1] 2 3 4 7 9 11 13 14 15 18 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "Resample1"
train =iris[inTrain,]
test =iris[-inTrain,]
We have to make sure the samples have been fairly split between the training and testing data set, and that the proportion is similar in the complete data set:
Proportions of the complete dataset:
## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333
Proportions of the training dataset:
## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333
Proportions of the testing dataset:
## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333
We see that the proportions are similar among the datasets.

1.3. Use a k-NN algorithm (k = 1:20) to predict the species.
#labels of train dataset
iris_train_labels =train[,5]
head(iris_train_labels)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#labels of test dataset
iris_test_labels =test[,5]
head(iris_test_labels)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred =knn(train = train[1:4], test = test[1:4], 
                        cl = iris_train_labels, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.4.1
tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred, 
                      prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  48 
## 
##  
##                  | iris_test_pred 
## iris_test_labels |     setosa | versicolor |  virginica |  Row Total | 
## -----------------|------------|------------|------------|------------|
##           setosa |         16 |          0 |          0 |         16 | 
##                  |      1.000 |      0.000 |      0.000 |      0.333 | 
##                  |      1.000 |      0.000 |      0.000 |            | 
##                  |      0.333 |      0.000 |      0.000 |            | 
## -----------------|------------|------------|------------|------------|
##       versicolor |          0 |         12 |          4 |         16 | 
##                  |      0.000 |      0.750 |      0.250 |      0.333 | 
##                  |      0.000 |      0.923 |      0.211 |            | 
##                  |      0.000 |      0.250 |      0.083 |            | 
## -----------------|------------|------------|------------|------------|
##        virginica |          0 |          1 |         15 |         16 | 
##                  |      0.000 |      0.062 |      0.938 |      0.333 | 
##                  |      0.000 |      0.077 |      0.789 |            | 
##                  |      0.000 |      0.021 |      0.312 |            | 
## -----------------|------------|------------|------------|------------|
##     Column Total |         16 |         13 |         19 |         48 | 
##                  |      0.333 |      0.271 |      0.396 |            | 
## -----------------|------------|------------|------------|------------|
## 
## 
Table with different k results to see which value of k is better for getting better results:
#to create a table from 0 to 20 k results:
result<- span="">matrix(0,20,3)
colnames(result)<- span="">c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:20)){
  iris_test_pred =knn(train = train[1:4], test = test[1:4], 
                        cl = iris_train_labels, k = i)
  tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
  result[i,1]=i 
  result[i,2]=wrong
  result[i,3]=round(((wrong/150) * 100),2)}
result
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             4                           2.67
##  [3,]      3                             3                           2.00
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             4                           2.67
##  [8,]      8                             3                           2.00
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67
## [11,]     11                             3                           2.00
## [12,]     12                             4                           2.67
## [13,]     13                             4                           2.67
## [14,]     14                             4                           2.67
## [15,]     15                             4                           2.67
## [16,]     16                             4                           2.67
## [17,]     17                             4                           2.67
## [18,]     18                             4                           2.67
## [19,]     19                             4                           2.67
## [20,]     20                             3                           2.00

1.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.
#install.packages(caret)
library(caret)
confu =confusionMatrix(iris_test_pred, iris_test_labels, positive = "setosa")
confu
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         14         1
##   virginica       0          2        15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9375         
##                  95% CI : (0.828, 0.9869)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9062         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8750           0.9375
## Specificity                 1.0000            0.9688           0.9375
## Pos Pred Value              1.0000            0.9333           0.8824
## Neg Pred Value              1.0000            0.9394           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2917           0.3125
## Detection Prevalence        0.3333            0.3125           0.3542
## Balanced Accuracy           1.0000            0.9219           0.9375

2. MIN-MAX NORMALIZING NUMERIC DATA
2.1 Initial Data Analysis
This process transforms a feature such that all of its values fall in a range between 0 and 1.
normalize =function(x){
  return((x-min(x))/(max(x)-min(x)))}

iris_nor =as.data.frame(lapply(iris[,-5], normalize))
iris_nor$Species =iris$Species
head(iris_nor)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   0.22222222   0.6250000   0.06779661  0.04166667  setosa
## 2   0.16666667   0.4166667   0.06779661  0.04166667  setosa
## 3   0.11111111   0.5000000   0.05084746  0.04166667  setosa
## 4   0.08333333   0.4583333   0.08474576  0.04166667  setosa
## 5   0.19444444   0.6666667   0.06779661  0.04166667  setosa
## 6   0.30555556   0.7916667   0.11864407  0.12500000  setosa
summary(iris_nor)
##   Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
##  Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
##  Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
##  3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
boxplot(iris_nor[1:4], las=2, col="lightblue2", main="Normalize data") 

2.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
train_nor =iris_nor[inTrain,]
test_nor =iris_nor[-inTrain,]

2.3. Use a k-NN algorithm (k = 1:20) to predict the species.
#labels of train dataset
iris_train_labels_nor = train_nor[,5]
head(iris_train_labels_nor)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#labels of test dataset
iris_test_labels_nor = test_nor[,5]
head(iris_test_labels_nor)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred_nor =knn(train = train_nor[1:4], test = test_nor[1:4], 
                        cl = iris_train_labels_nor, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = iris_test_labels_nor, y = iris_test_pred_nor, 
                      prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  48 
## 
##  
##                      | iris_test_pred_nor 
## iris_test_labels_nor |     setosa | versicolor |  virginica |  Row Total | 
## ---------------------|------------|------------|------------|------------|
##               setosa |         16 |          0 |          0 |         16 | 
##                      |      1.000 |      0.000 |      0.000 |      0.333 | 
##                      |      1.000 |      0.000 |      0.000 |            | 
##                      |      0.333 |      0.000 |      0.000 |            | 
## ---------------------|------------|------------|------------|------------|
##           versicolor |          0 |         12 |          4 |         16 | 
##                      |      0.000 |      0.750 |      0.250 |      0.333 | 
##                      |      0.000 |      0.923 |      0.211 |            | 
##                      |      0.000 |      0.250 |      0.083 |            | 
## ---------------------|------------|------------|------------|------------|
##            virginica |          0 |          1 |         15 |         16 | 
##                      |      0.000 |      0.062 |      0.938 |      0.333 | 
##                      |      0.000 |      0.077 |      0.789 |            | 
##                      |      0.000 |      0.021 |      0.312 |            | 
## ---------------------|------------|------------|------------|------------|
##         Column Total |         16 |         13 |         19 |         48 | 
##                      |      0.333 |      0.271 |      0.396 |            | 
## ---------------------|------------|------------|------------|------------|
## 
## 
#to create a table with the k results:
result_nor<- span="">matrix(0,20,3)
colnames(result_nor)<- span="">c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:20)){
  iris_test_pred_nor =knn(train = train_nor[1:4], test = test_nor[1:4], 
                        cl = iris_train_labels_nor, k = i)
  tablepredict =CrossTable(x = iris_test_labels_nor, y = iris_test_pred_nor, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
  result_nor[i,1]=i 
  result_nor[i,2]=wrong
  result_nor[i,3]=round(((wrong/150) * 100),2)}
result_nor
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             5                           3.33
##  [3,]      3                             4                           2.67
##  [4,]      4                             4                           2.67
##  [5,]      5                             4                           2.67
##  [6,]      6                             5                           3.33
##  [7,]      7                             5                           3.33
##  [8,]      8                             5                           3.33
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67
## [11,]     11                             3                           2.00
## [12,]     12                             3                           2.00
## [13,]     13                             3                           2.00
## [14,]     14                             3                           2.00
## [15,]     15                             3                           2.00
## [16,]     16                             3                           2.00
## [17,]     17                             4                           2.67
## [18,]     18                             3                           2.00
## [19,]     19                             3                           2.00
## [20,]     20                             3                           2.00

2.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.
#install.packages(caret)
library(caret)
confu_nor =confusionMatrix(iris_test_pred_nor, iris_test_labels_nor, positive = "setosa")
confu_nor
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         14         1
##   virginica       0          2        15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9375         
##                  95% CI : (0.828, 0.9869)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9062         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8750           0.9375
## Specificity                 1.0000            0.9688           0.9375
## Pos Pred Value              1.0000            0.9333           0.8824
## Neg Pred Value              1.0000            0.9394           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2917           0.3125
## Detection Prevalence        0.3333            0.3125           0.3542
## Balanced Accuracy           1.0000            0.9219           0.9375

3. Z-SCORE STANDARDIZATION
3.1 Data Initial Analysis
This process rescales each of the feature’s values in terms of how many standard deviations they fall above or below the mean value. The resulting value is called z-score. The z-score falls in an unbound range of negative and positive numbers. Unlike the normalized values, they have no predefined minimum and maximum.
iris_z =as.data.frame(scale(iris[,-5]))
iris_z$Species =iris$Species
summary(iris_z)
##   Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
##  Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
##  1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
##  Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
boxplot(iris_z[1:4], las=2, col="gold", main="Z-score standarization")

3.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
train_z =iris_z[inTrain,]
test_z =iris_z[-inTrain,]

3.3. Use a k-NN algorithm (k = 1:20) to predict the species.
#labels of train dataset
iris_train_labels_z = train_z[,5]
head(iris_train_labels_z)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#labels of test dataset
iris_test_labels_z = test_z[,5]
head(iris_test_labels_z)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred_z =knn(train = train_z[1:4], test = test_z[1:4], 
                        cl = iris_train_labels_z, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = iris_test_labels_z, y = iris_test_pred_z, 
                      prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  48 
## 
##  
##                    | iris_test_pred_z 
## iris_test_labels_z |     setosa | versicolor |  virginica |  Row Total | 
## -------------------|------------|------------|------------|------------|
##             setosa |         16 |          0 |          0 |         16 | 
##                    |      1.000 |      0.000 |      0.000 |      0.333 | 
##                    |      1.000 |      0.000 |      0.000 |            | 
##                    |      0.333 |      0.000 |      0.000 |            | 
## -------------------|------------|------------|------------|------------|
##         versicolor |          0 |         12 |          4 |         16 | 
##                    |      0.000 |      0.750 |      0.250 |      0.333 | 
##                    |      0.000 |      0.923 |      0.211 |            | 
##                    |      0.000 |      0.250 |      0.083 |            | 
## -------------------|------------|------------|------------|------------|
##          virginica |          0 |          1 |         15 |         16 | 
##                    |      0.000 |      0.062 |      0.938 |      0.333 | 
##                    |      0.000 |      0.077 |      0.789 |            | 
##                    |      0.000 |      0.021 |      0.312 |            | 
## -------------------|------------|------------|------------|------------|
##       Column Total |         16 |         13 |         19 |         48 | 
##                    |      0.333 |      0.271 |      0.396 |            | 
## -------------------|------------|------------|------------|------------|
## 
## 
#to create a table with the k results:
result_z<- span="">matrix(0,20,3)
colnames(result_z)<- span="">c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:20)){
  iris_test_pred_z =knn(train = train_z[1:4], test = test_z[1:4], 
                        cl = iris_train_labels_z, k = i)
  tablepredict =CrossTable(x = iris_test_labels_z, y = iris_test_pred_z, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
  result_z[i,1]=i 
  result_z[i,2]=wrong
  result_z[i,3]=round(((wrong/150) * 100),2)}
result_z
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             4                           2.67
##  [2,]      2                             5                           3.33
##  [3,]      3                             4                           2.67
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             3                           2.00
##  [8,]      8                             3                           2.00
##  [9,]      9                             3                           2.00
## [10,]     10                             3                           2.00
## [11,]     11                             4                           2.67
## [12,]     12                             5                           3.33
## [13,]     13                             4                           2.67
## [14,]     14                             5                           3.33
## [15,]     15                             4                           2.67
## [16,]     16                             4                           2.67
## [17,]     17                             4                           2.67
## [18,]     18                             3                           2.00
## [19,]     19                             4                           2.67
## [20,]     20                             4                           2.67

3.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.
#install.packages(caret)
library(caret)
confu_z =confusionMatrix(iris_test_pred_z, iris_test_labels_z, positive = "setosa")
confu_z
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         13         1
##   virginica       0          3        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8002, 0.9768)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.875           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8125           0.9375
## Specificity                 1.0000            0.9688           0.9062
## Pos Pred Value              1.0000            0.9286           0.8333
## Neg Pred Value              1.0000            0.9118           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2708           0.3125
## Detection Prevalence        0.3333            0.2917           0.3750
## Balanced Accuracy           1.0000            0.8906           0.9219

- Comparing results:
result
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             4                           2.67
##  [3,]      3                             3                           2.00
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             4                           2.67
##  [8,]      8                             3                           2.00
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67
## [11,]     11                             3                           2.00
## [12,]     12                             4                           2.67
## [13,]     13                             4                           2.67
## [14,]     14                             4                           2.67
## [15,]     15                             4                           2.67
## [16,]     16                             4                           2.67
## [17,]     17                             4                           2.67
## [18,]     18                             4                           2.67
## [19,]     19                             4                           2.67
## [20,]     20                             3                           2.00
result_nor
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             5                           3.33
##  [3,]      3                             4                           2.67
##  [4,]      4                             4                           2.67
##  [5,]      5                             4                           2.67
##  [6,]      6                             5                           3.33
##  [7,]      7                             5                           3.33
##  [8,]      8                             5                           3.33
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67
## [11,]     11                             3                           2.00
## [12,]     12                             3                           2.00
## [13,]     13                             3                           2.00
## [14,]     14                             3                           2.00
## [15,]     15                             3                           2.00
## [16,]     16                             3                           2.00
## [17,]     17                             4                           2.67
## [18,]     18                             3                           2.00
## [19,]     19                             3                           2.00
## [20,]     20                             3                           2.00
result_z
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             4                           2.67
##  [2,]      2                             5                           3.33
##  [3,]      3                             4                           2.67
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             3                           2.00
##  [8,]      8                             3                           2.00
##  [9,]      9                             3                           2.00
## [10,]     10                             3                           2.00
## [11,]     11                             4                           2.67
## [12,]     12                             5                           3.33
## [13,]     13                             4                           2.67
## [14,]     14                             5                           3.33
## [15,]     15                             4                           2.67
## [16,]     16                             4                           2.67
## [17,]     17                             4                           2.67
## [18,]     18                             3                           2.00
## [19,]     19                             4                           2.67
## [20,]     20                             4                           2.67
confu
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         14         1
##   virginica       0          2        15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9375         
##                  95% CI : (0.828, 0.9869)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9062         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8750           0.9375
## Specificity                 1.0000            0.9688           0.9375
## Pos Pred Value              1.0000            0.9333           0.8824
## Neg Pred Value              1.0000            0.9394           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2917           0.3125
## Detection Prevalence        0.3333            0.3125           0.3542
## Balanced Accuracy           1.0000            0.9219           0.9375
confu_nor
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         14         1
##   virginica       0          2        15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9375         
##                  95% CI : (0.828, 0.9869)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9062         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8750           0.9375
## Specificity                 1.0000            0.9688           0.9375
## Pos Pred Value              1.0000            0.9333           0.8824
## Neg Pred Value              1.0000            0.9394           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2917           0.3125
## Detection Prevalence        0.3333            0.3125           0.3542
## Balanced Accuracy           1.0000            0.9219           0.9375
confu_z
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         13         1
##   virginica       0          3        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8002, 0.9768)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.875           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8125           0.9375
## Specificity                 1.0000            0.9688           0.9062
## Pos Pred Value              1.0000            0.9286           0.8333
## Neg Pred Value              1.0000            0.9118           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2708           0.3125
## Detection Prevalence        0.3333            0.2917           0.3750
## Balanced Accuracy           1.0000            0.8906           0.9219
With these results we can see the differences are minimal using different distance functions, this is due to that the variables in the iris data set have inputs of similar ranges, and hence, modifying them does not produces a clear improvement of the algorithm.


Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...