General information about k-NN algorithm can be found: http://dataworldblog.blogspot.com.es/2017/08/k-nn-algorithm.html
In order to measure the similarity between two instances is used a distance function. There are different ways to calculate distance, but traditionally the k-NN algorithm uses Euclidean distance, which is the “ordinary” or “straight-line” distance between two points.
It has been demonstrated that the chosen distance function can affect the classification accuracy of the k-NN classifier.
The distance calculation for k-NN is heavily dependent on the measurement scale of the input features. Since different inputs have different ranges of values, those inputs with larger range of value will have a larger impact than those that have smaller range of values.
This could potentially cause problems for our classifier, so we have to apply normalization to rescale the features into a standard range of values.
So here we are going to compare the results of using different distance functions.
1. WITHOUT RESCALING THE FEATURES
1.1. Initial data analysis.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
Here we see that the inputs have slightly different range of values.
1.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
We will divide our data into two different sets: a training data set (67%) that will be used to build the model and a testing data set (33%) that will be used to estimate the predictive accuracy of the model.
The function
createDataPartition
can be used to create random balanced splits of the data. Using this function the random sampling occurs within each class and preserves the overall class distribution of the data.library(caret)
set.seed(123)
inTrain =createDataPartition(y = iris$Species,p = 0.67,list = FALSE)
str(inTrain)
## int [1:102, 1] 2 3 4 7 9 11 13 14 15 18 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "Resample1"
train =iris[inTrain,]
test =iris[-inTrain,]
We have to make sure the samples have been fairly split between the training and testing data set, and that the proportion is similar in the complete data set:
Proportions of the complete dataset:
##
## setosa versicolor virginica
## 0.333 0.333 0.333
Proportions of the training dataset:
##
## setosa versicolor virginica
## 0.333 0.333 0.333
Proportions of the testing dataset:
##
## setosa versicolor virginica
## 0.333 0.333 0.333
We see that the proportions are similar among the datasets.
1.3. Use a k-NN algorithm (k = 1:20) to predict the species.
#labels of train dataset
iris_train_labels =train[,5]
head(iris_train_labels)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#labels of test dataset
iris_test_labels =test[,5]
head(iris_test_labels)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred =knn(train = train[1:4], test = test[1:4],
cl = iris_train_labels, k = 4)
#to compare the orginal results woth the results predicted:
#install.packages("gmodels")
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.4.1
tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 48
##
##
## | iris_test_pred
## iris_test_labels | setosa | versicolor | virginica | Row Total |
## -----------------|------------|------------|------------|------------|
## setosa | 16 | 0 | 0 | 16 |
## | 1.000 | 0.000 | 0.000 | 0.333 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.333 | 0.000 | 0.000 | |
## -----------------|------------|------------|------------|------------|
## versicolor | 0 | 12 | 4 | 16 |
## | 0.000 | 0.750 | 0.250 | 0.333 |
## | 0.000 | 0.923 | 0.211 | |
## | 0.000 | 0.250 | 0.083 | |
## -----------------|------------|------------|------------|------------|
## virginica | 0 | 1 | 15 | 16 |
## | 0.000 | 0.062 | 0.938 | 0.333 |
## | 0.000 | 0.077 | 0.789 | |
## | 0.000 | 0.021 | 0.312 | |
## -----------------|------------|------------|------------|------------|
## Column Total | 16 | 13 | 19 | 48 |
## | 0.333 | 0.271 | 0.396 | |
## -----------------|------------|------------|------------|------------|
##
##
Table with different
k
results to see which value of k
is better for getting better results:#to create a table from 0 to 20 k results:
result<- span="">matrix(0,20,3)
colnames(result)<- span="">c("kvalue","Number classified incorrectly",
"Percent classified incorrectly")
for(i in c(1:20)){
iris_test_pred =knn(train = train[1:4], test = test[1:4],
cl = iris_train_labels, k = i)
tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred,
prop.chisq = FALSE)
wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
result[i,1]=i
result[i,2]=wrong
result[i,3]=round(((wrong/150) * 100),2)}->->
result
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 3 2.00
## [2,] 2 4 2.67
## [3,] 3 3 2.00
## [4,] 4 5 3.33
## [5,] 5 4 2.67
## [6,] 6 4 2.67
## [7,] 7 4 2.67
## [8,] 8 3 2.00
## [9,] 9 4 2.67
## [10,] 10 4 2.67
## [11,] 11 3 2.00
## [12,] 12 4 2.67
## [13,] 13 4 2.67
## [14,] 14 4 2.67
## [15,] 15 4 2.67
## [16,] 16 4 2.67
## [17,] 17 4 2.67
## [18,] 18 4 2.67
## [19,] 19 4 2.67
## [20,] 20 3 2.00
1.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.
#install.packages(caret)
library(caret)
confu =confusionMatrix(iris_test_pred, iris_test_labels, positive = "setosa")
confu
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 14 1
## virginica 0 2 15
##
## Overall Statistics
##
## Accuracy : 0.9375
## 95% CI : (0.828, 0.9869)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9062
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8750 0.9375
## Specificity 1.0000 0.9688 0.9375
## Pos Pred Value 1.0000 0.9333 0.8824
## Neg Pred Value 1.0000 0.9394 0.9677
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2917 0.3125
## Detection Prevalence 0.3333 0.3125 0.3542
## Balanced Accuracy 1.0000 0.9219 0.9375
2. MIN-MAX NORMALIZING NUMERIC DATA
2.1 Initial Data Analysis
This process transforms a feature such that all of its values fall in a range between 0 and 1.
normalize =function(x){
return((x-min(x))/(max(x)-min(x)))}
iris_nor =as.data.frame(lapply(iris[,-5], normalize))
iris_nor$Species =iris$Species
head(iris_nor)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 0.22222222 0.6250000 0.06779661 0.04166667 setosa
## 2 0.16666667 0.4166667 0.06779661 0.04166667 setosa
## 3 0.11111111 0.5000000 0.05084746 0.04166667 setosa
## 4 0.08333333 0.4583333 0.08474576 0.04166667 setosa
## 5 0.19444444 0.6666667 0.06779661 0.04166667 setosa
## 6 0.30555556 0.7916667 0.11864407 0.12500000 setosa
summary(iris_nor)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333
## Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000
## Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806
## 3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
boxplot(iris_nor[1:4], las=2, col="lightblue2", main="Normalize data")
2.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
train_nor =iris_nor[inTrain,]
test_nor =iris_nor[-inTrain,]
2.3. Use a k-NN algorithm (k = 1:20) to predict the species.
#labels of train dataset
iris_train_labels_nor = train_nor[,5]
head(iris_train_labels_nor)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#labels of test dataset
iris_test_labels_nor = test_nor[,5]
head(iris_test_labels_nor)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred_nor =knn(train = train_nor[1:4], test = test_nor[1:4],
cl = iris_train_labels_nor, k = 4)
#to compare the orginal results woth the results predicted:
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = iris_test_labels_nor, y = iris_test_pred_nor,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 48
##
##
## | iris_test_pred_nor
## iris_test_labels_nor | setosa | versicolor | virginica | Row Total |
## ---------------------|------------|------------|------------|------------|
## setosa | 16 | 0 | 0 | 16 |
## | 1.000 | 0.000 | 0.000 | 0.333 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.333 | 0.000 | 0.000 | |
## ---------------------|------------|------------|------------|------------|
## versicolor | 0 | 12 | 4 | 16 |
## | 0.000 | 0.750 | 0.250 | 0.333 |
## | 0.000 | 0.923 | 0.211 | |
## | 0.000 | 0.250 | 0.083 | |
## ---------------------|------------|------------|------------|------------|
## virginica | 0 | 1 | 15 | 16 |
## | 0.000 | 0.062 | 0.938 | 0.333 |
## | 0.000 | 0.077 | 0.789 | |
## | 0.000 | 0.021 | 0.312 | |
## ---------------------|------------|------------|------------|------------|
## Column Total | 16 | 13 | 19 | 48 |
## | 0.333 | 0.271 | 0.396 | |
## ---------------------|------------|------------|------------|------------|
##
##
#to create a table with the k results:
result_nor<- span="">matrix(0,20,3)
colnames(result_nor)<- span="">c("kvalue","Number classified incorrectly",
"Percent classified incorrectly")
for(i in c(1:20)){
iris_test_pred_nor =knn(train = train_nor[1:4], test = test_nor[1:4],
cl = iris_train_labels_nor, k = i)
tablepredict =CrossTable(x = iris_test_labels_nor, y = iris_test_pred_nor,
prop.chisq = FALSE)
wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
result_nor[i,1]=i
result_nor[i,2]=wrong
result_nor[i,3]=round(((wrong/150) * 100),2)}->->
result_nor
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 3 2.00
## [2,] 2 5 3.33
## [3,] 3 4 2.67
## [4,] 4 4 2.67
## [5,] 5 4 2.67
## [6,] 6 5 3.33
## [7,] 7 5 3.33
## [8,] 8 5 3.33
## [9,] 9 4 2.67
## [10,] 10 4 2.67
## [11,] 11 3 2.00
## [12,] 12 3 2.00
## [13,] 13 3 2.00
## [14,] 14 3 2.00
## [15,] 15 3 2.00
## [16,] 16 3 2.00
## [17,] 17 4 2.67
## [18,] 18 3 2.00
## [19,] 19 3 2.00
## [20,] 20 3 2.00
2.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.
#install.packages(caret)
library(caret)
confu_nor =confusionMatrix(iris_test_pred_nor, iris_test_labels_nor, positive = "setosa")
confu_nor
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 14 1
## virginica 0 2 15
##
## Overall Statistics
##
## Accuracy : 0.9375
## 95% CI : (0.828, 0.9869)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9062
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8750 0.9375
## Specificity 1.0000 0.9688 0.9375
## Pos Pred Value 1.0000 0.9333 0.8824
## Neg Pred Value 1.0000 0.9394 0.9677
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2917 0.3125
## Detection Prevalence 0.3333 0.3125 0.3542
## Balanced Accuracy 1.0000 0.9219 0.9375
3. Z-SCORE STANDARDIZATION
3.1 Data Initial Analysis
This process rescales each of the feature’s values in terms of how many standard deviations they fall above or below the mean value. The resulting value is called z-score. The z-score falls in an unbound range of negative and positive numbers. Unlike the normalized values, they have no predefined minimum and maximum.
iris_z =as.data.frame(scale(iris[,-5]))
iris_z$Species =iris$Species
summary(iris_z)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :-1.86378 Min. :-2.4258 Min. :-1.5623 Min. :-1.4422
## 1st Qu.:-0.89767 1st Qu.:-0.5904 1st Qu.:-1.2225 1st Qu.:-1.1799
## Median :-0.05233 Median :-0.1315 Median : 0.3354 Median : 0.1321
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67225 3rd Qu.: 0.5567 3rd Qu.: 0.7602 3rd Qu.: 0.7880
## Max. : 2.48370 Max. : 3.0805 Max. : 1.7799 Max. : 1.7064
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
boxplot(iris_z[1:4], las=2, col="gold", main="Z-score standarization")
3.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
train_z =iris_z[inTrain,]
test_z =iris_z[-inTrain,]
3.3. Use a k-NN algorithm (k = 1:20) to predict the species.
#labels of train dataset
iris_train_labels_z = train_z[,5]
head(iris_train_labels_z)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#labels of test dataset
iris_test_labels_z = test_z[,5]
head(iris_test_labels_z)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred_z =knn(train = train_z[1:4], test = test_z[1:4],
cl = iris_train_labels_z, k = 4)
#to compare the orginal results woth the results predicted:
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = iris_test_labels_z, y = iris_test_pred_z,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 48
##
##
## | iris_test_pred_z
## iris_test_labels_z | setosa | versicolor | virginica | Row Total |
## -------------------|------------|------------|------------|------------|
## setosa | 16 | 0 | 0 | 16 |
## | 1.000 | 0.000 | 0.000 | 0.333 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.333 | 0.000 | 0.000 | |
## -------------------|------------|------------|------------|------------|
## versicolor | 0 | 12 | 4 | 16 |
## | 0.000 | 0.750 | 0.250 | 0.333 |
## | 0.000 | 0.923 | 0.211 | |
## | 0.000 | 0.250 | 0.083 | |
## -------------------|------------|------------|------------|------------|
## virginica | 0 | 1 | 15 | 16 |
## | 0.000 | 0.062 | 0.938 | 0.333 |
## | 0.000 | 0.077 | 0.789 | |
## | 0.000 | 0.021 | 0.312 | |
## -------------------|------------|------------|------------|------------|
## Column Total | 16 | 13 | 19 | 48 |
## | 0.333 | 0.271 | 0.396 | |
## -------------------|------------|------------|------------|------------|
##
##
#to create a table with the k results:
result_z<- span="">matrix(0,20,3)
colnames(result_z)<- span="">c("kvalue","Number classified incorrectly",
"Percent classified incorrectly")
for(i in c(1:20)){
iris_test_pred_z =knn(train = train_z[1:4], test = test_z[1:4],
cl = iris_train_labels_z, k = i)
tablepredict =CrossTable(x = iris_test_labels_z, y = iris_test_pred_z,
prop.chisq = FALSE)
wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
result_z[i,1]=i
result_z[i,2]=wrong
result_z[i,3]=round(((wrong/150) * 100),2)}->->
result_z
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 4 2.67
## [2,] 2 5 3.33
## [3,] 3 4 2.67
## [4,] 4 5 3.33
## [5,] 5 4 2.67
## [6,] 6 4 2.67
## [7,] 7 3 2.00
## [8,] 8 3 2.00
## [9,] 9 3 2.00
## [10,] 10 3 2.00
## [11,] 11 4 2.67
## [12,] 12 5 3.33
## [13,] 13 4 2.67
## [14,] 14 5 3.33
## [15,] 15 4 2.67
## [16,] 16 4 2.67
## [17,] 17 4 2.67
## [18,] 18 3 2.00
## [19,] 19 4 2.67
## [20,] 20 4 2.67
3.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.
#install.packages(caret)
library(caret)
confu_z =confusionMatrix(iris_test_pred_z, iris_test_labels_z, positive = "setosa")
confu_z
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 13 1
## virginica 0 3 15
##
## Overall Statistics
##
## Accuracy : 0.9167
## 95% CI : (0.8002, 0.9768)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.875
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8125 0.9375
## Specificity 1.0000 0.9688 0.9062
## Pos Pred Value 1.0000 0.9286 0.8333
## Neg Pred Value 1.0000 0.9118 0.9667
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2708 0.3125
## Detection Prevalence 0.3333 0.2917 0.3750
## Balanced Accuracy 1.0000 0.8906 0.9219
- Comparing results:
result
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 3 2.00
## [2,] 2 4 2.67
## [3,] 3 3 2.00
## [4,] 4 5 3.33
## [5,] 5 4 2.67
## [6,] 6 4 2.67
## [7,] 7 4 2.67
## [8,] 8 3 2.00
## [9,] 9 4 2.67
## [10,] 10 4 2.67
## [11,] 11 3 2.00
## [12,] 12 4 2.67
## [13,] 13 4 2.67
## [14,] 14 4 2.67
## [15,] 15 4 2.67
## [16,] 16 4 2.67
## [17,] 17 4 2.67
## [18,] 18 4 2.67
## [19,] 19 4 2.67
## [20,] 20 3 2.00
result_nor
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 3 2.00
## [2,] 2 5 3.33
## [3,] 3 4 2.67
## [4,] 4 4 2.67
## [5,] 5 4 2.67
## [6,] 6 5 3.33
## [7,] 7 5 3.33
## [8,] 8 5 3.33
## [9,] 9 4 2.67
## [10,] 10 4 2.67
## [11,] 11 3 2.00
## [12,] 12 3 2.00
## [13,] 13 3 2.00
## [14,] 14 3 2.00
## [15,] 15 3 2.00
## [16,] 16 3 2.00
## [17,] 17 4 2.67
## [18,] 18 3 2.00
## [19,] 19 3 2.00
## [20,] 20 3 2.00
result_z
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 4 2.67
## [2,] 2 5 3.33
## [3,] 3 4 2.67
## [4,] 4 5 3.33
## [5,] 5 4 2.67
## [6,] 6 4 2.67
## [7,] 7 3 2.00
## [8,] 8 3 2.00
## [9,] 9 3 2.00
## [10,] 10 3 2.00
## [11,] 11 4 2.67
## [12,] 12 5 3.33
## [13,] 13 4 2.67
## [14,] 14 5 3.33
## [15,] 15 4 2.67
## [16,] 16 4 2.67
## [17,] 17 4 2.67
## [18,] 18 3 2.00
## [19,] 19 4 2.67
## [20,] 20 4 2.67
confu
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 14 1
## virginica 0 2 15
##
## Overall Statistics
##
## Accuracy : 0.9375
## 95% CI : (0.828, 0.9869)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9062
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8750 0.9375
## Specificity 1.0000 0.9688 0.9375
## Pos Pred Value 1.0000 0.9333 0.8824
## Neg Pred Value 1.0000 0.9394 0.9677
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2917 0.3125
## Detection Prevalence 0.3333 0.3125 0.3542
## Balanced Accuracy 1.0000 0.9219 0.9375
confu_nor
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 14 1
## virginica 0 2 15
##
## Overall Statistics
##
## Accuracy : 0.9375
## 95% CI : (0.828, 0.9869)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9062
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8750 0.9375
## Specificity 1.0000 0.9688 0.9375
## Pos Pred Value 1.0000 0.9333 0.8824
## Neg Pred Value 1.0000 0.9394 0.9677
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2917 0.3125
## Detection Prevalence 0.3333 0.3125 0.3542
## Balanced Accuracy 1.0000 0.9219 0.9375
confu_z
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 13 1
## virginica 0 3 15
##
## Overall Statistics
##
## Accuracy : 0.9167
## 95% CI : (0.8002, 0.9768)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.875
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8125 0.9375
## Specificity 1.0000 0.9688 0.9062
## Pos Pred Value 1.0000 0.9286 0.8333
## Neg Pred Value 1.0000 0.9118 0.9667
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2708 0.3125
## Detection Prevalence 0.3333 0.2917 0.3750
## Balanced Accuracy 1.0000 0.8906 0.9219
With these results we can see the differences are minimal using different distance functions, this is due to that the variables in the iris data set have inputs of similar ranges, and hence, modifying them does not produces a clear improvement of the algorithm.