Machine learning algorithm K Nearest neighbors (k-NN) uses the principle of classifying data by using nearest neighbors.
Nearest neighbors classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of similar labeled examples. Despite the simplicity of this approach this method is extremely powerful and has been used for computer vision application, predictions and even, identifying patters in genetic data.
The k-NN algorithm gets his name from the fact that uses information about the k-Nearest Neighbors to classify unlabeled examples. The letter is a variable term stating how many numbers of nearest neighbors will be used for the classification.
After choosing K the algorithm requires a training dataset made of examples that have been classified previously into categories.
Then, for each unlabeled record in the test dataset, the k-NN identifies records in the training data that are the nearest in similarity. Then, the unlabeled test is classified into the class of the majority of the k nearest neighbors. Hence, the classification is decided by majority vote, with ties broken at random.
In order to measure the similarity between two instances is used a distance function. There are different ways to calculate distance, but traditionally the k-NN algorithm uses Euclidean distance, which is the “ordinary” or “straight-line” distance between two points.
For a single test sample point, the basic steps of the k-NN algorith are as follows:
- Locate the test point.
- Compute the distances between the test point and all points in the training set.
- Find k shortest distances and the corresponding training set points.
- Vote for the result.
Step1. Initial data analysis.
We are going to work with the well known iris dataset:
dim(iris)
## [1] 150 5
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
nrow(iris)
## [1] 150
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
Step2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
We will divide our data into two different sets: a training dataset (67%) that will be used to build the model and a testing dataset (33%) that will be used to estimate the predictive accuracy of the model.
The function
createDataPartition
can be used to create random balanced splits of the data. Using this function the random sampling occurs within each class and preserves the overall class distribution of the data.library(caret)
set.seed(123)
inTrain <- span=""> createDataPartition(y = iris$Species,p = 0.67,list = FALSE)
str(inTrain)->
## int [1:102, 1] 2 3 4 7 9 11 13 14 15 18 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "Resample1"
train <- span=""> iris[inTrain,]
test <- span=""> iris[-inTrain,]->->
We have to make sure the samples have been fairly split between the training and testing dataset, and that the proportion is similar in the complete dataset:
Proportions of the complete dataset:
##
## setosa versicolor virginica
## 0.333 0.333 0.333
Proportions of the training dataset:
##
## setosa versicolor virginica
## 0.333 0.333 0.333
Proportions of the testing dataset:
##
## setosa versicolor virginica
## 0.333 0.333 0.333
We see that the proportions are similar among the datasets.
Step 3. Use a k-NN algorithm (k = 1:10) to predict the species.
#labels of train dataset
iris_train_labels =train[,5]
head(iris_train_labels)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#labels of test dataset
iris_test_labels =test[,5]
head(iris_test_labels)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred =knn(train = train[1:4], test = test[1:4],
cl = iris_train_labels, k = 4)
#to compare the orginal results woth the results predicted:
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 48
##
##
## | iris_test_pred
## iris_test_labels | setosa | versicolor | virginica | Row Total |
## -----------------|------------|------------|------------|------------|
## setosa | 16 | 0 | 0 | 16 |
## | 1.000 | 0.000 | 0.000 | 0.333 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.333 | 0.000 | 0.000 | |
## -----------------|------------|------------|------------|------------|
## versicolor | 0 | 12 | 4 | 16 |
## | 0.000 | 0.750 | 0.250 | 0.333 |
## | 0.000 | 0.923 | 0.211 | |
## | 0.000 | 0.250 | 0.083 | |
## -----------------|------------|------------|------------|------------|
## virginica | 0 | 1 | 15 | 16 |
## | 0.000 | 0.062 | 0.938 | 0.333 |
## | 0.000 | 0.077 | 0.789 | |
## | 0.000 | 0.021 | 0.312 | |
## -----------------|------------|------------|------------|------------|
## Column Total | 16 | 13 | 19 | 48 |
## | 0.333 | 0.271 | 0.396 | |
## -----------------|------------|------------|------------|------------|
##
##
#to create a table with the k results:
result<- span="">matrix(0,10,3)
colnames(result)<- span="">c("kvalue","Number classified incorrectly",
"Percent classified incorrectly")
for(i in c(1:10)){
iris_test_pred =knn(train = train[1:4], test = test[1:4],
cl = iris_train_labels, k = i)
tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred,
prop.chisq = FALSE)
wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
result[i,1]=i
result[i,2]=wrong
result[i,3]=round(((wrong/150) * 100),2)}->->
result
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 3 2.00
## [2,] 2 4 2.67
## [3,] 3 3 2.00
## [4,] 4 5 3.33
## [5,] 5 4 2.67
## [6,] 6 4 2.67
## [7,] 7 4 2.67
## [8,] 8 3 2.00
## [9,] 9 4 2.67
## [10,] 10 4 2.67
Step4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.
#install.packages(caret)
library(caret)
confu =confusionMatrix(iris_test_pred, iris_test_labels, positive = "setosa")
confu
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 13 1
## virginica 0 3 15
##
## Overall Statistics
##
## Accuracy : 0.9167
## 95% CI : (0.8002, 0.9768)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.875
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8125 0.9375
## Specificity 1.0000 0.9688 0.9062
## Pos Pred Value 1.0000 0.9286 0.8333
## Neg Pred Value 1.0000 0.9118 0.9667
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2708 0.3125
## Detection Prevalence 0.3333 0.2917 0.3750
## Balanced Accuracy 1.0000 0.8906 0.9219
The accuracy of the model is 91.67 %, whit an error rate of 8.33 %. The model has a high accuracy that means that performs well. We have to check the kappa statistic, that adjusts accuracy by accounting for the possibility of a correct prediction by chance alone. The kappa statistic of the model is 0.875.
The sensitivity (true positive rate) of a model mesures the proportion of positive examples that were correctly classified. Whereas, the especificity (true negative rate) mesures the proportion of negative examples that were correctly classified.
Sensitivity - S: 1, Ve: 0.81, Vi: 0.94
Specificity - S: 1, Ve: 0.97, Vi: 0.91
Closely related to sensitivity and specificity are the precision and recall. The precision, positive predictive value, is defined as the proportion of positive examples that are truly positive. A precise model will only predict the positive class in cases that are very likely to be positive. So the times that the model predicts a positive class most of the time is correct. The recall is a mesure of how complete the results are. A model with high recall captures a large proportion of the positives examples.
Precision - S: 1, Ve: 0.93, Vi: 0.83
Animation kNN
The function knn.ani demonstrates the process of k-Nearest Neighbors classification on the 2D plane in a dynamic way.
For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random.
For a single test sample point, the basic steps are:
- Locate the test point
- Compute the distances between the test point and all points in the training set
- Find k shortest distances and the corresponding training set points
- Vote for the result (find the maximum in the table for the true classifications)
The result is a vector with the decisions made by the algorithm and a dynamic plot shown in the Rstudio screen.
#since it is in a 2D dimension only two variables can be choosen for being representated in the animation.
library(animation)
animation=knn.ani(train = train[,c(1,3)], test = test[c(1,3)],
cl = iris_train_labels, k = 7)