Skip to main content

K-NN algorithm

Machine learning algorithm K Nearest neighbors (k-NN) uses the principle of classifying data by using nearest neighbors.
Nearest neighbors classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of similar labeled examples. Despite the simplicity of this approach this method is extremely powerful and has been used for computer vision application, predictions and even, identifying patters in genetic data.
The k-NN algorithm gets his name from the fact that uses information about the k-Nearest Neighbors to classify unlabeled examples. The letter  is a variable term stating how many numbers of nearest neighbors will be used for the classification.
After choosing K the algorithm requires a training dataset made of examples that have been classified previously into categories.
Then, for each unlabeled record in the test dataset, the k-NN identifies k records in the training data that are the nearest in similarity. Then, the unlabeled test is classified into the class of the majority of the k nearest neighbors. Hence, the classification is decided by majority vote, with ties broken at random.
In order to measure the similarity between two instances is used a distance function. There are different ways to calculate distance, but traditionally the k-NN algorithm uses Euclidean distance, which is the “ordinary” or “straight-line” distance between two points.
For a single test sample point, the basic steps of the k-NN algorith are as follows:
  • Locate the test point.
  • Compute the distances between the test point and all points in the training set.
  • Find k shortest distances and the corresponding training set points.
  • Vote for the result.

Step1. Initial data analysis.
We are going to work with the well known iris dataset:
dim(iris)
## [1] 150   5
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
nrow(iris)
## [1] 150
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

Step2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
We will divide our data into two different sets: a training dataset (67%) that will be used to build the model and a testing dataset (33%) that will be used to estimate the predictive accuracy of the model.
The function createDataPartition can be used to create random balanced splits of the data. Using this function the random sampling occurs within each class and preserves the overall class distribution of the data.
library(caret)
set.seed(123)
inTrain <- span=""> createDataPartition(y = iris$Species,p = 0.67,list = FALSE)
str(inTrain)
##  int [1:102, 1] 2 3 4 7 9 11 13 14 15 18 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "Resample1"
train <- span=""> iris[inTrain,]
test <- span=""> iris[-inTrain,]
We have to make sure the samples have been fairly split between the training and testing dataset, and that the proportion is similar in the complete dataset:
Proportions of the complete dataset:
## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333
Proportions of the training dataset:
## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333
Proportions of the testing dataset:
## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333
We see that the proportions are similar among the datasets.

Step 3. Use a k-NN algorithm (k = 1:10) to predict the species.
#labels of train dataset
iris_train_labels =train[,5]
head(iris_train_labels)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#labels of test dataset
iris_test_labels =test[,5]
head(iris_test_labels)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred =knn(train = train[1:4], test = test[1:4], 
                        cl = iris_train_labels, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred, 
                      prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  48 
## 
##  
##                  | iris_test_pred 
## iris_test_labels |     setosa | versicolor |  virginica |  Row Total | 
## -----------------|------------|------------|------------|------------|
##           setosa |         16 |          0 |          0 |         16 | 
##                  |      1.000 |      0.000 |      0.000 |      0.333 | 
##                  |      1.000 |      0.000 |      0.000 |            | 
##                  |      0.333 |      0.000 |      0.000 |            | 
## -----------------|------------|------------|------------|------------|
##       versicolor |          0 |         12 |          4 |         16 | 
##                  |      0.000 |      0.750 |      0.250 |      0.333 | 
##                  |      0.000 |      0.923 |      0.211 |            | 
##                  |      0.000 |      0.250 |      0.083 |            | 
## -----------------|------------|------------|------------|------------|
##        virginica |          0 |          1 |         15 |         16 | 
##                  |      0.000 |      0.062 |      0.938 |      0.333 | 
##                  |      0.000 |      0.077 |      0.789 |            | 
##                  |      0.000 |      0.021 |      0.312 |            | 
## -----------------|------------|------------|------------|------------|
##     Column Total |         16 |         13 |         19 |         48 | 
##                  |      0.333 |      0.271 |      0.396 |            | 
## -----------------|------------|------------|------------|------------|
## 
## 
#to create a table with the k results:
result<- span="">matrix(0,10,3)
colnames(result)<- span="">c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:10)){
  iris_test_pred =knn(train = train[1:4], test = test[1:4], 
                        cl = iris_train_labels, k = i)
  tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
  result[i,1]=i 
  result[i,2]=wrong
  result[i,3]=round(((wrong/150) * 100),2)}
result
##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             4                           2.67
##  [3,]      3                             3                           2.00
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             4                           2.67
##  [8,]      8                             3                           2.00
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67

Step4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.
#install.packages(caret)
library(caret)
confu =confusionMatrix(iris_test_pred, iris_test_labels, positive = "setosa")
confu
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         13         1
##   virginica       0          3        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8002, 0.9768)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.875           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8125           0.9375
## Specificity                 1.0000            0.9688           0.9062
## Pos Pred Value              1.0000            0.9286           0.8333
## Neg Pred Value              1.0000            0.9118           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2708           0.3125
## Detection Prevalence        0.3333            0.2917           0.3750
## Balanced Accuracy           1.0000            0.8906           0.9219
The accuracy of the model is 91.67 %, whit an error rate of 8.33 %. The model has a high accuracy that means that performs well. We have to check the kappa statistic, that adjusts accuracy by accounting for the possibility of a correct prediction by chance alone. The kappa statistic of the model is 0.875.
The sensitivity (true positive rate) of a model mesures the proportion of positive examples that were correctly classified. Whereas, the especificity (true negative rate) mesures the proportion of negative examples that were correctly classified.
Sensitivity - S: 1, Ve: 0.81, Vi: 0.94
Specificity - S: 1, Ve: 0.97, Vi: 0.91
Closely related to sensitivity and specificity are the precision and recall. The precision, positive predictive value, is defined as the proportion of positive examples that are truly positive. A precise model will only predict the positive class in cases that are very likely to be positive. So the times that the model predicts a positive class most of the time is correct. The recall is a mesure of how complete the results are. A model with high recall captures a large proportion of the positives examples.
Precision - S: 1, Ve: 0.93, Vi: 0.83

Animation kNN
The function knn.ani demonstrates the process of k-Nearest Neighbors classification on the 2D plane in a dynamic way.
For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random.
For a single test sample point, the basic steps are:
  1. Locate the test point
  2. Compute the distances between the test point and all points in the training set
  3. Find k shortest distances and the corresponding training set points
  4. Vote for the result (find the maximum in the table for the true classifications)
The result is a vector with the decisions made by the algorithm and a dynamic plot shown in the Rstudio screen.
#since it is in a 2D dimension only two variables can be choosen for being representated in the animation.
library(animation)
animation=knn.ani(train = train[,c(1,3)], test = test[c(1,3)], 
                    cl = iris_train_labels, k = 7)

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...