K-NN algorithm

Machine learning algorithm K Nearest neighbors (k-NN) uses the principle of classifying data by using nearest neighbors.

Nearest neighbors classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of similar labeled examples. Despite the simplicity of this approach this method is extremely powerful and has been used for computer vision application, predictions and even, identifying patters in genetic data.

The k-NN algorithm gets his name from the fact that uses information about the k-Nearest Neighbors to classify unlabeled examples. The letter is a variable term stating how many numbers of nearest neighbors will be used for the classification.

After choosing K the algorithm requires a training dataset made of examples that have been classified previously into categories.

Then, for each unlabeled record in the test dataset, the k-NN identifies

k

records in the training data that are the nearest in similarity. Then, the unlabeled test is classified into the class of the majority of the k nearest neighbors. Hence, the classification is decided by majority vote, with ties broken at random.

In order to measure the similarity between two instances is used a distance function. There are different ways to calculate distance, but traditionally the k-NN algorithm uses Euclidean distance, which is the “ordinary” or “straight-line” distance between two points.

For a single test sample point, the basic steps of the k-NN algorith are as follows:

Locate the test point.
Compute the distances between the test point and all points in the training set.
Find k shortest distances and the corresponding training set points.
Vote for the result.

Step1. Initial data analysis.

We are going to work with the well known iris dataset:

dim(iris)

## [1] 150   5

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

nrow(iris)

## [1] 150

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

Step2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).

We will divide our data into two different sets: a training dataset (67%) that will be used to build the model and a testing dataset (33%) that will be used to estimate the predictive accuracy of the model.

The function createDataPartition can be used to create random balanced splits of the data. Using this function the random sampling occurs within each class and preserves the overall class distribution of the data.

library(caret)

set.seed(123)
inTrain <- span=""> createDataPartition(y = iris$Species,p = 0.67,list = FALSE)
str(inTrain)

##  int [1:102, 1] 2 3 4 7 9 11 13 14 15 18 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "Resample1"

train <- span=""> iris[inTrain,]
test <- span=""> iris[-inTrain,]

We have to make sure the samples have been fairly split between the training and testing dataset, and that the proportion is similar in the complete dataset:

Proportions of the complete dataset:

## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333

Proportions of the training dataset:

## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333

Proportions of the testing dataset:

## 
##     setosa versicolor  virginica 
##      0.333      0.333      0.333

We see that the proportions are similar among the datasets.

Step 3. Use a k-NN algorithm (k = 1:10) to predict the species.

#labels of train dataset
iris_train_labels =train[,5]
head(iris_train_labels)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

#labels of test dataset
iris_test_labels =test[,5]
head(iris_test_labels)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

#to run the algorithm:
#install.packages("class")
library(class)
iris_test_pred =knn(train = train[1:4], test = test[1:4], 
                        cl = iris_train_labels, k = 4)

#to compare the orginal results woth the results predicted: 
#install.packages("gmodels")
library(gmodels)

tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred, 
                      prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  48 
## 
##  
##                  | iris_test_pred 
## iris_test_labels |     setosa | versicolor |  virginica |  Row Total | 
## -----------------|------------|------------|------------|------------|
##           setosa |         16 |          0 |          0 |         16 | 
##                  |      1.000 |      0.000 |      0.000 |      0.333 | 
##                  |      1.000 |      0.000 |      0.000 |            | 
##                  |      0.333 |      0.000 |      0.000 |            | 
## -----------------|------------|------------|------------|------------|
##       versicolor |          0 |         12 |          4 |         16 | 
##                  |      0.000 |      0.750 |      0.250 |      0.333 | 
##                  |      0.000 |      0.923 |      0.211 |            | 
##                  |      0.000 |      0.250 |      0.083 |            | 
## -----------------|------------|------------|------------|------------|
##        virginica |          0 |          1 |         15 |         16 | 
##                  |      0.000 |      0.062 |      0.938 |      0.333 | 
##                  |      0.000 |      0.077 |      0.789 |            | 
##                  |      0.000 |      0.021 |      0.312 |            | 
## -----------------|------------|------------|------------|------------|
##     Column Total |         16 |         13 |         19 |         48 | 
##                  |      0.333 |      0.271 |      0.396 |            | 
## -----------------|------------|------------|------------|------------|
## 
##

#to create a table with the k results:
result<- span="">matrix(0,10,3)
colnames(result)<- span="">c("kvalue","Number classified incorrectly", 
                    "Percent classified incorrectly") 
for(i in c(1:10)){
  iris_test_pred =knn(train = train[1:4], test = test[1:4], 
                        cl = iris_train_labels, k = i)
  tablepredict =CrossTable(x = iris_test_labels, y = iris_test_pred, 
                             prop.chisq = FALSE)
  wrong = tablepredict$t[1,2] + tablepredict$t[1,3] + tablepredict$t[2,1] + tablepredict$t[2,3] + tablepredict$t[3,1] + tablepredict$t[3,2]
  result[i,1]=i 
  result[i,2]=wrong
  result[i,3]=round(((wrong/150) * 100),2)}

result

##       kvalue Number classified incorrectly Percent classified incorrectly
##  [1,]      1                             3                           2.00
##  [2,]      2                             4                           2.67
##  [3,]      3                             3                           2.00
##  [4,]      4                             5                           3.33
##  [5,]      5                             4                           2.67
##  [6,]      6                             4                           2.67
##  [7,]      7                             4                           2.67
##  [8,]      8                             3                           2.00
##  [9,]      9                             4                           2.67
## [10,]     10                             4                           2.67

Step4. Results

It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “setosa”.

#install.packages(caret)
library(caret)
confu =confusionMatrix(iris_test_pred, iris_test_labels, positive = "setosa")
confu

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         13         1
##   virginica       0          3        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8002, 0.9768)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.875           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8125           0.9375
## Specificity                 1.0000            0.9688           0.9062
## Pos Pred Value              1.0000            0.9286           0.8333
## Neg Pred Value              1.0000            0.9118           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2708           0.3125
## Detection Prevalence        0.3333            0.2917           0.3750
## Balanced Accuracy           1.0000            0.8906           0.9219

The accuracy of the model is 91.67 %, whit an error rate of 8.33 %. The model has a high accuracy that means that performs well. We have to check the kappa statistic, that adjusts accuracy by accounting for the possibility of a correct prediction by chance alone. The kappa statistic of the model is 0.875.

The sensitivity (true positive rate) of a model mesures the proportion of positive examples that were correctly classified. Whereas, the especificity (true negative rate) mesures the proportion of negative examples that were correctly classified.

Sensitivity - S: 1, Ve: 0.81, Vi: 0.94

Specificity - S: 1, Ve: 0.97, Vi: 0.91

Closely related to sensitivity and specificity are the precision and recall. The precision, positive predictive value, is defined as the proportion of positive examples that are truly positive. A precise model will only predict the positive class in cases that are very likely to be positive. So the times that the model predicts a positive class most of the time is correct. The recall is a mesure of how complete the results are. A model with high recall captures a large proportion of the positives examples.

Precision - S: 1, Ve: 0.93, Vi: 0.83

Animation kNN

The function knn.ani demonstrates the process of k-Nearest Neighbors classification on the 2D plane in a dynamic way.

For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random.

For a single test sample point, the basic steps are:

Locate the test point
Compute the distances between the test point and all points in the training set
Find k shortest distances and the corresponding training set points
Vote for the result (find the maximum in the table for the true classifications)

The result is a vector with the decisions made by the algorithm and a dynamic plot shown in the Rstudio screen.

#since it is in a 2D dimension only two variables can be choosen for being representated in the animation.
library(animation)

animation=knn.ani(train = train[,c(1,3)], test = test[c(1,3)], 
                    cl = iris_train_labels, k = 7)

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...

Continue >>

Data World Blog

Search This Blog

K-NN algorithm

Labels

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Initial Data Analysis (infert dataset)

Ant Colony Optimization (part 2) : Graph optimization using ACO