k-NN using breast cancer data from Winsconsin (https://data.world/health/breast-cancer-wisconsin/workspace/file?filename=breast-cancer-wisconsin-data%2Fdata.csv)
General information about k-NN algorithm can be found: http://dataworldblog.blogspot.com.es/2017/08/k-nn-algorithm.html
In order to measure the similarity between two instances is used a distance function. There are different ways to calculate distance, but traditionally the k-NN algorithm uses Euclidean distance, which is the “ordinary” or “straight-line” distance between two points.
It has been demonstrated that the chosen distance function can affect the classification accuracy of the k-NN classifier.
The distance calculation for k-NN is heavily dependent on the measurement scale of the input features. Since different inputs have different ranges of values, those inputs with larger range of value will have a larger impact than those that have smaller range of values.
This could potentially cause problems for our classifier, so we have to apply normalization to rescale the features into a standard range of values. We have seen that when using the dataset
iris
the there is no difference using different distance function since the variables in the iris dataset have similar ranges: http://dataworldblog.blogspot.com/2017/08/distances-for-k-nn-algorithm.html
Here, we will use a data set where the variables have different ranges to see the the effect of using different distance functions.
So here we are going to compare the results of using different distance functions with the breast cancer data set. This data set collects all the 31 features from breast cancer tumors, we want to use the k-NN algorithm to predict if the cancer is benign or malignant (diagnosis), taking into account 30 features of the tumor (radius_mean, texture_mean, etc...).
1.WITHOUT RESCALING THE FEATURES
1.1. Initial data analysis
breastcancer = read.csv("C:/Users/ester/Downloads/breast-cancer-wisconsin-data-data.csv", sep = "," , dec = ".", header = TRUE)
dim(breastcancer)
## [1] 569 33
head(breastcancer)
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 842302 M 17.99 10.38 122.80 1001.0
## 2 842517 M 20.57 17.77 132.90 1326.0
## 3 84300903 M 19.69 21.25 130.00 1203.0
## 4 84348301 M 11.42 20.38 77.58 386.1
## 5 84358402 M 20.29 14.34 135.10 1297.0
## 6 843786 M 12.45 15.70 82.57 477.1
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## 4 0.14250 0.28390 0.2414 0.10520
## 5 0.10030 0.13280 0.1980 0.10430
## 6 0.12780 0.17000 0.1578 0.08089
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## area_se smoothness_se compactness_se concavity_se concave.points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst
## 1 0.03003 0.006193 25.38 17.33
## 2 0.01389 0.003532 24.99 23.41
## 3 0.02250 0.004571 23.57 25.53
## 4 0.05963 0.009208 14.91 26.50
## 5 0.01756 0.005115 22.54 16.67
## 6 0.02165 0.005082 15.47 23.75
## perimeter_worst area_worst smoothness_worst compactness_worst
## 1 184.60 2019.0 0.1622 0.6656
## 2 158.80 1956.0 0.1238 0.1866
## 3 152.50 1709.0 0.1444 0.4245
## 4 98.87 567.7 0.2098 0.8663
## 5 152.20 1575.0 0.1374 0.2050
## 6 103.40 741.6 0.1791 0.5249
## concavity_worst concave.points_worst symmetry_worst
## 1 0.7119 0.2654 0.4601
## 2 0.2416 0.1860 0.2750
## 3 0.4504 0.2430 0.3613
## 4 0.6869 0.2575 0.6638
## 5 0.4000 0.1625 0.2364
## 6 0.5355 0.1741 0.3985
## fractal_dimension_worst X
## 1 0.11890 NA
## 2 0.08902 NA
## 3 0.08758 NA
## 4 0.17300 NA
## 5 0.07678 NA
## 6 0.12440 NA
summary(breastcancer)
## id diagnosis radius_mean texture_mean
## Min. : 8670 B:357 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 M:212 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619
## Median :0.06154 Median :0.03350 Median :0.1792
## Mean :0.08880 Mean :0.04892 Mean :0.1812
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957
## Max. :0.42680 Max. :0.20120 Max. :0.3040
## fractal_dimension_mean radius_se texture_se perimeter_se
## Min. :0.04996 Min. :0.1115 Min. :0.3602 Min. : 0.757
## 1st Qu.:0.05770 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606
## Median :0.06154 Median :0.3242 Median :1.1080 Median : 2.287
## Mean :0.06280 Mean :0.4052 Mean :1.2169 Mean : 2.866
## 3rd Qu.:0.06612 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357
## Max. :0.09744 Max. :2.8730 Max. :4.8850 Max. :21.980
## area_se smoothness_se compactness_se concavity_se
## Min. : 6.802 Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.: 17.850 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median : 24.530 Median :0.006380 Median :0.020450 Median :0.02589
## Mean : 40.337 Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.: 45.190 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :542.200 Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst X
## Min. :0.1565 Min. :0.05504 Mode:logical
## 1st Qu.:0.2504 1st Qu.:0.07146 NA's:569
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
breastcan = breastcancer[-c(1,33)] #we don't need the first and last columns
summary(breastcan)
## diagnosis radius_mean texture_mean perimeter_mean
## B:357 Min. : 6.981 Min. : 9.71 Min. : 43.79
## M:212 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median :0.006380 Median :0.020450 Median :0.02589
## Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst
## Min. :0.1565 Min. :0.05504
## 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
boxplot(breastcan, col = 'gold')
We see that the inputs have different ranges of values.
1.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
We will divide our data into two different sets: a training dataset (67%) that will be used to build the model and a testing dataset (33%) that will be used to estimate the predictive accuracy of the model.
The function
createDataPartition
can be used to create random balanced splits of the data. Using this function the random sampling occurs within each class and preserves the overall class distribution of the data.library(caret)
set.seed(123)
inTrain =createDataPartition(y = breastcan$diagnosis,p = 0.67,list = FALSE)
str(inTrain)
## int [1:383, 1] 2 4 5 8 9 10 11 12 13 14 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "Resample1"
train =breastcan[inTrain,]
test =breastcan[-inTrain,]
We have to make sure the samples have been fairly split between the training and testing dataset, and that the proportion is similar in the complete dataset:
Proportions of the complete dataset:
##
## B M
## 0.627 0.373
Proportions of the training dataset:
##
## B M
## 0.627 0.373
Proportions of the testing dataset:
##
## B M
## 0.629 0.371
We see that the proportions are similar among the datasets.
1.3. Use a k-NN algorithm (k = 1:15) to predict id the cancer is benign or malignant.
#labels of train dataset
breast_train_labels =train[,1]
head(breast_train_labels)
## [1] M M M M M M
## Levels: B M
#labels of test dataset
breast_test_labels =test[,1]
head(breast_test_labels)
## [1] M M M M M M
## Levels: B M
#to run the algorithm:
#install.packages("class")
library(class)
breast_test_pred =knn(train = train[2:31], test = test[2:31],
cl = breast_train_labels, k = 4)
#to compare the orginal results woth the results predicted:
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = breast_test_labels, y = breast_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 186
##
##
## | breast_test_pred
## breast_test_labels | B | M | Row Total |
## -------------------|-----------|-----------|-----------|
## B | 111 | 6 | 117 |
## | 0.949 | 0.051 | 0.629 |
## | 0.941 | 0.088 | |
## | 0.597 | 0.032 | |
## -------------------|-----------|-----------|-----------|
## M | 7 | 62 | 69 |
## | 0.101 | 0.899 | 0.371 |
## | 0.059 | 0.912 | |
## | 0.038 | 0.333 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 118 | 68 | 186 |
## | 0.634 | 0.366 | |
## -------------------|-----------|-----------|-----------|
##
##
Table with different
k
results to see which value of k
is better for getting better results:#to create a table from 0 to 15 k results:
result=matrix(0,15,3)
colnames(result)=c("kvalue","Number classified incorrectly",
"Percent classified incorrectly")
for(i in c(1:15)){
breast_test_pred =knn(train = train[2:31], test = test[2:31],
cl = breast_train_labels, k = i)
tablepredict =CrossTable(x = breast_test_labels, y = breast_test_pred,
prop.chisq = FALSE)
wrong = tablepredict$t[1,2] + tablepredict$t[2,1]
result[i,1]=i
result[i,2]=wrong
result[i,3]=round(((wrong/569) * 100),2)}
result
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 12 2.11
## [2,] 2 13 2.28
## [3,] 3 13 2.28
## [4,] 4 15 2.64
## [5,] 5 13 2.28
## [6,] 6 13 2.28
## [7,] 7 12 2.11
## [8,] 8 12 2.11
## [9,] 9 12 2.11
## [10,] 10 11 1.93
## [11,] 11 11 1.93
## [12,] 12 11 1.93
## [13,] 13 13 2.28
## [14,] 14 13 2.28
## [15,] 15 13 2.28
1.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “B” (bening).
#install.packages(caret)
library(caret)
confu =confusionMatrix(breast_test_pred, breast_test_labels, positive = "B")
confu
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 113 9
## M 4 60
##
## Accuracy : 0.9301
## 95% CI : (0.8834, 0.9623)
## No Information Rate : 0.629
## P-Value [Acc > NIR] : <2e-16 0.2673="" 0.6075="" 0.6290="" 0.6559="" 0.848="" 0.8696="" 0.9177="" 0.9262="" 0.9375="" 0.9658="" :="" accuracy="" b="" balanced="" class="" code="" detection="" kappa="" mcnemar="" neg="" ositive="" p-value="" pos="" pred="" prevalence="" rate="" s="" sensitivity="" specificity="" test="" value="">2e-16>
2. MIN-MAX NORMALIZATION
2.1 Initial data analysis
This process transforms a feature such that all of its values fall in a range between 0 and 1.
normalize =function(x){
return((x-min(x))/(max(x)-min(x)))}
breast_nor =as.data.frame(lapply(breastcan[,-1], normalize))
breast_nor$diagnosis = breastcan$diagnosis
head(breast_nor)
## radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1 0.5210374 0.0226581 0.5459885 0.3637328 0.5937528
## 2 0.6431445 0.2725736 0.6157833 0.5015907 0.2898799
## 3 0.6014956 0.3902604 0.5957432 0.4494168 0.5143089
## 4 0.2100904 0.3608387 0.2335015 0.1029056 0.8113208
## 5 0.6298926 0.1565776 0.6309861 0.4892895 0.4303512
## 6 0.2588386 0.2025702 0.2679842 0.1415058 0.6786133
## compactness_mean concavity_mean concave.points_mean symmetry_mean
## 1 0.7920373 0.7031396 0.7311133 0.6863636
## 2 0.1817680 0.2036082 0.3487575 0.3797980
## 3 0.4310165 0.4625117 0.6356859 0.5095960
## 4 0.8113613 0.5656045 0.5228628 0.7762626
## 5 0.3478928 0.4639175 0.5183897 0.3782828
## 6 0.4619962 0.3697282 0.4020378 0.5186869
## fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 1 0.6055181 0.35614702 0.12046941 0.36903360 0.27381126
## 2 0.1413227 0.15643672 0.08258929 0.12444047 0.12565979
## 3 0.2112468 0.22962158 0.09430251 0.18037035 0.16292179
## 4 1.0000000 0.13909107 0.17587518 0.12665504 0.03815479
## 5 0.1868155 0.23382220 0.09306489 0.22056260 0.16368757
## 6 0.5511794 0.08075321 0.11713225 0.06879329 0.03808008
## smoothness_se compactness_se concavity_se concave.points_se symmetry_se
## 1 0.1592956 0.35139844 0.13568182 0.3006251 0.31164518
## 2 0.1193867 0.08132304 0.04696970 0.2538360 0.08453875
## 3 0.1508312 0.28395470 0.09676768 0.3898466 0.20569032
## 4 0.2514532 0.54321507 0.14295455 0.3536655 0.72814769
## 5 0.3323588 0.16791841 0.14363636 0.3570752 0.13617943
## 6 0.1970629 0.23431069 0.09272727 0.2153817 0.19372995
## fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1 0.1830424 0.6207755 0.1415245 0.6683102
## 2 0.0911101 0.6069015 0.3035714 0.5398177
## 3 0.1270055 0.5563856 0.3600746 0.5084417
## 4 0.2872048 0.2483102 0.3859275 0.2413467
## 5 0.1457996 0.5197439 0.1239339 0.5069476
## 6 0.1446596 0.2682319 0.3126333 0.2639076
## area_worst smoothness_worst compactness_worst concavity_worst
## 1 0.45069799 0.6011358 0.6192916 0.5686102
## 2 0.43521431 0.3475533 0.1545634 0.1929712
## 3 0.37450845 0.4835898 0.3853751 0.3597444
## 4 0.09400806 0.9154725 0.8140117 0.5486422
## 5 0.34157491 0.4373638 0.1724151 0.3194888
## 6 0.13674794 0.7127386 0.4827837 0.4277157
## concave.points_worst symmetry_worst fractal_dimension_worst diagnosis
## 1 0.9120275 0.5984624 0.4188640 M
## 2 0.6391753 0.2335896 0.2228781 M
## 3 0.8350515 0.4037059 0.2134330 M
## 4 0.8848797 1.0000000 0.7737111 M
## 5 0.5584192 0.1575005 0.1425948 M
## 6 0.5982818 0.4770353 0.4549390 M
summary(breast_nor)
## radius_mean texture_mean perimeter_mean area_mean
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2233 1st Qu.:0.2185 1st Qu.:0.2168 1st Qu.:0.1174
## Median :0.3024 Median :0.3088 Median :0.2933 Median :0.1729
## Mean :0.3382 Mean :0.3240 Mean :0.3329 Mean :0.2169
## 3rd Qu.:0.4164 3rd Qu.:0.4089 3rd Qu.:0.4168 3rd Qu.:0.2711
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.3046 1st Qu.:0.1397 1st Qu.:0.06926 1st Qu.:0.1009
## Median :0.3904 Median :0.2247 Median :0.14419 Median :0.1665
## Mean :0.3948 Mean :0.2606 Mean :0.20806 Mean :0.2431
## 3rd Qu.:0.4755 3rd Qu.:0.3405 3rd Qu.:0.30623 3rd Qu.:0.3678
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## symmetry_mean fractal_dimension_mean radius_se
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.2823 1st Qu.:0.1630 1st Qu.:0.04378
## Median :0.3697 Median :0.2439 Median :0.07702
## Mean :0.3796 Mean :0.2704 Mean :0.10635
## 3rd Qu.:0.4530 3rd Qu.:0.3404 3rd Qu.:0.13304
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## texture_se perimeter_se area_se smoothness_se
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.1047 1st Qu.:0.04000 1st Qu.:0.02064 1st Qu.:0.1175
## Median :0.1653 Median :0.07209 Median :0.03311 Median :0.1586
## Mean :0.1893 Mean :0.09938 Mean :0.06264 Mean :0.1811
## 3rd Qu.:0.2462 3rd Qu.:0.12251 3rd Qu.:0.07170 3rd Qu.:0.2187
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## compactness_se concavity_se concave.points_se symmetry_se
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.08132 1st Qu.:0.03811 1st Qu.:0.1447 1st Qu.:0.1024
## Median :0.13667 Median :0.06538 Median :0.2070 Median :0.1526
## Mean :0.17444 Mean :0.08054 Mean :0.2235 Mean :0.1781
## 3rd Qu.:0.22680 3rd Qu.:0.10619 3rd Qu.:0.2787 3rd Qu.:0.2195
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## fractal_dimension_se radius_worst texture_worst perimeter_worst
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.04675 1st Qu.:0.1807 1st Qu.:0.2415 1st Qu.:0.1678
## Median :0.07919 Median :0.2504 Median :0.3569 Median :0.2353
## Mean :0.10019 Mean :0.2967 Mean :0.3640 Mean :0.2831
## 3rd Qu.:0.12656 3rd Qu.:0.3863 3rd Qu.:0.4717 3rd Qu.:0.3735
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## area_worst smoothness_worst compactness_worst concavity_worst
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.08113 1st Qu.:0.3000 1st Qu.:0.1163 1st Qu.:0.09145
## Median :0.12321 Median :0.3971 Median :0.1791 Median :0.18107
## Mean :0.17091 Mean :0.4041 Mean :0.2202 Mean :0.21740
## 3rd Qu.:0.22090 3rd Qu.:0.4942 3rd Qu.:0.3025 3rd Qu.:0.30583
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## concave.points_worst symmetry_worst fractal_dimension_worst diagnosis
## Min. :0.0000 Min. :0.0000 Min. :0.0000 B:357
## 1st Qu.:0.2231 1st Qu.:0.1851 1st Qu.:0.1077 M:212
## Median :0.3434 Median :0.2478 Median :0.1640
## Mean :0.3938 Mean :0.2633 Mean :0.1896
## 3rd Qu.:0.5546 3rd Qu.:0.3182 3rd Qu.:0.2429
## Max. :1.0000 Max. :1.0000 Max. :1.0000
boxplot(breast_nor[1:30], las=2, col="lightblue2", main="Normalize data")
2.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
train_nor =breast_nor[inTrain,]
test_nor =breast_nor[-inTrain,]
2.3. Use a k-NN algorithm (k = 1:15) to predict if the cancer is benign or malignant.
#labels of train dataset
breast_train_labels_nor = train_nor[,31]
head(breast_train_labels_nor)
## [1] M M M M M M
## Levels: B M
#labels of test dataset
breast_test_labels_nor = test_nor[,31]
head(breast_test_labels_nor)
## [1] M M M M M M
## Levels: B M
#to run the algorithm:
#install.packages("class")
library(class)
breast_test_pred_nor =knn(train = train_nor[1:30], test = test_nor[1:30],
cl = breast_train_labels_nor, k = 4)
#to compare the orginal results woth the results predicted:
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = breast_test_labels_nor, y = breast_test_pred_nor,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 186
##
##
## | breast_test_pred_nor
## breast_test_labels_nor | B | M | Row Total |
## -----------------------|-----------|-----------|-----------|
## B | 115 | 2 | 117 |
## | 0.983 | 0.017 | 0.629 |
## | 0.943 | 0.031 | |
## | 0.618 | 0.011 | |
## -----------------------|-----------|-----------|-----------|
## M | 7 | 62 | 69 |
## | 0.101 | 0.899 | 0.371 |
## | 0.057 | 0.969 | |
## | 0.038 | 0.333 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 122 | 64 | 186 |
## | 0.656 | 0.344 | |
## -----------------------|-----------|-----------|-----------|
##
##
#to create a table with the k results:
result_nor = matrix(0,15,3)
colnames(result_nor)= c("kvalue","Number classified incorrectly",
"Percent classified incorrectly")
for(i in c(1:15)){
breast_test_pred_nor =knn(train = train_nor[1:30], test = test_nor[1:30],
cl = breast_train_labels_nor, k = i)
tablepredict =CrossTable(x = breast_test_labels_nor, y = breast_test_pred_nor,
prop.chisq = FALSE)
wrong = tablepredict$t[1,2] + tablepredict$t[2,1]
result_nor[i,1]=i
result_nor[i,2]=wrong
result_nor[i,3]=round(((wrong/569) * 100),2)}
result_nor
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 10 1.76
## [2,] 2 5 0.88
## [3,] 3 8 1.41
## [4,] 4 7 1.23
## [5,] 5 8 1.41
## [6,] 6 8 1.41
## [7,] 7 8 1.41
## [8,] 8 11 1.93
## [9,] 9 10 1.76
## [10,] 10 9 1.58
## [11,] 11 11 1.93
## [12,] 12 11 1.93
## [13,] 13 11 1.93
## [14,] 14 10 1.76
## [15,] 15 10 1.76
2.4. Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “B”.
#install.packages(caret)
library(caret)
confu_nor =confusionMatrix(breast_test_pred_nor, breast_test_labels_nor, positive = "B")
confu_nor
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 117 10
## M 0 59
##
## Accuracy : 0.9462
## 95% CI : (0.9034, 0.9739)
## No Information Rate : 0.629
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8813
## Mcnemar's Test P-Value : 0.004427
##
## Sensitivity : 1.0000
## Specificity : 0.8551
## Pos Pred Value : 0.9213
## Neg Pred Value : 1.0000
## Prevalence : 0.6290
## Detection Rate : 0.6290
## Detection Prevalence : 0.6828
## Balanced Accuracy : 0.9275
##
## 'Positive' Class : B
##
3.Z-SCORE STANDARDIZATION
3.1. Initial data analysis
This process rescales each of the feature’s values in terms of how many standard deviations they fall above or below the mean value. The resulting value is called z-score. The z-score falls in an unbound range of negative and positive numbers. Unlike the normalized values, they have no predefined minimum and maximum.
breast_z = as.data.frame(scale(breastcan[,-1]))
breast_z$diagnosis = breastcan$diagnosis
summary(breast_z)
## radius_mean texture_mean perimeter_mean area_mean
## Min. :-2.0279 Min. :-2.2273 Min. :-1.9828 Min. :-1.4532
## 1st Qu.:-0.6888 1st Qu.:-0.7253 1st Qu.:-0.6913 1st Qu.:-0.6666
## Median :-0.2149 Median :-0.1045 Median :-0.2358 Median :-0.2949
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4690 3rd Qu.: 0.5837 3rd Qu.: 0.4992 3rd Qu.: 0.3632
## Max. : 3.9678 Max. : 4.6478 Max. : 3.9726 Max. : 5.2459
## smoothness_mean compactness_mean concavity_mean
## Min. :-3.10935 Min. :-1.6087 Min. :-1.1139
## 1st Qu.:-0.71034 1st Qu.:-0.7464 1st Qu.:-0.7431
## Median :-0.03486 Median :-0.2217 Median :-0.3419
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.63564 3rd Qu.: 0.4934 3rd Qu.: 0.5256
## Max. : 4.76672 Max. : 4.5644 Max. : 4.2399
## concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :-1.2607 Min. :-2.74171 Min. :-1.8183
## 1st Qu.:-0.7373 1st Qu.:-0.70262 1st Qu.:-0.7220
## Median :-0.3974 Median :-0.07156 Median :-0.1781
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.6464 3rd Qu.: 0.53031 3rd Qu.: 0.4706
## Max. : 3.9245 Max. : 4.48081 Max. : 4.9066
## radius_se texture_se perimeter_se area_se
## Min. :-1.0590 Min. :-1.5529 Min. :-1.0431 Min. :-0.7372
## 1st Qu.:-0.6230 1st Qu.:-0.6942 1st Qu.:-0.6232 1st Qu.:-0.4943
## Median :-0.2920 Median :-0.1973 Median :-0.2864 Median :-0.3475
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2659 3rd Qu.: 0.4661 3rd Qu.: 0.2428 3rd Qu.: 0.1067
## Max. : 8.8991 Max. : 6.6494 Max. : 9.4537 Max. :11.0321
## smoothness_se compactness_se concavity_se concave.points_se
## Min. :-1.7745 Min. :-1.2970 Min. :-1.0566 Min. :-1.9118
## 1st Qu.:-0.6235 1st Qu.:-0.6923 1st Qu.:-0.5567 1st Qu.:-0.6739
## Median :-0.2201 Median :-0.2808 Median :-0.1989 Median :-0.1404
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3680 3rd Qu.: 0.3893 3rd Qu.: 0.3365 3rd Qu.: 0.4722
## Max. : 8.0229 Max. : 6.1381 Max. :12.0621 Max. : 6.6438
## symmetry_se fractal_dimension_se radius_worst
## Min. :-1.5315 Min. :-1.0960 Min. :-1.7254
## 1st Qu.:-0.6511 1st Qu.:-0.5846 1st Qu.:-0.6743
## Median :-0.2192 Median :-0.2297 Median :-0.2688
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3554 3rd Qu.: 0.2884 3rd Qu.: 0.5216
## Max. : 7.0657 Max. : 9.8429 Max. : 4.0906
## texture_worst perimeter_worst area_worst smoothness_worst
## Min. :-2.22204 Min. :-1.6919 Min. :-1.2213 Min. :-2.6803
## 1st Qu.:-0.74797 1st Qu.:-0.6890 1st Qu.:-0.6416 1st Qu.:-0.6906
## Median :-0.04348 Median :-0.2857 Median :-0.3409 Median :-0.0468
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.65776 3rd Qu.: 0.5398 3rd Qu.: 0.3573 3rd Qu.: 0.5970
## Max. : 3.88249 Max. : 4.2836 Max. : 5.9250 Max. : 3.9519
## compactness_worst concavity_worst concave.points_worst
## Min. :-1.4426 Min. :-1.3047 Min. :-1.7435
## 1st Qu.:-0.6805 1st Qu.:-0.7558 1st Qu.:-0.7557
## Median :-0.2693 Median :-0.2180 Median :-0.2233
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5392 3rd Qu.: 0.5307 3rd Qu.: 0.7119
## Max. : 5.1084 Max. : 4.6965 Max. : 2.6835
## symmetry_worst fractal_dimension_worst diagnosis
## Min. :-2.1591 Min. :-1.6004 B:357
## 1st Qu.:-0.6413 1st Qu.:-0.6913 M:212
## Median :-0.1273 Median :-0.2163
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4497 3rd Qu.: 0.4504
## Max. : 6.0407 Max. : 6.8408
boxplot(breast_z[1:30], las=2, col="pink", main="Z-score standarization")
3.2. Randomly separate the data in two sets, the training set (67%) and the test set (33%).
train_z =breast_z[inTrain,]
test_z =breast_z[-inTrain,]
3.3. Use a k-NN algorithm (k = 1:15) to predict if the cancer is benign or malignant.
#labels of train dataset
breast_train_labels_z = train_z[,31]
head(breast_train_labels_z)
## [1] M M M M M M
## Levels: B M
#labels of test dataset
breast_test_labels_z = test_z[,31]
head(breast_test_labels_z)
## [1] M M M M M M
## Levels: B M
#to run the algorithm:
#install.packages("class")
library(class)
breast_test_pred_z =knn(train = train_z[1:30], test = test_z[1:30],
cl = breast_train_labels_z, k = 4)
#to compare the orginal results woth the results predicted:
#install.packages("gmodels")
library(gmodels)
tablepredict =CrossTable(x = breast_test_labels_z, y = breast_test_pred_z,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 186
##
##
## | breast_test_pred_z
## breast_test_labels_z | B | M | Row Total |
## ---------------------|-----------|-----------|-----------|
## B | 117 | 0 | 117 |
## | 1.000 | 0.000 | 0.629 |
## | 0.921 | 0.000 | |
## | 0.629 | 0.000 | |
## ---------------------|-----------|-----------|-----------|
## M | 10 | 59 | 69 |
## | 0.145 | 0.855 | 0.371 |
## | 0.079 | 1.000 | |
## | 0.054 | 0.317 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 127 | 59 | 186 |
## | 0.683 | 0.317 | |
## ---------------------|-----------|-----------|-----------|
##
##
#to create a table with the k results:
result_z = matrix(0,15,3)
colnames(result_z) = c("kvalue","Number classified incorrectly",
"Percent classified incorrectly")
for(i in c(1:15)){
breast_test_pred_z =knn(train = train_z[1:30], test = test_z[1:30],
cl = breast_train_labels_z, k = i)
tablepredict =CrossTable(x = breast_test_labels_z, y = breast_test_pred_z,
prop.chisq = FALSE)
wrong = tablepredict$t[1,2] + tablepredict$t[2,1]
result_z[i,1]=i
result_z[i,2]=wrong
result_z[i,3]=round(((wrong/569) * 100),2)}
result_z
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 12 2.11
## [2,] 2 15 2.64
## [3,] 3 9 1.58
## [4,] 4 10 1.76
## [5,] 5 11 1.93
## [6,] 6 13 2.28
## [7,] 7 10 1.76
## [8,] 8 12 2.11
## [9,] 9 14 2.46
## [10,] 10 13 2.28
## [11,] 11 13 2.28
## [12,] 12 13 2.28
## [13,] 13 14 2.46
## [14,] 14 14 2.46
## [15,] 15 12 2.11
3.4.Results
It is going to be used a confusion matrix to measure the performance. “Caret” provides measures of model performance that consider the ability to classify the positive class, in this case the positive class will be “B”.
#install.packages(caret)
library(caret)
confu_z =confusionMatrix(breast_test_pred_z, breast_test_labels_z, positive = "B")
confu_z
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 116 11
## M 1 58
##
## Accuracy : 0.9355
## 95% CI : (0.89, 0.9662)
## No Information Rate : 0.629
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8575
## Mcnemar's Test P-Value : 0.009375
##
## Sensitivity : 0.9915
## Specificity : 0.8406
## Pos Pred Value : 0.9134
## Neg Pred Value : 0.9831
## Prevalence : 0.6290
## Detection Rate : 0.6237
## Detection Prevalence : 0.6828
## Balanced Accuracy : 0.9160
##
## 'Positive' Class : B
##
Comparing results:
result
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 12 2.11
## [2,] 2 13 2.28
## [3,] 3 13 2.28
## [4,] 4 15 2.64
## [5,] 5 13 2.28
## [6,] 6 13 2.28
## [7,] 7 12 2.11
## [8,] 8 12 2.11
## [9,] 9 12 2.11
## [10,] 10 11 1.93
## [11,] 11 11 1.93
## [12,] 12 11 1.93
## [13,] 13 13 2.28
## [14,] 14 13 2.28
## [15,] 15 13 2.28
result_nor
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 10 1.76
## [2,] 2 5 0.88
## [3,] 3 8 1.41
## [4,] 4 7 1.23
## [5,] 5 8 1.41
## [6,] 6 8 1.41
## [7,] 7 8 1.41
## [8,] 8 11 1.93
## [9,] 9 10 1.76
## [10,] 10 9 1.58
## [11,] 11 11 1.93
## [12,] 12 11 1.93
## [13,] 13 11 1.93
## [14,] 14 10 1.76
## [15,] 15 10 1.76
result_z
## kvalue Number classified incorrectly Percent classified incorrectly
## [1,] 1 12 2.11
## [2,] 2 15 2.64
## [3,] 3 9 1.58
## [4,] 4 10 1.76
## [5,] 5 11 1.93
## [6,] 6 13 2.28
## [7,] 7 10 1.76
## [8,] 8 12 2.11
## [9,] 9 14 2.46
## [10,] 10 13 2.28
## [11,] 11 13 2.28
## [12,] 12 13 2.28
## [13,] 13 14 2.46
## [14,] 14 14 2.46
## [15,] 15 12 2.11
confu
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 113 9
## M 4 60
##
## Accuracy : 0.9301
## 95% CI : (0.8834, 0.9623)
## No Information Rate : 0.629
## P-Value [Acc > NIR] : <2e-16 0.2673="" 0.6075="" 0.6290="" 0.6559="" 0.848="" 0.8696="" 0.9177="" 0.9262="" 0.9375="" 0.9658="" :="" accuracy="" b="" balanced="" class="" code="" detection="" kappa="" mcnemar="" neg="" ositive="" p-value="" pos="" pred="" prevalence="" rate="" s="" sensitivity="" specificity="" test="" value="">2e-16>
confu_nor
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 117 10
## M 0 59
##
## Accuracy : 0.9462
## 95% CI : (0.9034, 0.9739)
## No Information Rate : 0.629
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8813
## Mcnemar's Test P-Value : 0.004427
##
## Sensitivity : 1.0000
## Specificity : 0.8551
## Pos Pred Value : 0.9213
## Neg Pred Value : 1.0000
## Prevalence : 0.6290
## Detection Rate : 0.6290
## Detection Prevalence : 0.6828
## Balanced Accuracy : 0.9275
##
## 'Positive' Class : B
##
confu_z
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 116 11
## M 1 58
##
## Accuracy : 0.9355
## 95% CI : (0.89, 0.9662)
## No Information Rate : 0.629
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8575
## Mcnemar's Test P-Value : 0.009375
##
## Sensitivity : 0.9915
## Specificity : 0.8406
## Pos Pred Value : 0.9134
## Neg Pred Value : 0.9831
## Prevalence : 0.6290
## Detection Rate : 0.6237
## Detection Prevalence : 0.6828
## Balanced Accuracy : 0.9160
##
## 'Positive' Class : B
##
With these results we can see there is an improvement of the model when using different distance functions, this is due to that the variables in the iris dataset have inputs of different ranges, and hence, modifying them does produces an improvement of the algorithm. In this case, we see that using the
min-max normalization
distance function we get a better result.