Skip to main content

Initial Data Analysis - handling missing data (airquality dataset)

In some cases there are missing data or the missing data has been coded in a way that we didn't expected, we have to be very careful with these situacions.

In this example we work with the dataset airquality

require(datasets)
data("airquality")
?airquality      #gives us important info about the dataset

A data frame with 154 observations on 6 variables.
[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1–12)
[,6] Day numeric Day of month (1–31)

head(airquality)  #shows the first rows in the dataset 
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

str(airquality) # internal structure of an R object
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

summary(airquality) #summary of the variables presents in the dataset
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

First, we see that the data are: Daily readings of the following air quality values for May 1, 1973 to September 30, 1973. 

Two columns correspond to the day and the month, we can use them to create a new identifier for each row:

airquality$Month <- month.abb[airquality$Month]
airquality$Date <- paste (airquality$Day, airquality$Month)
row.names(airquality) <- airquality$Date
airquality1 <- airquality[c(1:4)]
head(airquality1)
##       Ozone Solar.R Wind Temp
## 1 May    41     190  7.4   67
## 2 May    36     118  8.0   72
## 3 May    12     149 12.6   74
## 4 May    18     313 11.5   62
## 5 May    NA      NA 14.3   56
## 6 May    28      NA 14.9   66

Now we have a dataframe with 4 columns.

We see from the summary that some data is missing, to have an idea which data is missing we can represent it the following way, giving us an idea how the NAs are distibruted in the dataset:

image(is.na(airquality1), axes= FALSE, col=gray(1:0))
title(main= 'NAs distribution in the dataset', col.main = "purple")
axis(2, at= 0:3/3,labels = colnames(airquality1))
axis(1, at= 0:152/152,labels = row.names(airquality1), las=2)




#closer look at the image:
image(is.na(airquality1), axes= FALSE, col=gray(1:0))
title(main= 'NAs distribution in the dataset', col.main = "purple")
axis(2, at= 0:3/3,labels = colnames(airquality1)) axis(1, at= 0:29/29,labels = row.names(airquality1)[1:30], las=2)




The NAs values only affects the 2 first columns : "Ozone" and "Solar.R".
To handle these data we have different functions:

is.na(head(airquality1))           #is.na()  returns a TRUE for the data that is missing.
##       Ozone Solar.R  Wind  Temp
## 1 May FALSE   FALSE FALSE FALSE
## 2 May FALSE   FALSE FALSE FALSE
## 3 May FALSE   FALSE FALSE FALSE
## 4 May FALSE   FALSE FALSE FALSE
## 5 May  TRUE    TRUE FALSE FALSE
## 6 May FALSE    TRUE FALSE FALSE

complete.cases(head(airquality1))  # returns TRUE if the case is complete (no `NA` in any column of the case).
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

airquality2 <-na.omit(airquality1)  #removes all the cases with `NAs`.
is.na(head(airquality2))            # new dataset does not have `NAs`
##       Ozone Solar.R  Wind  Temp
## 1 May FALSE   FALSE FALSE FALSE
## 2 May FALSE   FALSE FALSE FALSE
## 3 May FALSE   FALSE FALSE FALSE
## 4 May FALSE   FALSE FALSE FALSE
## 7 May FALSE   FALSE FALSE FALSE
## 8 May FALSE   FALSE FALSE FALSE

summary(airquality2)
##      Ozone          Solar.R           Wind            Temp      
##  Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
##  1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
##  Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
##  Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
##  3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
##  Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00

boxplot(airquality2, col = cm.colors(6)) #Quick picture of the variables in the datset.

We can see that the range of some variables are wider than others.

Some arithmetic functions does not work wtih NAs values:

mean(airquality$Ozone)
## [1] NA

mean(airquality$Ozone, na.rm = TRUE) #na.rm = TRUE, removes `NAs`.
## [1] 42.12931

sd(airquality$Ozone)
## [1] NA

sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788

But other functions as summary() are able to work with data that contains NAs values.

summary(airquality$Ozone)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   18.00   31.50   42.13   63.25  168.00      37

Also, we have to be carefull that in some datasets Nas values are wrong coded as 0 or 99. When this happens this vales have to be re-coded as NA
Here we are going to add some -99 for temperature to recode them properly as missing values:

airquality2[c("1 Apr", "2 Apr", "3 Apr"), ] <- matrix(c(36, 38,33, 199,298,198,12,11,8, -99, -99, -99 ), ncol=3)

With this new dataset, we see that Temp parameter has -99, while in the details is stated that this parameter refers to the maximum daily temperature.

summary(airquality2)
##      Ozone           Solar.R           Wind            Temp       
##  Min.   :  1.00   Min.   :  7.0   Min.   : 2.30   Min.   :-99.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.40   1st Qu.: 70.25  
##  Median : 32.00   Median :205.0   Median : 9.70   Median : 78.50  
##  Mean   : 41.93   Mean   :186.0   Mean   : 9.95   Mean   : 73.14  
##  3rd Qu.: 60.50   3rd Qu.:255.8   3rd Qu.:11.50   3rd Qu.: 84.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.70   Max.   : 97.00

boxplot(airquality2, col = cm.colors(6)) #Quick picture of the variables in the datset

sort(airquality2$Temp)
##   [1] -99 -99 -99  57  58  59  59  61  61  61  62  62  63  64  64  65  65
##  [18]  66  66  67  67  67  68  68  68  68  69  69  70  71  71  71  72  72
##  [35]  72  73  73  73  73  74  74  75  75  76  76  76  76  76  76  77  77
##  [52]  77  77  78  78  78  78  79  79  79  80  80  80  81  81  81  81  81
##  [69]  81  81  81  81  81  82  82  82  82  82  82  82  83  83  83  84  84
##  [86]  84  85  85  85  86  86  86  86  86  87  87  87  88  88  89  89  90
## [103]  90  90  91  92  92  92  93  93  94  94  96  97

Here it has been used -99 to cade NA values, so we have to recode it, :

airquality2$Temp[airquality2$Temp == -99 ] <- NA
summary(airquality2$Temp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   57.00   71.00   79.00   77.79   84.50   97.00       3

Here we see that the -99 values have been recorded as NAs, since now in the parameter Temp the minimum values is 57, and 3 NA's values 

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...