In some cases there are missing data or the missing data has been coded in a way that we didn't expected, we have to be very careful with these situacions.
We can see that the range of some variables are wider than others.
Some arithmetic functions does not work wtih
But other functions as
In this example we work with the dataset
airquality
require(datasets)
data("airquality")
?airquality #gives us important info about the dataset
A data frame with 154 observations on 6 variables.
[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1–12)
[,6] Day numeric Day of month (1–31)
head(airquality) #shows the first rows in the dataset
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
str(airquality) # internal structure of an R object
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
summary(airquality) #summary of the variables presents in the dataset
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
First, we see that the data are: Daily readings of the following air quality values for May 1, 1973 to September 30, 1973.
Two columns correspond to the day and the month, we can use them to create a new identifier for each row:
airquality$Month <- month.abb[airquality$Month]
airquality$Date <- paste (airquality$Day, airquality$Month)
row.names(airquality) <- airquality$Date
airquality1 <- airquality[c(1:4)]
head(airquality1)
## Ozone Solar.R Wind Temp
## 1 May 41 190 7.4 67
## 2 May 36 118 8.0 72
## 3 May 12 149 12.6 74
## 4 May 18 313 11.5 62
## 5 May NA NA 14.3 56
## 6 May 28 NA 14.9 66
Now we have a dataframe with 4 columns.
We see from the summary that some data is missing, to have an idea which data is missing we can represent it the following way, giving us an idea how the
NAs
are distibruted in the dataset:
image(is.na(airquality1), axes= FALSE, col=gray(1:0))
title(main= 'NAs distribution in the dataset', col.main = "purple")
axis(2, at= 0:3/3,labels = colnames(airquality1))
axis(1, at= 0:152/152,labels = row.names(airquality1), las=2)
#closer look at the image:
image(is.na(airquality1), axes= FALSE, col=gray(1:0))
title(main= 'NAs distribution in the dataset', col.main = "purple")
axis(2, at= 0:3/3,labels = colnames(airquality1))
axis(1, at= 0:29/29,labels = row.names(airquality1)[1:30], las=2)
The
NAs
values only affects the 2 first columns : "Ozone"
and "Solar.R"
.
To handle these data we have different functions:
is.na(head(airquality1)) #is.na() returns a TRUE for the data that is missing.
## Ozone Solar.R Wind Temp
## 1 May FALSE FALSE FALSE FALSE
## 2 May FALSE FALSE FALSE FALSE
## 3 May FALSE FALSE FALSE FALSE
## 4 May FALSE FALSE FALSE FALSE
## 5 May TRUE TRUE FALSE FALSE
## 6 May FALSE TRUE FALSE FALSE
complete.cases(head(airquality1)) # returns TRUE if the case is complete (no `NA` in any column of the case).
## [1] TRUE TRUE TRUE TRUE FALSE FALSE
airquality2 <-na.omit(airquality1) #removes all the cases with `NAs`.
is.na(head(airquality2)) # new dataset does not have `NAs`
## Ozone Solar.R Wind Temp
## 1 May FALSE FALSE FALSE FALSE
## 2 May FALSE FALSE FALSE FALSE
## 3 May FALSE FALSE FALSE FALSE
## 4 May FALSE FALSE FALSE FALSE
## 7 May FALSE FALSE FALSE FALSE
## 8 May FALSE FALSE FALSE FALSE
summary(airquality2)
## Ozone Solar.R Wind Temp
## Min. : 1.0 Min. : 7.0 Min. : 2.30 Min. :57.00
## 1st Qu.: 18.0 1st Qu.:113.5 1st Qu.: 7.40 1st Qu.:71.00
## Median : 31.0 Median :207.0 Median : 9.70 Median :79.00
## Mean : 42.1 Mean :184.8 Mean : 9.94 Mean :77.79
## 3rd Qu.: 62.0 3rd Qu.:255.5 3rd Qu.:11.50 3rd Qu.:84.50
## Max. :168.0 Max. :334.0 Max. :20.70 Max. :97.00
boxplot(airquality2, col = cm.colors(6)) #Quick picture of the variables in the datset.
We can see that the range of some variables are wider than others.
Some arithmetic functions does not work wtih
NAs
values:
mean(airquality$Ozone)
## [1] NA
mean(airquality$Ozone, na.rm = TRUE) #na.rm = TRUE, removes `NAs`.
## [1] 42.12931
sd(airquality$Ozone)
## [1] NA
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
But other functions as
summary()
are able to work with data that contains NAs
values.
summary(airquality$Ozone)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 18.00 31.50 42.13 63.25 168.00 37
Also, we have to be carefull that in some datasets
Nas
values are wrong coded as 0
or 99
. When this happens this vales have to be re-coded as NA
.
Here we are going to add some
-99
for temperature to recode them properly as missing values:
airquality2[c("1 Apr", "2 Apr", "3 Apr"), ] <- matrix(c(36, 38,33, 199,298,198,12,11,8, -99, -99, -99 ), ncol=3)
With this new dataset, we see that
Temp
parameter has -99
, while in the details is stated that this parameter refers to the maximum daily temperature.
summary(airquality2)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 2.30 Min. :-99.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.40 1st Qu.: 70.25
## Median : 32.00 Median :205.0 Median : 9.70 Median : 78.50
## Mean : 41.93 Mean :186.0 Mean : 9.95 Mean : 73.14
## 3rd Qu.: 60.50 3rd Qu.:255.8 3rd Qu.:11.50 3rd Qu.: 84.00
## Max. :168.00 Max. :334.0 Max. :20.70 Max. : 97.00
boxplot(airquality2, col = cm.colors(6)) #Quick picture of the variables in the datset
sort(airquality2$Temp)
## [1] -99 -99 -99 57 58 59 59 61 61 61 62 62 63 64 64 65 65
## [18] 66 66 67 67 67 68 68 68 68 69 69 70 71 71 71 72 72
## [35] 72 73 73 73 73 74 74 75 75 76 76 76 76 76 76 77 77
## [52] 77 77 78 78 78 78 79 79 79 80 80 80 81 81 81 81 81
## [69] 81 81 81 81 81 82 82 82 82 82 82 82 83 83 83 84 84
## [86] 84 85 85 85 86 86 86 86 86 87 87 87 88 88 89 89 90
## [103] 90 90 91 92 92 92 93 93 94 94 96 97
Here it has been used
-99
to cade NA
values, so we have to recode it, :
airquality2$Temp[airquality2$Temp == -99 ] <- NA
summary(airquality2$Temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 57.00 71.00 79.00 77.79 84.50 97.00 3
Here we see that the -99 values have been recorded as
NAs
, since now in the parameter Temp
the minimum values is 57, and 3 NA's
values