Skip to main content

Univariate graphs (part 3) (Boxplot)

Visualizing data prior to any analysis is a basic and important step. Here we will cover boxplot, which is a type of univariate plot. Univariate plots are those that take into account one varible, these may include histograms, density plots, boxplots, etc.

BOXPLOT:

Boxplot is a standardized way of displaying the distribution of data based on five summary numbers from the data.
Using boxplot we can see that the distribution of the data and its main characteristics are clearly observed. And it also allows us to compare different sets of data simultaneously.
It is a powerfull visual tool that can be used to illustrate data, to study symmetry, to study queues, assumptions about distribution, and also can be used to compare different populations.

The five numbers used as default in R are: 
25th percentil: bottom of the box (Q1)
75th percentile: top of the box (Q3) 
50th percentil: band near the middle of the box (Q2, median),
and, the ends of the whiskers can represent different alternative values, but as default in R it is used:

upper whisker = min(max(x), Q3 + 1.5 * IQR) 
lower whisker = max(min(x), Q1 - 1.5 * IQR)
where IQR (inter quartil range) = Q3 - Q1 (box length).

Here there are som examples with the dataset airquality:

airquality dataset it has been modified in order to work with it, all the information about how the initial data analysis of this dataset has been done can be found in the following link: http://dataworldblog.blogspot.com/2017/06/initial-data-analysis-handling-missing.html

data(airquality)
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

airquality$Month <- month.abb[airquality$Month]
airquality$Date <- paste (airquality$Day, airquality$Month)
row.names(airquality) <- airquality$Date
airquality1 <- airquality[c(1:4)]
#The `airquality` modified dataset has been renamed `airquality1`:
head(airquality1)
##       Ozone Solar.R Wind Temp
## 1 May    41     190  7.4   67
## 2 May    36     118  8.0   72
## 3 May    12     149 12.6   74
## 4 May    18     313 11.5   62
## 5 May    NA      NA 14.3   56
## 6 May    28      NA 14.9   66

boxplot(airquality1)
text (2,205,"Q2", adj = c(-3,0))
text (2,115,"Q1", adj = c(-3,0))
text (2,258,"Q3", adj = c(-3,0))
text (2,334,"min( max(x), Q3 + 1.5 * IQR) ", adj = c(-0.05,1))
text (2,7,"max( min(x), Q1 - 1.5 * IQR)", adj = c(-0.05,-1))
plot of chunk unnamed-chunk-4

#Variables can be ploted horizontally using the parameter `horizontal = TRUE`:
boxplot(airquality1, col= c("red", "pink", "orange", "gold"), horizontal = TRUE)
plot of chunk unnamed-chunk-5
#for variable's name horizontal: `las = 1` ; for vertical `las = 2`
#`at` parameter controls were to plot each variable
par(mfrow = c(1,2))
boxplot(airquality1, xlab= "Air Quality Variables", las = 2, col = topo.colors(4, alpha = 0.5), at = c(1,2,4,5),cex.lab = "0.75", cex.axis = "0.75")
boxplot(airquality1, xlab= "Air Quality Variables", las = 1, col = topo.colors(4, alpha = 0.5), at = c(5,4,2,1), cex.lab = "0.75", cex.axis = "0.75")

plot of chunk unnamed-chunk-6

Range parameter: determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.

par(mfrow = c(1,3))
boxplot(airquality1, xlab= "Air Quality Variables", las = 2, col = topo.colors(4, alpha = 0.5), range = 0, cex.lab = "0.75", cex.axis = "0.75")
boxplot(airquality1, xlab= "Air Quality Variables", las = 2, col = topo.colors(4, alpha = 0.5), cex.lab = "0.75", cex.axis = "0.75")
boxplot(airquality1, xlab= "Air Quality Variables", las = 2, col = topo.colors(4, alpha = 0.5), range = 1, cex.lab = "0.75", cex.axis = "0.75")
title("Comparing range parameter", outer = TRUE)
plot of chunk unnamed-chunk-7

We see that the whiskers are different in each of them.

Also, we can use BOXPLOT FOR COMPARING DIFFERENT CONDITIONS:

Here, we will use the chickwts dataset.

notch parameter: a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is 'strong evidence' that the two medians differ (Chambers et al, 1983, p. 62)

varwith parameter: the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.

data("chickwts")
head(chickwts)
##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean

par(mfrow = c(1,2))
boxplot(weight ~ feed, data = chickwts, col = topo.colors(6, alpha = 0.5), varwidth = FALSE, notch = FALSE, main = "Chickwt data 1", ylab = "Weight at six weeks (gm)", las = 2)
boxplot(weight ~ feed, data = chickwts, col = topo.colors(6, alpha = 0.5), varwidth = TRUE, notch = TRUE, main = "Chickwt data 2", ylab = "Weight at six weeks (gm)", las = 2)
## Warning in bxp(structure(list(stats = structure(c(216, 271.5, 342, 373.5, :
## some notches went outside hinges ('box'): maybe set notch=FALSE
plot of chunk unnamed-chunk-8

Using the second graph we can compare the effect of different types of feed, we see that some of them overlap (linseed/meatmeal/soybean or casein/sunflower) while others do not overlap (casein/horsebeanor or soybean/sunflower). 

Also, in the second plot we can see that we have used varwith = TRUE, so the widths of each box are slightly different, while in the fist graph (varwith = TRUE) the widths are the same for every box.

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...