Skip to main content

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions.

Here we will use the dataset infert, that is already present in R.

To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code:

require(datasets)
?infert            #gives us important info about the dataset
inf <- infert      #renamed dataset  as 'inf'

This gives us the following information:

Format
1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years
2.Age: Age in years of case
3.Parity: Count
4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more
5.Case status:
1 = case 0 = control
6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2 = 2 or more
7.Matched set number: 1-83
8.Stratum number: 1-63

This dataset studies secondary infertility in woman (https://pdfs.semanticscholar.org/8087/6668e6e7487818f250688506eba558b0fa43.pdf):
“The role of induced (and spontaneous) abortions in the aetiology of secondary sterility was investigated. Obstetric and gynaecologic histories were obtained from 100 women with secondary infertility […] For every patient, an attempt was made to find two healthy control subjects from the same hospital with matching for age, parity, and level of education. Two control subjects each were found for 83 of the index patients.”

NUMERICAL SUMMARY
Dimention of the dataset:

dim(inf)
## [1] 248   8

The dataset has a dimention of 8 columns and 248 rows.

A glance to the dataset:

head(inf)         #shows the first 6 lines of the dataset
##   education age parity induced case spontaneous stratum pooled.stratum
## 1    0-5yrs  26      6       1    1           2       1              3
## 2    0-5yrs  42      1       1    1           0       2              1
## 3    0-5yrs  39      6       2    1           0       3              4
## 4    0-5yrs  34      4       2    1           0       4              2
## 5   6-11yrs  35      3       1    1           1       5             32
## 6   6-11yrs  36      4       2    1           1       6             36

#head(inf, n=10)  #also,it can be specified the number of lines to be showed
tail(inf)         #shows the last 6 lines of the dataset
##     education age parity induced case spontaneous stratum pooled.stratum
## 243   12+ yrs  25      1       0    0           1      78             41
## 244   12+ yrs  31      1       0    0           1      79             45
## 245   12+ yrs  34      1       0    0           0      80             47
## 246   12+ yrs  35      2       2    0           0      81             54
## 247   12+ yrs  29      1       0    0           1      82             43
## 248   12+ yrs  23      1       0    0           1      83             40

colnames(inf)     #names of the columns
## [1] "education"      "age"            "parity"         "induced"       
## [5] "case"           "spontaneous"    "stratum"        "pooled.stratum"

str(inf)        #displays internal structure of an R object
## 'data.frame':    248 obs. of  8 variables:
##  $ education     : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ...
##  $ age           : num  26 42 39 34 35 36 23 32 21 28 ...
##  $ parity        : num  6 1 6 4 3 4 1 2 1 2 ...
##  $ induced       : num  1 1 2 2 1 2 0 0 0 0 ...
##  $ case          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ spontaneous   : num  2 0 0 0 1 1 0 0 1 0 ...
##  $ stratum       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pooled.stratum: num  3 1 4 2 32 36 6 22 5 19 ...

summary(inf)    #summary of variables 
##    education        age            parity         induced      
##  0-5yrs : 12   Min.   :21.00   Min.   :1.000   Min.   :0.0000  
##  6-11yrs:120   1st Qu.:28.00   1st Qu.:1.000   1st Qu.:0.0000  
##  12+ yrs:116   Median :31.00   Median :2.000   Median :0.0000  
##                Mean   :31.50   Mean   :2.093   Mean   :0.5726  
##                3rd Qu.:35.25   3rd Qu.:3.000   3rd Qu.:1.0000  
##                Max.   :44.00   Max.   :6.000   Max.   :2.0000  
##       case         spontaneous        stratum      pooled.stratum 
##  Min.   :0.0000   Min.   :0.0000   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:21.00   1st Qu.:19.00  
##  Median :0.0000   Median :0.0000   Median :42.00   Median :36.00  
##  Mean   :0.3347   Mean   :0.5766   Mean   :41.87   Mean   :33.58  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:62.25   3rd Qu.:48.25  
##  Max.   :1.0000   Max.   :2.0000   Max.   :83.00   Max.   :63.00

Based on the information from the dataset we see that some variables have to be modified. case, stratum and pooled.stratum need to be recoded as factors:

inf$case <- factor(inf$case)
levels(inf$case) <- c('Control', 'Infert')  #rename levels of variable
inf$stratum <- factor(inf$stratum)
inf$pooled.stratum <- factor(inf$pooled.stratum)
summary(inf)
##    education        age            parity         induced      
##  0-5yrs : 12   Min.   :21.00   Min.   :1.000   Min.   :0.0000  
##  6-11yrs:120   1st Qu.:28.00   1st Qu.:1.000   1st Qu.:0.0000  
##  12+ yrs:116   Median :31.00   Median :2.000   Median :0.0000  
##                Mean   :31.50   Mean   :2.093   Mean   :0.5726  
##                3rd Qu.:35.25   3rd Qu.:3.000   3rd Qu.:1.0000  
##                Max.   :44.00   Max.   :6.000   Max.   :2.0000  
##                                                                
##       case      spontaneous        stratum    pooled.stratum
##  Control:165   Min.   :0.0000   1      :  3   41     : 12   
##  Infert : 83   1st Qu.:0.0000   2      :  3   45     :  9   
##                Median :0.0000   3      :  3   49     :  9   
##                Mean   :0.5766   4      :  3   51     :  9   
##                3rd Qu.:1.0000   5      :  3   12     :  6   
##                Max.   :2.0000   6      :  3   18     :  6   
##                                 (Other):230   (Other):197

From the information we know that there are two controls for each case:

summary(inf$case)
## Control  Infert 
##     165      83

If there is 2 control cases for each case, should be 166 control cases no 165, something is missing:

summary(inf$stratum)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
##  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3 
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 
##  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  2  3 
## 76 77 78 79 80 81 82 83 
##  3  3  3  3  3  3  3  3

which(summary(inf$stratum) != 3)
## 74 
## 74

The case 74 has only one control case, so we remove it:

infer <- inf[(inf$stratum != 74),]
dim(infer)
## [1] 246   8

summary(infer$case)
## Control  Infert 
##     164      82

summary(infer$stratum)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
##  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3 
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 
##  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  0  3 
## 76 77 78 79 80 81 82 83 
##  3  3  3  3  3  3  3  3


GRAPHICAL SUMMARY
It is very important to use graphs to understand better the data we are working with. Here we show two simple graphs to have an idea about how is the data we are studing.
Comparing the number of abortion cases in the Cases with the Controls:

replicate1 <- infer[1:82,]
replicate2 <- infer[83:164,]
replicate3 <- infer[165:246,]

Case <- c(sum(replicate1$induced),sum(replicate1$spontaneous))
Control1 <- c(sum(replicate2$induced),sum(replicate2$spontaneous))
Control2 <- c(sum(replicate3$induced),sum(replicate3$spontaneous))

repl <- data.frame(Case, Control1, Control2)
barplot(as.matrix(repl), main= "Replicated abortion cases", ylab = "Induced + Spontaneous",space=0.3, cex.axis=0.8, col= cm.colors(2))
legend("topright", c("Induced", "Spontaneous"), fill=cm.colors(2))



We see that woman with secondary infertility (Case), have much more spontaneous abortions that those that don't have secondary infertility (Control1 and Control2), while induced abortions are constant.

Also, we can plot other graph, for example parity against education:

plot(parity~education, inf, col= 'gold')



Here, we can see that in this dataset woman with less years of education have given birth more times.

In the article it is stated: ' For every patient, an attempt was made to find two healthy control subjects from the same hospital with matching for age, parity, and level of education.'
To chech this we can use:

par(mfrow=c(1,3))  #devides the layout in 3 for, since we ra egoing to plot 3 graphs.
plot(parity~ age, replicate1, pch= as.character(case))
plot(parity~ age, replicate2, pch= as.character(case))
plot(parity~ age, replicate3, pch= as.character(case))




Here, we can see that the same values are repeated in the three datasets (replica1, 2 and 3), but the first are the cases and the other 2 are the controls.

If we want to have a wider view of the variables, we can use:

plot(inf)



which plots every variable against all the others.

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o