Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions.
Here we will use the dataset
To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since
Format
1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years
2.Age: Age in years of case
3.Parity: Count
4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more
5.Case status:
1 = case 0 = control
6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2 = 2 or more
7.Matched set number: 1-83
8.Stratum number: 1-63
This dataset studies secondary infertility in woman (https://pdfs.semanticscholar.org/8087/6668e6e7487818f250688506eba558b0fa43.pdf):
“The role of induced (and spontaneous) abortions in the aetiology of secondary sterility was investigated. Obstetric and gynaecologic histories were obtained from 100 women with secondary infertility […] For every patient, an attempt was made to find two healthy control subjects from the same hospital with matching for age, parity, and level of education. Two control subjects each were found for 83 of the index patients.”
NUMERICAL SUMMARY
Dimention of the dataset:
The dataset has a dimention of 8 columns and 248 rows.
A glance to the dataset:
Based on the information from the dataset we see that some variables have to be modified.
From the information we know that there are two controls for each case:
If there is 2 control cases for each case, should be 166 control cases no 165, something is missing:
GRAPHICAL SUMMARY
It is very important to use graphs to understand better the data we are working with. Here we show two simple graphs to have an idea about how is the data we are studing.
Comparing the number of abortion cases in the
Also, we can plot other graph, for example
Here, we can see that in this dataset woman with less years of education have given birth more times.
In the article it is stated: ' For every patient, an attempt was made to find two healthy control subjects from the same hospital with matching for age, parity, and level of education.'
To chech this we can use:
Here, we can see that the same values are repeated in the three datasets (replica1, 2 and 3), but the first are the cases and the other 2 are the controls.
If we want to have a wider view of the variables, we can use:
which plots every variable against all the others.
Here we will use the dataset
infert
, that is already present in R.To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since
infert
is a dataset in R we can get information about the data using the following code:require(datasets)
?infert #gives us important info about the dataset
inf <- infert #renamed dataset as 'inf'
This gives us the following information:Format
1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years
2.Age: Age in years of case
3.Parity: Count
4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more
5.Case status:
1 = case 0 = control
6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2 = 2 or more
7.Matched set number: 1-83
8.Stratum number: 1-63
This dataset studies secondary infertility in woman (https://pdfs.semanticscholar.org/8087/6668e6e7487818f250688506eba558b0fa43.pdf):
“The role of induced (and spontaneous) abortions in the aetiology of secondary sterility was investigated. Obstetric and gynaecologic histories were obtained from 100 women with secondary infertility […] For every patient, an attempt was made to find two healthy control subjects from the same hospital with matching for age, parity, and level of education. Two control subjects each were found for 83 of the index patients.”
NUMERICAL SUMMARY
Dimention of the dataset:
dim(inf)
## [1] 248 8
The dataset has a dimention of 8 columns and 248 rows.
A glance to the dataset:
head(inf) #shows the first 6 lines of the dataset
## education age parity induced case spontaneous stratum pooled.stratum
## 1 0-5yrs 26 6 1 1 2 1 3
## 2 0-5yrs 42 1 1 1 0 2 1
## 3 0-5yrs 39 6 2 1 0 3 4
## 4 0-5yrs 34 4 2 1 0 4 2
## 5 6-11yrs 35 3 1 1 1 5 32
## 6 6-11yrs 36 4 2 1 1 6 36
#head(inf, n=10) #also,it can be specified the number of lines to be showed
tail(inf) #shows the last 6 lines of the dataset
## education age parity induced case spontaneous stratum pooled.stratum
## 243 12+ yrs 25 1 0 0 1 78 41
## 244 12+ yrs 31 1 0 0 1 79 45
## 245 12+ yrs 34 1 0 0 0 80 47
## 246 12+ yrs 35 2 2 0 0 81 54
## 247 12+ yrs 29 1 0 0 1 82 43
## 248 12+ yrs 23 1 0 0 1 83 40
colnames(inf) #names of the columns
## [1] "education" "age" "parity" "induced"
## [5] "case" "spontaneous" "stratum" "pooled.stratum"
str(inf) #displays internal structure of an R object
## 'data.frame': 248 obs. of 8 variables:
## $ education : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ...
## $ age : num 26 42 39 34 35 36 23 32 21 28 ...
## $ parity : num 6 1 6 4 3 4 1 2 1 2 ...
## $ induced : num 1 1 2 2 1 2 0 0 0 0 ...
## $ case : num 1 1 1 1 1 1 1 1 1 1 ...
## $ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ...
## $ stratum : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ...
summary(inf) #summary of variables
## education age parity induced
## 0-5yrs : 12 Min. :21.00 Min. :1.000 Min. :0.0000
## 6-11yrs:120 1st Qu.:28.00 1st Qu.:1.000 1st Qu.:0.0000
## 12+ yrs:116 Median :31.00 Median :2.000 Median :0.0000
## Mean :31.50 Mean :2.093 Mean :0.5726
## 3rd Qu.:35.25 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :44.00 Max. :6.000 Max. :2.0000
## case spontaneous stratum pooled.stratum
## Min. :0.0000 Min. :0.0000 Min. : 1.00 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:21.00 1st Qu.:19.00
## Median :0.0000 Median :0.0000 Median :42.00 Median :36.00
## Mean :0.3347 Mean :0.5766 Mean :41.87 Mean :33.58
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:62.25 3rd Qu.:48.25
## Max. :1.0000 Max. :2.0000 Max. :83.00 Max. :63.00
Based on the information from the dataset we see that some variables have to be modified.
case
, stratum
and pooled.stratum
need to be recoded as factors:
inf$case <- factor(inf$case)
levels(inf$case) <- c('Control', 'Infert') #rename levels of variable
inf$stratum <- factor(inf$stratum)
inf$pooled.stratum <- factor(inf$pooled.stratum)
summary(inf)
## education age parity induced
## 0-5yrs : 12 Min. :21.00 Min. :1.000 Min. :0.0000
## 6-11yrs:120 1st Qu.:28.00 1st Qu.:1.000 1st Qu.:0.0000
## 12+ yrs:116 Median :31.00 Median :2.000 Median :0.0000
## Mean :31.50 Mean :2.093 Mean :0.5726
## 3rd Qu.:35.25 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :44.00 Max. :6.000 Max. :2.0000
##
## case spontaneous stratum pooled.stratum
## Control:165 Min. :0.0000 1 : 3 41 : 12
## Infert : 83 1st Qu.:0.0000 2 : 3 45 : 9
## Median :0.0000 3 : 3 49 : 9
## Mean :0.5766 4 : 3 51 : 9
## 3rd Qu.:1.0000 5 : 3 12 : 6
## Max. :2.0000 6 : 3 18 : 6
## (Other):230 (Other):197
From the information we know that there are two controls for each case:
summary(inf$case)
## Control Infert
## 165 83
If there is 2 control cases for each case, should be 166 control cases no 165, something is missing:
summary(inf$stratum)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3
## 76 77 78 79 80 81 82 83
## 3 3 3 3 3 3 3 3
which(summary(inf$stratum) != 3)
## 74
## 74
The case 74 has only one control case, so we remove it:
infer <- inf[(inf$stratum != 74),]
dim(infer)
## [1] 246 8
summary(infer$case)
## Control Infert
## 164 82
summary(infer$stratum)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3
## 76 77 78 79 80 81 82 83
## 3 3 3 3 3 3 3 3
GRAPHICAL SUMMARY
It is very important to use graphs to understand better the data we are working with. Here we show two simple graphs to have an idea about how is the data we are studing.
Comparing the number of abortion cases in the
Cases
with the Controls
:
replicate1 <- infer[1:82,]
replicate2 <- infer[83:164,]
replicate3 <- infer[165:246,]
Case <- c(sum(replicate1$induced),sum(replicate1$spontaneous))
Control1 <- c(sum(replicate2$induced),sum(replicate2$spontaneous))
Control2 <- c(sum(replicate3$induced),sum(replicate3$spontaneous))
repl <- data.frame(Case, Control1, Control2)
barplot(as.matrix(repl), main= "Replicated abortion cases", ylab = "Induced + Spontaneous",space=0.3, cex.axis=0.8, col= cm.colors(2))
legend("topright", c("Induced", "Spontaneous"), fill=cm.colors(2))
We see that woman with secondary infertility (Case), have much more spontaneous abortions that those that don't have secondary infertility (Control1 and Control2), while induced abortions are constant.Also, we can plot other graph, for example
parity
against education
:
plot(parity~education, inf, col= 'gold')
Here, we can see that in this dataset woman with less years of education have given birth more times.
In the article it is stated: ' For every patient, an attempt was made to find two healthy control subjects from the same hospital with matching for age, parity, and level of education.'
To chech this we can use:
par(mfrow=c(1,3)) #devides the layout in 3 for, since we ra egoing to plot 3 graphs.
plot(parity~ age, replicate1, pch= as.character(case))
plot(parity~ age, replicate2, pch= as.character(case))
plot(parity~ age, replicate3, pch= as.character(case))
Here, we can see that the same values are repeated in the three datasets (replica1, 2 and 3), but the first are the cases and the other 2 are the controls.
If we want to have a wider view of the variables, we can use:
plot(inf)
which plots every variable against all the others.