Simple Linear Model are models in which the behavior of a variable, Y, can be explained by a variable X: Y = f(X).
If we consider that the relation f(), which connects Y with X, is linear, then it can be written as: y = β0 + β1x + ε.
Since relations of the above type are seldom exact, but rather are approximations in which many variables of secondary importance have been omitted, we must include a random perturbation term, which reflects all factors - other than X - that influence the endogenous variable, but none of them is individually relevant.
The parameters β0 and β1 determine the intercept and the slope of the line respectively. The intercept β0 represents the predicted value of y when x = 0. The slope β1 represents the predicted increase in Y resulting from a one unit increase in X.
The above expression reflects a linear relationship, and only one single explanatory variable, receiving the name of simple linear relation. Here the term simple is due to the fact that there is only one explanatory variable.
We are going to work with the data from Sen & Srivastava that can be found in the package ‘SenSrivastava’:
#install.packages('SenSrivastava')
library(SenSrivastava)
head(E1.1)
## DENSITY SPEED
## 1 20.4 38.8
## 2 27.4 31.5
## 3 106.2 10.6
## 4 80.4 16.1
## 5 141.3 7.7
## 6 130.9 8.3
str(E1.1)
## 'data.frame': 24 obs. of 2 variables:
## $ DENSITY: num 20.4 27.4 106.2 80.4 141.3 ...
## $ SPEED : num 38.8 31.5 10.6 16.1 7.7 8.3 8.5 11.1 8.6 11.1 ...
summary(E1.1)
## DENSITY SPEED
## Min. : 20.40 Min. : 7.70
## 1st Qu.: 34.48 1st Qu.:10.18
## Median :103.65 Median :11.30
## Mean : 87.61 Mean :16.89
## 3rd Qu.:112.25 3rd Qu.:28.82
## Max. :144.20 Max. :38.80
CORRELATION AND SIMPLE LINEAR MODEL:
Correlation and simple linear model are not the same. Correlation quantifies the degree to which two variables are related. With correlation you don’t have to think about cause and effect. You simply quantify how well two variables relate to each other. While in simple linear model, you have to think about cause and effect since the regression line is determined as the best way to predict Y from X.
Correlation between the variables:
plot(E1.1$DENSITY, E1.1$SPEED)
cor(E1.1$DENSITY, E1.1$SPEED)
## [1] -0.9715016
The correlation is +1 if there is a perfect positive linear relationship, -1 if there is a perfect negative linear relationship and values between +1 and -1 indicates the degree of linear dependence between the variables. Closer the coefficient to either -1 or +1, stronger the correlation between the variables. When the correlation coefficient is close to zero there is no evidence of any relationship. From the correlation value (-0.97) we can infer that there is a negative strong relationship between both variables.
MODEL ESTIMATION:
lmod <-lm(SPEED~DENSITY, data= E1.1)
lmod
##
## Call:
## lm(formula = SPEED ~ DENSITY, data = E1.1)
##
## Coefficients:
## (Intercept) DENSITY
## 38.1295 -0.2425
It gives us the formula we have used and the coefficients of the model.
summary(lmod)
##
## Call:
## lm(formula = SPEED ~ DENSITY, data = E1.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.083 -1.924 -0.425 1.761 5.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.12948 1.21768 31.31 < 2e-16 ***
## DENSITY -0.24247 0.01261 -19.22 3.04e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.507 on 22 degrees of freedom
## Multiple R-squared: 0.9438, Adjusted R-squared: 0.9413
## F-statistic: 369.6 on 1 and 22 DF, p-value: 3.041e-15
summary
function provides much more information. First, we can check that the p-value is for the model can be found in the last line and is P-value: 3.041e-15
. As the p-value is smaller than 0.05 we will consider the significant regression (the slope of the line is not zero) and the regression is useful.
Also, we can see the estimation of the parameters:
(Intercept) 38.12948 (< 2e-16)
DENSITY -0.24247 (3.04e-15)
and both are significant.
How to interpret the coefficients?
is the point of intersection of the regression line with the Y-axis. It is only interpreted if the data of the regressive variable (the X) contains zero.
is the slope of the line. It represents the decrease (in our case) for each unit of density. the units are speed/density.
MEASURE HOW WELL THE MODEL FITS THE DATA
To measure how well the model fits the data we have different options. One choice is the coefficient of determination or percentage of variance explained (R-Squared), it ranges from 0 to 1, and values closer to 1 indicates a better fit. For simple linear regression R^2 = cor^2.
- R-Squared and adjusted R-Squared:
Every time a predictor is added to a model, the R-squared increases, even if due to chance alone. Consequently, a model with more terms may appear to have a better fit simply because it has more terms. In order to adjust this value for the number of parameters the Adjusted R-squared is used. Here we see there a good value of R-squared.
Multiple R-squared: 0.9438, Adjusted R-squared: 0.9413
This value represents the sum of squared residuals, over the number of degrees of freedom in the model, the value is proportionate to the standard deviation of the the model residuals. It is measured in the units of the response and can be interpreted in the context of the particular data set. Since sigma is a standard deviation (and not a variance), it is on the scale of the original outcome data.
summary(lmod)$sigma
## [1] 2.50655
RESIDUAL ANALYSIS
par(mfrow = c(1,2))
plot(E1.1$DENSITY, E1.1$SPEED)
abline(lmod)
plot(residuals(lmod))
abline(h=0)
The purpose of the residue analysis is to test the hypotheses of the model. Thus, this analysis aims to verify the following premises:
- The distribution of errors is approximately normal.
- The errors are not correlated.
- The variance of the errors is constant (homoscedasticity).
par(mfrow=c(2,2))
plot(lmod)
- Checking Error Assumptions:
- The distribution of errors is approximately normal.
par(mfrow=c(1,2))
qqnorm(residuals(lmod), ylab="Residuals")
qqline(residuals(lmod))
plot(density(residuals(lmod)), main="denisty plot")
We can verify it with a test:
shapiro.test(residuals(lmod))
##
## Shapiro-Wilk normality test
##
## data: residuals(lmod)
## W = 0.92462, p-value = 0.07396
H0: distribution of errors is approximately normal
HA: distribution of errors is not normal
HA: distribution of errors is not normal
Since p-value obtained is bigger than 0.05 we accept H0.
- The errors are not correlated.
Graphs:
par(mfrow = c(1,2))
n<-length(residuals(lmod))
plot(tail(residuals(lmod),n-1) ~ head(residuals(lmod),n-1), xlab=
expression(hat(epsilon)[i]),ylab=expression(hat(epsilon)[i+1]))
abline(h=0,v=0)
plot(residuals(lmod),ylab="residues")
Regression of the errors:
summary(lm(tail(residuals(lmod),n-1) ~ head(residuals(lmod),n-1) -1))
##
## Call:
## lm(formula = tail(residuals(lmod), n - 1) ~ head(residuals(lmod),
## n - 1) - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.985 -1.838 -1.082 0.908 4.393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## head(residuals(lmod), n - 1) 0.2217 0.1813 1.223 0.234
##
## Residual standard error: 2.131 on 22 degrees of freedom
## Multiple R-squared: 0.06363, Adjusted R-squared: 0.02107
## F-statistic: 1.495 on 1 and 22 DF, p-value: 0.2344
p-values is bigger than 0.05 suggesting that there is no correlation between the errors.
Durbin-Watson statistic: DW statistic is a test statistic used to detect the presence of autocorrelation (a relationship between values separated from each other by a given time lag) in the residuals (prediction errors) from a regression analysis.
require(lmtest)
dwtest(lmod)
##
## Durbin-Watson test
##
## data: lmod
## DW = 1.3281, p-value = 0.03127
## alternative hypothesis: true autocorrelation is greater than 0
p-valor is smaller than 0.05, and bigger than 0.01. Since all the other tests do not suggest correlation, we accept that there is not correlation among the errors of the model.
- The variance of the errors is constant (homoscedasticity).
par(mfrow=c(1,2))
plot(fitted(lmod), residuals(lmod), xlab="Fitted", ylab="Residuals")
abline(h=0)
plot(fitted(lmod), sqrt(abs(residuals(lmod))), xlab="Fitted", ylab=expression(sqrt(hat(epsilon))))
summary(lm(sqrt(abs(residuals(lmod)))~fitted(lmod)))
##
## Call:
## lm(formula = sqrt(abs(residuals(lmod))) ~ fitted(lmod))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.04655 -0.25784 0.00797 0.31290 1.22159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.476438 0.218620 6.753 8.7e-07 ***
## fitted(lmod) -0.009886 0.011187 -0.884 0.386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.539 on 22 degrees of freedom
## Multiple R-squared: 0.03428, Adjusted R-squared: -0.009613
## F-statistic: 0.781 on 1 and 22 DF, p-value: 0.3864
p-value is bigger than 0.05, so we can’t reject H0: constant error variance
Score Test For Non-Constant Error Variance: Computes a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors.
#install.packages('car')
library(car)
ncvTest(lmod)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.004526111 Df = 1 p = 0.9463617
p-value is bigger than 0.05, so we can’t reject H0: constant error variance.