Skip to main content

Simple Linear Model

Simple Linear Model are models in which the behavior of a variable, Y, can be explained by a variable X:  Y = f(X).

If we consider that the relation f(), which connects Y with X, is linear, then it can be written as: y = β0 + β1x + ε. 

Since relations of the above type are seldom exact, but rather are approximations in which many variables of secondary importance have been omitted, we must include a random perturbation term, which reflects all factors - other than X - that influence the endogenous variable, but none of them is individually relevant.

The parameters β0 and β1 determine the intercept and the slope of the line respectively. The intercept β0 represents the predicted value of y when x = 0. The slope β1 represents the predicted increase in Y resulting from a one unit increase in X.

The above expression reflects a linear relationship, and only one single explanatory variable, receiving the name of simple linear relation. Here the term simple is due to the fact that there is only one explanatory variable.

We are going to work with the data from Sen & Srivastava that can be found in the package ‘SenSrivastava’:
#install.packages('SenSrivastava')
library(SenSrivastava)
head(E1.1)
##   DENSITY SPEED
## 1    20.4  38.8
## 2    27.4  31.5
## 3   106.2  10.6
## 4    80.4  16.1
## 5   141.3   7.7
## 6   130.9   8.3
str(E1.1)
## 'data.frame':    24 obs. of  2 variables:
##  $ DENSITY: num  20.4 27.4 106.2 80.4 141.3 ...
##  $ SPEED  : num  38.8 31.5 10.6 16.1 7.7 8.3 8.5 11.1 8.6 11.1 ...
summary(E1.1)
##     DENSITY           SPEED      
##  Min.   : 20.40   Min.   : 7.70  
##  1st Qu.: 34.48   1st Qu.:10.18  
##  Median :103.65   Median :11.30  
##  Mean   : 87.61   Mean   :16.89  
##  3rd Qu.:112.25   3rd Qu.:28.82  
##  Max.   :144.20   Max.   :38.80

CORRELATION AND SIMPLE LINEAR MODEL:
Correlation and simple linear model are not the same. Correlation quantifies the degree to which two variables are related. With correlation you don’t have to think about cause and effect. You simply quantify how well two variables relate to each other. While in simple linear model, you have to think about cause and effect since the regression line is determined as the best way to predict Y from X.
Correlation between the variables:
plot(E1.1$DENSITY, E1.1$SPEED)
cor(E1.1$DENSITY, E1.1$SPEED)
## [1] -0.9715016
The correlation is +1 if there is a perfect positive linear relationship, -1 if there is a perfect negative linear relationship and values between +1 and -1 indicates the degree of linear dependence between the variables. Closer the coefficient to either -1 or +1, stronger the correlation between the variables. When the correlation coefficient is close to zero there is no evidence of any relationship. From the correlation value (-0.97) we can infer that there is a negative strong relationship between both variables.

MODEL ESTIMATION:
lmod <-lm(SPEED~DENSITY, data= E1.1)
lmod
## 
## Call:
## lm(formula = SPEED ~ DENSITY, data = E1.1)
## 
## Coefficients:
## (Intercept)      DENSITY  
##     38.1295      -0.2425
It gives us the formula we have used and the coefficients of the model.
summary(lmod)
## 
## Call:
## lm(formula = SPEED ~ DENSITY, data = E1.1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.083 -1.924 -0.425  1.761  5.617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.12948    1.21768   31.31  < 2e-16 ***
## DENSITY     -0.24247    0.01261  -19.22 3.04e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.507 on 22 degrees of freedom
## Multiple R-squared:  0.9438, Adjusted R-squared:  0.9413 
## F-statistic: 369.6 on 1 and 22 DF,  p-value: 3.041e-15
summary function provides much more information. First, we can check that the p-value is for the model can be found in the last line and is P-value: 3.041e-15. As the p-value is smaller than 0.05 we will consider the significant regression (the slope of the line is not zero) and the regression is useful.

Also, we can see the estimation of the parameters:
(Intercept) 38.12948 (< 2e-16)
DENSITY -0.24247 (3.04e-15)
and both are significant.

How to interpret the coefficients?
 is the point of intersection of the regression line with the Y-axis. It is only interpreted if the data of the regressive variable (the X) contains zero.
 is the slope of the line. It represents the decrease (in our case) for each unit of density. the units are speed/density.

MEASURE HOW WELL THE MODEL FITS THE DATA
To measure how well the model fits the data we have different options. One choice is the coefficient of determination or percentage of variance explained (R-Squared), it ranges from 0 to 1, and values closer to 1 indicates a better fit. For simple linear regression R^2 = cor^2.
  • R-Squared and adjusted R-Squared:
Every time a predictor is added to a model, the R-squared increases, even if due to chance alone. Consequently, a model with more terms may appear to have a better fit simply because it has more terms. In order to adjust this value for the number of parameters the Adjusted R-squared is used. Here we see there a good value of R-squared.
Multiple R-squared: 0.9438, Adjusted R-squared: 0.9413
This value represents the sum of squared residuals, over the number of degrees of freedom in the model, the value is proportionate to the standard deviation of the the model residuals. It is measured in the units of the response and can be interpreted in the context of the particular data set. Since sigma is a standard deviation (and not a variance), it is on the scale of the original outcome data.
summary(lmod)$sigma
## [1] 2.50655

RESIDUAL ANALYSIS
par(mfrow = c(1,2))
plot(E1.1$DENSITY, E1.1$SPEED)
abline(lmod)
plot(residuals(lmod))
abline(h=0)

The purpose of the residue analysis is to test the hypotheses of the model. Thus, this analysis aims to verify the following premises:
  1. The distribution of errors is approximately normal.
  2. The errors are not correlated.
  3. The variance of the errors is constant (homoscedasticity).
par(mfrow=c(2,2))
plot(lmod)

These graphs observed suggest that the model verifies Gauss-Markov conditions for errors.

  • Checking Error Assumptions:
  1. The distribution of errors is approximately normal.
par(mfrow=c(1,2)) 
qqnorm(residuals(lmod), ylab="Residuals") 
qqline(residuals(lmod)) 
plot(density(residuals(lmod)), main="denisty plot") 

Graphs show a normal distribution.
We can verify it with a test:
shapiro.test(residuals(lmod)) 
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lmod)
## W = 0.92462, p-value = 0.07396
H0: distribution of errors is approximately normal
HA: distribution of errors is not normal
Since p-value obtained is bigger than 0.05 we accept H0.

  1. The errors are not correlated.
Graphs:
par(mfrow = c(1,2))
n<-length(residuals(lmod))
plot(tail(residuals(lmod),n-1) ~ head(residuals(lmod),n-1), xlab=
expression(hat(epsilon)[i]),ylab=expression(hat(epsilon)[i+1]))
abline(h=0,v=0)
plot(residuals(lmod),ylab="residues")

It does not seems that there is a correlation among the successive residues.
Regression of the errors:
summary(lm(tail(residuals(lmod),n-1) ~ head(residuals(lmod),n-1) -1))
## 
## Call:
## lm(formula = tail(residuals(lmod), n - 1) ~ head(residuals(lmod), 
##     n - 1) - 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.985 -1.838 -1.082  0.908  4.393 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)
## head(residuals(lmod), n - 1)   0.2217     0.1813   1.223    0.234
## 
## Residual standard error: 2.131 on 22 degrees of freedom
## Multiple R-squared:  0.06363,    Adjusted R-squared:  0.02107 
## F-statistic: 1.495 on 1 and 22 DF,  p-value: 0.2344
p-values is bigger than 0.05 suggesting that there is no correlation between the errors.
Durbin-Watson statistic: DW statistic is a test statistic used to detect the presence of autocorrelation (a relationship between values separated from each other by a given time lag) in the residuals (prediction errors) from a regression analysis.
require(lmtest)
dwtest(lmod)
## 
##  Durbin-Watson test
## 
## data:  lmod
## DW = 1.3281, p-value = 0.03127
## alternative hypothesis: true autocorrelation is greater than 0
p-valor is smaller than 0.05, and bigger than 0.01. Since all the other tests do not suggest correlation, we accept that there is not correlation among the errors of the model.

  1. The variance of the errors is constant (homoscedasticity).
par(mfrow=c(1,2)) 
plot(fitted(lmod), residuals(lmod), xlab="Fitted", ylab="Residuals") 
abline(h=0) 
plot(fitted(lmod), sqrt(abs(residuals(lmod))), xlab="Fitted", ylab=expression(sqrt(hat(epsilon)))) 

Test for homoscedasticity:
summary(lm(sqrt(abs(residuals(lmod)))~fitted(lmod)))
## 
## Call:
## lm(formula = sqrt(abs(residuals(lmod))) ~ fitted(lmod))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04655 -0.25784  0.00797  0.31290  1.22159 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.476438   0.218620   6.753  8.7e-07 ***
## fitted(lmod) -0.009886   0.011187  -0.884    0.386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.539 on 22 degrees of freedom
## Multiple R-squared:  0.03428,    Adjusted R-squared:  -0.009613 
## F-statistic: 0.781 on 1 and 22 DF,  p-value: 0.3864
p-value is bigger than 0.05, so we can’t reject H0: constant error variance
Score Test For Non-Constant Error Variance: Computes a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors.
#install.packages('car')
library(car)
ncvTest(lmod)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.004526111    Df = 1     p = 0.9463617
p-value is bigger than 0.05, so we can’t reject H0: constant error variance.

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...