The regression model y = Xβ + ε, divides the response into two components: one systematic, Xβ and one random, the error (ε). What we want is to choose β so that the systematic components explain as much of the response as possible. We have to find the unknown parameters β that make Xβ is as close to Y as possible.
What is the analytical relationship that best fits our data? The least squares method is a general procedure that allows us to answer this question.
Ordinary least squares (OLS) is a method for estimating the unknown parameters β in a linear regression model, with the goal of minimizing the sum of the squares of the differences between the observed response in the given dataset.
We have to take into account that It is common to assume the hypothesis that the ordinary least squares method must be used to minimize the residuals (difference between the values in the data set and the adjusted line). Under this hypothesis, the precision of a line through the sampling points is measured by the sum of the squares of the residuals, and the goal is to make this sum as small as possible.
But, also other regression methods that can be used instead of ordinary least squares include: least absolute deviations (minimizing the sum of absolute values of residuals) or the Theil-Sen estimator (chooses a line whose slope is the median of the slopes determined by pairs of sample points), among others.
In 1822 Gauss stated that the least-squares estimator approach to regression analysis is optimal in the sense that, is the best linear unbiased estimator of the coefficients in a linear model when the three following conditions are met:
- errors have a mean zero
- are uncorrelated, and
- have equal variances.
These conditions consist of a collection of assumptions about the regression model that should be thought as a description of an ideal dataset. These conditions need to be meet in order to consider least squares method as the Best Linear Unbiased Estimator (BLUE).
In real world, most data sets do not meet these conditions. However, the linear regression model under these conditions can be used a benchmark case to be compare and analyze the results obtained. If we know the ideal conditions and their violations we are able to control for deviations from these conditions to provide consistent results.
Since the method of least squares is a standard approach in regression analysis and is the best linear unbiased estimator of the coefficients in a linear model when the conditions of Gauss-Markov theorem are met we are going to focused on this method.
EXAMPLE
Calculating the parameters using least squares method:
The method of least squares calculates the line of best fit by minimizing the sum of the squares of the vertical distances of the points to the line.
x <- c(1,2,3,4,5,6,7)
y <- c(3,4,7,8,9,13,18)
data.frame(x,y)
## x y
## 1 1 3
## 2 2 4
## 3 3 7
## 4 4 8
## 5 5 9
## 6 6 13
## 7 7 18
mean(x)
## [1] 4
mean(y)
## [1] 8.857143
plot(x,y)
abline(v = mean(x))
abline(h = mean(y))
To calculate the needed values we compute:
data <- data.frame(x,y, x-mean(x), y-mean(y), (x-mean(x))^2, (x-mean(x))*(y - mean(y)))
colnames(data)<- c('x','y', 'x - mean(x)', 'y - mean(y)', '(x-mean(x))^2', '(x-mean(x))*(y - mean(y))')
data
## x y x - mean(x) y - mean(y) (x-mean(x))^2 (x-mean(x))*(y - mean(y))
## 1 1 3 -3 -5.8571429 9 17.5714286
## 2 2 4 -2 -4.8571429 4 9.7142857
## 3 3 7 -1 -1.8571429 1 1.8571429
## 4 4 8 0 -0.8571429 0 0.0000000
## 5 5 9 1 0.1428571 1 0.1428571
## 6 6 13 2 4.1428571 4 8.2857143
## 7 7 18 3 9.1428571 9 27.4285714
sum((x- mean(x))^2)
## [1] 28
sum((x- mean(x))*(y - mean(y)))
## [1] 65
#to compute b1
b1 = sum((x- mean(x))*(y - mean(y))) / sum((x- mean(x))^2)
b1
## [1] 2.321429
#to compute b0 (mean(y) = b0 + b1*(mean(x)))
b0 = -(b1*(mean(x)) - mean(y))
b0
## [1] -0.4285714
R allows us to do the same very fast using the function
lm()
(we will use this function in following posts).
We compare our results with the results from the
r
function lm()
:plot(x,y, ylim= c(-5,20))
abline(lm(y ~ x), col = 'red') #lm is a function in r that is used to fit linear models.
abline(v = mean(x))
abline(h = mean(y))
fit <- lm(y ~ x)
fit
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -0.4286 2.3214
We see that the values obtained are the same (Intercept = b0, x = b1).