Skip to main content

Least squares method

The regression model y = Xβ + ε, divides the response into two components: one systematic,  and one random, the error (ε). What we want is to choose β so that the systematic components explain as much of the response as possible. We have to find the unknown parameters β that make Xβ is as close to Y as possible.
What is the analytical relationship that best fits our data? The least squares method is a general procedure that allows us to answer this question.
Ordinary least squares (OLS) is a method for estimating the unknown parameters β in a linear regression model, with the goal of minimizing the sum of the squares of the differences between the observed response in the given dataset.
We have to take into account that It is common to assume the hypothesis that the ordinary least squares method must be used to minimize the residuals (difference between the values in the data set and the adjusted line). Under this hypothesis, the precision of a line through the sampling points is measured by the sum of the squares of the residuals, and the goal is to make this sum as small as possible.
But, also other regression methods that can be used instead of ordinary least squares include: least absolute deviations (minimizing the sum of absolute values of residuals) or the Theil-Sen estimator (chooses a line whose slope is the median of the slopes determined by pairs of sample points), among others.
In 1822 Gauss stated that the least-squares estimator approach to regression analysis is optimal in the sense that, is the best linear unbiased estimator of the coefficients in a linear model when the three following conditions are met:
  1. errors have a mean zero
  2. are uncorrelated, and
  3. have equal variances.
These conditions consist of a collection of assumptions about the regression model that should be thought as a description of an ideal dataset. These conditions need to be meet in order to consider least squares method as the Best Linear Unbiased Estimator (BLUE).
In real world, most data sets do not meet these conditions. However, the linear regression model under these conditions can be used a benchmark case to be compare and analyze the results obtained. If we know the ideal conditions and their violations we are able to control for deviations from these conditions to provide consistent results.
Since the method of least squares is a standard approach in regression analysis and is the best linear unbiased estimator of the coefficients in a linear model when the conditions of Gauss-Markov theorem are met we are going to focused on this method.

EXAMPLE 
Calculating the parameters using least squares method:
The method of least squares calculates the line of best fit by minimizing the sum of the squares of the vertical distances of the points to the line.
x <- c(1,2,3,4,5,6,7)
y <- c(3,4,7,8,9,13,18)
data.frame(x,y)
##   x  y
## 1 1  3
## 2 2  4
## 3 3  7
## 4 4  8
## 5 5  9
## 6 6 13
## 7 7 18
mean(x)
## [1] 4
mean(y)
## [1] 8.857143
plot(x,y)
abline(v = mean(x))
abline(h = mean(y))
To calculate the needed values we compute:
data <- data.frame(x,y, x-mean(x), y-mean(y), (x-mean(x))^2, (x-mean(x))*(y - mean(y))) 
colnames(data)<- c('x','y', 'x - mean(x)', 'y - mean(y)', '(x-mean(x))^2', '(x-mean(x))*(y - mean(y))')
data
##   x  y x - mean(x) y - mean(y) (x-mean(x))^2 (x-mean(x))*(y - mean(y))
## 1 1  3          -3  -5.8571429             9                17.5714286
## 2 2  4          -2  -4.8571429             4                 9.7142857
## 3 3  7          -1  -1.8571429             1                 1.8571429
## 4 4  8           0  -0.8571429             0                 0.0000000
## 5 5  9           1   0.1428571             1                 0.1428571
## 6 6 13           2   4.1428571             4                 8.2857143
## 7 7 18           3   9.1428571             9                27.4285714
sum((x- mean(x))^2)
## [1] 28
sum((x- mean(x))*(y - mean(y)))
## [1] 65
#to compute b1
b1 = sum((x- mean(x))*(y - mean(y))) / sum((x- mean(x))^2)
b1
## [1] 2.321429
#to compute b0 (mean(y) = b0 + b1*(mean(x)))
b0 = -(b1*(mean(x)) - mean(y))
b0
## [1] -0.4285714

R allows us to do the same very fast using the function lm()(we will use this function in following posts).
We compare our results with the results from the r function lm():
plot(x,y, ylim= c(-5,20))
abline(lm(y ~ x), col = 'red') #lm is a function in r that is used to fit linear models.
abline(v = mean(x))
abline(h = mean(y))
fit <- lm(y ~ x)
fit
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     -0.4286       2.3214
We see that the values obtained are the same (Intercept = b0, x = b1).

Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...