Skip to main content

Linear Models

"Linear model describes a quantitative response in terms of a linear combination of predictors. You can use a linear model to make predictions or explain the relationship between the response and the predictors. Linear models are very flexible and widely used in applications in physical science, engineering, social science and business." as stated by Julian J. Faraway in his book Linear Models with R.

Linear models are used for explaining the relationship between a single variable Y ( which represents a response or output) and one or more predictors or inputs X1, X2,....Xp  (where p represents the number of different predictors).

A general form for the model is:
Y = f(X1, X2,..., Xp ) + ε
where f() represents an unknown function and  ε  the error of the model. 

If we assume that f() is a smooth, continuos function that leaves us with a wide range of possibilities, and with all the data (predictors we are working with) we can collect is not enough  to try to estimate f() directly, so we have to use a more restrictive form: linear.

For example, if we are working with 4 predictors:
Y = f(X1, X2, X3, X4)  + ε
we modify the previous infinite function in a linear way, so we get the following linear model:
Y = β0 + β1X1 + β2X2 + β3X3 + β4X4  + ε
Hence, using the lineal model approach the problem to solve is reduced to estimate the parameters (β0, β1, β2, β3, β4) instead of the infinite dimentional f().

A model is lineal if the parameters enter linearly in the equation. For instance, the following equations are linear:
Y = β0 + β1X1 + β2X2 + β3X3 + ε
Y = β0 + β1X1 + β2X2 + β3X3ε
Y = β0 + β1X1 + β2 log X2ε
Y = β0 + β1X1 + β2X2 β3X1X2 ε

While these equations are not:

Y = β0 + β1X1 + β2X2β3 + ε
Y = β0(β1X1) + β2X2 + ε 

Some equations can be transformed to linearity, so linear models are very flexible and are able to handle complex datasets.

When the number of predictors is one it is called Simple Linear Model:
Y = β0 + β1X1 + ε
and, when there is more than one predictor is called Multiple Linear Model:
Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + ε


The error is a necessary factor to take into account when working with linear models for the following reasons:
1. Effect of variables not considered in the model
2. Unforeseen events (catastrophes, fashions, etc.)
3. Erors from observations or measurements

Linear models can be used for:
1. Verifying the existence of a linear relationship
2. Predicting the variable Y as a function of X or Xs (X1, X2 ... Xk)


Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne...

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2...

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o...