<- c(5.1, 4.9, 4.7, 4.6, 5, 5.4)
x <- c(3.5, 3, 3.2, 3.1, 3.6, 3.9)
y <- mean(y)
y_mean <- mean(x)
x_mean <- sum((y - y_mean)*x) / sum((x - x_mean)*x)
b <- y_mean - b*x_mean
a cat('a =', a, '\tb =', b)
a = -1.685944 b = 1.024096
So far we have done some simple data modelling using linear and quadratic equations. ie in a perfect world, a linear model between one variable \(x\) and another \(y\) can be described as \(y= mx + b\) ie using the equation of a line. Instead of using \(m\) and \(b\) we will use \(\alpha\) to denote the intercept and \(\beta\) to denote the slope. ie \(y = \alpha + \beta x\). Once we solve for \(\alpha\) and \(\beta\) we can estimate what the value of \(y\) will be that corresponds to any given \(x\). All the points will line on this equation. To represent the different data points, we will use the subscript \(i\). Thus our equation becomes:
\[ y_i = \alpha + \beta x_i \]
where \((x_i, y_i)\) represents the \(i^{th}\) point. We can therefore have \((x_1,y_1), x_2,y_2), \cdots, (x_n,y_n)\) to represent \(n\) data points. example:
\[ \begin{array}{r|r|r|r|} i & x_i & y_i = 2x_i + 1&(x_i, y_i)\\ \hline 1&1.0&3.0&(1.0, 3.0)\\ 2&1.5&4.0&(1.5, 4.0)\\ 3&2.0&5.0&(2.0, 5.0)\\ 4&2.5&6.0&(2.5, 6.0)\\ 5&3.0&7.0&(3.0, 7.0)\\\end{array} \]
In the case of perfect world, we can use any two points above to find the slope \(\beta\) and the intercept \(\alpha\) and we will obtain the equation \(y = 2x + 1\)
But we are not in a perfect world. In this world, there are errors involved in the measuring/reading/recording of the \(x_i\) and \(y_i\) and many other associated errors. Therefore we end up with points that resemble the true values, but are not themselves true. Because of these errors, not all the points will fit on the line. The equation will now be \(y_i =\alpha + \beta x_i + \epsilon_i\). We therefore need to find a line which would be described as being closer to the true line. If we use any two points, we will have different results. Thus we need to use all the points rather than any two points. The approximate equation can be obtained by forcing the error terms to be zero or as close as possible to zero. ie the closer the \(\epsilon\) is to 0 the better our equation is.
How do we set up the problem? Having the equation \(y_i = \alpha + \beta x_i + \epsilon_i\) we can rewrite the equation as \(\epsilon_i = y_i - (\alpha + \beta x_i)\) . This becomes a root finding problem. ie we find \(\alpha\) and \(\beta\) such that \(e_i = 0\). Previously we noted that a root finding problem can be written as a minimization problem. In this case, we will use squared errors ie
\[ L = \sum_i\epsilon_i^2 = \sum_i (y_i - \hat y_i)^2 \quad\text{where}\quad \hat y_i = \alpha + \beta x_i \]
The objective is to minimize the loss function \(L\). The values of \(\alpha\) and \(\beta\) that gives the minimal loss value denote the best linear model.
Analytically, we could solve this minimization problem using calculus, ie find the derivative, equate to zero and solve for both \(\alpha\) and \(\beta\) simultaneously:
\[ \begin{aligned} \frac{\partial L}{\partial \alpha} : &-2\sum_{i}(y_i - \alpha - \beta x_i) = 0\\ \implies &\sum_iy_i - \beta\sum_ix_i - n\alpha = 0 \implies \alpha = \frac{\sum_iy_i - \beta\sum_ix_i}{n}\\ \therefore\quad &\boxed{\hat \alpha = \bar y - \hat \beta\bar x}\\ \end{aligned} \]
\[ \begin{aligned} \frac{\partial L}{\partial \beta} : &-2\sum_{i}(y_i - \hat y_i)x_i = \sum_{i}\Big[y_i - (\alpha+\beta x_i)\Big]x_i = 0\\ \implies&\sum_{i}\Big[y_i - (\bar y - \beta \bar x + \beta x_i)\Big]x_i = \sum_{i}\Big[(y_i - \bar y) - \beta(x_i - \bar x)\Big]x_i\\ 0&= \sum_{i}(y_i - \bar y)x_i - \beta\sum_i(x_i - \bar x)x_i\\ \implies &\boxed{\hat \beta = \frac{\sum_{i}(y_i - \bar y)x_i}{\sum_i(x_i - \bar x)x_i} = \frac{\sum_i(y_i - \bar y)(x_i - \bar x)}{\sum_i(x_i - \bar x)^2}} \end{aligned} \]
These two gives us the closed form solution for a simple linear model.
Example:
Suppose we have the following data, find the equation of the line that best describes the data.
\[ \begin{array}{r|r} x & y\\ \hline 5.1 & 3.5\\ 4.9 & 3.0\\ 4.7 & 3.2\\ 4.6 & 3.1\\ 5.0 & 3.6\\ 5.4 & 3.9\\ \end{array} \]
::: {panel-tabset group=‘language’} ## R
<- c(5.1, 4.9, 4.7, 4.6, 5, 5.4)
x <- c(3.5, 3, 3.2, 3.1, 3.6, 3.9)
y <- mean(y)
y_mean <- mean(x)
x_mean <- sum((y - y_mean)*x) / sum((x - x_mean)*x)
b <- y_mean - b*x_mean
a cat('a =', a, '\tb =', b)
a = -1.685944 b = 1.024096
import numpy as np
= np.array([5.1, 4.9, 4.7, 4.6, 5, 5.4])
x = np.array([3.5, 3, 3.2, 3.1, 3.6, 3.9])
y = np.mean(y)
y_mean = np.mean(x)
x_mean = np.sum((y - y_mean)*x) / np.sum((x - x_mean)*x)
b = y_mean - b*x_mean
a print('a = ', a, '\tb = ', b)
a = -1.6859437751003128 b = 1.0240963855421508
:::
The formulation of \(\alpha\) and \(\beta\) is limited to only one predictor variable \(x\) What if we had many predictors?
In the simple linear regression above, we expressed every quantity in a scalar notation. We could rewrite the same using matrix notations:
\[ \begin{aligned} \begin{array}{} y_1 = \alpha + \beta x_1 + \epsilon_1\\ y_2 = \alpha + \beta x_2 + \epsilon_2\\ \vdots\\ y_n = \alpha + \beta x_n + \epsilon_n\\ \end{array}\implies& \begin{pmatrix}y_1\\y_2\\\vdots\\y_n\end{pmatrix} = \alpha\begin{pmatrix}1\\1\\\vdots\\1\end{pmatrix} + \beta\begin{pmatrix}x_1\\x_2\\\vdots\\x_n\end{pmatrix} +\begin{pmatrix}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_n\end{pmatrix}\\ \\ \\ \implies&\underbrace{\begin{pmatrix}y_1\\y_2\\\vdots\\y_n\end{pmatrix}}_{\mathbf{Y}} = \underbrace{\begin{pmatrix}1&x_1\\1&x_2\\\vdots\\1&x_n\end{pmatrix}}_{\mathbf{X}}\underbrace{\begin{pmatrix}\alpha\\\beta\end{pmatrix}}_{\large\mathbf{\beta}} + \underbrace{\begin{pmatrix}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_n\end{pmatrix}}_{\large\mathbf{\epsilon}}\\ \implies \boxed{Y = X\beta + \Large\epsilon} \end{aligned} \]
Using this matrix notation, the loss/cost function that we hope to be minimized is expressed as:
\[ \mathcal{L}(\beta;\mathbf{X}, \mathbf{y}) = {\Large \epsilon}^\top{\Large \epsilon} = \left(\mathbf{y} - \mathbf{X}\beta\right)^\top\left(\mathbf{y}-\mathbf{X}\beta\right) = \big|\big|\mathbf{y} - \mathbf{X}\beta\big|\big|^2 \]
we can solve for \(\beta\) which is a vector of the parameters using matrix differentiation of the loss function:
\[ \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\s}{\operatorname{sign}} \begin{aligned} \frac{\partial}{\partial \beta}C(\beta;\mathbf{X,y}) =& -2\mathbf{X}^\top(\mathbf{y-X\beta}) = 0\implies \mathbf{X}^\top\mathbf{y}=\X^\top\X\beta\\ \implies&\boxed{\hat\beta = (\X^\top\X)^{-1}\X^\top\y} \end{aligned} \]
Lets use this formula to solve the previous problem:
<- c(5.1, 4.9, 4.7, 4.6, 5, 5.4)
x <- c(3.5, 3, 3.2, 3.1, 3.6, 3.9)
y <- cbind(a = 1, b = x)
X solve(t(X)%*%X, t(X)%*%y)
[,1]
a -1.685944
b 1.024096
solve(crossprod(X), crossprod(X, y))
[,1]
a -1.685944
b 1.024096
import numpy as np
= np.array([5.1, 4.9, 4.7, 4.6, 5, 5.4])
x = np.array([3.5, 3, 3.2, 3.1, 3.6, 3.9])
y = np.c_[np.ones(x.size), x]
X @ X, X.T @ y) np.linalg.solve(X.T
array([-1.68594378, 1.02409639])
Note that we still get the same results as before.
The notion of writing the problem in matrix notation simplifies the problem of multiple variables. The formulation and computation remains as shown above:
\[ \begin{aligned} \mathbf{X} =& \begin{array}{c|c|c|c}\operatorname{intercept}&\operatorname{variable}_1&\dots& \operatorname{variable}_p\\\hline1&x_{11}&\dots&x_{1n}\\ 1&x_{21}&\dots&x_{2n}\\ \vdots&\vdots&\ddots&\vdots\\ 1&x_{n1}&\dots&x_{nn}\\ \end{array}, \quad \mathbf{y} = \begin{bmatrix}y_1\\y_2\\\vdots\\y_n\end{bmatrix}\\ \text{ The }i^{th}&\text{ observation } \mathbf{x}_i^T = \begin{bmatrix} 1&x_{i1}&\dots&x_{ip}\end{bmatrix}\\ \text{The cost function that} &\text{ we hope to minimize: }\\ \mathcal{L}(\beta;\mathbf{X},\mathbf{y}) = &\sum_{i=1}^n (y_i - \mathbf{x_i}^T\beta)^2 = (\mathbf{y-X\beta})^\top(\mathbf{y-X\beta}) = ||\mathbf{y-X\beta}||^2\\ \text{And the minimization}& \text{ occurs at}:\\ \hat\beta = & (\X^\top\X)^{-1}\X^\top\y\quad\text{as previously shown}. \end{aligned} \]
Example 2:
Fit a linear model to the iris data set taking Sepal.Length
to be the response variable while Sepal.Width
, Petal.Length
and Petal.Width
to be the predictor variables. Include the intercept term.
<- iris$Sepal.Length # y <- iris[,1]
y <- iris[c("Sepal.Width", "Petal.Length", "Petal.Width")] # x <- iris[2:4]
x <- cbind(intercept = 1, as.matrix(x)) # column bind 1 as intercept
X solve(t(X)%*%X, t(X)%*%y) # solve(crossprod(X), crossprod(X,y))
[,1]
intercept 1.8559975
Sepal.Width 0.6508372
Petal.Length 0.7091320
Petal.Width -0.5564827
import numpy as np
from sklearn.datasets import load_iris
= load_iris(return_X_y = True)
x, Species = x[:,0] #the first column as y
y = np.c_[np.ones(y.size), x[:,1:]] # column bind 1 as intercept
X @ X, X.T @ y) # @ is matrix multiplication np.linalg.solve(X.T
array([ 1.85599749, 0.65083716, 0.70913196, -0.55648266])
From the above results, we could write our equation as:
\[ \begin{array}{}\hat y_i = 1.856 + 0.651X_{1i} + 0.709X_{2i} -0.556X_{3i}\end{array}\quad \text{where } \quad\begin{array}{ccc}\hat y =\text{Predicted Sepal Length}\\X_1 = \text{Sepal Width}\\ X_2 = \text{Petal Length}\\X_{3} =\text{Petal Width} \end{array} \]
For factors, we would have to carry out a one hot encoding on each factor variable. Depending on the contrast chosen (mostly treatment contrast), we convert each level to a variable. In the case that an intercept is included in the model, we will have to drop one level ie the base level. This is because maintaning all the levels and the intercept will lead to linearly dependent variables thus the covariance matrix will be rank deficient.
Example: Include Species in the regression model of iris dataset.
<- iris[,1]# First column iris
y
# convert column species to one hot encoding, exclude 1st column species
<- with(iris, diag(nlevels(Species))[Species, -1])
species colnames(species) <- levels(iris$Species)[-1]
#make X include intercept
<- cbind(Intercept = 1, as.matrix(iris[,2:4]), species)
X
solve(t(X) %*% X, t(X) %*% y)
[,1]
Intercept 2.1712663
Sepal.Width 0.4958889
Petal.Length 0.8292439
Petal.Width -0.3151552
versicolor -0.7235620
virginica -1.0234978
Note we do not need to load the data/packages as they had previously been loaded
import numpy as np
from sklearn.datasets import load_iris
= load_iris(return_X_y = True)
iris, Species
= iris[:,0]# First column iris
y
# convert column species to one hot encoding, exclude 1st column species
= np.identity(Species.max() + 1)[Species][:,1:]
species
#make X include intercept,
= np.c_[np.ones(y.size), iris[:,1:], species]
X @ X, X.T @ y) np.linalg.solve(X.T
array([ 2.17126629, 0.49588894, 0.82924391, -0.31515517, -0.72356196,
-1.02349781])
There are three main functions to use lm
, lm.fit
and lsfit
.
lsfit
and lm.fit
: Since our X
matrix above includes the intercept column, when using these functions, you could decide to drop that particular column or simply use intercept = FALSE
as shown below. Notice that X
must be a matrix. and y
can be a vector, or matrix if there are many dependent variables.
# we have already included an intercept in our ma
= lsfit(X,y, intercept = FALSE)
model $coefficients model
Intercept Sepal.Width Petal.Length Petal.Width versicolor virginica
2.1712663 0.4958889 0.8292439 -0.3151552 -0.7235620 -1.0234978
There are other information computed that we have not covered here. You can learn on how each component is computed.
lm.fit(X,y)$coefficients
Intercept Sepal.Width Petal.Length Petal.Width versicolor virginica
2.1712663 0.4958889 0.8292439 -0.3151552 -0.7235620 -1.0234978
Lastly the commonly used function is the lm
function. Notice that the two given functions above only take numeric values. You will have to do one hot encoding to all the factors/characters before you could run your model. We can make R to automatically do this for us:
<- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
formula <- lm(formula, iris)
model summary(model)
Call:
lm(formula = formula, data = iris)
Residuals:
Min 1Q Median 3Q Max
-0.79424 -0.21874 0.00899 0.20255 0.73103
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.17127 0.27979 7.760 1.43e-12 ***
Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
Petal.Width -0.31516 0.15120 -2.084 0.03889 *
Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **
Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3068 on 144 degrees of freedom
Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
Take note on how we run the model. Note that we were able to not only use the dataframe directly but specify a formula that captured the essence of the model being fitted. What if we had 100 variables? What if we had to fit interaction terms, quadratic terms change contrast etc? This are things to be learned. For example, There is a shortcut to run a linear model on all the variables at once. ie lm(y ~ ., data)
eg lm(Sepal.Length~., iris)
In python, we can use 3 ways to solve the same problem. ie sklearn.linear_model.LinearRegression
or use the sm.ols
from sklearn.linear_model import LinearRegression as lm
= lm().fit(X[:,1:], y)
reg reg.intercept_
2.1712662921550714
reg.coef_
array([ 0.49588894, 0.82924391, -0.31515517, -0.72356196, -1.02349781])
import statsmodels.api as sm
from statsmodels.formula.api import ols as lm
= sm.datasets.get_rdataset('iris').data
iris = iris.columns.str.replace('.', '_', regex = False)
iris.columns = "Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width + Species"
formula = lm(formula, iris).fit()
model model.summary()
Dep. Variable: | Sepal_Length | R-squared: | 0.867 |
Model: | OLS | Adj. R-squared: | 0.863 |
Method: | Least Squares | F-statistic: | 188.3 |
Date: | Sun, 29 Oct 2023 | Prob (F-statistic): | 2.67e-61 |
Time: | 00:08:31 | Log-Likelihood: | -32.558 |
No. Observations: | 150 | AIC: | 77.12 |
Df Residuals: | 144 | BIC: | 95.18 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 2.1713 | 0.280 | 7.760 | 0.000 | 1.618 | 2.724 |
Species[T.versicolor] | -0.7236 | 0.240 | -3.013 | 0.003 | -1.198 | -0.249 |
Species[T.virginica] | -1.0235 | 0.334 | -3.067 | 0.003 | -1.683 | -0.364 |
Sepal_Width | 0.4959 | 0.086 | 5.761 | 0.000 | 0.326 | 0.666 |
Petal_Length | 0.8292 | 0.069 | 12.101 | 0.000 | 0.694 | 0.965 |
Petal_Width | -0.3152 | 0.151 | -2.084 | 0.039 | -0.614 | -0.016 |
Omnibus: | 0.418 | Durbin-Watson: | 1.966 |
Prob(Omnibus): | 0.811 | Jarque-Bera (JB): | 0.572 |
Skew: | -0.060 | Prob(JB): | 0.751 |
Kurtosis: | 2.722 | Cond. No. | 94.0 |
One could also use sm.formula.ols(formula, data)
.
Another important notion is to note that there are various values associated with the linear regression output. All these can be calculated directly using basic formulas. These other results are mostly used for model evaluation and test of hypothesis.
The computation of the covariance matrix is very expensive especially when the data is very huge. Of course there are methods that have been developed to solve this issue, eg SVD which can be used to obtain the solutions to the parameters.
Sometimes we have very large data that cannot fit into memory. In that case we cannot solve the linear equations to obtain the parameters \(\beta\). A numerical way ie the gradient descent is therefore used to obtain the solution. Note that with this method, for a large dataset, you could read in chunks of data into the memory, update your results and delete the data from memory before reading in another chunk.
A quick gradient descent for the iris dataset:
<- cbind(intercept = 1, as.matrix(iris[,2:4]))
x <- iris[,1]
y <- rep(1, 4)
b <- nrow(x)
n for (i in 1:100000){
<- b - 0.02* t(x) %*% -(y - x %*% b)/n
b
} b
[,1]
intercept 1.8559975
Sepal.Width 0.6508372
Petal.Length 0.7091320
Petal.Width -0.5564827
= iris.shape[0]
n = np.c_[np.ones(n), iris.iloc[:,1:-1].values]
x = iris.iloc[:,0].values
y = np.ones(x.shape[1])
b for i in range(100000):
-= -0.02 * x.T @ (y - x @ b)/n
b b
array([ 1.85599749, 0.65083716, 0.70913196, -0.55648266])