Skip to main content

Multiple Linear Regression


Predicting one thing from several things

Still fitting a line to some data, just in multiple dimensions.

The line:

y=β0+β1x1+β2x2++βpxpy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dotsb + \beta_p x_p


y=β0+i=1pβixiy = \beta_0 + \sum^p_{i=1} \beta_i x_i
  • Many predictors/inputs (x1,x2,x3,x_1, x_2, x_3, \dots, x_p)
  • One response/output (yy)
  • Need to find β0,β1,β2,,βp\beta_0, \beta_1, \beta_2, \dots, \beta_p

We aim to find values for β\beta that minimizes

i=1M(yiy^)2=i=1M(yi(β0+j=1Pβj×xij))2\sum^{M}_{i = 1} {(y_i - \hat{y})^2} = \sum^{M}_{i = 1} {\left( y_i - (\beta_0 + \sum^{P}_{j = 1} {\beta_j \times x_{ij}}) \right)^2}
  • yiy_i is the actual value we are trying to predict (ground truth).
  • yi^\hat{y_i} is the prediction made by the model.
  • MM is the number of examples.
  • ii stands for the i-th example.
  • jj stands for the j-th attribute/dimension of an example. (e.g., wind speed, gender)
  • PP stands for the number of attributes/dimensions of an example.
  • βj\beta_j is the weight for that attribute/dimension. It is also known as coefficient.
  • xijx_{ij} stands for the j-th attribute/dimension of the i-th example.

Analyse a Linear Model Performance

There are many tools that can be used to measure the individual terms, and the validity of the whole model.

Individual Terms (Coefficients)

Example data coming from number of cyclists case study

model = sm.ols(formula="cnt ~ atemp + temp + hum + windspeed", data=data_train).fit()
coef std err t P>|t| [0.025 0.975]
Intercept 1708.8451 296.956 5.755 0.000 1124.182 2293.509
atemp -3132.5570 3164.040 -0.990 0.323 -9362.093 3096.979
temp 8823.6644 2823.617 3.125 0.002 3264.372 1.44e+04
hum -1134.9410 302.778 -3.748 0.000 -1731.067 -538.815
windspeed -3052.9184 642.368 -4.753 0.000 -4317.647 -1788.190

Standard Error

Definition: A measure of how much the coefficient changes by if we resample the data and recompute the regression.


Definition: CoefficientstandardError\frac{Coefficient}{standard Error} indicates that the result is less likely to be the result of noise. Therefore, we want this to be long way from 0.


On an individual term, the p-value is the most useful. The null hypothesis states that the coefficient is equal to 0.

  • Low p-value means that you can reject the null hypothesis. Because that term is significant and thus important to the model.
  • High p-value means that the term is not significant, there are two main reasons for this.
    1. the predictor term (input) and the response are not related. In this case, the term shouldn't be in the model
    2. the predictor term is correlated with another predictor. Therefore, the relationship between that variable and the response is captured twice in the model.


When an individual term's p-value is high, using correlation can help identify what's going on to improve the model. see correlation

Model Level

Dep. Variable: cnt R-squared: 0.748
Model: OLS Adj. R-squared: 0.744
Method: Least Squares F-statistic: 198.4
Date: Tue, 08 Mar 2022 Prob (F-statistic): 8.06e-79
Time: 01:32:58 Log-Likelihood: -2190.2
No. Observations: 273 AIC: 4390.
Df Residuals: 268 BIC: 4408.
Df Model: 4
Covariance Type: nonrobust

R2R^2 and RMSE are useful values to consider


Definition: The R2R^2 indicates how much of the observed variance explained by the model.

  • 0 - the model captures/explains nothing
  • 1 - the model captures/explains everything
    • 1 is the ideal performance, but 1 may also indicate overfitting.
    • The R2R^2 can only be calculated on the training data. but why?

Defined as follows:

  • R2=1SSESSYR^2 = 1 - \frac{SSE}{SSY}
  • SSE=i=0n(yiyi^)2SSE = \sum^n_{i=0}(y_i - \hat{y_i})^2, SSE is the sum of the errors between the predicted and the actual value.
  • SSY=i=0n(yiyˉ)2SSY = \sum^n_{i=0}(y_i - \bar{y})^2, SSY or SST (Total Sum of Squares). Is the sum of the differences from the mean.

Adjusted R-Squared

Definition: Considers the model's complexity (number of terms) as well as how much variance it explains. Which means that complex models are penalized.

RMSE (Root Mean Square Error)

Definition: Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). It is a measure of how spread these residuals are. Therefore, smaller RMSEs are better. The scale of RMSE is dependent on the data. Unlike R2R^2, it can be computed on training, validation, and testing sets.

RMSE=(predictionactual)2RMSE = \sqrt{ \overline{ (prediction - actual)^2 } }
def compute_RMSE(predicted, actual):
return numpy.sqrt(numpy.mean((predicted - actual)**2))

What is RMSE? [2-min video, blog]

What is RMSE?

Df Residuals

Definition: The df(Residual) is the total number of observations (rows) minus the number of parameters being estimated.


Definition: A p-value used to indicate if the entire model is significant, much the same as we have for the individual terms.
