Multiple Linear Regression and The Normal Equations

← return to practice.dsc40a.com


This page contains all problems about Multiple Linear Regression and The Normal Equations.


Problem 1

Source: Spring 2023 Midterm 1, Problem 5

Let X be a design matrix with 4 columns, such that the first column is a column of all 1s. Let \vec{y} be an observation vector. Let \vec{w}^* = (X^TX)^{-1}X^T\vec{y}. We’ll name the components of \vec{w}^* as follows:

\vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}

In this problem, we’ll consider various modifications to the design matrix and see how they affect the solution to the normal equations.


Problem 1.1

Let X_a be the design matrix that comes from interchanging the first two columns of X. Let \vec{w_a}^* = (X_a^TX_a)^{-1}X_a^T\vec{y}. Express the components \vec{w_a}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).

\vec{w_a}^* = \begin{bmatrix} w_1^* \\ w_0^* \\ w_2^* \\ w_3^* \end{bmatrix}

Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.

Where: \vec{w_a}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}

By swapping the first two columns of our design matrix, this changes the prediction rule to be of the form: H_2(\vec{x}) = v_1 + v_0x_1 + v_2x_2+ v_3x_3.

Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= w_1^* \\ v_1^* &= w_0^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

Intuitively, when we interchange two columns of our design matrix, all that does is interchange the terms in the prediction rule, which interchanges those weights in the parameter vector.



Problem 1.2

Let X_b be the design matrix that comes from adding one to each entry of the first column of X. Let \vec{w_b}^* = (X_b^TX_b)^{-1}X_b^T\vec{y}. Express the components \vec{w_b}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).

\vec{w_b}^* = \begin{bmatrix} \dfrac{w_0^*}{2} \\ w_1^* \\ w_2^* \\ w_3^*\end{bmatrix}

Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.

Where: \vec{w_b}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}

By adding one to each entry of the first column of the design matrix, we are changing the column of 1s to be a column of 2s. This changes the prediction rule to be of the form: H_2(\vec{x}) = v_0\cdot 2+ v_1x_1 + v_2x_2+ v_3x_3.

In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= \dfrac{w_0^*}{2} \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

This is saying we just halve the intercept term. For example, imagine fitting a line to data in \mathbb{R}^2 and finding that the best-fitting line is y=12+3x. If we had to write this in the form y=v_0\cdot 2 + v_1x, we would find that the best choice for v_0 is 6 and the best choice for v_1 is 3.



Problem 1.3

Let X_c be the design matrix that comes from adding one to each entry of the third column of X. Let \vec{w_c}^* = (X_c^TX_c)^{-1}X_c^T\vec{y}. Express the components \vec{w_c}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^*, which were the components of \vec{w}^*.

\vec{w_c}^* = \begin{bmatrix} w_0^* - w_2^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}

Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.

Where: \vec{w_c}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}

By adding one to each entry of the third column of the design matrix, this changes the prediction rule to be of the form: \begin{aligned} H_2(\vec{x}) &= v_0+ v_1x_1 + v_2(x_2+1)+ v_3x_3 \\ &= (v_0 + v_2) + v_1x_1 + v_2x_2+ v_3x_3 \end{aligned}

In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by \begin{aligned} v_0^* &= w_0^* - w_2^* \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

One way to think about this is that if we replace x_2 with x_2+1, then our predictions will increase by the coefficient of x_2. In order to keep our predictions the same, we would need to adjust our intercept term by subtracting this same amount.



Problem 2

Source: Spring 2023 Final Part 1, Problem 5

Suppose we want to predict how long it takes to run a Jupyter notebook on Datahub. For 100 different Jupyter notebooks, we collect the following 5 pieces of information:

Then we use multiple regression to fit a prediction rule of the form H(\text{cells, lines, max iterations, variables}) = w_0 + w_1 \cdot \text{cells} \cdot \text{lines} + w_2 \cdot (\text{max iterations})^{\text{variables} - 10}


Problem 2.1

What are the dimensions of the design matrix X?

\begin{bmatrix} & & & \\ & & & \\ & & & \\ \end{bmatrix}_{r \times c}

So, what should r and c be for: r rows \times c columns.

100 \text{ rows} \times 3 \text{ columns}

There should be 100 rows because there are 100 different Jupyter notebooks with different information within them. There should be 3 columns, one for each w_i. In this case we have w_0, which means X will have a column of ones, w_1, which means X will have a second column of \text{cells} \cdot \text{lines}, and w_2, which will be the last column in X containing \text{max iterations})^{\text{variables} - 10}.


Problem 2.2

In one sentence, what does the entry in row 3, column 2 of the design matrix X represent? (Count rows and columns starting at 1, not 0).

This entry represents the product of the number of cells and number of lines of code for the third Jupyter notebook in the training dataset.



Problem 3

Source: Spring 2023 Final Part 1, Problem 6

Now we solve the normal equations and find the solution to be \begin{aligned} \vec{w}^* &= \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix} \end{aligned} Define a new vector: \begin{aligned} \vec{w}^{\circ} &= \begin{bmatrix} w_0^{\circ} \\ w_1^{\circ} \\ w_2^{\circ} \end{bmatrix} = \begin{bmatrix} w_0^*+3 \\ w_1^*-4 \\ w_2^*-6 \end{bmatrix} \end{aligned}

and consider the two prediction rules

\begin{aligned} H^*(\text{cells, lines, max iterations, variables}) &= w_0^* + w_1^* \cdot \text{cells} \cdot \text{lines} + w_2^* \cdot (\text{max iterations})^{\text{variables} - 10}\\ H^{\circ}(\text{cells, lines, max iterations, variables}) &= w_0^{\circ} + w_1^{\circ} \cdot \text{cells} \cdot \text{lines} + w_2^{\circ} \cdot (\text{max iterations})^{\text{variables} - 10} \end{aligned}

Let \text{MSE} represent the mean squared error of a prediction rule, and let \text{MAE} represent the mean absolute error of a prediction rule. Select the symbol that should go in each blank.


Problem 3.1

\text{MSE}(H^*) ___ \text{MSE}(H^{\circ})

\leq

It is given that \vec{w}^* is the optimal parameter vector. Here’s one thing we know about the optimal parameter vector \vec{w}^*: it is optimal, which means that any changes made to it will, at best, keep our predictions of the exact same quality, and, at worst, reduce the quality of our predictions and increase our error. And since \vec{w} \degree is just the optimal parameter vector but with some small changes to the weights, it stands that \vec{w} \degree is liable to create equal or greater error!

In other words, \vec{w} \degree is a slightly worse version of \vec{w}^*, meaning that H \degree(x) is a slightly worse version of H^*(x). So, H \degree(x) will have equal or higher error than H^*(x).

Hence: \text{MSE}(H^*) \leq \text{MSE}(H^{\circ})



Problem 4

Source: Winter 2022 Midterm 1, Problem 4

Consider the dataset shown below.

x^{(1)} x^{(2)} x^{(3)} y
0 6 8 -5
3 4 5 7
5 -1 -3 4
0 2 1 2


Problem 4.1

We want to use multiple regression to fit a prediction rule of the form H(x^{(1)}, x^{(2)}, x^{(3)}) = w_0 + w_1 x^{(1)}x^{(3)} + w_2 (x^{(2)}-x^{(3)})^2. Write down the design matrix X and observation vector \vec{y} for this scenario. No justification needed.

The design matrix X and observation vector \vec{y} are given by:

\begin{align*} X &= \begin{bmatrix} 1 & 0 & 4\\ 1 & 15 & 1\\ 1 & -15 & 4\\ 1 & 0 & 1 \end{bmatrix} \\ \vec{y} &= \begin{bmatrix} -5\\ 7\\ 4\\ 2 \end{bmatrix} \end{align*}

We got \vec{y} by looking at our dataset and seeing the y column.

The matrix X was found by looking at the equation H(x). You can think of each row of X being: \begin{bmatrix}1 & x^{(1)}x^{(3)} & (x^{(2)}-x^{(3)})^2\end{bmatrix}. Recall our bias term here is not affected by x^{(i)}, but it still exists! So we will always have the first element in our row be 1. We can then easily calculate the other elements in the matrix.


Problem 4.2

For the X and \vec{y} that you have written down, let \vec{w} be the optimal parameter vector, which comes from solving the normal equations X^TX\vec{w}=X^T\vec{y}. Let \vec{e} = \vec{y} - X \vec{w} be the error vector, and let e_i be the ith component of this error vector. Show that 4e_1+e_2+4e_3+e_4=0.

The key to this problem is the fact that the error vector, \vec{e}, is orthogonal to the columns of the design matrix, X. As a refresher, if \vec{w^*} satisfies the normal equations, then:

We can rewrite the normal equation (X^TX\vec{w}=X^T\vec{y}) to allow substitution for \vec{e} = \vec{y} - X \vec{w}.

\begin{align*} X^TX\vec{w}&=X^T\vec{y} \\ 0 &= X^T\vec{y} - X^TX\vec{w} \\ 0 &= X^T(\vec{y}-X\vec{w}) \\ 0 &= X^T\vec{e} \end{align*}

The first step is to find X^T, which is easy because we found X above: \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix}

And now we can plug X^T and \vec e into our equation 0 = X^T\vec{e}. It might be easiest to find the right side first: \begin{align*} X^T\vec{e} &= \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix} \cdot \begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ e_4\end{bmatrix} \\ &= \begin{bmatrix} e_1 + e_2 + e_3 + e_4 \\ 15e_2 - 15e_3 \\ 4e_1 + e_2 + 4e_3 + e_4\end{bmatrix} \end{align*}

Finally, we set it equal to zero! \begin{align*} 0 &= e_1 + e_2 + e_3 + e_4 \\ 0 &= 15e_2 - 15e_3 \\ 0 &= 4e_1 + e_2 + 4e_3 + e_4 \end{align*}

With this we have shown that 4e_1+e_2+4e_3+e_4=0.



Problem 5

Source: Winter 2023 Final, Problem 4

Reggie and Essie are given a dataset of real features x_i \in \mathbb{R} and observations y_i. Essie proposes the following linear prediction rule: H_1(\alpha_0,\alpha_1) = \alpha_0 + \alpha_1 x_i. and Reggie proposes to use v_i=(x_i)^2 and the prediction rule H_2(\gamma_0,\gamma_1) = \gamma_0 + \gamma_1 v_i.


Problem 5.1

Give an example of a dataset \{(x_i,y_i)\}_{i=1}^n for which minimum MSE(H_2) < minimum MSE(H_1). Explain.

Example: If the datapoints follow a quadratic form y_i=x_i^2 for all i, then the H_2 prediction rule will achieve a zero error while H_1>0 since the data do not follow a linear form.


Problem 5.2

Give an example of a dataset \{(x_i,y_i)\}_{i=1}^n for which minimum MSE(H_2) = minimum MSE(H_1). Explain.

Example 1: If the response variables are constant y_i=c for all i, then for both prediciton rules by setting \alpha_0=\gamma_0=c and \alpha_1=\gamma_1=0, both predictors will achieve MSE=0.

Example 2: when every single value of the features x_i and x^2_ i coincide in the dataset (this occurs when x = 0 or x = 1), the parameters of both prediction rules will be the same, as will the MSE.


Problem 5.3

A new feature z has been added to the dataset.

Essie proposes a linear regression model using two predictor variables x,z as H_3(w_0,w_1,w_2) = w_0 + w_1 x_i +w_2 z_i.

Explain if the following statement is True or False (prove or provide counter-example).

Reggie claims that having more features will lead to a smaller error, therefore the following prediction rule will give a smaller MSE: H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = \alpha_0 + \alpha_1 x_i +\alpha_2 z_i + \alpha_3 (2x_i-z_i)

H_4 can be rewritten as H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = \alpha_0 + (\alpha_1+2\alpha_3) x_i +(\alpha_2 - \alpha_3)z_i By setting \tilde{\alpha}_1=\alpha_1+2\alpha_3 and \tilde{\alpha_2}= \alpha_2 - \alpha_3 then

H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = H_4(\alpha_0,\tilde{\alpha}_1,\tilde{\alpha}_2) = \alpha_0 + \tilde{\alpha_1} x_i +\tilde{\alpha}_2 z_i

Thus H_4 and H_3 have the same normal equations and therefore the same minimum MSE.



Problem 6

Source: Winter 2024 Midterm 1, Problem 5

Suppose the following information is given for linear regression:

X = \begin{bmatrix} 1 & 2\\ 1 & -1 \end{bmatrix}

\vec{y} = \begin{bmatrix} a\\ b\\ \end{bmatrix} \vec{w}^{*} = \begin{bmatrix} 1\\ 2\\ \end{bmatrix}

Where X is the design matrix, \vec{y} is the observation vector, and \vec{w}^{*} is the optimal parameter vector. Solve for parameters a and b using the normal equations, show your work.

\begin{cases} a = 5\\ b = -1\\ \end{cases}

Since \vec{w}^{*} is the optimal parameter vector, it must satisfy the normal equations:

\begin{align*} X^{T}X\vec{w} = X^{T}\vec{y} \end{align*}

The left hand side of the equation will read:

\begin{align*} X^{T}X\vec{w} &= \begin{bmatrix} 1 & 1\\ 2 & -1 \end{bmatrix} \begin{bmatrix} 1 & 2\\ 1 & -1 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} \\ &= \begin{bmatrix} 2 & 1\\ 1 & 5 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} \\ &= \begin{bmatrix} 4\\ 11 \end{bmatrix} \end{align*}

The right hand side of the equation is given by:

\begin{align*} X^{T}\vec{y} &= \begin{bmatrix} 1 & 1\\ 2 & -1 \end{bmatrix} \begin{bmatrix} a\\ b \end{bmatrix} \\ &= \begin{bmatrix} a+b\\ 2a-b \end{bmatrix} \end{align*}

By setting the left hand side and right hand side equal to each other, we will obtain the following system of equations:

\begin{align*} \begin{bmatrix} 4\\ 11 \end{bmatrix} = \begin{bmatrix} a+b\\ 2a-b \end{bmatrix} \end{align*}

\begin{cases} &4 = a + b\\ &11 = 2a-b \end{cases}

To solve this equation set, we can add them together: \begin{align*} 4+11 &= a + b + 2a -b\\ 3a &= 15\\ \\ \\ \end{align*}

\begin{cases} a = 5\\ b = -1\\ \end{cases}


Problem 7

Source: Winter 2024 Final Part 1, Problem 4

Albert collected 400 data points from a radiation detector. Each data point contains 3 features: feature A, feature B and feature C. The true particle energy E is also reported. Albert wants to design a linear regression algorithm to predict the energy E of each particle, given a combination of one or more of feature A, B, and C. As the first step, Albert calculated the correlation coefficients among A, B, C and E. He wrote it down in the following table, where each cell of the table represents the correlaton of two terms:

A    B    C    E   
A 1 -0.99 0.13 0.8
B -0.99 1 0.25 -0.95
C 0.13 0.25 1 0.72
E 0.8 -0.95 0.72 1


Problem 7.1

Albert wants to start with a simple model: fitting only a single feature to obtain the true energy (i.e. y = w_0+w_1 x). Which feature should he choose as x to get the lowest mean square error?

B

B is the correct answer, because it has the highest absolute correlation (0.95), the negative sign in front of B just means it is negatively correlated to energy, and it can be compensated by a negative sign in the weight.


Problem 7.2

Albert wants to add another feature to his linear regression in part (a) to further boost the model’s performance. (i.e. y = w_0 + w_1 x + w_2 x_2) Which feature should he choose as x_2 to make additional improvements?

C

C is the correct answer, because although A has a higher correlation with energy, it also has an extremely high correlation with B (-0.99), that means adding A into the fit will not be too useful, since it provides almost the same information as B.


Problem 7.3

Albert further refines his algorithm by fitting a prediction rule of the form: \begin{aligned} H(A,B,C) = w_0 + w_1 \cdot A\cdot C + w_2 \cdot B^{C-7} \end{aligned}

Given this prediction rule, what are the dimensions of the design matrix X?

\begin{bmatrix} & & & \\ & & & \\ & & & \\ \end{bmatrix}_{r \times c}

So, what are r and c in r \text{ rows} \times c \text{ columns}?

400 \text{ rows} \times 3 \text{ columns}

Recall there are 400 data points, which means there will be 400 rows. There will be 3 columns; one is the bias column of all 1s, one is for the feature A\cdot C, and one is for the feature B^{C-7}.



Problem 8

Source: Fall 2021 Final Exam, Problem 6

Billy’s aunt owns a jewellery store, and gives him data on 5000 of the diamonds in her store. For each diamond, we have:

The first 5 rows of the 5000-row dataset are shown below:

carat    length    width    price   
0.40 4.81 4.76 1323
1.04 6.58 6.53 5102
0.40 4.74 4.76 696
0.40 4.67 4.65 798
0.50 4.90 4.95 987


Billy has enlisted our help in predicting the price of a diamond given various other features.


Problem 8.1

Suppose we want to fit a linear prediction rule that uses two features, carat and length, to predict price. Specifically, our prediction rule will be of the form

\text{predicted price} = w_0 + w_1 \cdot \text{carat} + w_2 \cdot \text{length}


We will use least squares to find \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}.

Write out the first 5 rows of the design matrix, X. Your matrix should not have any variables in it.

X = \begin{bmatrix} 1 & 0.40 & 4.81 \\ 1 & 1.04 & 6.58 \\ 1 & 0.40 & 4.74 \\ 1 & 0.40 & 4.67 \\ 1 & 0.50 & 4.90 \end{bmatrix}


Problem 8.2

Suppose the optimal parameter vector \vec{w}^* is given by

\vec{w}^* = \begin{bmatrix} 2000 \\ 10000 \\ -1000 \end{bmatrix}

What is the predicted price of a diamond with 0.65 carats and a length of 4 centimeters? Show your work.

The predicted price is 4500 dollars.

2000 + 10000 \cdot 0.65 - 1000 \cdot 4 = 4500


Problem 8.3

Suppose \vec{e} = \begin{bmatrix} e_1 \\ e_2 \\ ... \\ e_n \end{bmatrix} is the error/residual vector, defined as

\vec{e} = \vec{y} - X \vec{w}^*

where \vec{y} is the observation vector containing the prices for each diamond.

For each of the following quantities, state whether they are guaranteed to be equal to 0 the scalar, \vec{0} the vector of all 0s, or neither. No justification is necessary.

  • \sum_{i = 1}^n e_i: Yes, this is guaranteed to be 0. This was discussed in Homework 4; it is a consequence of the fact that X^T (y - X \vec{w}^*) = 0 and that we have an intercept term in our prediction rule (and hence a column of all 1s in our design matrix, X).
  • || \vec{y} - X \vec{w}^* ||^2: No, this is not guaranteed to be 0. This is the mean squared error of our prediction rule, multiplied by n. \vec{w}^* is found by minimizing mean squared error, but the minimum value of mean squared error isn’t necessarily 0 — in fact, this quantity is only 0 if we can write \vec{y} exactly as X \vec{w}^* with no prediction errors.
  • X^TX \vec{w}^*: No, this is not guaranteed to be 0, either.
  • 2X^TX \vec{w}^* - 2X^T\vec{y}: Yes, this is guaranteed to be 0. Recall, the optimal parameter vector \vec{w}^* satisfies the normal equations X^TX\vec{w}^* = X^T \vec{y}. Subtracting X^T \vec{y} from both sides of this equation and multiplying both sides by 2 yields the desired result. (You may also find 2X^TX \vec{w}^* - 2X^T\vec{y} to be familiar from lecture — it is the gradient of the mean squared error, multiplied by n, and we set the gradient to 0 to find \vec{w}^*.)


Problem 8.4

Suppose we introduce two more features:

Suppose we also decide to remove the intercept term of our prediction rule. With all of these changes, our prediction rule is now

\text{predicted price} = w_1 \cdot \text{carat} + w_2 \cdot \text{length} + w_3 \cdot \text{width} + w_4 \cdot (\text{length} \cdot \text{width})

  • X = \begin{bmatrix} 0.40 & 4.81 & 4.76 & 4.81 \cdot 4.76 \\ 1.04 & 6.58 & 6.53 & 6.58 \cdot 6.53 \end{bmatrix}
  • No, it’s not guaranteed that the \vec{w}_1^* for this new prediction rule is equal to the \vec{w}_1^* for the original prediction rule. The value of \vec{w}_1^* in the new prediction rule will be influenced by the fact that there’s no longer an intercept term and that there are two new features (width and area) that weren’t previously there.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.