Multiple Linear Regression and The Normal Equations

← return to practice.dsc40a.com


This page contains all problems about Multiple Linear Regression and The Normal Equations.


Problem 1

Source: Summer Session 2 2024 Midterm, Problem 4a-d

Suppose we want to fit a hypothesis function of the form: H(x) = w_0 + w_1 x^{(1)} + w_2 x^{(2)} + w_3 (x^{(1)})^2 + w_4 (x^{(1)} x^{(2)})^2 Our dataset looks like this:

x^{(1)} x^{(2)} y
1 6 7
-3 8 2
4 1 9
-2 7 5
0 4 6


Problem 1.1

Suppose we found w_2^* = 15.6 using multiple linear regression. Now, suppose we rescaled our data so feature vector \vec{x^{(2)}} became \left[60,\ 80,\ 10,\ 70,\ 40\right]^T, and performed multiple linear regression in the same setting. What would the new value of w_2^* (the weight on feature x^{(2)} in H(x)) be? You do not need to simplify your answer.

w_2^* = 1.56

Recall in linear regression, when a feature is rescaled by a factor of c, the corresponding weight is inversely scaled by \frac{1}{c}. This is because the coefficient adjusts to maintain the overall contribution of the feature to the prediction. In this problem we can see that x^{(2)} has been scaled by a factor of 10! It changed from \left[6,\ 8,\ 1,\ 7,\ 4\right]^T to \left[60,\ 80,\ 10,\ 70,\ 40\right]^T. This means our w_2 will be scaled by \frac{1}{10}. We can easily calculate \frac{1}{10}w_2^* = \frac{1}{10}\cdot 15.6 = 1.56.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.



Problem 1.2

Suppose we found w_4^* = 72 using multiple linear regression. Now, suppose we rescaled our data so feature vector \vec{x^{(1)}} became \left[ \frac{1}{2},\ -\frac{3}{2},\ 2,\ -1,\ 0 \right] and \vec{x^{(2)}} now became \left[36, \ 48, \ 6, \ 42, \ 24\right]^T, and performed multiple linear regression in the same setting. What would the new value of w_4^* (the weight on feature (x^{(1)} x^{(2)})^2 in H(x)) be? You do not need to simplify your answer.

w_4^* = \frac{72}{(\frac{6}{2})^2}

Similar to the first part of the problem we need to find how \vec{x^{(1)}} and \vec{x^{(2)}} changed. We can then inversely scale w_4^* by those values.

Let’s start with \vec{x^{(1)}}. Originally \vec{x^{(1)}} was \left[1,\ -3,\ 4,\ -2,\ 0\right]^T, but it becomes \left[ \frac{1}{2},\ -\frac{3}{2},\ 2,\ -1,\ 0 \right]. We can see the values were scaled by \frac{1}{2}.

We can now look at \vec{x^{(2)}}. Originally \vec{x^{(2)}} was \left[6,\ 8,\ 1,\ 7,\ 4\right]^T, but it becomes \left[36, \ 48, \ 6, \ 42, \ 24\right]^T. We can see the values were scaled by 6.

We know w_4^* is attached to the variable (x^{(1)} x^{(2)})^2. This means we need to multiply the scaled values we found and then square it. (\frac{1}{2} \cdot 6)^2 = (3)^2 = 9.

From here we simply inversely scale w_4^*. Originally w_4^* = 72, so we multiply by \frac{1}{9} to get 72 \cdot \frac{1}{9} = \frac{72}{9} (which is equal to \frac{72}{(\frac{6}{2})^2}).


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 35%.


Problem 1.3

Suppose we found w_0^* = 4.47 using multiple linear regression. Now, suppose our observation vector \vec{y} now became \left[12, \ 7, \ 14, \ 10, \ 11\right]^T, and performed multiple linear regression in the same setting. What would the new value of w_0^* be? You do not need to simplify your answer.

w_0^* = 9.47

Recall w_0^* is the intercept term and represents the value of the prediction H(x) when all the features (\vec{x^{(1)}} and \vec{x^{(2)}}) are zero. If all other features (the w_1^* and w_2^*) remain unchanged then the intercept adjusts to reflect the shifts in the mean of the observed \vec y.

Our original \vec y was \left[7, \ 2, \ 9, \ 5, \ 6\right]^T, but became \left[12, \ 7, \ 14, \ 10, \ 11\right]^T. We need to calculate the original \vec y’s mean and the new \vec y’s mean.

Old \vec y’s mean: \frac{7 + 2 + 9 + 5 + 6}{5} = \frac{29}{5} = 5.8

New \vec y’s mean: \frac{12 + 7 + 14 + 10 + 11}{5} = \frac{54}{5} = 10.8

Our new w_0^* is found by taking the old one and adding the difference of 10.8 - 5.8. This means: w_0^* = 4.47 + (10.8 - 5.8) = 4.47 + 5 = 9.47.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.


Problem 1.4

Suppose we found w_1^* = 0.428 using multiple linear regression. Now, suppose our observation vector \vec{y} now became \left[12, \ 7, \ 14, \ 10, \ 11\right]^T, and performed multiple linear regression in the same setting. What would the new value of w_1^* (the weight on feature x^{(1)} in H(x)) be? You do not need to simplify your answer.

w_1^* = 0.428

Our old \vec y was \left[7, \ 2, \ 9, \ 5, \ 6\right]^T and our new \vec y is \left[12, \ 7, \ 14, \ 10, \ 11\right]^T. We can see that our old \vec y has five added to each value to get our new \vec y.

Recall the slope (w_1^*) does not change if our y_i shifts by a constant amount! This means w_1^* = 0.428.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.



Problem 2

Source: Summer Session 2 2024 Midterm, Problem 5a-e

Suppose we want to fit a hypothesis function of the form: H(x) = w_0 + w_1 x^{(1)} + w_2 x^{(2)} + w_3 (x^{(1)})^2 + w_4 (x^{(1)} x^{(2)})^2

Our dataset looks like this:

x^{(1)} x^{(2)} y
1 6 7
-3 8 2
4 1 9
-2 7 5
0 4 6

We know we need to find an optimal parameter vector \vec{w}^* = \left[w_0^* \ \ w_1^* \ \ w_2^* \ \ w_3^* \ \ w_4^* \right]^T that satisfies the normal equations. To do so, we build a design matrix, but our columns get all shuffled due to an error with our computer! Our resulting design matrix is

X_\text{shuffled} = \begin{bmatrix} 36 & 6 & 1 & 1 & 1 \\ 576 & 8 & -3& 1 & 9 \\ 16 & 1 & 4 & 1 & 16 \\ 196 & 7 & -2& 1 & 4 \\ 0 & 4 & 0 & 1 & 0 \end{bmatrix}

If we solved the normal equations using this shuffled design matrix X_\text{shuffled}, we would not get our parameter vector \vec{w}^* = \left[w_0^* \ \ w_1^* \ \ w_2^* \ \ w_3^* \ \ w_4^* \right]^T in the correct order. Let \vec{s} = \left[ s_0 \ \ s_1 \ \ s_2 \ \ s_3 \ \ s_4 \right] be the parameter vector we find instead. Let’s figure out which features correspond to the weight vector \vec{s} that we found using the shuffled design matrix X_\text{shuffled}. Fill in the bubbles below.


Problem 2.1

First weight s_0 after solving normal equations corresponds to the term in H(x):

(x^{(1)} x^{(2)})^2

The first column inside of our X_\text{shuffled} represents s_0, so we want to figure out how to create these values. We can easily eliminate intercept, x^{(1)}, and x^{(2)} because none of these numbers match. From here we can calculate (x^{(1)})^2 and (x^{(1)} x^{(2)})^2 to determine which element creates s_0.

\begin{align*} (x^{(1)})^2 &= \begin{bmatrix} 1^2 = 1 \\ (-3)^2 = 9\\ 4^2 = 16\\ (-2)^2 = 4\\ 0^2 = 0 \end{bmatrix} \\ &\text{and}\\ (x^{(1)} x^{(2)})^2 &= \begin{bmatrix} (1 \times 6)^2 = 36 \\ (-3 \times 8)^2 = 576 \\ (4 \times 1)^2 = 16 \\ (-2 \times 7 )^2 = 196 \\ (0 \times 4)^2 = 0 \end{bmatrix} \end{align*}

From this we can see the answer is clearly (x^{(1)} x^{(2)})^2.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.



Problem 2.2

Second weight s_1 after solving normal equations corresponds to the term in H(x):

x^{(2)}

The second column inside of our X_\text{shuffled} represents s_1, so we want to figure out how to create these values. We can see this is the same as x^{(2)}.


Difficulty: ⭐️

The average score on this problem was 93%.


Problem 2.3

Third weight s_2 after solving normal equations corresponds to the term in H(x):

x^{(1)}

The third column inside of our X_\text{shuffled} represents s_2, so we want to figure out how to create these values. We can see this is the same as x^{(1)}.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 2.4

Fourth weight s_3 after solving normal equations corresponds to the term in H(x):

intercept

The fourth column inside of our X_\text{shuffled} represents s_3, so we want to figure out how to create these values. We know the intercept is a vector of ones, which matches!


Difficulty: ⭐️

The average score on this problem was 93%.


Problem 2.5

Fifth weight s_4 after solving normal equations corresponds to the term in H(x):

(x^{(1)})^2

From process of elimination we can find the answer or from our first calculation.

(x^{(1)})^2 = \begin{bmatrix} 1^2 = 1 \\ (-3)^2 = 9\\ 4^2 = 16\\ (-2)^2 = 4\\ 0^2 = 0 \end{bmatrix}


Difficulty: ⭐️⭐️

The average score on this problem was 87%.



Problem 3

Source: Summer Session 2 2024 Midterm, Problem 6a-c

Suppose we have already fit a multiple regression hypothesis function of the form: H(x) = w_0 + w_1 x^{(1)} + w_2 x^{(2)}

Now, suppose we add the feature (x^{(1)} + x^{(2)}) when performing multiple regression. Below, answer ``Yes/No” to the following questions and rigorously justify why certain behavior will or will not occur. Your answer must mention linear algebra concepts such as rank and linear independence in relation to the design matrix, weight vector \vec{w^*}, and hypothesis function H(x).


Problem 3.1

Which of the following are true about the new design matrix X_\text{new} with our added feature (x^{(1)} + x^{(2)})?

The columns of X_\text{new} are linearly dependent. The columns of X_\text{new} have the same span as the original design matrix X. X_\text{new}^TX_\text{new} is not a full-rank matrix.

Let’s go through each of the options and determine if they are true or false.

The columns of X_\text{new} are linearly independent.

This statement is false because (x^{(1)} + x^{(2)}) is a linear combination of the original features (linearly dependent). This means the added feature does not provide any new, independent information to the model.

The columns of X_\text{new} are linearly dependent.

This statement is true because (x^{(1)} + x^{(2)}) is a linear combination of the original features.

\vec{y} is orthogonal to all the columns of X_\text{new}.

This statement is false because there is no justification for othogonality. It is usually not the case that \vec y is orthogonal to the columns of X_\text{new} because the goal of regression is to find a linear relatiionship between the predictors and the response variable. Since we have some regression coefficients (w_0, w_1, w_2) this implies there exists a relationship between \vec y and X_\text{new}.

\vec{y} is orthogonal to all the columns of the original design matrix X.

This statement is false because there is no justification for othogonality. It is usually not the case that \vec y is orthogonal to the columns of X because the goal of regression is to find a linear relatiionship between the predictors and the response variable. Since we have some regression coefficients (w_0, w_1, w_2) this implies there exists a relationship between \vec y and X.

The columns of X_\text{new} have the same span as the original design matrix X.

This statement is true because (x^{(1)} + x^{(2)}) is a linear combination of the original features the span does not change.

X_\text{new}^TX_\text{new} is a full-rank matrix.

This statement is false because there is a linearly dependent column!

X_\text{new}^TX_\text{new} is not a full-rank matrix.

This statement is true because of linear dependence.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.



Problem 3.2

Is there more than one optimal weight vector \vec{w^*} that produces the lowest mean squared error hypothesis function H(x) = w_0^* + w_1^* x^{(1)} + w_2^* x^{(2)} + w_4^*(x^{(1)} + x^{(2)})?

Yes


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 3.3

Explain your answer above.

There can be multiple optimal weight vectors \vec w^* that achieve the lowest mean squared error for the hypothesis function because of the linear dependence between the columns in the design matrix. This means the matrix is not full rank, so there are infinite solutions. This also results in non-unique solutions for the weight coefficients, allowing for various combinations of weights that produce the same optimal prediction outcome.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.


Problem 3.4

Does the best possible mean squared error of the new hypothesis function differ from that of the previous hypothesis function?

No


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 37%.


Problem 3.5

Explain your answer above.

When we have a linear combination (x^{(1)} + x^{(2)}) we are not enhancing the model’s capavility to fit the data in a way that would lower the best possible mean squared error. This means both models are capturing the same underlying relationship between the predictors and the response variable. Making it so that the mean squared error of the new hypothesis function does not differ from that of the previous hypothesis function.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 31%.



Problem 4

Source: Spring 2023 Midterm 1, Problem 5

Let X be a design matrix with 4 columns, such that the first column is a column of all 1s. Let \vec{y} be an observation vector. Let \vec{w}^* = (X^TX)^{-1}X^T\vec{y}. We’ll name the components of \vec{w}^* as follows:

\vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}

In this problem, we’ll consider various modifications to the design matrix and see how they affect the solution to the normal equations.


Problem 4.1

Let X_a be the design matrix that comes from interchanging the first two columns of X. Let \vec{w_a}^* = (X_a^TX_a)^{-1}X_a^T\vec{y}. Express the components \vec{w_a}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).

\vec{w_a}^* = \begin{bmatrix} w_1^* \\ w_0^* \\ w_2^* \\ w_3^* \end{bmatrix}

Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.

Where: \vec{w_a}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}

By swapping the first two columns of our design matrix, this changes the prediction rule to be of the form: H_2(\vec{x}) = v_1 + v_0x_1 + v_2x_2+ v_3x_3.

Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= w_1^* \\ v_1^* &= w_0^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

Intuitively, when we interchange two columns of our design matrix, all that does is interchange the terms in the prediction rule, which interchanges those weights in the parameter vector.



Problem 4.2

Let X_b be the design matrix that comes from adding one to each entry of the first column of X. Let \vec{w_b}^* = (X_b^TX_b)^{-1}X_b^T\vec{y}. Express the components \vec{w_b}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).

\vec{w_b}^* = \begin{bmatrix} \dfrac{w_0^*}{2} \\ w_1^* \\ w_2^* \\ w_3^*\end{bmatrix}

Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.

Where: \vec{w_b}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}

By adding one to each entry of the first column of the design matrix, we are changing the column of 1s to be a column of 2s. This changes the prediction rule to be of the form: H_2(\vec{x}) = v_0\cdot 2+ v_1x_1 + v_2x_2+ v_3x_3.

In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= \dfrac{w_0^*}{2} \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

This is saying we just halve the intercept term. For example, imagine fitting a line to data in \mathbb{R}^2 and finding that the best-fitting line is y=12+3x. If we had to write this in the form y=v_0\cdot 2 + v_1x, we would find that the best choice for v_0 is 6 and the best choice for v_1 is 3.



Problem 4.3

Let X_c be the design matrix that comes from adding one to each entry of the third column of X. Let \vec{w_c}^* = (X_c^TX_c)^{-1}X_c^T\vec{y}. Express the components \vec{w_c}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^*, which were the components of \vec{w}^*.

\vec{w_c}^* = \begin{bmatrix} w_0^* - w_2^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}

Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.

Where: \vec{w_c}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}

By adding one to each entry of the third column of the design matrix, this changes the prediction rule to be of the form: \begin{aligned} H_2(\vec{x}) &= v_0+ v_1x_1 + v_2(x_2+1)+ v_3x_3 \\ &= (v_0 + v_2) + v_1x_1 + v_2x_2+ v_3x_3 \end{aligned}

In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by \begin{aligned} v_0^* &= w_0^* - w_2^* \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

One way to think about this is that if we replace x_2 with x_2+1, then our predictions will increase by the coefficient of x_2. In order to keep our predictions the same, we would need to adjust our intercept term by subtracting this same amount.



Problem 5

Source: Spring 2023 Final Part 1, Problem 5

Suppose we want to predict how long it takes to run a Jupyter notebook on Datahub. For 100 different Jupyter notebooks, we collect the following 5 pieces of information:

Then we use multiple regression to fit a prediction rule of the form H(\text{cells, lines, max iterations, variables}) = w_0 + w_1 \cdot \text{cells} \cdot \text{lines} + w_2 \cdot (\text{max iterations})^{\text{variables} - 10}


Problem 5.1

What are the dimensions of the design matrix X?

\begin{bmatrix} & & & \\ & & & \\ & & & \\ \end{bmatrix}_{r \times c}

So, what should r and c be for: r rows \times c columns.

100 \text{ rows} \times 3 \text{ columns}

There should be 100 rows because there are 100 different Jupyter notebooks with different information within them. There should be 3 columns, one for each w_i. In this case we have w_0, which means X will have a column of ones, w_1, which means X will have a second column of \text{cells} \cdot \text{lines}, and w_2, which will be the last column in X containing \text{max iterations})^{\text{variables} - 10}.


Problem 5.2

In one sentence, what does the entry in row 3, column 2 of the design matrix X represent? (Count rows and columns starting at 1, not 0).

This entry represents the product of the number of cells and number of lines of code for the third Jupyter notebook in the training dataset.



Problem 6

Source: Spring 2023 Final Part 1, Problem 6

Now we solve the normal equations and find the solution to be \begin{aligned} \vec{w}^* &= \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix} \end{aligned} Define a new vector: \begin{aligned} \vec{w}^{\circ} &= \begin{bmatrix} w_0^{\circ} \\ w_1^{\circ} \\ w_2^{\circ} \end{bmatrix} = \begin{bmatrix} w_0^*+3 \\ w_1^*-4 \\ w_2^*-6 \end{bmatrix} \end{aligned}

and consider the two prediction rules

\begin{aligned} H^*(\text{cells, lines, max iterations, variables}) &= w_0^* + w_1^* \cdot \text{cells} \cdot \text{lines} + w_2^* \cdot (\text{max iterations})^{\text{variables} - 10}\\ H^{\circ}(\text{cells, lines, max iterations, variables}) &= w_0^{\circ} + w_1^{\circ} \cdot \text{cells} \cdot \text{lines} + w_2^{\circ} \cdot (\text{max iterations})^{\text{variables} - 10} \end{aligned}

Let \text{MSE} represent the mean squared error of a prediction rule, and let \text{MAE} represent the mean absolute error of a prediction rule. Select the symbol that should go in each blank.


Problem 6.1

\text{MSE}(H^*) ___ \text{MSE}(H^{\circ})

\leq

It is given that \vec{w}^* is the optimal parameter vector. Here’s one thing we know about the optimal parameter vector \vec{w}^*: it is optimal, which means that any changes made to it will, at best, keep our predictions of the exact same quality, and, at worst, reduce the quality of our predictions and increase our error. And since \vec{w} \degree is just the optimal parameter vector but with some small changes to the weights, it stands that \vec{w} \degree is liable to create equal or greater error!

In other words, \vec{w} \degree is a slightly worse version of \vec{w}^*, meaning that H \degree(x) is a slightly worse version of H^*(x). So, H \degree(x) will have equal or higher error than H^*(x).

Hence: \text{MSE}(H^*) \leq \text{MSE}(H^{\circ})



Problem 7

Source: Winter 2022 Midterm 1, Problem 4

Consider the dataset shown below.

x^{(1)} x^{(2)} x^{(3)} y
0 6 8 -5
3 4 5 7
5 -1 -3 4
0 2 1 2


Problem 7.1

We want to use multiple regression to fit a prediction rule of the form H(x^{(1)}, x^{(2)}, x^{(3)}) = w_0 + w_1 x^{(1)}x^{(3)} + w_2 (x^{(2)}-x^{(3)})^2. Write down the design matrix X and observation vector \vec{y} for this scenario. No justification needed.

The design matrix X and observation vector \vec{y} are given by:

\begin{align*} X &= \begin{bmatrix} 1 & 0 & 4\\ 1 & 15 & 1\\ 1 & -15 & 4\\ 1 & 0 & 1 \end{bmatrix} \\ \vec{y} &= \begin{bmatrix} -5\\ 7\\ 4\\ 2 \end{bmatrix} \end{align*}

We got \vec{y} by looking at our dataset and seeing the y column.

The matrix X was found by looking at the equation H(x). You can think of each row of X being: \begin{bmatrix}1 & x^{(1)}x^{(3)} & (x^{(2)}-x^{(3)})^2\end{bmatrix}. Recall our bias term here is not affected by x^{(i)}, but it still exists! So we will always have the first element in our row be 1. We can then easily calculate the other elements in the matrix.


Problem 7.2

For the X and \vec{y} that you have written down, let \vec{w} be the optimal parameter vector, which comes from solving the normal equations X^TX\vec{w}=X^T\vec{y}. Let \vec{e} = \vec{y} - X \vec{w} be the error vector, and let e_i be the ith component of this error vector. Show that 4e_1+e_2+4e_3+e_4=0.

The key to this problem is the fact that the error vector, \vec{e}, is orthogonal to the columns of the design matrix, X. As a refresher, if \vec{w^*} satisfies the normal equations, then:

We can rewrite the normal equation (X^TX\vec{w}=X^T\vec{y}) to allow substitution for \vec{e} = \vec{y} - X \vec{w}.

\begin{align*} X^TX\vec{w}&=X^T\vec{y} \\ 0 &= X^T\vec{y} - X^TX\vec{w} \\ 0 &= X^T(\vec{y}-X\vec{w}) \\ 0 &= X^T\vec{e} \end{align*}

The first step is to find X^T, which is easy because we found X above: \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix}

And now we can plug X^T and \vec e into our equation 0 = X^T\vec{e}. It might be easiest to find the right side first: \begin{align*} X^T\vec{e} &= \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix} \cdot \begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ e_4\end{bmatrix} \\ &= \begin{bmatrix} e_1 + e_2 + e_3 + e_4 \\ 15e_2 - 15e_3 \\ 4e_1 + e_2 + 4e_3 + e_4\end{bmatrix} \end{align*}

Finally, we set it equal to zero! \begin{align*} 0 &= e_1 + e_2 + e_3 + e_4 \\ 0 &= 15e_2 - 15e_3 \\ 0 &= 4e_1 + e_2 + 4e_3 + e_4 \end{align*}

With this we have shown that 4e_1+e_2+4e_3+e_4=0.



Problem 8

Source: Winter 2023 Final, Problem 4

Reggie and Essie are given a dataset of real features x_i \in \mathbb{R} and observations y_i. Essie proposes the following linear prediction rule: H_1(\alpha_0,\alpha_1) = \alpha_0 + \alpha_1 x_i. and Reggie proposes to use v_i=(x_i)^2 and the prediction rule H_2(\gamma_0,\gamma_1) = \gamma_0 + \gamma_1 v_i.


Problem 8.1

Give an example of a dataset \{(x_i,y_i)\}_{i=1}^n for which minimum MSE(H_2) < minimum MSE(H_1). Explain.

Example: If the datapoints follow a quadratic form y_i=x_i^2 for all i, then the H_2 prediction rule will achieve a zero error while H_1>0 since the data do not follow a linear form.


Problem 8.2

Give an example of a dataset \{(x_i,y_i)\}_{i=1}^n for which minimum MSE(H_2) = minimum MSE(H_1). Explain.

Example 1: If the response variables are constant y_i=c for all i, then for both prediciton rules by setting \alpha_0=\gamma_0=c and \alpha_1=\gamma_1=0, both predictors will achieve MSE=0.

Example 2: when every single value of the features x_i and x^2_ i coincide in the dataset (this occurs when x = 0 or x = 1), the parameters of both prediction rules will be the same, as will the MSE.


Problem 8.3

A new feature z has been added to the dataset.

Essie proposes a linear regression model using two predictor variables x,z as H_3(w_0,w_1,w_2) = w_0 + w_1 x_i +w_2 z_i.

Explain if the following statement is True or False (prove or provide counter-example).

Reggie claims that having more features will lead to a smaller error, therefore the following prediction rule will give a smaller MSE: H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = \alpha_0 + \alpha_1 x_i +\alpha_2 z_i + \alpha_3 (2x_i-z_i)

H_4 can be rewritten as H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = \alpha_0 + (\alpha_1+2\alpha_3) x_i +(\alpha_2 - \alpha_3)z_i By setting \tilde{\alpha}_1=\alpha_1+2\alpha_3 and \tilde{\alpha_2}= \alpha_2 - \alpha_3 then

H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = H_4(\alpha_0,\tilde{\alpha}_1,\tilde{\alpha}_2) = \alpha_0 + \tilde{\alpha_1} x_i +\tilde{\alpha}_2 z_i

Thus H_4 and H_3 have the same normal equations and therefore the same minimum MSE.



Problem 9

Source: Winter 2024 Midterm 1, Problem 5

Suppose the following information is given for linear regression:

X = \begin{bmatrix} 1 & 2\\ 1 & -1 \end{bmatrix}

\vec{y} = \begin{bmatrix} a\\ b\\ \end{bmatrix} \vec{w}^{*} = \begin{bmatrix} 1\\ 2\\ \end{bmatrix}

Where X is the design matrix, \vec{y} is the observation vector, and \vec{w}^{*} is the optimal parameter vector. Solve for parameters a and b using the normal equations, show your work.

\begin{cases} a = 5\\ b = -1\\ \end{cases}

Since \vec{w}^{*} is the optimal parameter vector, it must satisfy the normal equations:

\begin{align*} X^{T}X\vec{w} = X^{T}\vec{y} \end{align*}

The left hand side of the equation will read:

\begin{align*} X^{T}X\vec{w} &= \begin{bmatrix} 1 & 1\\ 2 & -1 \end{bmatrix} \begin{bmatrix} 1 & 2\\ 1 & -1 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} \\ &= \begin{bmatrix} 2 & 1\\ 1 & 5 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} \\ &= \begin{bmatrix} 4\\ 11 \end{bmatrix} \end{align*}

The right hand side of the equation is given by:

\begin{align*} X^{T}\vec{y} &= \begin{bmatrix} 1 & 1\\ 2 & -1 \end{bmatrix} \begin{bmatrix} a\\ b \end{bmatrix} \\ &= \begin{bmatrix} a+b\\ 2a-b \end{bmatrix} \end{align*}

By setting the left hand side and right hand side equal to each other, we will obtain the following system of equations:

\begin{align*} \begin{bmatrix} 4\\ 11 \end{bmatrix} = \begin{bmatrix} a+b\\ 2a-b \end{bmatrix} \end{align*}

\begin{cases} &4 = a + b\\ &11 = 2a-b \end{cases}

To solve this equation set, we can add them together: \begin{align*} 4+11 &= a + b + 2a -b\\ 3a &= 15\\ \\ \\ \end{align*}

\begin{cases} a = 5\\ b = -1\\ \end{cases}


Problem 10

Source: Winter 2024 Final Part 1, Problem 4

Albert collected 400 data points from a radiation detector. Each data point contains 3 features: feature A, feature B and feature C. The true particle energy E is also reported. Albert wants to design a linear regression algorithm to predict the energy E of each particle, given a combination of one or more of feature A, B, and C. As the first step, Albert calculated the correlation coefficients among A, B, C and E. He wrote it down in the following table, where each cell of the table represents the correlaton of two terms:

A    B    C    E   
A 1 -0.99 0.13 0.8
B -0.99 1 0.25 -0.95
C 0.13 0.25 1 0.72
E 0.8 -0.95 0.72 1


Problem 10.1

Albert wants to start with a simple model: fitting only a single feature to obtain the true energy (i.e. y = w_0+w_1 x). Which feature should he choose as x to get the lowest mean square error?

B

B is the correct answer, because it has the highest absolute correlation (0.95), the negative sign in front of B just means it is negatively correlated to energy, and it can be compensated by a negative sign in the weight.


Problem 10.2

Albert wants to add another feature to his linear regression in part (a) to further boost the model’s performance. (i.e. y = w_0 + w_1 x + w_2 x_2) Which feature should he choose as x_2 to make additional improvements?

C

C is the correct answer, because although A has a higher correlation with energy, it also has an extremely high correlation with B (-0.99), that means adding A into the fit will not be too useful, since it provides almost the same information as B.


Problem 10.3

Albert further refines his algorithm by fitting a prediction rule of the form: \begin{aligned} H(A,B,C) = w_0 + w_1 \cdot A\cdot C + w_2 \cdot B^{C-7} \end{aligned}

Given this prediction rule, what are the dimensions of the design matrix X?

\begin{bmatrix} & & & \\ & & & \\ & & & \\ \end{bmatrix}_{r \times c}

So, what are r and c in r \text{ rows} \times c \text{ columns}?

400 \text{ rows} \times 3 \text{ columns}

Recall there are 400 data points, which means there will be 400 rows. There will be 3 columns; one is the bias column of all 1s, one is for the feature A\cdot C, and one is for the feature B^{C-7}.



Problem 11

Source: Fall 2021 Final Exam, Problem 6

Billy’s aunt owns a jewellery store, and gives him data on 5000 of the diamonds in her store. For each diamond, we have:

The first 5 rows of the 5000-row dataset are shown below:

carat    length    width    price   
0.40 4.81 4.76 1323
1.04 6.58 6.53 5102
0.40 4.74 4.76 696
0.40 4.67 4.65 798
0.50 4.90 4.95 987


Billy has enlisted our help in predicting the price of a diamond given various other features.


Problem 11.1

Suppose we want to fit a linear prediction rule that uses two features, carat and length, to predict price. Specifically, our prediction rule will be of the form

\text{predicted price} = w_0 + w_1 \cdot \text{carat} + w_2 \cdot \text{length}


We will use least squares to find \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}.

Write out the first 5 rows of the design matrix, X. Your matrix should not have any variables in it.

X = \begin{bmatrix} 1 & 0.40 & 4.81 \\ 1 & 1.04 & 6.58 \\ 1 & 0.40 & 4.74 \\ 1 & 0.40 & 4.67 \\ 1 & 0.50 & 4.90 \end{bmatrix}


Problem 11.2

Suppose the optimal parameter vector \vec{w}^* is given by

\vec{w}^* = \begin{bmatrix} 2000 \\ 10000 \\ -1000 \end{bmatrix}

What is the predicted price of a diamond with 0.65 carats and a length of 4 centimeters? Show your work.

The predicted price is 4500 dollars.

2000 + 10000 \cdot 0.65 - 1000 \cdot 4 = 4500


Problem 11.3

Suppose \vec{e} = \begin{bmatrix} e_1 \\ e_2 \\ ... \\ e_n \end{bmatrix} is the error/residual vector, defined as

\vec{e} = \vec{y} - X \vec{w}^*

where \vec{y} is the observation vector containing the prices for each diamond.

For each of the following quantities, state whether they are guaranteed to be equal to 0 the scalar, \vec{0} the vector of all 0s, or neither. No justification is necessary.

  • \sum_{i = 1}^n e_i: Yes, this is guaranteed to be 0. This was discussed in Homework 4; it is a consequence of the fact that X^T (y - X \vec{w}^*) = 0 and that we have an intercept term in our prediction rule (and hence a column of all 1s in our design matrix, X).
  • || \vec{y} - X \vec{w}^* ||^2: No, this is not guaranteed to be 0. This is the mean squared error of our prediction rule, multiplied by n. \vec{w}^* is found by minimizing mean squared error, but the minimum value of mean squared error isn’t necessarily 0 — in fact, this quantity is only 0 if we can write \vec{y} exactly as X \vec{w}^* with no prediction errors.
  • X^TX \vec{w}^*: No, this is not guaranteed to be 0, either.
  • 2X^TX \vec{w}^* - 2X^T\vec{y}: Yes, this is guaranteed to be 0. Recall, the optimal parameter vector \vec{w}^* satisfies the normal equations X^TX\vec{w}^* = X^T \vec{y}. Subtracting X^T \vec{y} from both sides of this equation and multiplying both sides by 2 yields the desired result. (You may also find 2X^TX \vec{w}^* - 2X^T\vec{y} to be familiar from lecture — it is the gradient of the mean squared error, multiplied by n, and we set the gradient to 0 to find \vec{w}^*.)


Problem 11.4

Suppose we introduce two more features:

Suppose we also decide to remove the intercept term of our prediction rule. With all of these changes, our prediction rule is now

\text{predicted price} = w_1 \cdot \text{carat} + w_2 \cdot \text{length} + w_3 \cdot \text{width} + w_4 \cdot (\text{length} \cdot \text{width})

  • X = \begin{bmatrix} 0.40 & 4.81 & 4.76 & 4.81 \cdot 4.76 \\ 1.04 & 6.58 & 6.53 & 6.58 \cdot 6.53 \end{bmatrix}
  • No, it’s not guaranteed that the \vec{w}_1^* for this new prediction rule is equal to the \vec{w}_1^* for the original prediction rule. The value of \vec{w}_1^* in the new prediction rule will be influenced by the fact that there’s no longer an intercept term and that there are two new features (width and area) that weren’t previously there.



Problem 12

Source: Spring 2024 Final, Problem 4

Consider the vectors \vec{u} and \vec{v}, defined below.

\vec{u} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \qquad \vec{v} = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}

We define X \in \mathbb{R}^{3 \times 2} to be the matrix whose first column is \vec u and whose second column is \vec v.


Problem 12.1

In this part only, let \vec{y} = \begin{bmatrix} -1 \\ k \\ 252 \end{bmatrix}.

Find a scalar k such that \vec{y} is in \text{span}(\vec u, \vec v). Give your answer as a constant with no variables.

252.

Vectors in \text{span}(\vec u, \vec v) must have an equal 2nd and 3rd component, and the third component is 252, so the second must be as well.


Problem 12.2

Show that: (X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}

Hint: If A = \begin{bmatrix} a_1 & 0 \\ 0 & a_2 \end{bmatrix}, then A^{-1} = \begin{bmatrix} \frac{1}{a_1} & 0 \\ 0 & \frac{1}{a_2} \end{bmatrix}.

We can construct the following series of matrices to get (X^TX)^{-1}X^T.

  • X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix}
  • X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 1 \end{bmatrix}
  • X^TX = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}
  • (X^TX)^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & \frac{1}{2} \end{bmatrix}
  • (X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}


Problem 12.3

In parts (c) and (d) only, let \vec{y} = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix}.

Find scalars a and b such that a \vec u + b \vec v is the vector in \text{span}(\vec u, \vec v) that is as close to \vec{y} as possible. Give your answers as constants with no variables.

a = 4, b = 5.

The result from the part (b) implies that when using the normal equations to find coefficients for \vec u and \vec v – which we know from lecture produce an error vector whose length is minimized – the coefficient on \vec u must be y_1 and the coefficient on \vec v must be \frac{y_2 + y_3}{2}. This can be shown by taking the result from part (b), \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}, and multiplying it by the vector \vec y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix}.

Here, y_1 = 4, so a = 4. We also know y_2 = 2 and y_3 = 8, so b = \frac{2+8}{2} = 5.


Problem 12.4

Let \vec{e} = \vec{y} - (a \vec u + b \vec v), where a and b are the values you found in part (c).

What is \lVert \vec{e} \rVert?

3 \sqrt{2}.

The correct value of a \vec u + b \vec v = \begin{bmatrix} 4 \\ 5 \\ 5\end{bmatrix}. Then, \vec{e} = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix} - \begin{bmatrix} 4 \\ 5 \\ 5 \end{bmatrix} = \begin{bmatrix} 0 \\ -3 \\ 3 \end{bmatrix}, which has a length of \sqrt{0^2 + (-3)^2 + 3^2} = 3\sqrt{2}.


Problem 12.5

Is it true that, for any vector \vec{y} \in \mathbb{R}^3, we can find scalars c and d such that the sum of the entries in the vector \vec{y} - (c \vec u + d \vec v) is 0?

Yes, but for a reason that isn’t listed here.

Here’s the full reason: 1. We can use the normal equations to find c and d, no matter what \vec{y} is. 2. The error vector \vec e that results from using the normal equations is such that \vec e is orthogonal to the span of the columns of X. 3. The columns of X are just \vec u and \vec v. So, \vec e is orthogonal to any linear combination of \vec u and \vec v. 4. One of the many linear combinations of \vec u and \vec v is \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}. 5. This means that the vector \vec e is orthogonal to \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, which means that \vec{1}^T \vec{e} = 0 \implies \sum_{i = 1}^3 e_i = 0.


Problem 12.6

Suppose that Q \in \mathbb{R}^{100 \times 12}, \vec{s} \in \mathbb{R}^{100}, and \vec{f} \in \mathbb{R}^{12}. What are the dimensions of the following product?

\vec{s}^T Q \vec{f}

Correct: Scalar.

  • \vec{s}^T: 1 x 100
  • Q: 100 x 12.
  • \vec{f}: 12 x 1.

The inner dimensions of 100 and 12 cancel, and so \vec{s}^T Q \vec{f} is of shape 1 x 1.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.