Multiple Linear Regression and The Normal Equations

This page contains all problems about Multiple Linear Regression and The Normal Equations.

Problem 1

You are given a dataset with the following data points and want to fit a variety of hypothesis functions to predict y from features u and v:

\begin{array}{|c|c|c|} \hline u & v & y \\ \hline 1 & 3 & 4 \\ 3 & 0 & 6 \\ 2 & 2 & 5 \\ 4 & -4 & 8 \\ 5 & -1 & 11 \\ \hline \end{array}

You are also provided with the following hypothesis functions, used to construct design matrices:

X_1 = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 3 & 9 & 27 \\ 1 & 2 & 4 & 8 \\ 1 & 4 & 16 & 64 \\ 1 & 5 & 25 & 125 \end{bmatrix} \ \ \ X_2 = \begin{bmatrix} 1 & 1 & 1 & 3 \\ 1 & 3 & 27 & 0 \\ 1 & 2 & 8 & 4 \\ 1 & 4 & 64 & -16 \\ 1 & 5 & 125 & -5 \end{bmatrix} X_3 = \begin{bmatrix} 1 & 1 & 1 & 3 \\ 3 & 9 & 27 & 0 \\ 2 & 4 & 8 & 4 \\ 4 & 16 & 64 & -16 \\ 5 & 25 & 125 & -5 \end{bmatrix} \ \ \ X_4 = \begin{bmatrix} 1 & 1 & 3 & 9 \\ 1 & 3 & 0 & 0 \\ 1 & 2 & 2 & 4 \\ 1 & 4 & -4 & 16 \\ 1 & 5 & -1 & 1 \end{bmatrix}

Problem 1.1

X_1

We can easily create the design matrix from H_A(u, v) = w_0 + w_1 u + w_2 u^2 + w_3 u^3 by viewing the order of variables and how they are being manipulated. We can see there is a w_0 term, which means that the first column should be a vector of ones.

This means we know:

X_? = \begin{bmatrix} 1 & ? & ? & ? \\ 1 & ? & ? & ? \\ 1 & ? & ? & ? \\ 1 & ? & ? & ? \\ 1 & ? & ? & ? \end{bmatrix}

This immediately eliminates X_3. We now see that the second column should be \vec u. We are told the values inside of the vector at the top, which means we get:

X_? = \begin{bmatrix} 1 & 1 & ? & ? \\ 1 & 3 & ? & ? \\ 1 & 2 & ? & ? \\ 1 & 4 & ? & ? \\ 1 & 5 & ? & ? \end{bmatrix}

This does not eliminate any of our values, so we look to see the next column will be \vec u^2. This means:

X_? = \begin{bmatrix} 1 & 1 & (1)^2 & ? \\ 1 & 3 & (3)^2 & ? \\ 1 & 2 & (2)^2 & ? \\ 1 & 4 & (4)^2 & ? \\ 1 & 5 & (5)^2 & ? \end{bmatrix} = \begin{bmatrix} 1 & 1 & 1 & ? \\ 1 & 3 & 9 & ? \\ 1 & 2 & 4 & ? \\ 1 & 4 & 16 & ? \\ 1 & 5 & 25 & ? \end{bmatrix}

Here we can now eliminate X_2 and X_4, so we know the answer must be X_1!

Difficulty: ⭐️

The average score on this problem was 100%.

Problem 1.2

X_3

We can easily create the design matrix from H_B(u, v) = w_0 u + w_1 u^2 + w_2 u^3 + w_3 u v by viewing the order of variables and how they are being manipulated. We can see there is a w_0 term that is being modified, which means that the first column should be a vector of ones multiplied by \vec u.

This means we know:

X_? = \begin{bmatrix} 1 * 1 & ? & ? & ? \\ 1 * 3 & ? & ? & ? \\ 1 * 2 & ? & ? & ? \\ 1 *4 & ? & ? & ? \\ 1 * 5 & ? & ? & ? \end{bmatrix} = \begin{bmatrix} 1 & ? & ? & ? \\ 3 & ? & ? & ? \\ 2 & ? & ? & ? \\ 4 & ? & ? & ? \\ 5 & ? & ? & ? \end{bmatrix}

We can see the only design matrix with this first column is X_3.

Difficulty: ⭐️

The average score on this problem was 100%.

Problem 1.3

X_4

We can easily create the design matrix from H_C(u, v) = w_0 + w_1 u + w_2 v + w_3 v^2 by viewing the order of variables and how they are being manipulated. We can see there is a w_0 term, which means that the first column should be a vector of ones.

This means we know:

X_? = \begin{bmatrix} 1 & ? & ? & ? \\ 1 & ? & ? & ? \\ 1 & ? & ? & ? \\ 1 & ? & ? & ? \\ 1 & ? & ? & ? \end{bmatrix}

This immediately eliminates X_3. We now see that the second column should be \vec u. We are told the values inside of the vector at the top, which means we get:

X_? = \begin{bmatrix} 1 & 1 & ? & ? \\ 1 & 3 & ? & ? \\ 1 & 2 & ? & ? \\ 1 & 4 & ? & ? \\ 1 & 5 & ? & ? \end{bmatrix}

This does not eliminate any of our values, so we look to see the next column will be \vec v. This means:

X_? = \begin{bmatrix} 1 & 1 & 3 & ? \\ 1 & 3 & 0 & ? \\ 1 & 2 & 2 & ? \\ 1 & 4 & -4 & ? \\ 1 & 5 & -1 & ? \end{bmatrix}

Here we can eliminate X_1 and X_2, which means our answer is X_4.

Difficulty: ⭐️

The average score on this problem was 100%.

Problem 1.4

X_2

We can easily create the design matrix from H_D(u, v) = w_0 + w_1 u + w_2 u^3 + w_3 u v by viewing the order of variables and how they are being manipulated. We can see there is a w_0 term, which means that the first column should be a vector of ones.

This means we know:

X_? = \begin{bmatrix} 1 & ? & ? & ? \\ 1 & ? & ? & ? \\ 1 & ? & ? & ? \\ 1 & ? & ? & ? \\ 1 & ? & ? & ? \end{bmatrix}

This immediately eliminates X_3. We now see that the second column should be \vec u. We are told the values inside of the vector at the top, which means we get:

X_? = \begin{bmatrix} 1 & 1 & ? & ? \\ 1 & 3 & ? & ? \\ 1 & 2 & ? & ? \\ 1 & 4 & ? & ? \\ 1 & 5 & ? & ? \end{bmatrix}

We cannot eliminate any of the design matrices, so we move to the next column, which is \vec u^3. This means:

X_? = \begin{bmatrix} 1 & 1 & (1)^3 & ? \\ 1 & 3 & (3)^3 & ? \\ 1 & 2 & (2)^3 & ? \\ 1 & 4 & (4)^3 & ? \\ 1 & 5 & (5)^3 & ? \end{bmatrix} = \begin{bmatrix} 1 & 1 & 1 & ? \\ 1 & 3 & 27 & ? \\ 1 & 2 & 8 & ? \\ 1 & 4 & 64 & ? \\ 1 & 5 & 125 & ? \end{bmatrix}

Now we can eliminate the design matrices X_1 and X_4, which means the answer is X_2.

Difficulty: ⭐️

The average score on this problem was 100%.

The following hypothesis functions are repeated from the previous subparts, for your convenience, plus an additional hypothesis function H_E:

Problem 1.5

Which of the five hypothesis functions above has the lowest mean squared error on this dataset? Choose a hypothesis function H(\cdot) and briefly justify your answer in the space below.

H_E(u, v)

H_E(u, v) contains the most information. H_E has \vec u and \vec v.This means we have information about both variables. We also see it has every component present in the other functions (H_A, H_B, H_C, H_D). This makes H_E the most expressive, which will allow it to fit our data the best.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.

Problem 1.6

Suppose we use the lowest-MSE hypothesis function chosen above to make a prediction for a new point (u_\text{new}, v_\text{new}) = (10, 15). Is this prediction likely to be accurate? Justify your answer.

This prediction is not likely to be accurate due to overfitting or extrapolation issues. If there are too many terms there is a higher chance the function is overfitting the training data and will not generalize well to new points. You can think of a high-degree polynomial overfitting a linear trend.

Difficulty: ⭐️⭐️

The average score on this problem was 77%.

Problem 2

Suppose we want to fit a hypothesis function of the form: H(x) = w_0 + w_1 x^{(1)} + w_2 x^{(2)} + w_3 (x^{(1)})^2 + w_4 (x^{(1)} x^{(2)})^2 Our dataset looks like this:

Problem 2.1

x^{(1)}	x^{(2)}	y
1	6	7
-3	8	2
4	1	9
-2	7	5
0	4	6

Suppose we found w_2^* = 15.6 using multiple linear regression. Now, suppose we rescaled our data so feature vector \vec{x^{(2)}} became \left[60,\ 80,\ 10,\ 70,\ 40\right]^T, and performed multiple linear regression in the same setting. What would the new value of w_2^* (the weight on feature x^{(2)} in H(x)) be? You do not need to simplify your answer.

w_2^* = 1.56

Recall in linear regression, when a feature is rescaled by a factor of c, the corresponding weight is inversely scaled by \frac{1}{c}. This is because the coefficient adjusts to maintain the overall contribution of the feature to the prediction. In this problem we can see that x^{(2)} has been scaled by a factor of 10! It changed from \left[6,\ 8,\ 1,\ 7,\ 4\right]^T to \left[60,\ 80,\ 10,\ 70,\ 40\right]^T. This means our w_2 will be scaled by \frac{1}{10}. We can easily calculate \frac{1}{10}w_2^* = \frac{1}{10}\cdot 15.6 = 1.56.

Difficulty: ⭐️⭐️

The average score on this problem was 81%.

Problem 2.2

Suppose we found w_4^* = 72 using multiple linear regression. Now, suppose we rescaled our data so feature vector \vec{x^{(1)}} became \left[ \frac{1}{2},\ -\frac{3}{2},\ 2,\ -1,\ 0 \right] and \vec{x^{(2)}} now became \left[36, \ 48, \ 6, \ 42, \ 24\right]^T, and performed multiple linear regression in the same setting. What would the new value of w_4^* (the weight on feature (x^{(1)} x^{(2)})^2 in H(x)) be? You do not need to simplify your answer.

w_4^* = \frac{72}{(\frac{6}{2})^2}

Similar to the first part of the problem we need to find how \vec{x^{(1)}} and \vec{x^{(2)}} changed. We can then inversely scale w_4^* by those values.

Let’s start with \vec{x^{(1)}}. Originally \vec{x^{(1)}} was \left[1,\ -3,\ 4,\ -2,\ 0\right]^T, but it becomes \left[ \frac{1}{2},\ -\frac{3}{2},\ 2,\ -1,\ 0 \right]. We can see the values were scaled by \frac{1}{2}.

We can now look at \vec{x^{(2)}}. Originally \vec{x^{(2)}} was \left[6,\ 8,\ 1,\ 7,\ 4\right]^T, but it becomes \left[36, \ 48, \ 6, \ 42, \ 24\right]^T. We can see the values were scaled by 6.

We know w_4^* is attached to the variable (x^{(1)} x^{(2)})^2. This means we need to multiply the scaled values we found and then square it. (\frac{1}{2} \cdot 6)^2 = (3)^2 = 9.

From here we simply inversely scale w_4^*. Originally w_4^* = 72, so we multiply by \frac{1}{9} to get 72 \cdot \frac{1}{9} = \frac{72}{9} (which is equal to \frac{72}{(\frac{6}{2})^2}).

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 35%.

Problem 2.3

Suppose we found w_0^* = 4.47 using multiple linear regression. Now, suppose our observation vector \vec{y} now became \left[12, \ 7, \ 14, \ 10, \ 11\right]^T, and performed multiple linear regression in the same setting. What would the new value of w_0^* be? You do not need to simplify your answer.

w_0^* = 9.47

Recall w_0^* is the intercept term and represents the value of the prediction H(x) when all the features (\vec{x^{(1)}} and \vec{x^{(2)}}) are zero. If all other features (the w_1^* and w_2^*) remain unchanged then the intercept adjusts to reflect the shifts in the mean of the observed \vec y.

Our original \vec y was \left[7, \ 2, \ 9, \ 5, \ 6\right]^T, but became \left[12, \ 7, \ 14, \ 10, \ 11\right]^T. We need to calculate the original \vec y’s mean and the new \vec y’s mean.

Old \vec y’s mean: \frac{7 + 2 + 9 + 5 + 6}{5} = \frac{29}{5} = 5.8

New \vec y’s mean: \frac{12 + 7 + 14 + 10 + 11}{5} = \frac{54}{5} = 10.8

Our new w_0^* is found by taking the old one and adding the difference of 10.8 - 5.8. This means: w_0^* = 4.47 + (10.8 - 5.8) = 4.47 + 5 = 9.47.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.

Problem 2.4

Suppose we found w_1^* = 0.428 using multiple linear regression. Now, suppose our observation vector \vec{y} now became \left[12, \ 7, \ 14, \ 10, \ 11\right]^T, and performed multiple linear regression in the same setting. What would the new value of w_1^* (the weight on feature x^{(1)} in H(x)) be? You do not need to simplify your answer.

w_1^* = 0.428

Our old \vec y was \left[7, \ 2, \ 9, \ 5, \ 6\right]^T and our new \vec y is \left[12, \ 7, \ 14, \ 10, \ 11\right]^T. We can see that our old \vec y has five added to each value to get our new \vec y.

Recall the slope (w_1^*) does not change if our y_i shifts by a constant amount! This means w_1^* = 0.428.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 50%.

Problem 3

Suppose we want to fit a hypothesis function of the form: H(x) = w_0 + w_1 x^{(1)} + w_2 x^{(2)} + w_3 (x^{(1)})^2 + w_4 (x^{(1)} x^{(2)})^2

x^{(1)}	x^{(2)}	y
1	6	7
-3	8	2
4	1	9
-2	7	5
0	4	6

We know we need to find an optimal parameter vector \vec{w}^* = \left[w_0^* \ \ w_1^* \ \ w_2^* \ \ w_3^* \ \ w_4^* \right]^T that satisfies the normal equations. To do so, we build a design matrix, but our columns get all shuffled due to an error with our computer! Our resulting design matrix is

X_\text{shuffled} = \begin{bmatrix} 36 & 6 & 1 & 1 & 1 \\ 576 & 8 & -3& 1 & 9 \\ 16 & 1 & 4 & 1 & 16 \\ 196 & 7 & -2& 1 & 4 \\ 0 & 4 & 0 & 1 & 0 \end{bmatrix}

If we solved the normal equations using this shuffled design matrix X_\text{shuffled}, we would not get our parameter vector \vec{w}^* = \left[w_0^* \ \ w_1^* \ \ w_2^* \ \ w_3^* \ \ w_4^* \right]^T in the correct order. Let \vec{s} = \left[ s_0 \ \ s_1 \ \ s_2 \ \ s_3 \ \ s_4 \right] be the parameter vector we find instead. Let’s figure out which features correspond to the weight vector \vec{s} that we found using the shuffled design matrix X_\text{shuffled}. Fill in the bubbles below.

Problem 3.1

First weight s_0 after solving normal equations corresponds to the term in H(x):

(x^{(1)} x^{(2)})^2

The first column inside of our X_\text{shuffled} represents s_0, so we want to figure out how to create these values. We can easily eliminate intercept, x^{(1)}, and x^{(2)} because none of these numbers match. From here we can calculate (x^{(1)})^2 and (x^{(1)} x^{(2)})^2 to determine which element creates s_0.

\begin{align*} (x^{(1)})^2 &= \begin{bmatrix} 1^2 = 1 \\ (-3)^2 = 9\\ 4^2 = 16\\ (-2)^2 = 4\\ 0^2 = 0 \end{bmatrix} \\ &\text{and}\\ (x^{(1)} x^{(2)})^2 &= \begin{bmatrix} (1 \times 6)^2 = 36 \\ (-3 \times 8)^2 = 576 \\ (4 \times 1)^2 = 16 \\ (-2 \times 7 )^2 = 196 \\ (0 \times 4)^2 = 0 \end{bmatrix} \end{align*}

From this we can see the answer is clearly (x^{(1)} x^{(2)})^2.

Difficulty: ⭐️⭐️

The average score on this problem was 87%.

Problem 3.2

Second weight s_1 after solving normal equations corresponds to the term in H(x):

x^{(2)}

The second column inside of our X_\text{shuffled} represents s_1, so we want to figure out how to create these values. We can see this is the same as x^{(2)}.

Difficulty: ⭐️

The average score on this problem was 93%.

Problem 3.3

Third weight s_2 after solving normal equations corresponds to the term in H(x):

x^{(1)}

The third column inside of our X_\text{shuffled} represents s_2, so we want to figure out how to create these values. We can see this is the same as x^{(1)}.

Difficulty: ⭐️⭐️

The average score on this problem was 87%.

Problem 3.4

Fourth weight s_3 after solving normal equations corresponds to the term in H(x):

intercept

The fourth column inside of our X_\text{shuffled} represents s_3, so we want to figure out how to create these values. We know the intercept is a vector of ones, which matches!

Difficulty: ⭐️

The average score on this problem was 93%.

Problem 3.5

Fifth weight s_4 after solving normal equations corresponds to the term in H(x):

(x^{(1)})^2

From process of elimination we can find the answer or from our first calculation.

(x^{(1)})^2 = \begin{bmatrix} 1^2 = 1 \\ (-3)^2 = 9\\ 4^2 = 16\\ (-2)^2 = 4\\ 0^2 = 0 \end{bmatrix}

Difficulty: ⭐️⭐️

The average score on this problem was 87%.

Problem 4

Suppose we have already fit a multiple regression hypothesis function of the form: H(x) = w_0 + w_1 x^{(1)} + w_2 x^{(2)}

Now, suppose we add the feature (x^{(1)} + x^{(2)}) when performing multiple regression. Below, answer ``Yes/No” to the following questions and rigorously justify why certain behavior will or will not occur. Your answer must mention linear algebra concepts such as rank and linear independence in relation to the design matrix, weight vector \vec{w^*}, and hypothesis function H(x).

Problem 4.1

Which of the following are true about the new design matrix X_\text{new} with our added feature (x^{(1)} + x^{(2)})?

The columns of X_\text{new} are linearly dependent. The columns of X_\text{new} have the same span as the original design matrix X. X_\text{new}^TX_\text{new} is not a full-rank matrix.

Let’s go through each of the options and determine if they are true or false.

The columns of X_\text{new} are linearly independent.

This statement is false because (x^{(1)} + x^{(2)}) is a linear combination of the original features (linearly dependent). This means the added feature does not provide any new, independent information to the model.

The columns of X_\text{new} are linearly dependent.

This statement is true because (x^{(1)} + x^{(2)}) is a linear combination of the original features.

\vec{y} is orthogonal to all the columns of X_\text{new}.

This statement is false because there is no justification for othogonality. It is usually not the case that \vec y is orthogonal to the columns of X_\text{new} because the goal of regression is to find a linear relatiionship between the predictors and the response variable. Since we have some regression coefficients (w_0, w_1, w_2) this implies there exists a relationship between \vec y and X_\text{new}.

\vec{y} is orthogonal to all the columns of the original design matrix X.

This statement is false because there is no justification for othogonality. It is usually not the case that \vec y is orthogonal to the columns of X because the goal of regression is to find a linear relatiionship between the predictors and the response variable. Since we have some regression coefficients (w_0, w_1, w_2) this implies there exists a relationship between \vec y and X.

The columns of X_\text{new} have the same span as the original design matrix X.

This statement is true because (x^{(1)} + x^{(2)}) is a linear combination of the original features the span does not change.

X_\text{new}^TX_\text{new} is a full-rank matrix.

This statement is false because there is a linearly dependent column!

X_\text{new}^TX_\text{new} is not a full-rank matrix.

This statement is true because of linear dependence.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.

Problem 4.2

Is there more than one optimal weight vector \vec{w^*} that produces the lowest mean squared error hypothesis function H(x) = w_0^* + w_1^* x^{(1)} + w_2^* x^{(2)} + w_4^*(x^{(1)} + x^{(2)})?

Yes

Difficulty: ⭐️⭐️

The average score on this problem was 81%.

Problem 4.3

There can be multiple optimal weight vectors \vec w^* that achieve the lowest mean squared error for the hypothesis function because of the linear dependence between the columns in the design matrix. This means the matrix is not full rank, so there are infinite solutions. This also results in non-unique solutions for the weight coefficients, allowing for various combinations of weights that produce the same optimal prediction outcome.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 62%.

Problem 4.4

Does the best possible mean squared error of the new hypothesis function differ from that of the previous hypothesis function?

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 37%.

Problem 4.5

When we have a linear combination (x^{(1)} + x^{(2)}) we are not enhancing the model’s capavility to fit the data in a way that would lower the best possible mean squared error. This means both models are capturing the same underlying relationship between the predictors and the response variable. Making it so that the mean squared error of the new hypothesis function does not differ from that of the previous hypothesis function.

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 31%.

Problem 5

Let X be a design matrix with 4 columns, such that the first column is a column of all 1s. Let \vec{y} be an observation vector. Let \vec{w}^* = (X^TX)^{-1}X^T\vec{y}. We’ll name the components of \vec{w}^* as follows:

In this problem, we’ll consider various modifications to the design matrix and see how they affect the solution to the normal equations.

Problem 5.1

Let X_a be the design matrix that comes from interchanging the first two columns of X. Let \vec{w_a}^* = (X_a^TX_a)^{-1}X_a^T\vec{y}. Express the components \vec{w_a}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).

\vec{w_a}^* = \begin{bmatrix} w_1^* \\ w_0^* \\ w_2^* \\ w_3^* \end{bmatrix}

Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.

Where: \vec{w_a}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}

By swapping the first two columns of our design matrix, this changes the prediction rule to be of the form: H_2(\vec{x}) = v_1 + v_0x_1 + v_2x_2+ v_3x_3.

Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= w_1^* \\ v_1^* &= w_0^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

Intuitively, when we interchange two columns of our design matrix, all that does is interchange the terms in the prediction rule, which interchanges those weights in the parameter vector.

Problem 5.2

Let X_b be the design matrix that comes from adding one to each entry of the first column of X. Let \vec{w_b}^* = (X_b^TX_b)^{-1}X_b^T\vec{y}. Express the components \vec{w_b}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).

\vec{w_b}^* = \begin{bmatrix} \dfrac{w_0^*}{2} \\ w_1^* \\ w_2^* \\ w_3^*\end{bmatrix}

Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.

Where: \vec{w_b}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}

By adding one to each entry of the first column of the design matrix, we are changing the column of 1s to be a column of 2s. This changes the prediction rule to be of the form: H_2(\vec{x}) = v_0\cdot 2+ v_1x_1 + v_2x_2+ v_3x_3.

In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= \dfrac{w_0^*}{2} \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

This is saying we just halve the intercept term. For example, imagine fitting a line to data in \mathbb{R}^2 and finding that the best-fitting line is y=12+3x. If we had to write this in the form y=v_0\cdot 2 + v_1x, we would find that the best choice for v_0 is 6 and the best choice for v_1 is 3.

Problem 5.3

Let X_c be the design matrix that comes from adding one to each entry of the third column of X. Let \vec{w_c}^* = (X_c^TX_c)^{-1}X_c^T\vec{y}. Express the components \vec{w_c}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^*, which were the components of \vec{w}^*.

\vec{w_c}^* = \begin{bmatrix} w_0^* - w_2^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}

Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.

Where: \vec{w_c}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}

By adding one to each entry of the third column of the design matrix, this changes the prediction rule to be of the form: \begin{aligned} H_2(\vec{x}) &= v_0+ v_1x_1 + v_2(x_2+1)+ v_3x_3 \\ &= (v_0 + v_2) + v_1x_1 + v_2x_2+ v_3x_3 \end{aligned}

In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by \begin{aligned} v_0^* &= w_0^* - w_2^* \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

One way to think about this is that if we replace x_2 with x_2+1, then our predictions will increase by the coefficient of x_2. In order to keep our predictions the same, we would need to adjust our intercept term by subtracting this same amount.

Problem 6

Suppose we want to predict how long it takes to run a Jupyter notebook on Datahub. For 100 different Jupyter notebooks, we collect the following 5 pieces of information:

Then we use multiple regression to fit a prediction rule of the form H(\text{cells, lines, max iterations, variables}) = w_0 + w_1 \cdot \text{cells} \cdot \text{lines} + w_2 \cdot (\text{max iterations})^{\text{variables} - 10}

Problem 6.1

100 \text{ rows} \times 3 \text{ columns}

There should be 100 rows because there are 100 different Jupyter notebooks with different information within them. There should be 3 columns, one for each w_i. In this case we have w_0, which means X will have a column of ones, w_1, which means X will have a second column of \text{cells} \cdot \text{lines}, and w_2, which will be the last column in X containing \text{max iterations})^{\text{variables} - 10}.

Problem 6.2

In one sentence, what does the entry in row 3, column 2 of the design matrix X represent? (Count rows and columns starting at 1, not 0).

This entry represents the product of the number of cells and number of lines of code for the third Jupyter notebook in the training dataset.

Problem 7

Let \text{MSE} represent the mean squared error of a prediction rule, and let \text{MAE} represent the mean absolute error of a prediction rule. Select the symbol that should go in each blank.

Problem 7.1

\leq

It is given that \vec{w}^* is the optimal parameter vector. Here’s one thing we know about the optimal parameter vector \vec{w}^*: it is optimal, which means that any changes made to it will, at best, keep our predictions of the exact same quality, and, at worst, reduce the quality of our predictions and increase our error. And since \vec{w} \degree is just the optimal parameter vector but with some small changes to the weights, it stands that \vec{w} \degree is liable to create equal or greater error!

In other words, \vec{w} \degree is a slightly worse version of \vec{w}^*, meaning that H \degree(x) is a slightly worse version of H^*(x). So, H \degree(x) will have equal or higher error than H^*(x).

Hence: \text{MSE}(H^*) \leq \text{MSE}(H^{\circ})

Problem 8

Problem 8.1

x^{(1)}	x^{(2)}	x^{(3)}	y
0	6	8	-5
3	4	5	7
5	-1	-3	4
0	2	1	2

We want to use multiple regression to fit a prediction rule of the form H(x^{(1)}, x^{(2)}, x^{(3)}) = w_0 + w_1 x^{(1)}x^{(3)} + w_2 (x^{(2)}-x^{(3)})^2. Write down the design matrix X and observation vector \vec{y} for this scenario. No justification needed.

The design matrix X and observation vector \vec{y} are given by:

\begin{align*} X &= \begin{bmatrix} 1 & 0 & 4\\ 1 & 15 & 1\\ 1 & -15 & 4\\ 1 & 0 & 1 \end{bmatrix} \\ \vec{y} &= \begin{bmatrix} -5\\ 7\\ 4\\ 2 \end{bmatrix} \end{align*}

We got \vec{y} by looking at our dataset and seeing the y column.

The matrix X was found by looking at the equation H(x). You can think of each row of X being: \begin{bmatrix}1 & x^{(1)}x^{(3)} & (x^{(2)}-x^{(3)})^2\end{bmatrix}. Recall our bias term here is not affected by x^{(i)}, but it still exists! So we will always have the first element in our row be 1. We can then easily calculate the other elements in the matrix.

Problem 8.2

For the X and \vec{y} that you have written down, let \vec{w} be the optimal parameter vector, which comes from solving the normal equations X^TX\vec{w}=X^T\vec{y}. Let \vec{e} = \vec{y} - X \vec{w} be the error vector, and let e_i be the ith component of this error vector. Show that 4e_1+e_2+4e_3+e_4=0.

The key to this problem is the fact that the error vector, \vec{e}, is orthogonal to the columns of the design matrix, X. As a refresher, if \vec{w^*} satisfies the normal equations, then:

We can rewrite the normal equation (X^TX\vec{w}=X^T\vec{y}) to allow substitution for \vec{e} = \vec{y} - X \vec{w}.

\begin{align*} X^TX\vec{w}&=X^T\vec{y} \\ 0 &= X^T\vec{y} - X^TX\vec{w} \\ 0 &= X^T(\vec{y}-X\vec{w}) \\ 0 &= X^T\vec{e} \end{align*}

The first step is to find X^T, which is easy because we found X above: \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix}

And now we can plug X^T and \vec e into our equation 0 = X^T\vec{e}. It might be easiest to find the right side first: \begin{align*} X^T\vec{e} &= \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix} \cdot \begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ e_4\end{bmatrix} \\ &= \begin{bmatrix} e_1 + e_2 + e_3 + e_4 \\ 15e_2 - 15e_3 \\ 4e_1 + e_2 + 4e_3 + e_4\end{bmatrix} \end{align*}

Finally, we set it equal to zero! \begin{align*} 0 &= e_1 + e_2 + e_3 + e_4 \\ 0 &= 15e_2 - 15e_3 \\ 0 &= 4e_1 + e_2 + 4e_3 + e_4 \end{align*}

With this we have shown that 4e_1+e_2+4e_3+e_4=0.

Problem 9

Reggie and Essie are given a dataset of real features x_i \in \mathbb{R} and observations y_i. Essie proposes the following linear prediction rule: H_1(\alpha_0,\alpha_1) = \alpha_0 + \alpha_1 x_i. and Reggie proposes to use v_i=(x_i)^2 and the prediction rule H_2(\gamma_0,\gamma_1) = \gamma_0 + \gamma_1 v_i.

Problem 9.1

Give an example of a dataset \{(x_i,y_i)\}_{i=1}^n for which minimum MSE(H_2) < minimum MSE(H_1). Explain.

Example: If the datapoints follow a quadratic form y_i=x_i^2 for all i, then the H_2 prediction rule will achieve a zero error while H_1>0 since the data do not follow a linear form.

Problem 9.2

Give an example of a dataset \{(x_i,y_i)\}_{i=1}^n for which minimum MSE(H_2) = minimum MSE(H_1). Explain.

Example 1: If the response variables are constant y_i=c for all i, then for both prediciton rules by setting \alpha_0=\gamma_0=c and \alpha_1=\gamma_1=0, both predictors will achieve MSE=0.

Example 2: when every single value of the features x_i and x^2_ i coincide in the dataset (this occurs when x = 0 or x = 1), the parameters of both prediction rules will be the same, as will the MSE.

Problem 9.3

Essie proposes a linear regression model using two predictor variables x,z as H_3(w_0,w_1,w_2) = w_0 + w_1 x_i +w_2 z_i.

Explain if the following statement is True or False (prove or provide counter-example).

Reggie claims that having more features will lead to a smaller error, therefore the following prediction rule will give a smaller MSE: H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = \alpha_0 + \alpha_1 x_i +\alpha_2 z_i + \alpha_3 (2x_i-z_i)

H_4 can be rewritten as H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = \alpha_0 + (\alpha_1+2\alpha_3) x_i +(\alpha_2 - \alpha_3)z_i By setting \tilde{\alpha}_1=\alpha_1+2\alpha_3 and \tilde{\alpha_2}= \alpha_2 - \alpha_3 then

H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = H_4(\alpha_0,\tilde{\alpha}_1,\tilde{\alpha}_2) = \alpha_0 + \tilde{\alpha_1} x_i +\tilde{\alpha}_2 z_i

Thus H_4 and H_3 have the same normal equations and therefore the same minimum MSE.

Problem 10

\vec{y} = \begin{bmatrix} a\\ b\\ \end{bmatrix} \vec{w}^{*} = \begin{bmatrix} 1\\ 2\\ \end{bmatrix}

Where X is the design matrix, \vec{y} is the observation vector, and \vec{w}^{*} is the optimal parameter vector. Solve for parameters a and b using the normal equations, show your work.

\begin{cases} a = 5\\ b = -1\\ \end{cases}

Since \vec{w}^{*} is the optimal parameter vector, it must satisfy the normal equations:

\begin{align*} X^{T}X\vec{w} = X^{T}\vec{y} \end{align*}

The left hand side of the equation will read:

\begin{align*} X^{T}X\vec{w} &= \begin{bmatrix} 1 & 1\\ 2 & -1 \end{bmatrix} \begin{bmatrix} 1 & 2\\ 1 & -1 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} \\ &= \begin{bmatrix} 2 & 1\\ 1 & 5 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} \\ &= \begin{bmatrix} 4\\ 11 \end{bmatrix} \end{align*}

The right hand side of the equation is given by:

\begin{align*} X^{T}\vec{y} &= \begin{bmatrix} 1 & 1\\ 2 & -1 \end{bmatrix} \begin{bmatrix} a\\ b \end{bmatrix} \\ &= \begin{bmatrix} a+b\\ 2a-b \end{bmatrix} \end{align*}

By setting the left hand side and right hand side equal to each other, we will obtain the following system of equations:

\begin{align*} \begin{bmatrix} 4\\ 11 \end{bmatrix} = \begin{bmatrix} a+b\\ 2a-b \end{bmatrix} \end{align*}

\begin{cases} &4 = a + b\\ &11 = 2a-b \end{cases}

To solve this equation set, we can add them together: \begin{align*} 4+11 &= a + b + 2a -b\\ 3a &= 15\\ \\ \\ \end{align*}

\begin{cases} a = 5\\ b = -1\\ \end{cases}

Problem 11

Albert collected 400 data points from a radiation detector. Each data point contains 3 features: feature A, feature B and feature C. The true particle energy E is also reported. Albert wants to design a linear regression algorithm to predict the energy E of each particle, given a combination of one or more of feature A, B, and C. As the first step, Albert calculated the correlation coefficients among A, B, C and E. He wrote it down in the following table, where each cell of the table represents the correlaton of two terms:

Problem 11.1

	A	B	C	E
A	1	-0.99	0.13	0.8
B	-0.99	1	0.25	-0.95
C	0.13	0.25	1	0.72
E	0.8	-0.95	0.72	1

Albert wants to start with a simple model: fitting only a single feature to obtain the true energy (i.e. y = w_0+w_1 x). Which feature should he choose as x to get the lowest mean square error?

B is the correct answer, because it has the highest absolute correlation (0.95), the negative sign in front of B just means it is negatively correlated to energy, and it can be compensated by a negative sign in the weight.

Problem 11.2

Albert wants to add another feature to his linear regression in part (a) to further boost the model’s performance. (i.e. y = w_0 + w_1 x + w_2 x_2) Which feature should he choose as x_2 to make additional improvements?

C is the correct answer, because although A has a higher correlation with energy, it also has an extremely high correlation with B (-0.99), that means adding A into the fit will not be too useful, since it provides almost the same information as B.

Problem 11.3

Albert further refines his algorithm by fitting a prediction rule of the form: \begin{aligned} H(A,B,C) = w_0 + w_1 \cdot A\cdot C + w_2 \cdot B^{C-7} \end{aligned}

400 \text{ rows} \times 3 \text{ columns}

Recall there are 400 data points, which means there will be 400 rows. There will be 3 columns; one is the bias column of all 1s, one is for the feature A\cdot C, and one is for the feature B^{C-7}.

Problem 12

Billy’s aunt owns a jewellery store, and gives him data on 5000 of the diamonds in her store. For each diamond, we have:

carat	length	width	price
0.40	4.81	4.76	1323
1.04	6.58	6.53	5102
0.40	4.74	4.76	696
0.40	4.67	4.65	798
0.50	4.90	4.95	987

Billy has enlisted our help in predicting the price of a diamond given various other features.

Problem 12.1

Suppose we want to fit a linear prediction rule that uses two features, carat and length, to predict price. Specifically, our prediction rule will be of the form

We will use least squares to find \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}.

Write out the first 5 rows of the design matrix, X. Your matrix should not have any variables in it.

X = \begin{bmatrix} 1 & 0.40 & 4.81 \\ 1 & 1.04 & 6.58 \\ 1 & 0.40 & 4.74 \\ 1 & 0.40 & 4.67 \\ 1 & 0.50 & 4.90 \end{bmatrix}

Problem 12.2

What is the predicted price of a diamond with 0.65 carats and a length of 4 centimeters? Show your work.

The predicted price is 4500 dollars.

2000 + 10000 \cdot 0.65 - 1000 \cdot 4 = 4500

Problem 12.3

Suppose \vec{e} = \begin{bmatrix} e_1 \\ e_2 \\ ... \\ e_n \end{bmatrix} is the error/residual vector, defined as

For each of the following quantities, state whether they are guaranteed to be equal to 0 the scalar, \vec{0} the vector of all 0s, or neither. No justification is necessary.

\sum_{i = 1}^n e_i: Yes, this is guaranteed to be 0. This was discussed in Homework 4; it is a consequence of the fact that X^T (y - X \vec{w}^*) = 0 and that we have an intercept term in our prediction rule (and hence a column of all 1s in our design matrix, X).
|| \vec{y} - X \vec{w}^* ||^2: No, this is not guaranteed to be 0. This is the mean squared error of our prediction rule, multiplied by n. \vec{w}^* is found by minimizing mean squared error, but the minimum value of mean squared error isn’t necessarily 0 — in fact, this quantity is only 0 if we can write \vec{y} exactly as X \vec{w}^* with no prediction errors.
X^TX \vec{w}^*: No, this is not guaranteed to be 0, either.
2X^TX \vec{w}^* - 2X^T\vec{y}: Yes, this is guaranteed to be 0. Recall, the optimal parameter vector \vec{w}^* satisfies the normal equations X^TX\vec{w}^* = X^T \vec{y}. Subtracting X^T \vec{y} from both sides of this equation and multiplying both sides by 2 yields the desired result. (You may also find 2X^TX \vec{w}^* - 2X^T\vec{y} to be familiar from lecture — it is the gradient of the mean squared error, multiplied by n, and we set the gradient to 0 to find \vec{w}^*.)

Problem 12.4

Suppose we also decide to remove the intercept term of our prediction rule. With all of these changes, our prediction rule is now

\text{predicted price} = w_1 \cdot \text{carat} + w_2 \cdot \text{length} + w_3 \cdot \text{width} + w_4 \cdot (\text{length} \cdot \text{width})

X = \begin{bmatrix} 0.40 & 4.81 & 4.76 & 4.81 \cdot 4.76 \\ 1.04 & 6.58 & 6.53 & 6.58 \cdot 6.53 \end{bmatrix}
No, it’s not guaranteed that the \vec{w}_1^* for this new prediction rule is equal to the \vec{w}_1^* for the original prediction rule. The value of \vec{w}_1^* in the new prediction rule will be influenced by the fact that there’s no longer an intercept term and that there are two new features (width and area) that weren’t previously there.

Problem 13

\vec{u} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \qquad \vec{v} = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}

We define X \in \mathbb{R}^{3 \times 2} to be the matrix whose first column is \vec u and whose second column is \vec v.

Problem 13.1

Find a scalar k such that \vec{y} is in \text{span}(\vec u, \vec v). Give your answer as a constant with no variables.

252.

Vectors in \text{span}(\vec u, \vec v) must have an equal 2nd and 3rd component, and the third component is 252, so the second must be as well.

Problem 13.2

Show that: (X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}

Hint: If A = \begin{bmatrix} a_1 & 0 \\ 0 & a_2 \end{bmatrix}, then A^{-1} = \begin{bmatrix} \frac{1}{a_1} & 0 \\ 0 & \frac{1}{a_2} \end{bmatrix}.

We can construct the following series of matrices to get (X^TX)^{-1}X^T.

X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix}
X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 1 \end{bmatrix}
X^TX = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}
(X^TX)^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & \frac{1}{2} \end{bmatrix}
(X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}

Problem 13.3

In parts (c) and (d) only, let \vec{y} = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix}.

Find scalars a and b such that a \vec u + b \vec v is the vector in \text{span}(\vec u, \vec v) that is as close to \vec{y} as possible. Give your answers as constants with no variables.

a = 4, b = 5.

The result from the part (b) implies that when using the normal equations to find coefficients for \vec u and \vec v – which we know from lecture produce an error vector whose length is minimized – the coefficient on \vec u must be y_1 and the coefficient on \vec v must be \frac{y_2 + y_3}{2}. This can be shown by taking the result from part (b), \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}, and multiplying it by the vector \vec y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix}.

Here, y_1 = 4, so a = 4. We also know y_2 = 2 and y_3 = 8, so b = \frac{2+8}{2} = 5.

Problem 13.4

Let \vec{e} = \vec{y} - (a \vec u + b \vec v), where a and b are the values you found in part (c).

3 \sqrt{2}.

The correct value of a \vec u + b \vec v = \begin{bmatrix} 4 \\ 5 \\ 5\end{bmatrix}. Then, \vec{e} = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix} - \begin{bmatrix} 4 \\ 5 \\ 5 \end{bmatrix} = \begin{bmatrix} 0 \\ -3 \\ 3 \end{bmatrix}, which has a length of \sqrt{0^2 + (-3)^2 + 3^2} = 3\sqrt{2}.

Problem 13.5

Is it true that, for any vector \vec{y} \in \mathbb{R}^3, we can find scalars c and d such that the sum of the entries in the vector \vec{y} - (c \vec u + d \vec v) is 0?

Yes, but for a reason that isn’t listed here.

Here’s the full reason: 1. We can use the normal equations to find c and d, no matter what \vec{y} is. 2. The error vector \vec e that results from using the normal equations is such that \vec e is orthogonal to the span of the columns of X. 3. The columns of X are just \vec u and \vec v. So, \vec e is orthogonal to any linear combination of \vec u and \vec v. 4. One of the many linear combinations of \vec u and \vec v is \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}. 5. This means that the vector \vec e is orthogonal to \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, which means that \vec{1}^T \vec{e} = 0 \implies \sum_{i = 1}^3 e_i = 0.

Problem 13.6

Suppose that Q \in \mathbb{R}^{100 \times 12}, \vec{s} \in \mathbb{R}^{100}, and \vec{f} \in \mathbb{R}^{12}. What are the dimensions of the following product?

Correct: Scalar.

\vec{s}^T: 1 x 100
Q: 100 x 12.
\vec{f}: 12 x 1.

The inner dimensions of 100 and 12 cancel, and so \vec{s}^T Q \vec{f} is of shape 1 x 1.

Problem 14

Consider a dataset \{(\vec{x}_i, y_i)\}_{i=1}^{n} where each \vec{x}_i \in \mathbb{R}^{d} and y_i \in \mathbb{R} for which you decide to fit a multiple linear regression model:

f_1(\vec{w}, b;\, \vec{x}) = \vec{w}^\top x + b,\qquad\vec{w}\in\mathbb{R}^d,\;b\in\mathbb{R}.

After minimizing the MSE, the resulting model has an optimal empirical risk value denoted R_1.

Due to fairness constraints related to the nature of the input features, your boss informs you that the last two weights must be the same: \vec{w}^{(d-1)} =\vec{w}^{(d)}. Your colleague suggests a simple fix by removing the last two weights and features:

f_2(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-2)}\vec{x}^{(d-2)} + b.

After training, the resulting model has an optimal empirical risk value denoted R_2. On the other hand, you propose the approach of grouping the last two features and using the model formula

f_3(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-1)}\left(\vec{x}^{(d-1)} + \vec{x}^{(d)}\right) + b.

After training, the final model has an optimal empirical risk value denoted R_3.

Problem 14.1

Carefully apply Theorem 2.3.2 (“Optimal Model Parameters for Multiple Linear Regression”) to find an expression for the optimal parameters b^\ast, \vec{w}^\ast which minimize the mean squared error for the model f_2 and the training data \{(\vec{x}_i, y_i)\}_{i=1}^{n}. Your answer may contain the design matrix \mathbf{Z}, or any suitably modified version, as needed.

Problem 14.2

Using the comparison operators \{ =, \leq, \geq, <, >\}, rank the optimal risk values R_1, R_2, R_3 from least to greatest. Justify your answer.

Returning to the original model f_1, suppose you were asked instead to eliminate the intercept term, leading to the model formula

Once again, you train this model by minimizing the associated mean squared error and obtain an optimal MSE denoted R_4.

Problem 14.3

Problem 14.4

\sum_{i=1}^{n} \vec{x}_i^{(j)} = 0\text{ for each }1\leq j\leq d,\text{ and }\sum_{i=1}^n y_i = 0.

Problem 14.5

Use the setting of d=1 (a.k.a. simple linear regression) to draw a sketch which illustrates why the result in Part (d) makes sense geometrically.

Problem 15

An automotive research team wants to build a predictive model that simultaneously estimates two performance metrics of passenger cars:

To capture mechanical and aerodynamic factors, the engineers record the following four features for each vehicle (all measured on the current model year):

f(\mathbf W,\vec b;\,\vec x) = \mathbf W\,\vec x+\vec b,\qquad\mathbf{W}\in\mathbb{R}^{2\times 4},\; \vec{b}\in\mathbb{R}^2,

where \vec x\in\mathbb{R}^{4} denotes the feature vector for a given car. Data for eight different cars are listed below.

Problem 15.1

Feature	\vec{x}_1	\vec{x}_2	\vec{x}_3	\vec{x}_4	\vec{x}_5	\vec{x}_6	\vec{x}_7	\vec{x}_8
Engine disp. (L)	2.0	2.5	3.0	1.8	3.5	2.2	2.8	1.6
Mass (kg)	1300	1450	1600	1250	1700	1350	1500	1200
Horsepower	140	165	200	130	250	155	190	115
Drag coeff.	0.28	0.30	0.32	0.27	0.33	0.29	0.31	0.26
City L/100km	8.5	9.2	10.8	7.8	11.5	8.9	9.8	7.2
HWY L/100km	6.0	6.5	7.5	5.8	8.0	6.2	6.9	5.4

Write down the design matrix \mathbf Z and the target matrix \mathbf Y from the data in the table.

Problem 15.2

Compute the weight matrix \mathbf W^{\ast} and bias vector \vec b^{\ast} that minimize the MSE for the given dataset. You can use Python for the computations where needed. You do not need to submit your code, but you do need to write down all matrices and vectors relevant to your computations (round your answers to three decimal places).

Problem 15.3

With (\mathbf W^{\ast},\vec b^{\ast}), predict the two fuel consumption values for each of the eight cars and report the overall MSE between the predictions and the true targets.

Problem 15.4

Use your trained model to predict fuel consumption for these two cars. Compute the mean-squared error for the two testing examples and state whether you would recommend the model to the engineers.

Feature	\vec{x}_9	\vec{x}_{10}
Engine disp. (L)	2.4	1.5
Mass (kg)	1400	1150
Horsepower	170	110
Drag coeff.	0.29	0.25
City L/100km	9.0	7.0
HWY L/100km	6.3	5.2

Problem 1

Problem 1.1

Click to view the solution.

Difficulty: ⭐️

Problem 1.2

Click to view the solution.

Difficulty: ⭐️

Problem 1.3

Click to view the solution.

Difficulty: ⭐️

Problem 1.4

Click to view the solution.

Difficulty: ⭐️

Problem 1.5

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 1.6

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 2

Problem 2.1

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 2.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️⭐️

Problem 2.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 2.4

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 3

Problem 3.1

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 3.2

Click to view the solution.

Difficulty: ⭐️

Problem 3.3

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 3.4

Click to view the solution.

Difficulty: ⭐️

Problem 3.5

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 4

Problem 4.1

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 4.2

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 4.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 4.4

Click to view the solution.

Difficulty: ⭐️⭐️⭐️⭐️

Problem 4.5

Click to view the solution.

Difficulty: ⭐️⭐️⭐️⭐️

Problem 5

Problem 5.1

Click to view the solution.

Problem 5.2

Click to view the solution.

Problem 5.3

Click to view the solution.

Problem 6

Problem 6.1

Click to view the solution.

Problem 6.2

Click to view the solution.

Problem 7

Problem 7.1

Click to view the solution.

Problem 8