Fall 2024 Midterm

This exam was administered in person. They had a two sided 11in by 8.5in cheat sheet. Students had 50 minutes to take this exam.

Problem 1

Let \vec{y}_1, \vec{y}_2, \dotsc, \vec{y}_n \in\mathbb{R}^2 be a collection of two-dimensional vectors and let \overline{y} = \frac{1}{n}\sum_{i=1}^n \vec{y}_i. Suppose we wish to model the \vec{y}_i’s using a single vector of the form c\vec{z} where \vec{z} = \begin{bmatrix} -1 & 1 \end{bmatrix}^T and c\in\mathbb{R} is a real number (to be determined).

Problem 1.1

Suppose we use the loss function L_{\mathrm{sq}}(\vec{y}_i, c) = |\vec{y}_i^{(1)} + c|^2 + |\vec{y}_i^{(2)} - c|^2. Write down the empirical risk function R_{\mathrm{sq}}(\{\vec{y}_i\}_{i=1}^n, c).

R_{\mathrm{sq}}(\{\vec{y}_i\}_{i=1}^n, c) = \frac{1}{n}\sum_{i=1}^n (|\vec{y}_i^{(1)} + c|^2 + |\vec{y}_i^{(2)} - c|^2).

Recall risk follows the formula R_{L(h)}(h) = \frac{1}{n} \sum_{i=1}^n L(h). All we have to do is put the loss function in a summation and divide it by n. It is important to put parenthesis around the loss function.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.

Problem 1.2

Show that c^\ast = \frac{1}{2}\vec{z}^T \overline{y} is the only critical point for the empirical risk function R_{\mathrm{sq}}(\{\vec{y}_i\}_{i=1}^n, c) you found in part (a).

\begin{align*} \frac{\partial}{\partial c} R_{\mathrm{sq}}((\vec{y}_i), c) &= \frac{\partial}{\partial c} \frac{1}{n}\sum_{i=1}^n (|\vec{y}_i^{(1)} + c|^2 + |\vec{y}_i^{(2)} - c|^2)\\ &= \frac{1}{n}\sum_{i=1}^n \left( 2(\vec{y}_i^{(1)} + c)- 2(\vec{y}_i^{(2)} - c)\right)\\ &= 4c - \frac{2}{n}\sum_{i=1}^n \vec z^T \vec{y}_i\\ &=4c - 2\vec{z}^T \overline{y}. \end{align*} Therefore, the one critical point occurs when c = \frac{1}{2}\vec{z}^T \overline{y}.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 63%.

Problem 1.3

Given the dataset: \vec{y_1}=\begin{bmatrix} 1 \\ - 2 \end{bmatrix} \quad \vec{y_2}=\begin{bmatrix} -3 \\ 4 \end{bmatrix} \quad \vec{y_3}=\begin{bmatrix} 2 \\ - 2 \end{bmatrix} what is R_{\mathrm{sq}}(\{\vec{y}_1, \vec{y}_2, \vec{y}_3\}, c^\ast)?

\vec y = \begin{bmatrix} 0\\ 0 \end{bmatrix}

c^*=\frac{1}{2}\vec{z}^T \overline{y} = 0

\begin{align*} R_{\mathrm{sq}}((\vec{y}_i), c) & = \frac{1}{n}\sum_{i=1}^n \left(|\vec{y}_i^{(1)} - c|^2 + |\vec{y}_i^{(2)} - c|^2\right) \\ & = \frac{1}{3} \left( (1^2+2^2) + (3^2+4^2) + (2^2+2^2) \right) = \frac{38}{3} \end{align*}

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 57%.

Problem 2

\begin{align*} H(\vec{x}_i) = w_0 + w_1 \vec{x}_i^{(1)} + w_2 \vec{x}_i^{(2)} + \cdots + w_{d}\vec{x}_i^{(d)}. \end{align*}

Problem 2.1

If X^TX is invertible, the matrix P = X(X^TX)^{-1}X^T satisfies the equations P^2 = P and P^T = P.

TRUE

To see if this statement is true, we can plug in our definition for P and check if these equations hold using various properties of matrix multiplication.

First, we see P^2 = (X(X^TX)^{-1}X^T)(X(X^TX)^{-1}X^T) = X(X^TX)^{-1}(X^TX)(X^TX)^{-1}X^T, using the asociative property. We can notice that (X^TX)(X^TX)^{-1} = I. Therefore: P^2 = X(X^TX)^{-1}(X^TX)(X^TX)^{-1}X^T = X(X^TX)^{-1}X^T = P So P^2 = P is true.

Next, we check P^T. A useful fact for this problem is that for any invertible matrix A, we have (A^{-1})^T = (A^T)^{-1}. Then we see P^T = (X(X^TX)^{-1}X^T)^T = X((X^TX)^{-1})^T X^T = X((X^TX)^T)^{-1} X^T = X(X^TX)^{-1}X^T = P So P^T = P is true.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 58%.

Problem 2.2

The residual vector \vec{r} = \vec{y} - X\vec{w}^\ast is orthogonal to the rows of X.

FALSE

The wording here is tricky! Recall the residual vector, \vec r, is the same as the error vector \vec e = \vec y - X \vec w^*. We know \vec e is orthogonal to the columns of X. Recall X^T(\vec e) = 0. This means the rows of X^T \times \vec e will give us zero. However, when you transpose X^T to get X then the columns of X \cdot \vec e = 0.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 51%.

Problem 2.3

The equation \|\vec{x}_i + \vec{w}^\ast\|^2 = \|\vec{x}_i\|^2 + \|\vec{w}^\ast\|^2 holds for each 1\leq i\leq n.

FALSE

This is not always true. This would be true if \vec x_i and \vec w^* were orthogonal. Here is an example when they are not:

Let the following hold:

\vec x_i = \begin{bmatrix} 1 \\ 0 \end{bmatrix}
\vec w^* = \begin{bmatrix} 1 \\ 1 \end{bmatrix}

Looking at the left side:

\begin{align*} \|\vec{x}_i + \vec{w}^\ast\|^2 &= \|\begin{bmatrix} 1 \\ 0 \end{bmatrix} + \begin{bmatrix}1 \\ 1\end{bmatrix}\|\\ &= \|\begin{bmatrix} 2 \\ 1 \end{bmatrix}\|\\ &= 2^2 + 1^2\\ &= 4 + 1\\ &= 5 \end{align*}

Looking at the right side:

\begin{align*} \|\vec{x}_i\|^2 + \|\vec{w}^\ast\|^2 &= \|\begin{bmatrix} 1 \\ 0 \end{bmatrix}\|^2 + \|\begin{bmatrix} 1 \\ 1 \end{bmatrix}\|^2\\ &= 1^2 + (1^2 + 1^2)\\ &= 1 + 2\\ &= 3 \end{align*}

As we know 5 \neq 3, so it is not always true. \|\vec{x}_i + \vec{w}^\ast\|^2 \neq \|\vec{x}_i\|^2 + \|\vec{w}^\ast\|^2

Difficulty: ⭐️⭐️

The average score on this problem was 77%.

Problem 2.4

The gradient of f(\vec{w}) = y_i - H(\vec{x}_i) with respect to \vec{w} is \vec{x}_i.

FALSE

Let H(\vec{x}) = \vec{w} \cdot Aug(\vec{x}) = w_0 + w_1 x^{(1)} + \cdots + w_d x^{(d)} be our hypothesis function. We can see that the gradient with respect to \vec{w} is Aug(\vec{x}). Since H is multiplied by -1 in our function f(\vec{w}), the gradient of f(\vec{w}) would be -Aug(\vec{x}_i). Therefore, the statement is false.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.

Problem 2.5

\vec{w}^\ast is the orthogonal projection of \vec{y} onto the span of the columns of X.

FALSE

\vec w^* represents the vector of coefficients that would give the prest approximation of \vec y (prediction) in terms of the columns of X. If \vec w^* were orthogonal to \vec y then it would mean \vec y and X may not have a relation to one another. No matter what \vec w we choose X \vec w could not approximate \vec y.

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 49%.

Problem 2.6

The empirical risk function R_{\mathrm{sq}} can be written in the form R_{\mathrm{sq}}(\vec{w}) = \frac{1}{n}\| X\vec{w}\|^2.

FALSE

An important aspect of risk (and loss) is the difference between the actual value (\vec y) and our prediction (X \vec w). We want to measure how far away our prediction is from our actual value. This equation only measures the length of X \vec w, so it ignores the difference, which means it does not help us quantify how “wrong we are” and is not the same as risk.

Difficulty: ⭐️⭐️

The average score on this problem was 87%.

Problem 2.7

FALSE

This is false because n \geq d+1 alone does not guarantee (X^TX)^{-1} = exists.

Recall X is an n \times d matrix (with n samples and d features). For (X^TX)^{-1} to exist X must have d linearly independent columns (span the d-dimensional space), which would make X^TX full rank.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.

Problem 3

Assume you have a dataset \vec{x}_1, \vec{x}_2, \dotsc, \vec{x}_n\in\mathbb{R}^{d} of d-dimensional vectors for n individuals, and that each has a scalar response value y_1,\dotsc, y_n. You then fit a multiple linear regression prediction rule H_1(\vec{x}) = \vec{w}^T\text{Aug}(\vec{x}), with optimal parameters \vec{w}^* whose MSE is c_1.

Problem 3.1

Due to privacy reasons it turns out you are unable to use the feature x^{(d)}, so you need to find a way to modify your model so that it does not incorporate this feature. Your friend Minnie suggests using the parameters of the prediction rule you learned already and simply setting w_d=0: H_2(\vec{x}) = \begin{bmatrix} w_0^* & w_1^* & \ldots & w_{d-1}^* & 0 \end{bmatrix}^T \text{Aug}(\vec{x}). Denote the MSE of the model H_2(\vec{x}) by c_2.

Meanwhile, your friend Maxine recommends fitting a new prediction rule H_3 to the data with only the first d-1 features, with MSE c_3.

Order the three errors c_1, c_2, c_3 from least to greatest. Explain your solution. (Use < if the error is strictly smaller, = if the errors are equal or \leq if the errors may be equal.)

c_1 \leq c_3 \leq c_2.

The prediction rule H_1 uses an additional feature compared to H_3. The MSE cannot increase by adding a feature (HW4), therefore c_1 \leq c_3.

The MSE c_3 is as small as it can possibly be when using d-1 features. If we were to pick other coefficients – whatever they may be – the MSE cannot be smaller. So if we picked the coefficients (w^*_0, w^*_1, \ldots, w^*_{d-1}), note that setting w_d=0 in H_2 is the same as setting a prediction rule with d-1 features and the first d parameters of w^*. Since this prediction rule is using d-1 features, the MSE c_2 of this model cannot be smaller than c_3.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 65%.

Problem 3.2

You are able to add 5 additional individuals to your datasets, so that now you have n+5 individuals. You fit a new prediction rule to the data whose MSE is c_4. Can c_4< c_1? Explain.

In the case you add 5 individuals that exactly match the prediction rule, their added squared errors are all zero and therefore the mean squared error is smaller c_4 = \frac{n c_1}{n+5}.;

Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 41%.

Problem 1

Problem 1.1

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 1.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 1.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 2

Problem 2.1

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 2.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 2.3

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 2.4

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 2.5

Click to view the solution.

Difficulty: ⭐️⭐️⭐️⭐️

Problem 2.6

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 2.7

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 3

Problem 3.1

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 3.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️⭐️

👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.