← return to practice.dsc40a.com

**Instructor(s):** Suraj Rampure

This exam was administered in person. Students had **80
minutes** to take this exam.

*Source: Spring 2024
Midterm, Problem 1a*

Vectors get lonely, and so we will give each vector one friend to keep them company.

Specifically, if \vec{v} = \begin{bmatrix} v_1 \\ v_2 \end{bmatrix}, \vec{v_f} is the friend of \vec{v}, where \vec{v_f} = \begin{bmatrix} -v_2 \\ v_1 \end{bmatrix}.

Prove that \vec{v} and \vec{v_f} are orthogonal.

Recall two vectors are orthogonal if their dot product is equal to zero.

This means if \vec v \cdot \vec v_f = 0 then they are orthogonal to one another.

\begin{bmatrix} v_1 \\ v_2\end{bmatrix} \cdot \begin{bmatrix} -v_2 \\ v_1 \end{bmatrix} = (-v_1)*(v_2) + (v_1)(v_2) = 0

The average score on this problem was 89%.

*Source: Spring 2024
Midterm, Problem 1b-e*

Consider the vectors \vec{c} and \vec{d}, defined below.

\vec{c} = \begin{bmatrix} 1 \\ 7 \end{bmatrix} \qquad \vec{d} = \begin{bmatrix} -2 \\ 1 \end{bmatrix}

The next few parts ask you to write various vectors as scalar multiples of either \vec{c}, \vec{c_f}, \vec{d}, or \vec{d_f}, where \vec{c_f} and \vec{d_f} are the friends of \vec{c} and \vec{d}, respectively. In each part, write in the box, and fill in .

Here is an example:

A vector in span(\vec d) that is twice as long as \vec d.

2 \times \vec d

The projection of \vec{c} onto \text{span}(\vec{d}).

_____ \times

\vec c

\vec c_f

\vec d

\vec d_f

1 \times \vec d

Recall the span(\vec x) is the set of all scalar multiples of \vec x (like w\vec x). The closest vector to \vec y \in \mathbb{R}^n is the vector w * \vec x where w^* = \frac{\vec x \cdot \vec y}{\vec x \cdot \vec x}.

The problem is asking us to find w, put it inside of the blank, and multiply it by \vec c.

We know we are trying to find the projection of \vec c onto \text{span}(\vec{d}), so it will be our y in this situation, which makes \vec d our \vec x. We can now calculate w^*.

I would recommend calculating the dot product first and then plugging it into our equation. \vec d \cdot \vec c = \begin{bmatrix} -2 \\ 1 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 7 \end{bmatrix} = (-2)*(1) + (1)*(7) = 5 \vec d \cdot \vec d = \begin{bmatrix} -2 \\ 1 \end{bmatrix} \cdot \begin{bmatrix} -2 \\ 1 \end{bmatrix} = (-2)*(-2) + (1)*(1) = 5

w^* = \frac{\vec d \cdot \vec c}{\vec d \cdot \vec d} = \frac{5}{5} = 1

Since we are trying to find the span(\vec d) our vector is \vec d.

The average score on this problem was 66%.

The error vector of the projection of \vec{c} onto \text{span}(\vec{d}).

_____ \times

\vec c

\vec c_f

\vec d

\vec d_f

-3 \times \vec d_f

Using what we learned in part 1 of this problem (w) we are able to utilize the error vector equation \vec e = \vec y - w * \vec x. Recall we are trying to find the projection of \vec c onto \text{span}(\vec{d}), so it will be our y in this situation, which makes \vec d our \vec x. We learned that w from part 1 is 1, so our equation is \vec e = \vec c - 1 * \vec d.

\vec e = \begin{bmatrix} 1 \\ 7 \end{bmatrix} - \begin{bmatrix} -2 \\ 1 \end{bmatrix} = \begin{bmatrix} 3 \\ 6 \end{bmatrix}

We can now try and find a scalar that when multiplied by one of our multiple choice vectors gives us \begin{bmatrix} 3 \\ 6 \end{bmatrix}.

Recall our choices are:

- \vec c = \begin{bmatrix} 1 \\ 7 \end{bmatrix}
- \vec d = \begin{bmatrix} -2 \\ 1 \end{bmatrix}
- \vec c_f = \begin{bmatrix} -7 \\ 1 \end{bmatrix}
- \vec d_f = \begin{bmatrix} -1 \\ -2 \end{bmatrix}

*Note that we are applying the same transformation as in Problem
1.*

We can easily see -3 * \begin{bmatrix} -1 \\ -2 \end{bmatrix} = \begin{bmatrix} 3 \\ 6 \end{bmatrix}!

The average score on this problem was 27%.

The projection of \vec{d} onto \text{span}(\vec{c}).

_____ \times

\vec c

\vec c_f

\vec d

\vec d_f

\frac{1}{10} \times \vec c

Once again the problem is asking us to find w, put it inside of the blank, and multiply it by \vec c.

We know we are trying to find the projection of \vec d onto \text{span}(\vec{c}), so it will be our y in this situation, which makes \vec c our \vec x. We can now calculate w^*.

I would recommend calculating the dot product first and then plugging it into our equation. \vec c \cdot \vec d = \begin{bmatrix} 1 \\ 7 \end{bmatrix} \cdot \begin{bmatrix} -2 \\ 1 \end{bmatrix} = (1)*(7) + (-2)*(1) = 5 \vec c \cdot \vec c = \begin{bmatrix} 1 \\ 7 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 7 \end{bmatrix} = (1)*(1) + (7)*(7) = 50

w^* = \frac{\vec c \cdot \vec d}{\vec c \cdot \vec c} = \frac{5}{50} = \frac{1}{10}

Since we are trying to find the span(\vec c) our vector is \vec c.

The average score on this problem was 53%.

The error vector of the projection of \vec{d} onto \text{span}(\vec{c}).

_____ \times

\vec c

\vec c_f

\vec d

\vec d_f

\frac{3}{10} \times \vec c_f

Using what we learned in part 3 of this problem (w) we are able to utilize the error vector equation \vec e = \vec y - w * \vec x. Recall we are trying to find the projection of \vec d onto \text{span}(\vec{d}), so it will be our y in this situation, which makes \vec c our \vec x. We learned that w from part 1 is 1, so our equation is \vec e = \vec d - \frac{1}{10} * \vec c.

\begin{align*} \vec e &= \begin{bmatrix} -2 \\ 1 \end{bmatrix} - \frac{1}{10} * \begin{bmatrix} 1 \\ 7 \end{bmatrix} \\ &= \begin{bmatrix} -2 \\ 1 \end{bmatrix} - \begin{bmatrix} \frac{1}{10} \\ \frac{7}{10} \end{bmatrix}\\ &= \begin{bmatrix} \frac{-20}{10} \\ \frac{10}{10} \end{bmatrix} - \begin{bmatrix} \frac{1}{10} \\ \frac{7}{10} \end{bmatrix}\\ &= \begin{bmatrix} \frac{-21}{10} \\ \frac{3}{10} \end{bmatrix} \end{align*}

We can now try and find a scalar, s, that when multiplied by one of our multiple choice vectors gives us \begin{bmatrix} \frac{-21}{10} \\ \frac{3}{10} \end{bmatrix}.

Recall our choices are:

- \vec c = \begin{bmatrix} 1 \\ 7 \end{bmatrix}
- \vec d = \begin{bmatrix} -2 \\ 1 \end{bmatrix}
- \vec c_f = \begin{bmatrix} -7 \\ 1 \end{bmatrix}
- \vec d_f = \begin{bmatrix} -1 \\ -2 \end{bmatrix}

*Note that we are applying the same transformation as in Problem
1.*

It might be a bit hard to see the answer here, but we want a negative in our first element of our matrix and a positive one in the denominator, which means we want to look at one of the friend vectors.

From here we can see c_f multiplied by 3 will give us the numerators of our goal (\begin{bmatrix} \frac{-21}{10} \\ \frac{3}{10} \end{bmatrix}). The only thing we need to do is divide by 10, so our w^* = \frac{3}{10}.

The average score on this problem was 21%.

*Source: Spring 2024
Midterm, Problem 2*

Consider a dataset of n values, y_1, y_2, ..., y_n, all of which are non-negative. We’re interested in fitting a constant model, H(x) = h, to the data, using the new ``Sun God” loss function:

L_\text{sungod}(y_i, h) = w_i \left( y_i^2 - h^2 \right)^2

Here, w_i corresponds to the ``weight” assigned to the data point y_i, the idea being that different data points can be weighted differently when finding the optimal constant prediction, h^*.

For example, for the dataset y_1 = 1, y_2 = 5, y_3 = 2, we will end up with different values of h^* when we use the weights w_1 = w_2 = w_3 = 1 and when we use weights w_1 = 8, w_2 = 4, w_3 = 3.

Find \frac{\partial L_\text{sungod}}{\partial h}, the derivative of the Sun God loss function with respect to h. Show your work, and put a \boxed{\text{box}} around your final answer.

\frac{\partial L}{\partial h} = -4w_ih(y_i^2 -h^2)

To solve this problem we simply take the derivative of L_\text{sungod}(y_i, h) = w_i( y_i^2 - h^2 )^2.

We can use the chain rule to find the derivative. The chain rule is: \frac{\partial}{\partial h}[f(g(h))]=f'(g(h))g'(h).

Note that (y_i^2 -h^2)^2 is the area we care about inside of L_\text{sungod}(y_i, h) = w_i( y_i^2 - h^2 )^2 because that is where h is!. In this case f(h) = h^2 and g(h) = y_i^2 - h^2. We can then take the derivative of both to get: f'(h) = 2h and g'(x) = -2h.

This tells us the derivative is: \frac{\partial L}{\partial h} = (w_i) * 2(y_i^2 -h^2) * (-2h), which can be simplified to \frac{\partial L}{\partial h} = -4w_ih(y_i^2 -h^2).

The average score on this problem was 88%.

Prove that the constant prediction that minimizes empirical risk for the Sun God loss function is:

h^* = \sqrt{\frac{\sum_{i = 1}^n w_i y_i^2}{\sum_{i = 1}^n w_i}}

The recipe for emprical risk is to find the derivative of the risk function, set it equal to zero, and solve for h^*.

We know that empirical risk follows the equation R(L(y_i, h)) = \frac{1}{n} \sum_{i=1}^n L(y_i, h). This means that R_\text{sungod}(h) = \frac{1}{n} \sum_{i = 1}^n w_i (y_i^2 - h^2)^2.

Recall we have already found the derivative of L_\text{sungod}(y_i, h) = w_i ( y_i^2 - h^2)^2. Which means that \frac{\partial R}{\partial h}(h) = \frac{1}{n} \sum_{i = 1}^n \frac{\partial L}{\partial h}(h). So we can set \frac{\partial}{\partial h}(h) R_\text{sungod}(h) = \frac{1}{n} \sum_{i = 1}^n -4hw_i(y_i^2 -h^2).

We can now do the last two steps: \begin{align*} 0 &= \frac{1}{n} \sum_{i = 1}^n -4hw_i(y_i^2 -h^2)\\ 0&= \frac{-4h}{n} \sum_{i = 1}^n w_ih(y_i^2 -h^2)\\ 0&= \sum_{i = 1}^n w_i(y_i^2 -h^2)\\ 0&= \sum_{i = 1}^n w_iy_i^2 -w_ih^2\\ 0&= \sum_{i = 1}^n w_iy_i^2 - \sum_{i = 1}^n w_ih^2\\ \sum_{i = 1}^n w_ih^2 &= \sum_{i = 1}^n w_iy_i^2\\ h^2\sum_{i = 1}^n w_i &= \sum_{i = 1}^n w_iy_i^2\\ h^2 &= \frac{\sum_{i = 1}^n w_iy_i^2}{\sum_{i = 1}^n w_i}\\ h^* &= \sqrt{\frac{\sum_{i = 1}^n w_iy_i^2}{\sum_{i = 1}^n w_i}} \end{align*}

The average score on this problem was 77%.

For a dataset of non-negative values y_1, y_2, ..., y_n with weights w_1, 1, ..., 1, evaluate: \displaystyle \lim_{w_1 \rightarrow \infty} h^*

The maximum of y_1, y_2, ..., y_n

The mean of y_1, y_2, ..., y_{n-1}

The mean of y_2, y_3, ..., y_n

The mean of y_2, y_3, ..., y_n, multiplied by \frac{n}{n-1}

y_1

y_n

y_1

Recall from part b h^* = \sqrt{\frac{\sum_{i = 1}^n w_i y_i^2}{\sum_{i = 1}^n w_i}}.

The problem is askin us \lim_{w_1 \rightarrow \infty} \sqrt{\frac{\sum_{i = 1}^n w_i y_i^2}{\sum_{i = 1}^n w_i}}.

We can further rewrite the problem to get something like this: \lim_{w_1 \rightarrow \infty} \sqrt{\frac{w_1 y_1^2 + \sum_{i=1}^{n-1}y_i^2}{w_1 + (n-1)}}. Note that \frac{\sum_{i=1}^{n-1}y_i^2}{n-1} is insignificant because it is a constant. Constants compared to infinity can be ignored. We now have something like \sqrt{\frac{w_1y_1^2}{w_1}}. We can cancel out the w_1 to get \sqrt{y_1^2}, which becomes y_1.

The average score on this problem was 48%.

*Source: Spring 2024
Midterm, Problem 3*

Consider a dataset of n values, y_1, y_2, ..., y_n, where y_1 < y_2 < ... < y_n. Let R_\text{abs}(h) be the mean absolute error of a constant prediction h on this dataset of n values.

Suppose that we introduce a new value to the dataset, \alpha. Let S_\text{abs}(h) be the mean absolute error of a constant prediction h on this new dataset of n + 1 values.

We’re given that:

- n > 5
- \alpha is not equal to any of y_1, y_2, ..., y_n.
- All values of h between 7 and 9 minimize S_\text{abs}(h).
- The slope of S_\text{abs}(h) on the line segment immediately to the right of \alpha is \frac{5-n}{1 + n}.

In the problem statement, we were told that ``all values between 7 and 9 minimize S_\text{abs}(h).” More specifically, what interval of values h minimize S_\text{abs}(h)?

7 < h < 9

7 \leq h < 9

7 < h \leq 9

7 \leq h \leq 9

7 \leq h \leq 9

Remeber that when the minimizer of R_\text{abs}(h) is a range that imples that you have dataset with an even number of elements (in this problem that means n+1 is even, so n is odd). Now we know that any point in between 7 and 9 minimizes S_\text{abs}(h). Algebraically, if x is a point between 7 and 9, then S_\text{abs}(x) looks like this

\begin{align*} S_\text{abs}(x) &= |y_1 - x| + ... + |7 - x| + |9 - x| + ... + |y_\text{n} -x| \\ \end{align*}

Now let’s say that \gamma = |7-x| , and \beta = 2-\gamma = |9-x|. Then we can see that

\begin{align*} S_\text{abs}(7) &= (|y_1 - x| - \gamma) + ... + (|7 - x| - \gamma) + (|9 - x| + \gamma) + ... + (|y_\text{n} - x| + \gamma)\\ S_\text{abs}(7) &= |y_1 - x| + ... + |7 - x| + |9 - x| + ... + |y_\text{n} - x| + \frac{n+1}{2}(\gamma) + \frac{n+1}{2}(-\gamma)\\ S_\text{abs}(7) &= S_\text{abs}(x) \end{align*}

Similarly

\begin{align*} S_\text{abs}(9) &= (|y_1 - x| + \beta) + ... + (|7 - x| + \beta) + (|9 - x| - \beta) + ... + (|y_\text{n} - x| - \beta)\\ S_\text{abs}(9) &= |y_1 - x| + ... + |7 - x| + |9 - x| + ... + |y_\text{n} - x| + \frac{n+1}{2}(-\beta) + \frac{n+1}{2}(\beta)\\ S_\text{abs}(9) &= S_\text{abs}(x) \end{align*}

Therefore both 7 and 9 minimize S_\text{abs}(h).

**Note:** Even though the equations end at y_n, remember that there’s an element \alpha somewhere on the dataset, making n+1 elements in total. In particular, there
must be \frac{n+1}{2} elements before
and including 7, and \frac{n+1}{2}
after and including 9.

The average score on this problem was 59%.

Which value(s) minimize R_\text{abs}(h)? Give your answer(s) as integer(s) with no variables. Show your work.

*Hint: Don’t start by trying to expand \frac{1}{n} \sum_{i = 1}^n |y_i - h| —
instead, think about what removing \alpha does.*

9

We now that the minimizer of R_\text{abs}(h) must be the median, which we know it must be either 7 or 9, and it all depends on where on the dataset was \alpha added. Assuming we added \alpha on the first half of the dataset, our dataset should look like this \begin{align*} y_1,...,\alpha,...7,9,...,y_\text{n+1} \end{align*} Now if we remove \alpha, the dataset becomes \begin{align*} y_1,...,7,9,...,y_\text{n+1} \end{align*} which means that now there’s only \frac{n+1}{2} - 1 elements from y_1 to 7, and \frac{n+1}{2} elements from 9 to y_\text{n}. Thus 9 is the new median since there would be \frac{n+1}{2} - 1 elements before and after 9. Using a similar reasoning you can show that if \alpha is on the second half of the dataset then 7 would be the new median and minimizer.

Now we only need to find the placement of \alpha on our data. Remember that the slope of S_\text{abs}(h) at a point p can be calculated as \begin{align*} \frac{\text{number of points before p} - \text{number of points after p}}{n+1} \\ \end{align*} Which means that if the slope of S_\text{abs}(h) is negative near \alpha then \alpha is in the first halft, otherwise \alpha is in the second half.

Since we know that the slope of S_\text{abs}(h) on the line segment immediately to the right of \alpha is \frac{5-n}{1 + n} , and since n > 5 then the slope is negative, meaning that \alpha is on the first half, making 9 the minimizer.

The average score on this problem was 39%.

What is the slope of S_\text{abs}(h) on the line segment immediately to the left of \alpha? Give your answer in the form of an expression involving n. Show your work.

\frac{3-n}{1 + n}

Remember that the slope of S_\text{abs}(h) at a point p can be calculated as \begin{align*} \frac{\text{number of points before p} - \text{number of points after p}}{n+1} \\ \end{align*}

Whenever we transition from the right of \alpha to the left of \alpha, then alpha becomes a new point after p and we also lose a point that was before p, therefore looking at the formula above we can get the slope of the segment to the left of \alpha by subtracting 2 to the numerator of the slope to the right of \alpha.

Since we know that the slope of S_\text{abs}(h) on the line segment immediately to the right of \alpha is \frac{5-n}{1 + n}, then the answer must be \frac{3-n}{1 + n}.

The average score on this problem was 32%.

*Source: Spring 2024
Midterm, Problem 4a-c*

Suppose we want to fit a hypothesis function of the form:

H(x) = w_0 + w_1 x^2

Note that this is the simple linear regression hypothesis function, H(x) = w_0 + w_1x.

To do so, we will find the optimal parameter vector \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \end{bmatrix} that satisfies the normal equations. The first 5 rows of our dataset are as follows, though note that our dataset has n rows in total.

x | y |
---|---|

2 | 4 |

-1 | 4 |

3 | 4 |

-7 | 4 |

3 | 4 |

Suppose that x_1, x_2, ..., x_n have a mean of \bar{x} = 2 and a variance of \sigma_x^2 = 10.

Write out the first 5 rows of the design matrix, X.

X = \begin{bmatrix} 1 & 4 \\ 1 & 1 \\ 1 & 9 \\ 1 & 49 \\ 1 & 9 \end{bmatrix}

Recall our hypothesis function is H(x) = w_0 + w_1x^2. Since there is a w_0 present our X matrix should contain a column of ones. This means that our first column will be ones. Our second column should be x^2. This means we take each datapoint x and square it inside of X.

The average score on this problem was 84%.

Suppose, just in part (b), that after solving the normal equations, we find \vec{w}^* = \begin{bmatrix} 2 \\ -5 \end{bmatrix}. What is the predicted y value for the augmented feature vector \text{Aug}(\vec{x}) = \begin{bmatrix} 1 \\ 4 \end{bmatrix}? Give your answer as an integer with no variables. Show your work, and put

a \boxed{\text{box}} around your final answer.

(2)(1)+(-5)(4)=-18

To find the predicted y value all you need to do is \vec w^* \cdot \text{Aug}(\vec x).

\begin{align*} &\begin{bmatrix} 2 \\ -5 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 4 \end{bmatrix}\\ &(2)(1)+(-5)(4)\\ &2 - 20\\ &-18 \end{align*}

The average score on this problem was 78%.

Let X_\text{tri} = 3 X. Using the fact that \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2, determine the value of the bottom-left value in the matrix X_\text{tri}^T X_\text{tri}, i.e. the value in the second row and first column. Give your answer as an expression involving n. Show your work, and put a \boxed{\text{box}} around your final answer.

126n

To figure out a pattern it can be easier to use variables instead of numbers. Like so: X = \begin{bmatrix} 1 & x_1^2 \\ 1 & x_2^2 \\ \vdots & \vdots \\ 1 & x_n^2 \end{bmatrix}

We can now create X_{\text{tri}}: X_{\text{tri}} = \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}

We want to know what the bottom left value of X_\text{tri}^T X_\text{tri} is. We figure this out with matrix multiplication!

\begin{align*} X_\text{tri}^T X_\text{tri} &= \begin{bmatrix} 3 & 3 & ... & 3\\ 3x_1^2 & 3x_2^2 & ... & 3x_n^2 \end{bmatrix} \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 3(3) & \sum_{i = 1}^n 3(3x_i^2) \\ \sum_{i = 1}^n 3(3x_i^2) & \sum_{i = 1}^n (3x_i^2)(3x_i^2)\end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 9 & \sum_{i = 1}^n 9x_i^2 \\ \sum_{i = 1}^n 9x_i^2 & \sum_{i = 1}^n (3x_i^2)^2 \end{bmatrix} \end{align*}

We can see that the bottom left element should be \sum_{i = 1}^n 9x_i^2.

From here we can use the fact given to us in the directions: \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2.

\begin{align*} &\sum_{i = 1}^n 9x_i^2\\ &9\sum_{i = 1}^n x_i^2\\ &\text{Notice now we can replace $\sum_{i = 1}^n x_i^2$ with $n \sigma_x^2 + n \bar{x}^2$.}\\ &9(n \sigma_x^2 + n \bar{x}^2)\\ &\text{We know that $\sigma_x^2 = 10$ and $\bar x = 2$ fron the directions before part a.}\\ &9(10n + 2^2n)\\ &9(10n + 4n)\\ &9(14n) = 126n \end{align*}

The average score on this problem was 39%.

*Source: Spring 2024
Midterm, Problem 4d*

Consider the following four hypothesis functions:

- H_1(x) = H(x) = w_0 + w_1 x^2
- H_2(x) = w_0
- H_3(x) = w_0 + w_1 x
- H_4(x) = w_0 + w_1x + w_2x^2

Let H_1^*, H_2^*, H_3^*, and H_4^* be the versions of all four hypothesis functions that are using optimal parameters. In the subparts below, fill in the blanks.

The mean squared error of H_1^* is ____ the mean squared error of H_2^*.

Greater than

Greater than or equal to

Equal to

Less than

Less than or equal to

Impossible to tell

Less than or equal to

H_1^* is a more flexible version of H_2^* – anything H_2^* can do, H_1^* can also do, so H_1^* can’t have a higher MSE than H_1^*.

Another way to think about this is in H_1^* we have more information to make an educated guess in comparison to H_2^* because of x^2. If we have more information, assuming it is useful, then the error will at least be equal to H_2^*, but would probably decrease the error. Making the MSE of H_1^* is less than or equal to the MSE of H_2^*.

The average score on this problem was 56%.

The mean squared error of H_1^* is ____ the mean squared error of H_3^*.

Greater than

Greater than or equal to

Equal to

Less than

Less than or equal to

Impossible to tell

Impossible to tell

H_1^* has a different set of features than H_2^* and vice versa, which makes it impossible to tell – this is not a situation like in (a) or (c), where one hypothesis function has a superset of the features of the other.

The average score on this problem was 36%.

The mean squared error of H_1^* is ____ the mean squared error of H_4^*.

Greater than

Greater than or equal to

Equal to

Less than

Less than or equal to

Impossible to tell

Greater than or equal to.

H_1^* is a less flexible version of H_4^*, which has the same features as H_1^* but with an added linear term. So, H_4^* can fit all of the same patterns that H_1^* can, but more.

Again, think about adding more information. In H_4^* we can, at best, make a better guess by adding more x variables. At worst it will be the same as H_1^*.

The average score on this problem was 45%.

In ____ of the hypothesis functions H_1^*,H_2^*, H_3^*, and H_4^*, the sum of the residuals of the function’s predictions is 0.

None

1

2

3

All 4

All 4

All 4 hypothesis functions have intercept terms, which, as we saw in lecture, creates a column of all 1s in their design matrices, which enables the sum of residuals to be 0.

The average score on this problem was 53%.

*Source: Spring 2024
Midterm, Problem 5*

Consider a dataset of n points, (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) where:

- x_1 < x_2 < ... < x_n, and x_1, x_2, ..., x_n have a variance of \sigma_x^2 = 15 and a range of 20 (the range of a collection of values is the difference between the largest and smallest value).
- y_1 > y_2 > ... > y_n, and y_1, y_2, ..., y_n have a variance of \sigma_y^2 = 8 and a range of 6.

We fit two linear hypothesis functions using squared loss:

- One hypothesis function is fit with a ``swapped” version of the dataset, where x_1 and x_n are swapped — that is, it uses the dataset (x_n, y_1), (x_2, y_2), ..., (x_{n-1}, y_{n-1}), (x_1, y_n). Note that only two of the points in this dataset are different than in the original dataset. We’ll call the optimal slope and intercept of this hypothesis function w_1^\text{swap} and w_0^\text{swap}, respectively.
- Another hypothesis function is fit with the original dataset,

(x_1, y_1), (x_2, y_2), ..., (x_{n-1}, y_{n-1}), (x_n, y_n). We’ll call the optimal slope and intercept of this hypothesis function w_1^\text{orig} and w_0^\text{orig}, respectively.

On the next page, in the space provided, prove that:

\displaystyle | w_1^\text{swap} - w_1^\text{orig} | = \frac{8}{n}

Using one of the formulas for w_1.

\begin{align*} \displaystyle | w_1^\text{swap} - w_1^\text{orig} | = | \frac{[\sum_{i=2}^{n-1} (x_i - x_\text{mean})(y_i - y_\text{mean})] + (x_n - x_\text{mean})(y_1 - y_\text{mean}) + (x_1 - x_\text{mean})(y_n - y_\text{mean})}{15n} \\ - \frac{([\sum_{i=2}^{n-1} (x_i - x_\text{mean})(y_i - y_\text{mean})] + (x_1 - x_\text{mean})(y_1 - y_\text{mean}) + (x_n - x_\text{mean})(y_n - y_\text{mean}))}{15n} | \\ \end{align*}

then cancelling the sumations you get \begin{align*} \displaystyle | w_1^\text{swap} - w_1^\text{orig} | = | \frac{ (x_n - x_\text{mean})(y_1 - y_\text{mean}) + (x_1 - x_\text{mean})(y_n - y_\text{mean})}{15n} \\ - \frac{((x_1 - x_\text{mean})(y_1 - y_\text{mean}) + (x_n - x_\text{mean})(y_n - y_\text{mean}))}{15n} | \\ = |\frac{(x_n - x_\text{mean})(y_1 - y_n) + (x_1 - x_\text{mean})(y_n - y_1)}{15n}|\\ = |\frac{(x_n - x_\text{mean})(6) + (x_1 - x_\text{mean})(-6)}{15n}| \\ = |\frac{6(x_n - x_\text{mean} - x_1 + x_\text{mean})}{15n}|\\ = |\frac{6(20)}{15n}| = \frac{8}{n} \end{align*}

The average score on this problem was 39%.