Fall 2025 Midterm

← return to practice.dsc40a.com


Instructor(s): Gal Mishne

This exam was administered in person. They had a two sided 11in by 8.5in cheat sheet. Students had 50 minutes to take this exam.


Problem 1

You are working in a biology laboratory to quantify how an antibiotic drug influences the population growth of E. coli bacteria. You inoculate a Petri dish with the bacteria and the drug, then monitor the culture over time. At measurement times x_1, x_2, \dotsc, x_n (hours since inoculation), you estimate the total bacterial count y_1, y_2, \dotsc, y_n. To describe the growth pattern, you propose the exponential model H(x) = h\,e^{-x}, \qquad h \in \mathbb{R}.

Note: in version B the given function was H(x)=he^x.


Problem 1.1

Using the squared loss function \ell_{sq}(h, (x_i, y_i)) = (he^{-x_i} - y_i)^2, clearly write down the associated empirical risk as a function of h.

Version A: R(h)\;=\;\frac{1}{n}\sum_{i=1}^n\big(he^{-x_i}-y_i\big)^2.

Version B: R(h)\;=\;\frac{1}{n}\sum_{i=1}^n\big(he^{x_i}-y_i\big)^2.


Problem 1.2

Compute \frac{d}{dh} R(h).

Version A:

R(h)=\frac{1}{n}\sum_{i=1}^n\big(he^{-x_i}-y_i\big)^2

Differentiate term–by–term (chain rule): \frac{d}{dh}R(h)=\frac{1}{n}\sum_{i=1}^n 2\big(he^{-x_i}-y_i\big)\cdot e^{-x_i} =\frac{2}{n}\left(h\sum_{i=1}^n e^{-2x_i}-\sum_{i=1}^n y_i e^{-x_i}\right).

Equivalently, expanding the square, R(h)=\frac{1}{n}\sum_{i=1}^n\!\left(h^2e^{-2x_i}-2hy_ie^{-x_i}+y_i^2\right) \;\Rightarrow\; R'(h)=\frac{2}{n}\left(h\sum e^{-2x_i}-\sum y_ie^{-x_i}\right).

Version B:

R(h)=\frac{1}{n}\sum_{i=1}^n\big(he^{x_i}-y_i\big)^2

Differentiate term–by–term (chain rule): \frac{d}{dh}R(h)=\frac{1}{n}\sum_{i=1}^n 2\big(he^{x_i}-y_i\big)\cdot e^{x_i} =\frac{2}{n}\left(h\sum_{i=1}^n e^{2x_i}-\sum_{i=1}^n y_i e^{x_i}\right).


Problem 1.3

Prove that the global minimizer h^\ast for R(h) is given by the formula h^\ast \;=\; \frac{\sum_{i=1}^n y_i e^{-x_i}}{\sum_{i=1}^n e^{-2x_i}}.

Version A:

Set the derivative to zero: 0=\frac{2}{n}\left(h^\ast\sum_{i=1}^n e^{-2x_i}-\sum_{i=1}^n y_i e^{-x_i}\right) \;\Longrightarrow\; h^\ast\sum_{i=1}^n e^{-2x_i}=\sum_{i=1}^n y_i e^{-x_i} h^\ast=\frac{\sum_{i=1}^n y_i e^{-x_i}}{\sum_{i=1}^n e^{-2x_i}}.

Uniqueness: R(h) is a quadratic with R''(h)=\frac{2}{n}\sum_{i=1}^n e^{-2x_i}>0

Since e^{-2x_i}>0 for all i, R'' > 0 and the critical point is the unique minimizer.

Version B:

Set the derivative to zero: 0=\frac{2}{n}\left(h^\ast\sum_{i=1}^n e^{2x_i}-\sum_{i=1}^n y_i e^{x_i}\right) \;\Longrightarrow\; h^\ast=\frac{\sum_{i=1}^n y_i e^{x_i}}{\sum_{i=1}^n e^{2x_i}}.

Uniqueness: R(h) is a quadratic with R''(h)=\frac{2}{n}\sum_{i=1}^n e^{2x_i}>0

Since e^{2x_i}>0 for all i, R'' > 0 and the critical point is the unique minimizer.


Problem 1.4

After talking to a colleague, you decide to use the following hypothesis function instead: \widetilde{H}(x) = h_0 + h_1e^{-x}.

Using a suitable data transformation (i.e., change of variables) and formulas we have derived in class, find a formula for the optimal parameters h_0^\ast, h_1^\ast which minimize the mean squared error for the hypothesis function \widetilde{H}(x) given your observations from before.

For Version A:

Let z_i:=e^{-x_i}. Then \widetilde{H}(x)=h_0+h_1e^{-x}=h_0+h_1 z is ordinary simple linear regression of y on z with MSE.

For Version B:

Let z_i:=e^{x_i}. Then \widetilde{H}(x)=h_0+h_1e^{x}=h_0+h_1 z is ordinary simple linear regression of y on z with MSE.

The rest of the solution is the same for both versions:

With \bar z=\frac{1}{n}\sum z_i and \bar y=\frac{1}{n}\sum y_i, h_1^\ast=\frac{\sum_{i=1}^n (z_i-\bar z)(y_i-\bar y)}{\sum_{i=1}^n (z_i-\bar z)^2}, \qquad h_0^\ast=\bar y-h_1^\ast\,\bar z.

Equivalently, by the normal equations \begin{bmatrix} n & \sum z_i\\[2pt] \sum z_i & \sum z_i^2 \end{bmatrix} \begin{bmatrix} h_0^\ast\\ h_1^\ast \end{bmatrix} = \begin{bmatrix} \sum y_i\\ \sum z_i y_i \end{bmatrix}.

Alternative solution:

Let the design matrix be \mathbf{X} = \begin{bmatrix} 1 & z_1 \\ 1 & z_2 \\ \vdots & \vdots \\ 1 & z_n \\ \end{bmatrix}

Then by the normal equations \mathbf{X}^T\mathbf{X} h =\mathbf{X}^Ty



Problem 2

For each of the following statements, clearly fill in TRUE or FALSE. You do not need to show any work to earn full credit, but you can provide justification to possibly earn partial credit.

In parts (a)(d), let \{(\vec{x}_i, y_i)\}_{i=1}^n be a fixed dataset, where each \vec{x}_i \in \mathbb{R}^d and y_i \in \mathbb{R}. Let X \in \mathbb{R}^{n \times (d+1)} be the corresponding design matrix defined in the usual way. (For simple linear regression, d = 1 and each \vec{x}_i is a scalar.)


Problem 2.1

For a simple linear regression model: \min_{\vec{w} = (w_0, w_1)} \frac{1}{n}\sum_{i=1}^{n} |y_i -(w_0+w_1x_i)| = \min_{\vec{w} = (w_0, w_1)} \frac{1}{n} \| \vec{y} - X\vec{w}\|.

TRUE or FALSE?

FALSE. LHS is MAE (L₁); RHS is the Euclidean norm (L₂). They minimize different objectives.


Problem 2.2

\nabla \|X\vec{w} - \vec{y}\|^2 = X^\top X \vec{w} - X^\top \vec{y}.

TRUE or FALSE?

FALSE. \nabla\|X\vec w-\vec y\|^2=2X^\top(X\vec w-\vec y)=2X^\top X\vec w-2X^\top \vec y (missing factor 2).


Problem 2.3

d+1 = \mathrm{rank}(X^\top X) + \textrm{dim}(\mathrm{null}(X) ).

TRUE or FALSE?

TRUE. Rank–nullity: \mathrm{rank}(X)+\mathrm{null}(X)=d+1 and \mathrm{rank}(X^\top X)=\mathrm{rank}(X).


Problem 2.4

If the columns of X are orthonormal (so X^\top X = I), then the optimal parameter w_i^\ast is given by w_{i}^\ast = \vec{c}_i\cdot \vec{y}, where \vec{c}_i is the i-th column of X.

TRUE or FALSE?

TRUE. The OLS estimator minimizes \| \vec y - X\vec w\|^2. Setting the gradient to zero gives the normal equations 2X^\top (X\vec w - \vec y)=0 \;\;\Longrightarrow\;\; X^\top X\,\vec w = X^\top \vec y.

If the columns of X are orthonormal, then X^\top X=I, hence {\vec w}^*=(X^\top X)^{-1}X^\top \vec y = I^{-1}X^\top \vec y = X^\top \vec y.

Since X^\top X=I is invertible, the solution is unique. Geometrically, the coefficients are the inner products with the orthonormal regressors.


For parts (e)(g), consider the following scenario.

Glinda’s dataset has two features, x^{(1)} and x^{(2)}. She first fits a simple linear regression model H_1(\vec{\alpha}) using only the feature x^{(1)}. By minimizing R_{1}, the mean squared error (MSE) associated with this model, she obtains the optimal parameter vector \vec{\alpha}^{*} = (\alpha_0^{*}, \alpha_1^{*}). She then fits a second simple linear regression model H_2(\vec{\beta}) using only the feature x^{(2)} and, after minimizing R_{2}, the MSE for this model, obtains \vec{\beta}^{*} = (\beta_0^{*}, \beta_1^{*}). Finally, she fits a multiple linear regression model using both features, H_3(\vec{w},\vec{x}) = \begin{bmatrix} 1 & x^{(1)} & x^{(2)} \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix}, where x^{(1)} and x^{(2)} are the feature values for a single observation. Minimizing R_{3}, the MSE for this model, yields the optimal parameter vector \vec{w}^{*} = (w_0^{*}, w_1^{*}, w_2^{*}).


Problem 2.5

w^*_1=\alpha^*_1 and w^*_2=\beta_1^*.

TRUE or FALSE?

FALSE. Adding a second feature generally changes both coefficients; equality only under special conditions (e.g., orthogonality/centering).


Problem 2.6

R_3(\vec{w}^\ast) = R_1(\vec{\alpha}^\ast) - R_2(\vec{\beta}^\ast)

TRUE or FALSE?

FALSE. There is no subtraction relation between optimal MSEs; moreover the RHS can be negative. Typically R_3\le\min\{R_1,R_2\}.


Elphaba collects an additional feature x^{(3)} and adds it to Glinda’s dataset. She calculates a multiple linear regression model H_4(\vec{v}) using all 3 features, and minimizes the associated MSE R_4 to obtain an optimal parameter vector \vec{v}^\ast.


Problem 2.7

R_3(\vec{w}^\ast) \geq R_4(\vec{v}^\ast).

TRUE or FALSE?

TRUE. Adding features cannot increase the minimized in-sample MSE: R_4(\vec v^\ast)\le R_3(\vec w^\ast).



Problem 3

Hypothesis functions: - H(x) = w_0 + w_1x - H(x) = w_0 - H(x) = w_0\sin(w_1\,x) - H(x) = w_0+w_1e^{-x} - H(x) = w_0+w_1x+w_2x^2

Risk functions: - R(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}(H(x_i) - y_i)^2 - R(\vec{w}) = \frac{1}{n}\sum_{i=1}^{10}\ln(H(x_{i}) - y_{i}) - R(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n} |H(x_i) - y_i| - R(\vec{w}) = \max_{1\leq i\leq n} |H(x_i) - y_i| - R(\vec{w}) = (H(x_{10}) - y_{10})^2


Problem 3.1

[Plot a): Constant model with extreme outlier and maximum loss]

What are H(x) and R(\vec{w})?

H(x)=w_0 (constant).

R(w_0)=\max_i |y_i-w_0| (max loss).

Reason: the fitted line is the midrange level \frac{\min y+\max y}{2}, which minimizes the worst-case deviation.


Problem 3.2

[Plot b): Linear model with outliers and square loss]

What are H(x) and R(\vec{w})?

H(x)=w_0 + w_1x (linear).

R(w_0, w_1)=\frac{1}{n}\sum_{i=1}^n\big(y_i-(w_0 + w_1x_i)\big)^2 (squared loss/OLS).

Reason: line is pulled toward multiple high outliers, characteristic of L₂ sensitivity. Also, if LAD were used, the line would intersect two or more points, which isn’t shown here.


Problem 3.3

[Plot c): Constant model with two outliers and square loss]

What are H(x) and R(\vec{w})?

H(x)=w_0 (constant).

R(w_0)=\frac{1}{n}\sum_{i=1}^n (y_i-w_0)^2.

Reason: horizontal fit at the sample mean (two outliers shift the mean above the median, below the midrange).


Problem 3.4

[Plot d): Quadratic model with outliers]

What are H(x) and R(\vec{w})?

H(x) = w_0 + w_1x + w_2x^2

R(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n} (H(x_i) - y_i)^2 or R(\vec{w}) = \max_{1\le i\le n} |H(x_i) - y_i)|.

Reason: The curve is influenced by the three outlier points and does not pass through 3 points, so MAE can be ruled out (By Hwk 3 problem 7). Thus the only ones that make sense are R(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n} (H(x_i) - y_i)^2 or R(\vec{w}) = \max_{1\le i\le n} |H(x_i) - y_i)|, and any justification of these is valid depending on the student’s interpretation.


Problem 3.5

[Plot e): Linear model with outliers and absolute loss]

What are H(x) and R(\vec{w})?

H(x)=w_0 + w_1x (linear).

R(w_0, w_1)=\frac{1}{n}\sum_{i=1}^n |y_i-(w_0 + w_1x_i)| (LAD / L₁ loss).

Reason: line tracks the trend with reduced influence from large outliers above. It also intersects two or more points, which is another clue.


Problem 3.6

[Plot f): Constant model with five outliers and absolute loss]

What are H(x) and R(\vec{w})?

H(x) = w_0

R(w_0) = \frac{1}{n}\sum_{i=1}^{n} |H(x_i) - y_i|

The line is flat so it must be a constant model, and the line passed through the median as opposed to the mean or midrange.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.