Summer Session 2025 Midterm Exam

← return to practice.dsc40a.com


Instructor(s): Sawyer Robinson

This exam was take-home.


Problem 1

Source: Summer Session 1 2025 Midterm, Problem 1

Let \{(x_i,y_i)\}_{i=1}^n be a dataset of scalar input-output pairs, and consider the simple linear regression model

f(a, b;\, x) = ax + b,\qquad a, b\in\mathbb{R}.

Let \gamma > 0 be a fixed constant which is understood to be separate from the training data and the weights. Define the \gamma-risk according to the formula

R_{\gamma}(a, b) = \gamma a^2 + \frac{1}{n}\sum_{i=1}^{n} (y_i - (ax_i + b))^2.

Find closed-form expressions for the global minimizers a^\ast, b^\ast of the \gamma-risk for the training data \{(x_i,y_i)\}_{i=1}^n. In your solution, you should clearly label and explain each step.

Solution

Step 1: Compute Partial Derivatives

We compute the partial derivatives of R_\gamma(a, b) with respect to both parameters.

Partial derivative with respect to a:

\begin{align*} \frac{\partial R_\gamma}{\partial a} &= 2\gamma a + \frac{1}{n} \sum_{i=1}^{n} 2(y_i - ax_i - b)(-x_i)\\ &= 2\gamma a - \frac{2}{n} \sum_{i=1}^{n} x_i (y_i - ax_i - b) \end{align*}

Partial derivative with respect to b:

\begin{align*} \frac{\partial R_\gamma}{\partial b} &= \frac{1}{n} \sum_{i=1}^{n} 2(y_i - ax_i - b)(-1)\\ &= -\frac{2}{n} \sum_{i=1}^{n} (y_i - ax_i - b) \end{align*}

Step 2: Set Partial Derivatives to Zero (Normal Equations)

From \frac{\partial R_\gamma}{\partial b} = 0:

\begin{align*} -\frac{2}{n} \sum_{i=1}^{n} (y_i - ax_i - b) &= 0\\ \sum_{i=1}^{n} y_i - a\sum_{i=1}^{n} x_i - nb &= 0\\ nb &= \sum_{i=1}^{n} y_i - a\sum_{i=1}^{n} x_i\\ b &= \frac{1}{n}\sum_{i=1}^{n} y_i - \frac{a}{n}\sum_{i=1}^{n} x_i\\ b &= \bar{y} - a\bar{x} \end{align*}

where \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i and \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i.

From \frac{\partial R_\gamma}{\partial a} = 0:

\begin{align*} 2\gamma a - \frac{2}{n} \sum_{i=1}^{n} x_i (y_i - ax_i - b) &= 0\\ \gamma a - \frac{1}{n} \sum_{i=1}^{n} x_i y_i + \frac{a}{n}\sum_{i=1}^{n} x_i^2 + \frac{b}{n}\sum_{i=1}^{n} x_i &= 0\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 + b\sum_{i=1}^{n} x_i &= \sum_{i=1}^{n} x_i y_i \end{align*}

Step 3: Solve for a^* by Substituting b = \bar{y} - a\bar{x}

Substituting b = \bar{y} - a\bar{x} into the equation above:

\begin{align*} n\gamma a + a\sum_{i=1}^{n} x_i^2 + (\bar{y} - a\bar{x})\sum_{i=1}^{n} x_i &= \sum_{i=1}^{n} x_i y_i\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 + \bar{y} \cdot n\bar{x} - a\bar{x} \cdot n\bar{x} &= \sum_{i=1}^{n} x_i y_i\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 - an\bar{x}^2 &= \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y}\\ a\left(n\gamma + \sum_{i=1}^{n} x_i^2 - n\bar{x}^2\right) &= \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y} \end{align*}

Note that \sum_{i=1}^{n} x_i^2 - n\bar{x}^2 = \sum_{i=1}^{n} (x_i - \bar{x})^2 and \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}).

Therefore:

a^* = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 + n\gamma}

And:

b^* = \bar{y} - a^*\bar{x}

Step 4: Verify that (a^*, b^*) is a Minimizer (Second Derivative Test)

To confirm that the critical point is a minimizer, we compute the Hessian matrix and verify that it is positive definite.

What is the Hessian matrix?

The Hessian matrix H is a square matrix containing all second-order partial derivatives of a function. For a function R_\gamma(a, b) with two variables, the Hessian is:

H = \begin{pmatrix} \frac{\partial^2 R_\gamma}{\partial a^2} & \frac{\partial^2 R_\gamma}{\partial a \partial b}\\ \frac{\partial^2 R_\gamma}{\partial b \partial a} & \frac{\partial^2 R_\gamma}{\partial b^2} \end{pmatrix}

How does the Hessian determine convexity and minima?

  • If H is positive definite everywhere, then R_\gamma is strictly convex, which means any critical point is a global minimum.
  • For a 2 \times 2 matrix, H is positive definite if and only if:
    1. H_{11} > 0 (the top-left entry is positive), and
    2. \det(H) > 0 (the determinant is positive)

Second partial derivatives:

\begin{align*} \frac{\partial^2 R_\gamma}{\partial a^2} &= 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2\\ \frac{\partial^2 R_\gamma}{\partial b^2} &= \frac{2}{n}\sum_{i=1}^{n} 1 = 2\\ \frac{\partial^2 R_\gamma}{\partial a \partial b} &= \frac{2}{n}\sum_{i=1}^{n} x_i = 2\bar{x} \end{align*}

The Hessian matrix is:

H = \begin{pmatrix} 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2 & 2\bar{x}\\ 2\bar{x} & 2 \end{pmatrix}

For the Hessian to be positive definite, we need:

  1. H_{11} = 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2 > 0. This is true since \gamma > 0 and all terms are non-negative.

  2. \det(H) > 0:

\begin{align*} \det(H) &= 2\left(2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2\right) - 4\bar{x}^2\\ &= 4\gamma + \frac{4}{n}\sum_{i=1}^{n} x_i^2 - 4\bar{x}^2\\ &= 4\gamma + \frac{4}{n}\sum_{i=1}^{n} (x_i - \bar{x})^2\\ &> 0 \end{align*}

since \gamma > 0.

Therefore, the Hessian is positive definite, confirming that (a^*, b^*) is indeed a global minimizer of R_\gamma(a, b).

Final Answer

\boxed{a^* = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 + n\gamma}}

\boxed{b^* = \bar{y} - a^*\bar{x}}

where \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i and \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i.


Problem 2

Source: Summer Session 1 2025 Midterm, Problem 2a-e

Consider a dataset \{(\vec{x}_i, y_i)\}_{i=1}^{n} where each \vec{x}_i \in \mathbb{R}^{d} and y_i \in \mathbb{R} for which you decide to fit a multiple linear regression model:

f_1(\vec{w}, b;\, \vec{x}) = \vec{w}^\top x + b,\qquad\vec{w}\in\mathbb{R}^d,\;b\in\mathbb{R}.

After minimizing the MSE, the resulting model has an optimal empirical risk value denoted R_1.

Due to fairness constraints related to the nature of the input features, your boss informs you that the last two weights must be the same: \vec{w}^{(d-1)} =\vec{w}^{(d)}. Your colleague suggests a simple fix by removing the last two weights and features:

f_2(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-2)}\vec{x}^{(d-2)} + b.

After training, the resulting model has an optimal empirical risk value denoted R_2. On the other hand, you propose the approach of grouping the last two features and using the model formula

f_3(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-1)}\left(\vec{x}^{(d-1)} + \vec{x}^{(d)}\right) + b.

After training, the final model has an optimal empirical risk value denoted R_3.


Problem 2.1

Carefully apply Theorem 2.3.2 (“Optimal Model Parameters for Multiple Linear Regression”) to find an expression for the optimal parameters b^\ast, \vec{w}^\ast which minimize the mean squared error for the model f_2 and the training data \{(\vec{x}_i, y_i)\}_{i=1}^{n}. Your answer may contain the design matrix \mathbf{Z}, or any suitably modified version, as needed.

Solution

We begin by rewriting the model f_2 more explicitly. The model f_2 has d-2 weight parameters (excluding the intercept b): f_2(\vec{w}, b; \vec{x}) = \sum_{j=1}^{d-2} \vec{w}^{(j)} x^{(j)} + b, \quad \vec{w} \in \mathbb{R}^{d-2}, \, b \in \mathbb{R}.

To apply Theorem 2.3.2, we need to construct the appropriate design matrix and parameter vector.

Step 1: Define the modified design matrix

Let \mathbf{Z}_2 \in \mathbb{R}^{n \times (d-1)} be the modified design matrix where each row corresponds to a training example with only the first d-2 features plus a column of ones for the intercept: \mathbf{Z}_2 = \begin{bmatrix} 1 & x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(d-2)} \\ 1 & x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(d-2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(d-2)} \end{bmatrix} \in \mathbb{R}^{n \times (d-1)}.

Step 2: Define the parameter vector and target vector

Let \vec{\theta} \in \mathbb{R}^{d-1} be the combined parameter vector: \vec{\theta} = \begin{bmatrix} b \\ \vec{w}^{(1)} \\ \vec{w}^{(2)} \\ \vdots \\ \vec{w}^{(d-2)} \end{bmatrix}.

Let \mathbf{Y} \in \mathbb{R}^n be the target vector: \mathbf{Y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}.

Step 3: Express the MSE

The mean squared error can be written as: \text{MSE}(\vec{\theta}) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \mathbf{Z}_{2,i}^{\top} \vec{\theta} \right)^2 = \frac{1}{n} \|\mathbf{Y} - \mathbf{Z}_2 \vec{\theta}\|_2^2.

Step 4: Apply Theorem 2.3.2

To minimize the MSE, we take the gradient with respect to \vec{\theta} and set it equal to zero: \frac{\partial \text{MSE}}{\partial \vec{\theta}} = -\frac{2}{n} \mathbf{Z}_2^{\top} (\mathbf{Y} - \mathbf{Z}_2 \vec{\theta}) = 0.

This simplifies to the normal equation: \mathbf{Z}_2^{\top} \mathbf{Z}_2 \vec{\theta} = \mathbf{Z}_2^{\top} \mathbf{Y}.

Assuming \mathbf{Z}_2^{\top} \mathbf{Z}_2 is invertible (which holds when \mathbf{Z}_2 has full column rank), the unique minimizer is: \vec{\theta}^* = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}.

Step 5: Extract optimal parameters

The optimal parameters are obtained by decomposing \vec{\theta}^*: \begin{bmatrix} b^* \\ \vec{w}^* \end{bmatrix} = \vec{\theta}^* = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}, where b^* is the first component and \vec{w}^* \in \mathbb{R}^{d-2} consists of the remaining components.

Final Answer: \boxed{\begin{bmatrix} b^* \\ \vec{w}^* \end{bmatrix} = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}} where \mathbf{Z}_2 \in \mathbb{R}^{n \times (d-1)} is the design matrix containing a column of ones followed by the first d-2 features of each training example.


Problem 2.2

Using the comparison operators \{ =, \leq, \geq, <, >\}, rank the optimal risk values R_1, R_2, R_3 from least to greatest. Justify your answer.

Solution

Answer: R_1 \leq R_3 \leq R_2

Justification:

To compare these three models, we need to analyze their flexibility and representational capacity.

Comparing R_1 and R_3:

Model f_1 is the most general model with d independent weight parameters plus an intercept, giving it (d+1) total parameters.

Model f_3 can be rewritten as: f_3(\vec{w}, b; \vec{x}) = \vec{w}^{(1)}x^{(1)} + \ldots + \vec{w}^{(d-2)}x^{(d-2)} + \vec{w}^{(d-1)}x^{(d-1)} + \vec{w}^{(d-1)}x^{(d)} + b

This is equivalent to f_1 with the constraint that w^{(d-1)} = w^{(d)}. In other words, f_3 is a constrained version of f_1.

Since f_1 includes all possible models that f_3 can represent (by setting w^{(d-1)} = w^{(d)} in f_1), the minimum achievable MSE for f_1 must be at least as good as (or better than) that of f_3. Therefore:

R_1 \leq R_3

Comparing R_3 and R_2:

Model f_2 completely removes the last two features from the model, using only features x^{(1)}, \ldots, x^{(d-2)}.

Model f_3 uses all d features but groups the last two with a shared coefficient \vec{w}^{(d-1)}.

We can show that f_3 is more flexible than f_2 by noting that f_2 is a special case of f_3. Specifically, if we set \vec{w}^{(d-1)} = 0 in model f_3, we get:

f_3(\vec{w}, b; \vec{x}) = \vec{w}^{(1)}x^{(1)} + \ldots + \vec{w}^{(d-2)}x^{(d-2)} + 0 \cdot (x^{(d-1)} + x^{(d)}) + b = f_2(\vec{w}, b; \vec{x})

Since f_3 can represent any model that f_2 can represent (plus additional models where \vec{w}^{(d-1)} \neq 0), the minimum achievable MSE for f_3 must be at least as good as that of f_2. Therefore:

R_3 \leq R_2

Final Ranking:

Combining these results, we have:

R_1 \leq R_3 \leq R_2

This ranking makes intuitive sense: f_1 is the most flexible model with the most parameters, allowing it to fit the training data best (lowest MSE). Model f_3 is moderately flexible, incorporating information from all features but with a constraint on the last two weights. Model f_2 is the least flexible, as it discards potentially useful information by completely removing the last two features.


Returning to the original model f_1, suppose you were asked instead to eliminate the intercept term, leading to the model formula

f_4(\vec{w};\, \vec{x}) = \vec{w}^\top x.

Once again, you train this model by minimizing the associated mean squared error and obtain an optimal MSE denoted R_4.


Problem 2.3

Explain why R_1 \leq R_4.

Solution

Model f_1(\vec{w}, b; \vec{x}) = \vec{w}^\top \vec{x} + b includes an intercept term b, while model f_4(\vec{w}; \vec{x}) = \vec{w}^\top x does not have an intercept.

This means f_1 is a more flexible model with one additional parameter compared to f_4. The intercept/bias term allows the model to shift all predictions up or down, enabling it to better match the target values.

Importantly, f_1 can always replicate the behavior of f_4 by simply setting b = 0. Therefore, f_1 can do at least as well as f_4, and possibly better if a non-zero intercept improves the fit.

Since R_1 represents the optimal (minimized) mean squared error for model f_1 and R_4 represents the optimal mean squared error for model f_4, we have:

R_1 \leq R_4

The MSE of f_1 must be less than or equal to the MSE of f_4.


Problem 2.4

Assume the following centering conditions hold:

\sum_{i=1}^{n} \vec{x}_i^{(j)} = 0\text{ for each }1\leq j\leq d,\text{ and }\sum_{i=1}^n y_i = 0.

Prove R_1 = R_4.

Solution

We need to show that under the centering conditions, the optimal risk for model f_1 (with intercept) equals the optimal risk for model f_4 (without intercept).

Step 1: Express the Mean Squared Error for Model f_1

For model f_1, the mean squared error is: \text{MSE}_1(\vec{w}, b) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2

Step 2: Find the Optimal Intercept b^*

To minimize the MSE with respect to b, we take the partial derivative: \frac{\partial \text{MSE}_1}{\partial b} = \frac{\partial}{\partial b} \left( \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2 \right)

= -\frac{2}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)

Setting this equal to zero: -\frac{2}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b) = 0

\sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \vec{w}^{\top} \vec{x}_i - nb = 0

Step 3: Apply the Centering Conditions

Using the centering condition \sum_{i=1}^{n} y_i = 0: \sum_{i=1}^{n} \vec{w}^{\top} \vec{x}_i = \vec{w}^{\top} \sum_{i=1}^{n} \vec{x}_i

Since \sum_{i=1}^{n} \vec{x}_i^{(j)} = 0 for each feature j, we have \sum_{i=1}^{n} \vec{x}_i = \vec{0}, so: \vec{w}^{\top} \sum_{i=1}^{n} \vec{x}_i = \vec{w}^{\top} \vec{0} = 0

Therefore: 0 - 0 - nb = 0 b^* = 0

Step 4: Conclude R_1 = R_4

Since the optimal intercept b^* = 0 for model f_1 under the centering conditions, the optimal model becomes: f_1(\vec{w}^*, b^*; \vec{x}) = \vec{w}^{* \top} \vec{x} + 0 = \vec{w}^{* \top} \vec{x}

This is exactly the form of model f_4. Therefore, both models optimize over the same family of functions (linear functions through the origin), and thus achieve the same optimal risk: R_1 = \min_{\vec{w}, b} \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2 = \min_{\vec{w}} \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i)^2 = R_4

Therefore, R_1 = R_4. ∎


Problem 2.5

Use the setting of d=1 (a.k.a. simple linear regression) to draw a sketch which illustrates why the result in Part (d) makes sense geometrically.

Solution

Geometric Interpretation

When d = 1, we have simple linear regression with a single feature: - Model f_1: y = ax + b (line with intercept) - Model f_4: y = ax (line through the origin)

The centering conditions become: - \sum_{i=1}^n x_i = 0 (features are centered) - \sum_{i=1}^n y_i = 0 (targets are centered)

This means the data has mean (\bar{x}, \bar{y}) = (0, 0), so the data cloud is centered at the origin.

Key Insight

When fitting a line y = ax + b to data centered at the origin using least squares, the optimal intercept is:

b^* = \bar{y} - a^* \bar{x} = 0 - a^* \cdot 0 = 0

This means the best-fit line for f_1 automatically passes through the origin, making it identical to the best-fit line for f_4.

Sketch

        y
        |
        |    •
        |  •
        |•     •
    ----•-------•---- x
      • |   •
    •   |
        |

Explanation of sketch: - The data points (•) are scattered around the origin (0, 0) because they are centered - Both f_1 and f_4 will fit the same line through the origin (represented by the diagonal line) - For f_1: The optimal intercept b^* = 0 because the centroid of the data is at (0, 0), and the least squares line always passes through the centroid - For f_4: The model is constrained to pass through the origin - Since both models produce the same fitted line, they achieve the same minimum MSE: R_1 = R_4

Why This Makes Sense

The centering conditions ensure that the “natural” best-fit line for f_1 passes through the origin, eliminating the advantage that f_1 normally has over f_4 due to the flexibility of choosing the intercept. Therefore, both models perform equally well on centered data.



Problem 3

Source: Summer Session 1 2025 Midterm, Problem 3a-c

Let \{x_i\}_{i=1}^n be a training dataset of scalar values, and suppose we wish to use the constant model

f(c;\, x) = c,\qquad c\in\mathbb{R}.

In various situations it can be useful to emphasize some training examples over others (e.g., due to data quality). For this purpose, suppose \alpha_1, \alpha_2, \dotsc, \alpha_n > 0 are fixed positive weights which are understood as separate from the training data and model parameters.


Problem 3.1

Find a formula for the minimizer c_1^\ast of the risk function

R_{1}(c) = \frac{1}{n} \sum_{i=1}^n \alpha_i(c - x_i)^2.


Problem 3.2

Find a formula for the minimizer c_2^\ast of the risk function

R_{2}(c) = \frac{1}{n} \sum_{i=1}^n \alpha_i |c - x_i|.


Problem 3.3

Which risk function is more sensitive to outliers?



Problem 4

Source: Summer Session 1 2025 Midterm, Problem 4a-d

An automotive research team wants to build a predictive model that simultaneously estimates two performance metrics of passenger cars:

  1. City fuel consumption (in \text{L}/100\text{ km}),
  2. Highway fuel consumption (in \text{L}/100\text{ km}).

To capture mechanical and aerodynamic factors, the engineers record the following four features for each vehicle (all measured on the current model year):

  1. Engine displacement (L)
  2. Vehicle mass (kg)
  3. Peak horsepower
  4. Drag coefficient

They propose the general linear model

f(\mathbf W,\vec b;\,\vec x) = \mathbf W\,\vec x+\vec b,\qquad\mathbf{W}\in\mathbb{R}^{2\times 4},\; \vec{b}\in\mathbb{R}^2,

where \vec x\in\mathbb{R}^{4} denotes the feature vector for a given car. Data for eight different cars are listed below.

Feature \vec{x}_1 \vec{x}_2 \vec{x}_3 \vec{x}_4 \vec{x}_5 \vec{x}_6 \vec{x}_7 \vec{x}_8
Engine disp. (L) 2.0 2.5 3.0 1.8 3.5 2.2 2.8 1.6
Mass (kg) 1300 1450 1600 1250 1700 1350 1500 1200
Horsepower 140 165 200 130 250 155 190 115
Drag coeff. 0.28 0.30 0.32 0.27 0.33 0.29 0.31 0.26
City L/100km 8.5 9.2 10.8 7.8 11.5 8.9 9.8 7.2
HWY L/100km 6.0 6.5 7.5 5.8 8.0 6.2 6.9 5.4


Problem 4.1

Write down the design matrix \mathbf Z and the target matrix \mathbf Y from the data in the table.


Problem 4.2

Compute the weight matrix \mathbf W^{\ast} and bias vector \vec b^{\ast} that minimize the MSE for the given dataset. You can use Python for the computations where needed. You do not need to submit your code, but you do need to write down all matrices and vectors relevant to your computations (round your answers to three decimal places).


Problem 4.3

With (\mathbf W^{\ast},\vec b^{\ast}), predict the two fuel consumption values for each of the eight cars and report the overall MSE between the predictions and the true targets.


Problem 4.4

The team now measures two additional vehicles:

Feature \vec{x}_9 \vec{x}_{10}
Engine disp. (L) 2.4 1.5
Mass (kg) 1400 1150
Horsepower 170 110
Drag coeff. 0.29 0.25
City L/100km 9.0 7.0
HWY L/100km 6.3 5.2

Use your trained model to predict fuel consumption for these two cars. Compute the mean-squared error for the two testing examples and state whether you would recommend the model to the engineers.



Problem 5

Source: Summer Session 1 2025 Midterm, Problem 5a-c

Let \{(x_i,y_i)\}_{i=1}^n be a dataset of scalar input-output pairs.


Problem 5.1

Suppose we model y using a simple linear regression model of the form

f(\vec{\theta};\, x) = \vec{\theta}^{(0)} + \vec{\theta}^{(1)}x, \qquad\vec{\theta}\in\mathbb{R}^2.

Prove that the line of best fit (with respect to MSE) passes through the point (\overline{x}, \overline{y}).


Problem 5.2

Suppose we model y using a simple polynomial regression model of the form

f(\vec{\theta};\, x) = \vec{\theta}^{(0)} + \vec{\theta}^{(1)}x+ \vec{\theta}^{(2)}x^2, \qquad\vec{\theta}\in\mathbb{R}^3.

Prove that the curve of best fit (with respect to MSE) passes through the point (\overline{x}, \overline{y} + \vec\theta^{\ast(2)}((\overline{x})^2 - \overline{x^2})), where

\overline{x^2} = \frac{1}{n}\sum_{i=1}^n x_i^2.


Problem 5.3

Using the same model as (b), suppose we minimize MSE and find optimal parameters \vec{\theta}^\ast. Further suppose we apply a shifting and scaling operation to the training targets, defining

\widetilde{y_i} = \alpha(y_i - \beta),\qquad\alpha,\beta\in\mathbb{R}.

Find formulas for the new optimal parameters, denoted \vec{\widetilde{\theta}}^\ast, in terms of the old parameters and \alpha, \beta.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.