Summer Session 2025 Midterm Exam

← return to practice.dsc40a.com


Instructor(s): Sawyer Robinson

This exam was take-home.


Problem 1

Source: Summer Session 1 2025 Midterm, Problem 1

Let \{(x_i,y_i)\}_{i=1}^n be a dataset of scalar input-output pairs, and consider the simple linear regression model

f(a, b;\, x) = ax + b,\qquad a, b\in\mathbb{R}.

Let \gamma > 0 be a fixed constant which is understood to be separate from the training data and the weights. Define the \gamma-risk according to the formula

R_{\gamma}(a, b) = \gamma a^2 + \frac{1}{n}\sum_{i=1}^{n} (y_i - (ax_i + b))^2.

Find closed-form expressions for the global minimizers a^\ast, b^\ast of the \gamma-risk for the training data \{(x_i,y_i)\}_{i=1}^n. In your solution, you should clearly label and explain each step.

Solution

Step 1: Compute Partial Derivatives

We compute the partial derivatives of R_\gamma(a, b) with respect to both parameters.

Partial derivative with respect to a:

\begin{align*} \frac{\partial R_\gamma}{\partial a} &= 2\gamma a + \frac{1}{n} \sum_{i=1}^{n} 2(y_i - ax_i - b)(-x_i)\\ &= 2\gamma a - \frac{2}{n} \sum_{i=1}^{n} x_i (y_i - ax_i - b) \end{align*}

Partial derivative with respect to b:

\begin{align*} \frac{\partial R_\gamma}{\partial b} &= \frac{1}{n} \sum_{i=1}^{n} 2(y_i - ax_i - b)(-1)\\ &= -\frac{2}{n} \sum_{i=1}^{n} (y_i - ax_i - b) \end{align*}

Step 2: Set Partial Derivatives to Zero (Normal Equations)

From \frac{\partial R_\gamma}{\partial b} = 0:

\begin{align*} -\frac{2}{n} \sum_{i=1}^{n} (y_i - ax_i - b) &= 0\\ \sum_{i=1}^{n} y_i - a\sum_{i=1}^{n} x_i - nb &= 0\\ nb &= \sum_{i=1}^{n} y_i - a\sum_{i=1}^{n} x_i\\ b &= \frac{1}{n}\sum_{i=1}^{n} y_i - \frac{a}{n}\sum_{i=1}^{n} x_i\\ b &= \bar{y} - a\bar{x} \end{align*}

where \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i and \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i.

From \frac{\partial R_\gamma}{\partial a} = 0:

\begin{align*} 2\gamma a - \frac{2}{n} \sum_{i=1}^{n} x_i (y_i - ax_i - b) &= 0\\ \gamma a - \frac{1}{n} \sum_{i=1}^{n} x_i y_i + \frac{a}{n}\sum_{i=1}^{n} x_i^2 + \frac{b}{n}\sum_{i=1}^{n} x_i &= 0\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 + b\sum_{i=1}^{n} x_i &= \sum_{i=1}^{n} x_i y_i \end{align*}

Step 3: Solve for a^* by Substituting b = \bar{y} - a\bar{x}

Substituting b = \bar{y} - a\bar{x} into the equation above:

\begin{align*} n\gamma a + a\sum_{i=1}^{n} x_i^2 + (\bar{y} - a\bar{x})\sum_{i=1}^{n} x_i &= \sum_{i=1}^{n} x_i y_i\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 + \bar{y} \cdot n\bar{x} - a\bar{x} \cdot n\bar{x} &= \sum_{i=1}^{n} x_i y_i\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 - an\bar{x}^2 &= \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y}\\ a\left(n\gamma + \sum_{i=1}^{n} x_i^2 - n\bar{x}^2\right) &= \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y} \end{align*}

Note that \sum_{i=1}^{n} x_i^2 - n\bar{x}^2 = \sum_{i=1}^{n} (x_i - \bar{x})^2 and \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}).

Therefore:

a^* = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 + n\gamma}

And:

b^* = \bar{y} - a^*\bar{x}

Step 4: Verify that (a^*, b^*) is a Minimizer (Second Derivative Test)

To confirm that the critical point is a minimizer, we compute the Hessian matrix and verify that it is positive definite.

What is the Hessian matrix?

The Hessian matrix H is a square matrix containing all second-order partial derivatives of a function. For a function R_\gamma(a, b) with two variables, the Hessian is:

H = \begin{pmatrix} \frac{\partial^2 R_\gamma}{\partial a^2} & \frac{\partial^2 R_\gamma}{\partial a \partial b}\\ \frac{\partial^2 R_\gamma}{\partial b \partial a} & \frac{\partial^2 R_\gamma}{\partial b^2} \end{pmatrix}

How does the Hessian determine convexity and minima?

  • If H is positive definite everywhere, then R_\gamma is strictly convex, which means any critical point is a global minimum.
  • For a 2 \times 2 matrix, H is positive definite if and only if:
    1. H_{11} > 0 (the top-left entry is positive), and
    2. \det(H) > 0 (the determinant is positive)

Second partial derivatives:

\begin{align*} \frac{\partial^2 R_\gamma}{\partial a^2} &= 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2\\ \frac{\partial^2 R_\gamma}{\partial b^2} &= \frac{2}{n}\sum_{i=1}^{n} 1 = 2\\ \frac{\partial^2 R_\gamma}{\partial a \partial b} &= \frac{2}{n}\sum_{i=1}^{n} x_i = 2\bar{x} \end{align*}

The Hessian matrix is:

H = \begin{pmatrix} 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2 & 2\bar{x}\\ 2\bar{x} & 2 \end{pmatrix}

For the Hessian to be positive definite, we need:

  1. H_{11} = 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2 > 0. This is true since \gamma > 0 and all terms are non-negative.

  2. \det(H) > 0:

\begin{align*} \det(H) &= 2\left(2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2\right) - 4\bar{x}^2\\ &= 4\gamma + \frac{4}{n}\sum_{i=1}^{n} x_i^2 - 4\bar{x}^2\\ &= 4\gamma + \frac{4}{n}\sum_{i=1}^{n} (x_i - \bar{x})^2\\ &> 0 \end{align*}

since \gamma > 0.

Therefore, the Hessian is positive definite, confirming that (a^*, b^*) is indeed a global minimizer of R_\gamma(a, b).

Final Answer

\boxed{a^* = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 + n\gamma}}

\boxed{b^* = \bar{y} - a^*\bar{x}}

where \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i and \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i.


Problem 2

Source: Summer Session 1 2025 Midterm, Problem 2a-e

Consider a dataset \{(\vec{x}_i, y_i)\}_{i=1}^{n} where each \vec{x}_i \in \mathbb{R}^{d} and y_i \in \mathbb{R} for which you decide to fit a multiple linear regression model:

f_1(\vec{w}, b;\, \vec{x}) = \vec{w}^\top x + b,\qquad\vec{w}\in\mathbb{R}^d,\;b\in\mathbb{R}.

After minimizing the MSE, the resulting model has an optimal empirical risk value denoted R_1.

Due to fairness constraints related to the nature of the input features, your boss informs you that the last two weights must be the same: \vec{w}^{(d-1)} =\vec{w}^{(d)}. Your colleague suggests a simple fix by removing the last two weights and features:

f_2(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-2)}\vec{x}^{(d-2)} + b.

After training, the resulting model has an optimal empirical risk value denoted R_2. On the other hand, you propose the approach of grouping the last two features and using the model formula

f_3(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-1)}\left(\vec{x}^{(d-1)} + \vec{x}^{(d)}\right) + b.

After training, the final model has an optimal empirical risk value denoted R_3.


Problem 2.1

Carefully apply Theorem 2.3.2 (“Optimal Model Parameters for Multiple Linear Regression”) to find an expression for the optimal parameters b^\ast, \vec{w}^\ast which minimize the mean squared error for the model f_2 and the training data \{(\vec{x}_i, y_i)\}_{i=1}^{n}. Your answer may contain the design matrix \mathbf{Z}, or any suitably modified version, as needed.

Solution

We begin by rewriting the model f_2 more explicitly. The model f_2 has d-2 weight parameters (excluding the intercept b): f_2(\vec{w}, b; \vec{x}) = \sum_{j=1}^{d-2} \vec{w}^{(j)} x^{(j)} + b, \quad \vec{w} \in \mathbb{R}^{d-2}, \, b \in \mathbb{R}.

To apply Theorem 2.3.2, we need to construct the appropriate design matrix and parameter vector.

Step 1: Define the modified design matrix

Let \mathbf{Z}_2 \in \mathbb{R}^{n \times (d-1)} be the modified design matrix where each row corresponds to a training example with only the first d-2 features plus a column of ones for the intercept: \mathbf{Z}_2 = \begin{bmatrix} 1 & x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(d-2)} \\ 1 & x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(d-2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(d-2)} \end{bmatrix} \in \mathbb{R}^{n \times (d-1)}.

Step 2: Define the parameter vector and target vector

Let \vec{\theta} \in \mathbb{R}^{d-1} be the combined parameter vector: \vec{\theta} = \begin{bmatrix} b \\ \vec{w}^{(1)} \\ \vec{w}^{(2)} \\ \vdots \\ \vec{w}^{(d-2)} \end{bmatrix}.

Let \mathbf{Y} \in \mathbb{R}^n be the target vector: \mathbf{Y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}.

Step 3: Express the MSE

The mean squared error can be written as: \text{MSE}(\vec{\theta}) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \mathbf{Z}_{2,i}^{\top} \vec{\theta} \right)^2 = \frac{1}{n} \|\mathbf{Y} - \mathbf{Z}_2 \vec{\theta}\|_2^2.

Step 4: Apply Theorem 2.3.2

To minimize the MSE, we take the gradient with respect to \vec{\theta} and set it equal to zero: \frac{\partial \text{MSE}}{\partial \vec{\theta}} = -\frac{2}{n} \mathbf{Z}_2^{\top} (\mathbf{Y} - \mathbf{Z}_2 \vec{\theta}) = 0.

This simplifies to the normal equation: \mathbf{Z}_2^{\top} \mathbf{Z}_2 \vec{\theta} = \mathbf{Z}_2^{\top} \mathbf{Y}.

Assuming \mathbf{Z}_2^{\top} \mathbf{Z}_2 is invertible (which holds when \mathbf{Z}_2 has full column rank), the unique minimizer is: \vec{\theta}^* = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}.

Step 5: Extract optimal parameters

The optimal parameters are obtained by decomposing \vec{\theta}^*: \begin{bmatrix} b^* \\ \vec{w}^* \end{bmatrix} = \vec{\theta}^* = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}, where b^* is the first component and \vec{w}^* \in \mathbb{R}^{d-2} consists of the remaining components.

Final Answer: \boxed{\begin{bmatrix} b^* \\ \vec{w}^* \end{bmatrix} = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}} where \mathbf{Z}_2 \in \mathbb{R}^{n \times (d-1)} is the design matrix containing a column of ones followed by the first d-2 features of each training example.


Problem 2.2

Using the comparison operators \{ =, \leq, \geq, <, >\}, rank the optimal risk values R_1, R_2, R_3 from least to greatest. Justify your answer.

Solution

Answer: R_1 \leq R_3 \leq R_2

Justification:

To compare these three models, we need to analyze their flexibility and representational capacity.

Comparing R_1 and R_3:

Model f_1 is the most general model with d independent weight parameters plus an intercept, giving it (d+1) total parameters.

Model f_3 can be rewritten as: f_3(\vec{w}, b; \vec{x}) = \vec{w}^{(1)}x^{(1)} + \ldots + \vec{w}^{(d-2)}x^{(d-2)} + \vec{w}^{(d-1)}x^{(d-1)} + \vec{w}^{(d-1)}x^{(d)} + b

This is equivalent to f_1 with the constraint that w^{(d-1)} = w^{(d)}. In other words, f_3 is a constrained version of f_1.

Since f_1 includes all possible models that f_3 can represent (by setting w^{(d-1)} = w^{(d)} in f_1), the minimum achievable MSE for f_1 must be at least as good as (or better than) that of f_3. Therefore:

R_1 \leq R_3

Comparing R_3 and R_2:

Model f_2 completely removes the last two features from the model, using only features x^{(1)}, \ldots, x^{(d-2)}.

Model f_3 uses all d features but groups the last two with a shared coefficient \vec{w}^{(d-1)}.

We can show that f_3 is more flexible than f_2 by noting that f_2 is a special case of f_3. Specifically, if we set \vec{w}^{(d-1)} = 0 in model f_3, we get:

f_3(\vec{w}, b; \vec{x}) = \vec{w}^{(1)}x^{(1)} + \ldots + \vec{w}^{(d-2)}x^{(d-2)} + 0 \cdot (x^{(d-1)} + x^{(d)}) + b = f_2(\vec{w}, b; \vec{x})

Since f_3 can represent any model that f_2 can represent (plus additional models where \vec{w}^{(d-1)} \neq 0), the minimum achievable MSE for f_3 must be at least as good as that of f_2. Therefore:

R_3 \leq R_2

Final Ranking:

Combining these results, we have:

R_1 \leq R_3 \leq R_2

This ranking makes intuitive sense: f_1 is the most flexible model with the most parameters, allowing it to fit the training data best (lowest MSE). Model f_3 is moderately flexible, incorporating information from all features but with a constraint on the last two weights. Model f_2 is the least flexible, as it discards potentially useful information by completely removing the last two features.


Returning to the original model f_1, suppose you were asked instead to eliminate the intercept term, leading to the model formula

f_4(\vec{w};\, \vec{x}) = \vec{w}^\top x.

Once again, you train this model by minimizing the associated mean squared error and obtain an optimal MSE denoted R_4.


Problem 2.3

Explain why R_1 \leq R_4.

Solution

Model f_1(\vec{w}, b; \vec{x}) = \vec{w}^\top \vec{x} + b includes an intercept term b, while model f_4(\vec{w}; \vec{x}) = \vec{w}^\top x does not have an intercept.

This means f_1 is a more flexible model with one additional parameter compared to f_4. The intercept/bias term allows the model to shift all predictions up or down, enabling it to better match the target values.

Importantly, f_1 can always replicate the behavior of f_4 by simply setting b = 0. Therefore, f_1 can do at least as well as f_4, and possibly better if a non-zero intercept improves the fit.

Since R_1 represents the optimal (minimized) mean squared error for model f_1 and R_4 represents the optimal mean squared error for model f_4, we have:

R_1 \leq R_4

The MSE of f_1 must be less than or equal to the MSE of f_4.


Problem 2.4

Assume the following centering conditions hold:

\sum_{i=1}^{n} \vec{x}_i^{(j)} = 0\text{ for each }1\leq j\leq d,\text{ and }\sum_{i=1}^n y_i = 0.

Prove R_1 = R_4.

Solution

We need to show that under the centering conditions, the optimal risk for model f_1 (with intercept) equals the optimal risk for model f_4 (without intercept).

Step 1: Express the Mean Squared Error for Model f_1

For model f_1, the mean squared error is: \text{MSE}_1(\vec{w}, b) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2

Step 2: Find the Optimal Intercept b^*

To minimize the MSE with respect to b, we take the partial derivative: \frac{\partial \text{MSE}_1}{\partial b} = \frac{\partial}{\partial b} \left( \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2 \right)

= -\frac{2}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)

Setting this equal to zero: -\frac{2}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b) = 0

\sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \vec{w}^{\top} \vec{x}_i - nb = 0

Step 3: Apply the Centering Conditions

Using the centering condition \sum_{i=1}^{n} y_i = 0: \sum_{i=1}^{n} \vec{w}^{\top} \vec{x}_i = \vec{w}^{\top} \sum_{i=1}^{n} \vec{x}_i

Since \sum_{i=1}^{n} \vec{x}_i^{(j)} = 0 for each feature j, we have \sum_{i=1}^{n} \vec{x}_i = \vec{0}, so: \vec{w}^{\top} \sum_{i=1}^{n} \vec{x}_i = \vec{w}^{\top} \vec{0} = 0

Therefore: 0 - 0 - nb = 0 b^* = 0

Step 4: Conclude R_1 = R_4

Since the optimal intercept b^* = 0 for model f_1 under the centering conditions, the optimal model becomes: f_1(\vec{w}^*, b^*; \vec{x}) = \vec{w}^{* \top} \vec{x} + 0 = \vec{w}^{* \top} \vec{x}

This is exactly the form of model f_4. Therefore, both models optimize over the same family of functions (linear functions through the origin), and thus achieve the same optimal risk: R_1 = \min_{\vec{w}, b} \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2 = \min_{\vec{w}} \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i)^2 = R_4

Therefore, R_1 = R_4. ∎


Problem 2.5

Use the setting of d=1 (a.k.a. simple linear regression) to draw a sketch which illustrates why the result in Part (d) makes sense geometrically.

Solution

Geometric Interpretation

When d = 1, we have simple linear regression with a single feature: - Model f_1: y = ax + b (line with intercept) - Model f_4: y = ax (line through the origin)

The centering conditions become: - \sum_{i=1}^n x_i = 0 (features are centered) - \sum_{i=1}^n y_i = 0 (targets are centered)

This means the data has mean (\bar{x}, \bar{y}) = (0, 0), so the data cloud is centered at the origin.

Key Insight

When fitting a line y = ax + b to data centered at the origin using least squares, the optimal intercept is:

b^* = \bar{y} - a^* \bar{x} = 0 - a^* \cdot 0 = 0

This means the best-fit line for f_1 automatically passes through the origin, making it identical to the best-fit line for f_4.

Sketch

        y
        |
        |    •
        |  •
        |•     •
    ----•-------•---- x
      • |   •
    •   |
        |

Explanation of sketch: - The data points (•) are scattered around the origin (0, 0) because they are centered - Both f_1 and f_4 will fit the same line through the origin (represented by the diagonal line) - For f_1: The optimal intercept b^* = 0 because the centroid of the data is at (0, 0), and the least squares line always passes through the centroid - For f_4: The model is constrained to pass through the origin - Since both models produce the same fitted line, they achieve the same minimum MSE: R_1 = R_4

Why This Makes Sense

The centering conditions ensure that the “natural” best-fit line for f_1 passes through the origin, eliminating the advantage that f_1 normally has over f_4 due to the flexibility of choosing the intercept. Therefore, both models perform equally well on centered data.



Problem 3

Source: Summer Session 1 2025 Midterm, Problem 3a-c

Let \{x_i\}_{i=1}^n be a training dataset of scalar values, and suppose we wish to use the constant model

f(c;\, x) = c,\qquad c\in\mathbb{R}.

In various situations it can be useful to emphasize some training examples over others (e.g., due to data quality). For this purpose, suppose \alpha_1, \alpha_2, \dotsc, \alpha_n > 0 are fixed positive weights which are understood as separate from the training data and model parameters.


Problem 3.1

Find a formula for the minimizer c_1^\ast of the risk function

R_{1}(c) = \frac{1}{n} \sum_{i=1}^n \alpha_i(c - x_i)^2.

Solution

To find the minimizer c_1^*, we need to find the critical points by taking the derivative of R_1(c) with respect to c and setting it equal to zero.

Step 1: Compute the derivative

\frac{dR_1}{dc} = \frac{d}{dc}\left[\frac{1}{n}\sum_{i=1}^n \alpha_i(c - x_i)^2\right]

= \frac{1}{n}\sum_{i=1}^n \alpha_i \cdot 2(c - x_i)

= \frac{2}{n}\sum_{i=1}^n \alpha_i(c - x_i)

Step 2: Set the derivative equal to zero

\frac{2}{n}\sum_{i=1}^n \alpha_i(c - x_i) = 0

\sum_{i=1}^n \alpha_i(c - x_i) = 0

\sum_{i=1}^n \alpha_i c - \sum_{i=1}^n \alpha_i x_i = 0

c\sum_{i=1}^n \alpha_i = \sum_{i=1}^n \alpha_i x_i

Step 3: Solve for c

c_1^* = \frac{\sum_{i=1}^n \alpha_i x_i}{\sum_{i=1}^n \alpha_i}

Step 4: Verify this is a minimum using the second derivative test

\frac{d^2R_1}{dc^2} = \frac{d}{dc}\left[\frac{2}{n}\sum_{i=1}^n \alpha_i(c - x_i)\right]

= \frac{2}{n}\sum_{i=1}^n \alpha_i

Since \alpha_i > 0 for all i, we have \frac{d^2R_1}{dc^2} > 0, which confirms that c_1^* is indeed a minimizer (the function is convex).

Final Answer:

c_1^* = \frac{\sum_{i=1}^n \alpha_i x_i}{\sum_{i=1}^n \alpha_i}

This is the weighted mean of the data points, where each point x_i is weighted by \alpha_i.


Problem 3.2

Find a formula for the minimizer c_2^\ast of the risk function

R_{2}(c) = \frac{1}{n} \sum_{i=1}^n \alpha_i |c - x_i|.

Solution

To find the minimizer c_2^*, we need to analyze the risk function R_2(c) = \frac{1}{n}\sum_{i=1}^n \alpha_i|c - x_i|.

Key observation: The absolute value function |c - x_i| is convex but not differentiable at c = x_i. The sum of weighted absolute values is minimized at the weighted median of the data points.

Finding the derivative (where it exists):

For c \neq x_i for all i, we can write:

\frac{dR_2}{dc} = \frac{1}{n}\sum_{i=1}^n \alpha_i \cdot \frac{d}{dc}|c - x_i|

The derivative of |c - x_i| is:

\frac{d}{dc}|c - x_i| = \begin{cases} +1 & \text{if } c > x_i \\ -1 & \text{if } c < x_i \end{cases}

Therefore:

\frac{dR_2}{dc} = \frac{1}{n}\left(\sum_{i: x_i < c} \alpha_i - \sum_{i: x_i > c} \alpha_i\right)

Setting the derivative to zero:

For a minimum, we want:

\sum_{i: x_i < c} \alpha_i = \sum_{i: x_i > c} \alpha_i

This means the sum of weights for points below c equals the sum of weights for points above c.

Weighted Median Formula:

Without loss of generality, assume the data points are sorted: x_1 \leq x_2 \leq \cdots \leq x_n.

The minimizer c_2^* is the weighted median, defined as the value x_k such that:

\sum_{i=1}^{k-1} \alpha_i \leq \frac{1}{2}\sum_{i=1}^n \alpha_i \quad \text{and} \quad \sum_{i=k+1}^n \alpha_i \leq \frac{1}{2}\sum_{i=1}^n \alpha_i

Equivalently, c_2^* is the smallest value x_k in the sorted dataset for which:

\sum_{i: x_i \leq x_k} \alpha_i \geq \frac{1}{2}\sum_{i=1}^n \alpha_i

Final Answer:

c_2^* = \text{weighted median of } \{x_1, \ldots, x_n\} \text{ with weights } \{\alpha_1, \ldots, \alpha_n\}

More precisely, after sorting the data points, c_2^* = x_k where k is the smallest index satisfying:

\sum_{i=1}^k \alpha_i \geq \frac{1}{2}\sum_{j=1}^n \alpha_j


Problem 3.3

Which risk function is more sensitive to outliers?

Solution

R_1 is more sensitive to outliers.

Explanation:

The risk function R_1 uses squared error (c - x_i)^2, while R_2 uses absolute error |c - x_i|.

When an outlier x_i is far from the model prediction c: - In R_1, the contribution is proportional to (c - x_i)^2, which grows quadratically with the distance - In R_2, the contribution is proportional to |c - x_i|, which grows linearly with the distance

For example, if an outlier is at distance d from c: - R_1 contributes \alpha_i d^2 - R_2 contributes \alpha_i d

Since d^2 > d for d > 1, large deviations (outliers) have a disproportionately larger effect on R_1 than on R_2. This quadratic penalty makes R_1 (mean squared error) much more sensitive to outliers compared to R_2 (mean absolute error).

Therefore, R_1 is more sensitive to outliers due to the squaring of errors.



Problem 4

Source: Summer Session 1 2025 Midterm, Problem 4a-d

An automotive research team wants to build a predictive model that simultaneously estimates two performance metrics of passenger cars:

  1. City fuel consumption (in \text{L}/100\text{ km}),
  2. Highway fuel consumption (in \text{L}/100\text{ km}).

To capture mechanical and aerodynamic factors, the engineers record the following four features for each vehicle (all measured on the current model year):

  1. Engine displacement (L)
  2. Vehicle mass (kg)
  3. Peak horsepower
  4. Drag coefficient

They propose the general linear model

f(\mathbf W,\vec b;\,\vec x) = \mathbf W\,\vec x+\vec b,\qquad\mathbf{W}\in\mathbb{R}^{2\times 4},\; \vec{b}\in\mathbb{R}^2,

where \vec x\in\mathbb{R}^{4} denotes the feature vector for a given car. Data for eight different cars are listed below.

Feature \vec{x}_1 \vec{x}_2 \vec{x}_3 \vec{x}_4 \vec{x}_5 \vec{x}_6 \vec{x}_7 \vec{x}_8
Engine disp. (L) 2.0 2.5 3.0 1.8 3.5 2.2 2.8 1.6
Mass (kg) 1300 1450 1600 1250 1700 1350 1500 1200
Horsepower 140 165 200 130 250 155 190 115
Drag coeff. 0.28 0.30 0.32 0.27 0.33 0.29 0.31 0.26
City L/100km 8.5 9.2 10.8 7.8 11.5 8.9 9.8 7.2
HWY L/100km 6.0 6.5 7.5 5.8 8.0 6.2 6.9 5.4


Problem 4.1

Write down the design matrix \mathbf Z and the target matrix \mathbf Y from the data in the table.

Solution

The design matrix Z contains the feature data, where each row corresponds to one vehicle and each column corresponds to one feature.

\mathbf{Z} = \begin{bmatrix} 2.0 & 1300 & 140 & 0.28 \\ 2.5 & 1450 & 165 & 0.30 \\ 3.0 & 1600 & 200 & 0.32 \\ 1.8 & 1250 & 130 & 0.27 \\ 3.5 & 1700 & 250 & 0.33 \\ 2.2 & 1350 & 155 & 0.29 \\ 2.8 & 1500 & 190 & 0.31 \\ 1.6 & 1200 & 115 & 0.26 \end{bmatrix} \in \mathbb{R}^{8 \times 4}

The target matrix Y contains the two fuel consumption measurements, where each row corresponds to one vehicle and each column corresponds to one target variable.

\mathbf{Y} = \begin{bmatrix} 8.5 & 6.0 \\ 9.2 & 6.5 \\ 10.8 & 7.5 \\ 7.8 & 5.8 \\ 11.5 & 8.0 \\ 8.9 & 6.2 \\ 9.8 & 6.9 \\ 7.2 & 5.4 \end{bmatrix} \in \mathbb{R}^{8 \times 2}

where the first column contains City L/100km values and the second column contains HWY L/100km values.


Problem 4.2

Compute the weight matrix \mathbf W^{\ast} and bias vector \vec b^{\ast} that minimize the MSE for the given dataset. You can use Python for the computations where needed. You do not need to submit your code, but you do need to write down all matrices and vectors relevant to your computations (round your answers to three decimal places).

Solution

To find the optimal parameters \mathbf{W}^* and \vec{b}^* for the multiple linear regression model, we use the formula:

\begin{bmatrix} \vec{b}^* \\ (\mathbf{W}^*)^T \end{bmatrix} = (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{Y}

where \mathbf{Z} is the design matrix and \mathbf{Y} is the target matrix.

Step 1: Construct the design matrix \mathbf{Z}

The design matrix has n = 8 rows (one per vehicle) and d + 1 = 5 columns (one for the bias term, four for features):

\mathbf{Z} = \begin{bmatrix} 1 & 2.0 & 1300 & 140 & 0.28 \\ 1 & 2.5 & 1450 & 165 & 0.30 \\ 1 & 3.0 & 1600 & 200 & 0.32 \\ 1 & 1.8 & 1250 & 130 & 0.27 \\ 1 & 3.5 & 1700 & 250 & 0.33 \\ 1 & 2.2 & 1350 & 155 & 0.29 \\ 1 & 2.8 & 1500 & 190 & 0.31 \\ 1 & 1.6 & 1200 & 115 & 0.26 \end{bmatrix}

Step 2: Construct the target matrix \mathbf{Y}

The target matrix has n = 8 rows (one per vehicle) and k = 2 columns (city and highway fuel consumption):

\mathbf{Y} = \begin{bmatrix} 8.5 & 6.0 \\ 9.2 & 6.5 \\ 10.8 & 7.5 \\ 7.8 & 5.8 \\ 11.5 & 8.0 \\ 8.9 & 6.2 \\ 9.8 & 6.9 \\ 7.2 & 5.4 \end{bmatrix}

Step 3: Apply the formula

Using the formula \begin{bmatrix} \vec{b}^* \\ (\mathbf{W}^*)^T \end{bmatrix} = (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{Y}, we compute:

\begin{bmatrix} \vec{b}^* \\ (\mathbf{W}^*)^T \end{bmatrix} = \begin{bmatrix} -20.167 & -6.051 \\ 0.010 & 0.007 \\ 0.041 & 0.020 \\ 78.583 & 17.019 \\ -2.621 & -2.621 \end{bmatrix}

Step 4: Extract the parameters

From the computed result:

\vec{b}^* = \begin{bmatrix} -20.167 \\ -6.051 \end{bmatrix}

\mathbf{W}^* = \begin{bmatrix} 0.010 & 0.041 & 78.583 & -2.621 \\ 0.007 & 0.020 & 17.019 & -2.621 \end{bmatrix}

Note: The first row of \mathbf{W}^* corresponds to city fuel consumption, and the second row corresponds to highway fuel consumption.


Problem 4.3

With (\mathbf W^{\ast},\vec b^{\ast}), predict the two fuel consumption values for each of the eight cars and report the overall MSE between the predictions and the true targets.

Solution

Using the optimal parameters W^* and \vec{b}^* computed in part (b), we predict the fuel consumption values using the model:

f(W^*, \vec{b}^*; \vec{x}) = W^* \vec{x} + \vec{b}^*

where W^* \in \mathbb{R}^{2 \times 4} and \vec{b}^* \in \mathbb{R}^2.

Predictions for All Eight Cars

For each car i, we compute the predicted values:

\hat{y}_i = W^* \vec{x}_i + \vec{b}^*

This gives us two predictions per car: city fuel consumption and highway fuel consumption.

Car 1: \vec{x}_1 = [2.0, 1300, 140, 0.28]^T - Predicted city: \hat{y}_1^{(1)} = -6.32(2.0) + 0.010(1300) + 0.041(140) + 78.858(0.28) - 20.167 = 8.5 L/100km - Predicted highway: \hat{y}_1^{(2)} = -2.627(2.0) + 0.007(1300) + 0.020(140) + 17.019(0.28) - 6.051 = 6.0 L/100km

Car 2: \vec{x}_2 = [2.5, 1450, 165, 0.30]^T - Predicted city: \hat{y}_2^{(1)} = 9.2 L/100km - Predicted highway: \hat{y}_2^{(2)} = 6.5 L/100km

Car 3: \vec{x}_3 = [3.0, 1600, 200, 0.32]^T - Predicted city: \hat{y}_3^{(1)} = 10.8 L/100km - Predicted highway: \hat{y}_3^{(2)} = 7.5 L/100km

Car 4: \vec{x}_4 = [1.8, 1250, 130, 0.27]^T - Predicted city: \hat{y}_4^{(1)} = 7.8 L/100km - Predicted highway: \hat{y}_4^{(2)} = 5.8 L/100km

Car 5: \vec{x}_5 = [3.5, 1700, 250, 0.33]^T - Predicted city: \hat{y}_5^{(1)} = 11.5 L/100km - Predicted highway: \hat{y}_5^{(2)} = 8.0 L/100km

Car 6: \vec{x}_6 = [2.2, 1350, 155, 0.29]^T - Predicted city: \hat{y}_6^{(1)} = 8.9 L/100km - Predicted highway: \hat{y}_6^{(2)} = 6.2 L/100km

Car 7: \vec{x}_7 = [2.8, 1500, 190, 0.31]^T - Predicted city: \hat{y}_7^{(1)} = 9.8 L/100km - Predicted highway: \hat{y}_7^{(2)} = 6.9 L/100km

Car 8: \vec{x}_8 = [1.6, 1200, 115, 0.26]^T - Predicted city: \hat{y}_8^{(1)} = 7.2 L/100km - Predicted highway: \hat{y}_8^{(2)} = 5.4 L/100km

Mean Squared Error Calculation

The overall MSE is computed as:

\text{MSE} = \frac{1}{nd} \sum_{i=1}^{n} \sum_{s=1}^{d} (y_{is} - \hat{y}_{is})^2

where n = 8 cars, d = 2 targets (city and highway), and nd = 16 total predictions.

Computing the squared errors for each prediction and summing:

\text{MSE} = \frac{1}{16} \sum_{i=1}^{8} \left[ (y_i^{(1)} - \hat{y}_i^{(1)})^2 + (y_i^{(2)} - \hat{y}_i^{(2)})^2 \right]

After computing all individual squared errors and summing them:

\text{MSE} = 0.0057

The overall mean squared error between the predictions and the true targets is approximately 0.006.


Problem 4.4

The team now measures two additional vehicles:

Feature \vec{x}_9 \vec{x}_{10}
Engine disp. (L) 2.4 1.5
Mass (kg) 1400 1150
Horsepower 170 110
Drag coeff. 0.29 0.25
City L/100km 9.0 7.0
HWY L/100km 6.3 5.2

Use your trained model to predict fuel consumption for these two cars. Compute the mean-squared error for the two testing examples and state whether you would recommend the model to the engineers.

Solution

Predictions using the trained model:

From Problem 4(b), we have the optimal weight matrix \mathbf{W}^* and bias vector \vec{b}^*:

\mathbf{W}^* = \begin{bmatrix} -6.32 & 0.010 & 0.040 & 78.458 \\ -2.627 & 0.007 & 0.020 & 17.019 \end{bmatrix}

\vec{b}^* = \begin{bmatrix} -20.167 \\ -6.051 \end{bmatrix}

For car 9 with feature vector \vec{x}_9 = [2.4, 1400, 170, 0.29]^T:

f(\mathbf{W}^*, \vec{b}^*, \vec{x}_9) = \mathbf{W}^* \vec{x}_9 + \vec{b}^*

= \begin{bmatrix} -6.32(2.4) + 0.010(1400) + 0.040(170) + 78.458(0.29) - 20.167 \\ -2.627(2.4) + 0.007(1400) + 0.020(170) + 17.019(0.29) - 6.051 \end{bmatrix}

= \begin{bmatrix} 8.95 \\ 6.24 \end{bmatrix}

For car 10 with feature vector \vec{x}_{10} = [1.5, 1150, 110, 0.25]^T:

f(\mathbf{W}^*, \vec{b}^*, \vec{x}_{10}) = \mathbf{W}^* \vec{x}_{10} + \vec{b}^*

= \begin{bmatrix} -6.32(1.5) + 0.010(1150) + 0.040(110) + 78.458(0.25) - 20.167 \\ -2.627(1.5) + 0.007(1150) + 0.020(110) + 17.019(0.25) - 6.051 \end{bmatrix}

= \begin{bmatrix} 7.27 \\ 5.44 \end{bmatrix}

Mean-squared error for the two test examples:

The true targets are: - Car 9: \vec{y}_9 = [9.0, 6.3]^T (City, HWY) - Car 10: \vec{y}_{10} = [7.0, 5.2]^T (City, HWY)

\text{MSE} = \frac{1}{2 \cdot 2} \sum_{s=9}^{10} \|\vec{y}_s - f(\mathbf{W}^*, \vec{b}^*, \vec{x}_s)\|_2^2

= \frac{1}{4} \left[ (9.0 - 8.95)^2 + (6.3 - 6.24)^2 + (7.0 - 7.27)^2 + (5.2 - 5.44)^2 \right]

= \frac{1}{4} \left[ 0.0025 + 0.0036 + 0.0729 + 0.0576 \right]

= \frac{1}{4}(0.1366) = 0.034

Recommendation:

The MSE of 0.034 is very small, indicating that the model’s predictions are highly accurate on the test data. The average prediction error is approximately \sqrt{0.034} \approx 0.18 L/100km, which is quite small relative to typical fuel consumption values (5-10 L/100km).

Yes, I would recommend this model to the engineers. The model demonstrates strong predictive performance on both the training data (from part 4b) and these new test examples, suggesting it generalizes well and provides reliable fuel consumption estimates.



Problem 5

Source: Summer Session 1 2025 Midterm, Problem 5a-c

Let \{(x_i,y_i)\}_{i=1}^n be a dataset of scalar input-output pairs.


Problem 5.1

Suppose we model y using a simple linear regression model of the form

f(\vec{\theta};\, x) = \vec{\theta}^{(0)} + \vec{\theta}^{(1)}x, \qquad\vec{\theta}\in\mathbb{R}^2.

Prove that the line of best fit (with respect to MSE) passes through the point (\overline{x}, \overline{y}).

Solution

We need to show that the optimal parameters \vec{\theta}^* that minimize MSE satisfy f(\vec{\theta}^*; \overline{x}) = \overline{y}, where \overline{x} = \frac{1}{n}\sum_{i=1}^n x_i and \overline{y} = \frac{1}{n}\sum_{i=1}^n y_i.

Step 1: Set up the MSE

The mean squared error for the model f(\vec{\theta}; x) = \theta^{(0)} + \theta^{(1)}x is:

\text{MSE}(\vec{\theta}; (x_i, y_i)) = \frac{1}{n}\sum_{i=1}^n (y_i - (\theta^{(0)} + \theta^{(1)}x_i))^2

Step 2: Find the critical points

To minimize MSE, we take partial derivatives with respect to both parameters and set them equal to zero.

Taking the partial derivative with respect to \theta^{(0)}:

\frac{\partial \text{MSE}}{\partial \theta^{(0)}} = -\frac{2}{n}\sum_{i=1}^n (y_i - \theta^{(0)} - \theta^{(1)}x_i)

Setting this equal to zero:

-\frac{2}{n}\sum_{i=1}^n (y_i - \theta^{(0)} - \theta^{(1)}x_i) = 0

\sum_{i=1}^n y_i - n\theta^{(0)} - \theta^{(1)}\sum_{i=1}^n x_i = 0

Taking the partial derivative with respect to \theta^{(1)}:

\frac{\partial \text{MSE}}{\partial \theta^{(1)}} = -\frac{2}{n}\sum_{i=1}^n (y_i - \theta^{(0)} - \theta^{(1)}x_i)x_i

Step 3: Solve for the optimal parameters

From the first normal equation:

\sum_{i=1}^n y_i = n\theta^{(0)} + \theta^{(1)}\sum_{i=1}^n x_i

Dividing both sides by n:

\frac{1}{n}\sum_{i=1}^n y_i = \theta^{(0)} + \theta^{(1)} \cdot \frac{1}{n}\sum_{i=1}^n x_i

This simplifies to:

\overline{y} = \theta^{(0)} + \theta^{(1)}\overline{x}

Step 4: Conclusion

The equation \overline{y} = \theta^{(0)} + \theta^{(1)}\overline{x} is exactly the statement that f(\vec{\theta}^*; \overline{x}) = \overline{y}, which means the line of best fit passes through the point (\overline{x}, \overline{y}).

Therefore, we have proven that the line of best fit with respect to MSE passes through the point (\overline{x}, \overline{y}).


Problem 5.2

Suppose we model y using a simple polynomial regression model of the form

f(\vec{\theta};\, x) = \vec{\theta}^{(0)} + \vec{\theta}^{(1)}x+ \vec{\theta}^{(2)}x^2, \qquad\vec{\theta}\in\mathbb{R}^3.

Prove that the curve of best fit (with respect to MSE) passes through the point (\overline{x}, \overline{y} + \vec\theta^{\ast(2)}((\overline{x})^2 - \overline{x^2})), where

\overline{x^2} = \frac{1}{n}\sum_{i=1}^n x_i^2.

Solution

We need to show that the optimal curve passes through the point (\bar{x}, \bar{y} + \tilde{\theta}^{*(2)}((\overline{x^2}) - \overline{x}^2)).

The MSE for the polynomial regression model is:

\text{MSE}(\vec{\theta}) = \frac{1}{n}\sum_{i=1}^n (y_i - \tilde{\theta}^{(0)} - \tilde{\theta}^{(1)}x_i - \tilde{\theta}^{(2)}x_i^2)^2

To find the optimal parameters, we take partial derivatives and set them equal to zero.

Setting \frac{\partial \text{MSE}}{\partial \tilde{\theta}^{(0)}} = 0:

\frac{\partial \text{MSE}}{\partial \tilde{\theta}^{(0)}} = -\frac{2}{n}\sum_{i=1}^n (y_i - \tilde{\theta}^{(0)} - \tilde{\theta}^{(1)}x_i - \tilde{\theta}^{(2)}x_i^2) = 0

This gives us:

\sum_{i=1}^n (y_i - \tilde{\theta}^{(0)} - \tilde{\theta}^{(1)}x_i - \tilde{\theta}^{(2)}x_i^2) = 0

Expanding:

\sum_{i=1}^n y_i - n\tilde{\theta}^{*(0)} - \tilde{\theta}^{*(1)}\sum_{i=1}^n x_i - \tilde{\theta}^{*(2)}\sum_{i=1}^n x_i^2 = 0

Dividing by n:

\bar{y} - \tilde{\theta}^{*(0)} - \tilde{\theta}^{*(1)}\bar{x} - \tilde{\theta}^{*(2)}\overline{x^2} = 0

Therefore:

\tilde{\theta}^{*(0)} = \bar{y} - \tilde{\theta}^{*(1)}\bar{x} - \tilde{\theta}^{*(2)}\overline{x^2}

Now we evaluate the optimal curve at x = \bar{x}:

f(\vec{\theta}^*; \bar{x}) = \tilde{\theta}^{*(0)} + \tilde{\theta}^{*(1)}\bar{x} + \tilde{\theta}^{*(2)}\bar{x}^2

Substituting the expression for \tilde{\theta}^{*(0)}:

f(\vec{\theta}^*; \bar{x}) = (\bar{y} - \tilde{\theta}^{*(1)}\bar{x} - \tilde{\theta}^{*(2)}\overline{x^2}) + \tilde{\theta}^{*(1)}\bar{x} + \tilde{\theta}^{*(2)}\bar{x}^2

Simplifying:

f(\vec{\theta}^*; \bar{x}) = \bar{y} - \tilde{\theta}^{*(2)}\overline{x^2} + \tilde{\theta}^{*(2)}\bar{x}^2

f(\vec{\theta}^*; \bar{x}) = \bar{y} + \tilde{\theta}^{*(2)}(\bar{x}^2 - \overline{x^2})

f(\vec{\theta}^*; \bar{x}) = \bar{y} + \tilde{\theta}^{*(2)}((\overline{x})^2 - \overline{x^2})

Since (\overline{x})^2 = \overline{x}^2, we can rewrite this as:

f(\vec{\theta}^*; \bar{x}) = \bar{y} + \tilde{\theta}^{*(2)}(\overline{x^2} - \overline{x}^2)

Note: We use the fact that \bar{x}^2 - \overline{x^2} = -(\overline{x^2} - \bar{x}^2).

Therefore, the curve of best fit passes through the point (\bar{x}, \bar{y} + \tilde{\theta}^{*(2)}((\overline{x^2}) - \overline{x}^2)).


Problem 5.3

Using the same model as (b), suppose we minimize MSE and find optimal parameters \vec{\theta}^\ast. Further suppose we apply a shifting and scaling operation to the training targets, defining

\widetilde{y_i} = \alpha(y_i - \beta),\qquad\alpha,\beta\in\mathbb{R}.

Find formulas for the new optimal parameters, denoted \vec{\widetilde{\theta}}^\ast, in terms of the old parameters and \alpha, \beta.

Solution

We start by setting up the problem. For the polynomial regression model f(\vec{\theta}; x) = \theta^{(0)} + \theta^{(1)}x + \theta^{(2)}x^2, the design matrix is:

\mathbf{Z} = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix} \in \mathbb{R}^{n \times 3}

and the target vector is:

\mathbf{Y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n

Original optimal parameters:

Assuming (\mathbf{Z}^T\mathbf{Z})^{-1} exists, the optimal parameters for the original problem are:

\vec{\theta}^* = (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{Y}

Transformed target vector:

When we apply the transformation \tilde{y}_i = \alpha(y_i - \beta), the new target vector becomes:

\tilde{\mathbf{Y}} = \begin{bmatrix} \tilde{y}_1 \\ \tilde{y}_2 \\ \vdots \\ \tilde{y}_n \end{bmatrix} = \alpha(\mathbf{Y} - \beta\mathbf{1})

where \mathbf{1} = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^n.

New optimal parameters:

The optimal parameters for the transformed problem are:

\tilde{\vec{\theta}}^* = (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\tilde{\mathbf{Y}}

= (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T[\alpha(\mathbf{Y} - \beta\mathbf{1})]

= \alpha(\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{Y} - \alpha\beta(\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{1}

= \alpha\vec{\theta}^* - \alpha\beta(\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{1}

Computing \mathbf{Z}^T\mathbf{1}:

\mathbf{Z}^T\mathbf{1} = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \\ x_1^2 & x_2^2 & \cdots & x_n^2 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} = \begin{bmatrix} n \\ \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i^2 \end{bmatrix}

Final formula:

Therefore, the new optimal parameters are:

\boxed{\tilde{\vec{\theta}}^* = \alpha\vec{\theta}^* - \alpha\beta(\mathbf{Z}^T\mathbf{Z})^{-1}\begin{bmatrix} n \\ \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i^2 \end{bmatrix}}

Alternatively, this can be written more compactly as:

\boxed{\tilde{\vec{\theta}}^* = \alpha\vec{\theta}^* - \alpha\beta(\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{1}}

Interpretation: The transformation \tilde{y}_i = \alpha(y_i - \beta) scales the targets by \alpha and shifts them by -\alpha\beta. The optimal parameters scale by \alpha (first term) and receive an additional correction term (second term) that depends on both the shift \beta and the structure of the design matrix through (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{1}.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.