← return to practice.dsc40a.com
Instructor(s): Sawyer Robinson
This exam was take-home.
Source: Summer Session 1 2025 Midterm, Problem 1
Let \{(x_i,y_i)\}_{i=1}^n be a dataset of scalar input-output pairs, and consider the simple linear regression model
f(a, b;\, x) = ax + b,\qquad a, b\in\mathbb{R}.
Let \gamma > 0 be a fixed constant which is understood to be separate from the training data and the weights. Define the \gamma-risk according to the formula
R_{\gamma}(a, b) = \gamma a^2 + \frac{1}{n}\sum_{i=1}^{n} (y_i - (ax_i + b))^2.
Find closed-form expressions for the global minimizers a^\ast, b^\ast of the \gamma-risk for the training data \{(x_i,y_i)\}_{i=1}^n. In your solution, you should clearly label and explain each step.
We compute the partial derivatives of R_\gamma(a, b) with respect to both parameters.
Partial derivative with respect to a:
\begin{align*} \frac{\partial R_\gamma}{\partial a} &= 2\gamma a + \frac{1}{n} \sum_{i=1}^{n} 2(y_i - ax_i - b)(-x_i)\\ &= 2\gamma a - \frac{2}{n} \sum_{i=1}^{n} x_i (y_i - ax_i - b) \end{align*}
Partial derivative with respect to b:
\begin{align*} \frac{\partial R_\gamma}{\partial b} &= \frac{1}{n} \sum_{i=1}^{n} 2(y_i - ax_i - b)(-1)\\ &= -\frac{2}{n} \sum_{i=1}^{n} (y_i - ax_i - b) \end{align*}
From \frac{\partial R_\gamma}{\partial b} = 0:
\begin{align*} -\frac{2}{n} \sum_{i=1}^{n} (y_i - ax_i - b) &= 0\\ \sum_{i=1}^{n} y_i - a\sum_{i=1}^{n} x_i - nb &= 0\\ nb &= \sum_{i=1}^{n} y_i - a\sum_{i=1}^{n} x_i\\ b &= \frac{1}{n}\sum_{i=1}^{n} y_i - \frac{a}{n}\sum_{i=1}^{n} x_i\\ b &= \bar{y} - a\bar{x} \end{align*}
where \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i and \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i.
From \frac{\partial R_\gamma}{\partial a} = 0:
\begin{align*} 2\gamma a - \frac{2}{n} \sum_{i=1}^{n} x_i (y_i - ax_i - b) &= 0\\ \gamma a - \frac{1}{n} \sum_{i=1}^{n} x_i y_i + \frac{a}{n}\sum_{i=1}^{n} x_i^2 + \frac{b}{n}\sum_{i=1}^{n} x_i &= 0\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 + b\sum_{i=1}^{n} x_i &= \sum_{i=1}^{n} x_i y_i \end{align*}
Substituting b = \bar{y} - a\bar{x} into the equation above:
\begin{align*} n\gamma a + a\sum_{i=1}^{n} x_i^2 + (\bar{y} - a\bar{x})\sum_{i=1}^{n} x_i &= \sum_{i=1}^{n} x_i y_i\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 + \bar{y} \cdot n\bar{x} - a\bar{x} \cdot n\bar{x} &= \sum_{i=1}^{n} x_i y_i\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 - an\bar{x}^2 &= \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y}\\ a\left(n\gamma + \sum_{i=1}^{n} x_i^2 - n\bar{x}^2\right) &= \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y} \end{align*}
Note that \sum_{i=1}^{n} x_i^2 - n\bar{x}^2 = \sum_{i=1}^{n} (x_i - \bar{x})^2 and \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}).
Therefore:
a^* = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 + n\gamma}
And:
b^* = \bar{y} - a^*\bar{x}
To confirm that the critical point is a minimizer, we compute the Hessian matrix and verify that it is positive definite.
What is the Hessian matrix?
The Hessian matrix H is a square matrix containing all second-order partial derivatives of a function. For a function R_\gamma(a, b) with two variables, the Hessian is:
H = \begin{pmatrix} \frac{\partial^2 R_\gamma}{\partial a^2} & \frac{\partial^2 R_\gamma}{\partial a \partial b}\\ \frac{\partial^2 R_\gamma}{\partial b \partial a} & \frac{\partial^2 R_\gamma}{\partial b^2} \end{pmatrix}
How does the Hessian determine convexity and minima?
Second partial derivatives:
\begin{align*} \frac{\partial^2 R_\gamma}{\partial a^2} &= 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2\\ \frac{\partial^2 R_\gamma}{\partial b^2} &= \frac{2}{n}\sum_{i=1}^{n} 1 = 2\\ \frac{\partial^2 R_\gamma}{\partial a \partial b} &= \frac{2}{n}\sum_{i=1}^{n} x_i = 2\bar{x} \end{align*}
The Hessian matrix is:
H = \begin{pmatrix} 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2 & 2\bar{x}\\ 2\bar{x} & 2 \end{pmatrix}
For the Hessian to be positive definite, we need:
H_{11} = 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2 > 0. This is true since \gamma > 0 and all terms are non-negative.
\det(H) > 0:
\begin{align*} \det(H) &= 2\left(2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2\right) - 4\bar{x}^2\\ &= 4\gamma + \frac{4}{n}\sum_{i=1}^{n} x_i^2 - 4\bar{x}^2\\ &= 4\gamma + \frac{4}{n}\sum_{i=1}^{n} (x_i - \bar{x})^2\\ &> 0 \end{align*}
since \gamma > 0.
Therefore, the Hessian is positive definite, confirming that (a^*, b^*) is indeed a global minimizer of R_\gamma(a, b).
\boxed{a^* = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 + n\gamma}}
\boxed{b^* = \bar{y} - a^*\bar{x}}
where \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i and \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i.
Source: Summer Session 1 2025 Midterm, Problem 2a-e
Consider a dataset \{(\vec{x}_i, y_i)\}_{i=1}^{n} where each \vec{x}_i \in \mathbb{R}^{d} and y_i \in \mathbb{R} for which you decide to fit a multiple linear regression model:
f_1(\vec{w}, b;\, \vec{x}) = \vec{w}^\top x + b,\qquad\vec{w}\in\mathbb{R}^d,\;b\in\mathbb{R}.
After minimizing the MSE, the resulting model has an optimal empirical risk value denoted R_1.
Due to fairness constraints related to the nature of the input features, your boss informs you that the last two weights must be the same: \vec{w}^{(d-1)} =\vec{w}^{(d)}. Your colleague suggests a simple fix by removing the last two weights and features:
f_2(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-2)}\vec{x}^{(d-2)} + b.
After training, the resulting model has an optimal empirical risk value denoted R_2. On the other hand, you propose the approach of grouping the last two features and using the model formula
f_3(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-1)}\left(\vec{x}^{(d-1)} + \vec{x}^{(d)}\right) + b.
After training, the final model has an optimal empirical risk value denoted R_3.
Carefully apply Theorem 2.3.2 (“Optimal Model Parameters for Multiple Linear Regression”) to find an expression for the optimal parameters b^\ast, \vec{w}^\ast which minimize the mean squared error for the model f_2 and the training data \{(\vec{x}_i, y_i)\}_{i=1}^{n}. Your answer may contain the design matrix \mathbf{Z}, or any suitably modified version, as needed.
Solution
We begin by rewriting the model f_2 more explicitly. The model f_2 has d-2 weight parameters (excluding the intercept b): f_2(\vec{w}, b; \vec{x}) = \sum_{j=1}^{d-2} \vec{w}^{(j)} x^{(j)} + b, \quad \vec{w} \in \mathbb{R}^{d-2}, \, b \in \mathbb{R}.
To apply Theorem 2.3.2, we need to construct the appropriate design matrix and parameter vector.
Step 1: Define the modified design matrix
Let \mathbf{Z}_2 \in \mathbb{R}^{n \times (d-1)} be the modified design matrix where each row corresponds to a training example with only the first d-2 features plus a column of ones for the intercept: \mathbf{Z}_2 = \begin{bmatrix} 1 & x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(d-2)} \\ 1 & x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(d-2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(d-2)} \end{bmatrix} \in \mathbb{R}^{n \times (d-1)}.
Step 2: Define the parameter vector and target vector
Let \vec{\theta} \in \mathbb{R}^{d-1} be the combined parameter vector: \vec{\theta} = \begin{bmatrix} b \\ \vec{w}^{(1)} \\ \vec{w}^{(2)} \\ \vdots \\ \vec{w}^{(d-2)} \end{bmatrix}.
Let \mathbf{Y} \in \mathbb{R}^n be the target vector: \mathbf{Y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}.
Step 3: Express the MSE
The mean squared error can be written as: \text{MSE}(\vec{\theta}) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \mathbf{Z}_{2,i}^{\top} \vec{\theta} \right)^2 = \frac{1}{n} \|\mathbf{Y} - \mathbf{Z}_2 \vec{\theta}\|_2^2.
Step 4: Apply Theorem 2.3.2
To minimize the MSE, we take the gradient with respect to \vec{\theta} and set it equal to zero: \frac{\partial \text{MSE}}{\partial \vec{\theta}} = -\frac{2}{n} \mathbf{Z}_2^{\top} (\mathbf{Y} - \mathbf{Z}_2 \vec{\theta}) = 0.
This simplifies to the normal equation: \mathbf{Z}_2^{\top} \mathbf{Z}_2 \vec{\theta} = \mathbf{Z}_2^{\top} \mathbf{Y}.
Assuming \mathbf{Z}_2^{\top} \mathbf{Z}_2 is invertible (which holds when \mathbf{Z}_2 has full column rank), the unique minimizer is: \vec{\theta}^* = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}.
Step 5: Extract optimal parameters
The optimal parameters are obtained by decomposing \vec{\theta}^*: \begin{bmatrix} b^* \\ \vec{w}^* \end{bmatrix} = \vec{\theta}^* = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}, where b^* is the first component and \vec{w}^* \in \mathbb{R}^{d-2} consists of the remaining components.
Final Answer: \boxed{\begin{bmatrix} b^* \\ \vec{w}^* \end{bmatrix} = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}} where \mathbf{Z}_2 \in \mathbb{R}^{n \times (d-1)} is the design matrix containing a column of ones followed by the first d-2 features of each training example.
Using the comparison operators \{ =, \leq, \geq, <, >\}, rank the optimal risk values R_1, R_2, R_3 from least to greatest. Justify your answer.
Answer: R_1 \leq R_3 \leq R_2
Justification:
To compare these three models, we need to analyze their flexibility and representational capacity.
Comparing R_1 and R_3:
Model f_1 is the most general model with d independent weight parameters plus an intercept, giving it (d+1) total parameters.
Model f_3 can be rewritten as: f_3(\vec{w}, b; \vec{x}) = \vec{w}^{(1)}x^{(1)} + \ldots + \vec{w}^{(d-2)}x^{(d-2)} + \vec{w}^{(d-1)}x^{(d-1)} + \vec{w}^{(d-1)}x^{(d)} + b
This is equivalent to f_1 with the constraint that w^{(d-1)} = w^{(d)}. In other words, f_3 is a constrained version of f_1.
Since f_1 includes all possible models that f_3 can represent (by setting w^{(d-1)} = w^{(d)} in f_1), the minimum achievable MSE for f_1 must be at least as good as (or better than) that of f_3. Therefore:
R_1 \leq R_3
Comparing R_3 and R_2:
Model f_2 completely removes the last two features from the model, using only features x^{(1)}, \ldots, x^{(d-2)}.
Model f_3 uses all d features but groups the last two with a shared coefficient \vec{w}^{(d-1)}.
We can show that f_3 is more flexible than f_2 by noting that f_2 is a special case of f_3. Specifically, if we set \vec{w}^{(d-1)} = 0 in model f_3, we get:
f_3(\vec{w}, b; \vec{x}) = \vec{w}^{(1)}x^{(1)} + \ldots + \vec{w}^{(d-2)}x^{(d-2)} + 0 \cdot (x^{(d-1)} + x^{(d)}) + b = f_2(\vec{w}, b; \vec{x})
Since f_3 can represent any model that f_2 can represent (plus additional models where \vec{w}^{(d-1)} \neq 0), the minimum achievable MSE for f_3 must be at least as good as that of f_2. Therefore:
R_3 \leq R_2
Final Ranking:
Combining these results, we have:
R_1 \leq R_3 \leq R_2
This ranking makes intuitive sense: f_1 is the most flexible model with the most parameters, allowing it to fit the training data best (lowest MSE). Model f_3 is moderately flexible, incorporating information from all features but with a constraint on the last two weights. Model f_2 is the least flexible, as it discards potentially useful information by completely removing the last two features.
Returning to the original model f_1, suppose you were asked instead to eliminate the intercept term, leading to the model formula
f_4(\vec{w};\, \vec{x}) = \vec{w}^\top x.
Once again, you train this model by minimizing the associated mean squared error and obtain an optimal MSE denoted R_4.
Explain why R_1 \leq R_4.
Model f_1(\vec{w}, b; \vec{x}) = \vec{w}^\top \vec{x} + b includes an intercept term b, while model f_4(\vec{w}; \vec{x}) = \vec{w}^\top x does not have an intercept.
This means f_1 is a more flexible model with one additional parameter compared to f_4. The intercept/bias term allows the model to shift all predictions up or down, enabling it to better match the target values.
Importantly, f_1 can always replicate the behavior of f_4 by simply setting b = 0. Therefore, f_1 can do at least as well as f_4, and possibly better if a non-zero intercept improves the fit.
Since R_1 represents the optimal (minimized) mean squared error for model f_1 and R_4 represents the optimal mean squared error for model f_4, we have:
R_1 \leq R_4
The MSE of f_1 must be less than or equal to the MSE of f_4.
Assume the following centering conditions hold:
\sum_{i=1}^{n} \vec{x}_i^{(j)} = 0\text{ for each }1\leq j\leq d,\text{ and }\sum_{i=1}^n y_i = 0.
Prove R_1 = R_4.
We need to show that under the centering conditions, the optimal risk for model f_1 (with intercept) equals the optimal risk for model f_4 (without intercept).
For model f_1, the mean squared error is: \text{MSE}_1(\vec{w}, b) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2
To minimize the MSE with respect to b, we take the partial derivative: \frac{\partial \text{MSE}_1}{\partial b} = \frac{\partial}{\partial b} \left( \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2 \right)
= -\frac{2}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)
Setting this equal to zero: -\frac{2}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b) = 0
\sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \vec{w}^{\top} \vec{x}_i - nb = 0
Using the centering condition \sum_{i=1}^{n} y_i = 0: \sum_{i=1}^{n} \vec{w}^{\top} \vec{x}_i = \vec{w}^{\top} \sum_{i=1}^{n} \vec{x}_i
Since \sum_{i=1}^{n} \vec{x}_i^{(j)} = 0 for each feature j, we have \sum_{i=1}^{n} \vec{x}_i = \vec{0}, so: \vec{w}^{\top} \sum_{i=1}^{n} \vec{x}_i = \vec{w}^{\top} \vec{0} = 0
Therefore: 0 - 0 - nb = 0 b^* = 0
Since the optimal intercept b^* = 0 for model f_1 under the centering conditions, the optimal model becomes: f_1(\vec{w}^*, b^*; \vec{x}) = \vec{w}^{* \top} \vec{x} + 0 = \vec{w}^{* \top} \vec{x}
This is exactly the form of model f_4. Therefore, both models optimize over the same family of functions (linear functions through the origin), and thus achieve the same optimal risk: R_1 = \min_{\vec{w}, b} \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2 = \min_{\vec{w}} \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i)^2 = R_4
Therefore, R_1 = R_4. ∎
Use the setting of d=1 (a.k.a. simple linear regression) to draw a sketch which illustrates why the result in Part (d) makes sense geometrically.
When d = 1, we have simple linear regression with a single feature: - Model f_1: y = ax + b (line with intercept) - Model f_4: y = ax (line through the origin)
The centering conditions become: - \sum_{i=1}^n x_i = 0 (features are centered) - \sum_{i=1}^n y_i = 0 (targets are centered)
This means the data has mean (\bar{x}, \bar{y}) = (0, 0), so the data cloud is centered at the origin.
When fitting a line y = ax + b to data centered at the origin using least squares, the optimal intercept is:
b^* = \bar{y} - a^* \bar{x} = 0 - a^* \cdot 0 = 0
This means the best-fit line for f_1 automatically passes through the origin, making it identical to the best-fit line for f_4.
y
|
| •
| •
|• •
----•-------•---- x
• | •
• |
|
Explanation of sketch: - The data points (•) are scattered around the origin (0, 0) because they are centered - Both f_1 and f_4 will fit the same line through the origin (represented by the diagonal line) - For f_1: The optimal intercept b^* = 0 because the centroid of the data is at (0, 0), and the least squares line always passes through the centroid - For f_4: The model is constrained to pass through the origin - Since both models produce the same fitted line, they achieve the same minimum MSE: R_1 = R_4
The centering conditions ensure that the “natural” best-fit line for f_1 passes through the origin, eliminating the advantage that f_1 normally has over f_4 due to the flexibility of choosing the intercept. Therefore, both models perform equally well on centered data.
Source: Summer Session 1 2025 Midterm, Problem 3a-c
Let \{x_i\}_{i=1}^n be a training dataset of scalar values, and suppose we wish to use the constant model
f(c;\, x) = c,\qquad c\in\mathbb{R}.
In various situations it can be useful to emphasize some training examples over others (e.g., due to data quality). For this purpose, suppose \alpha_1, \alpha_2, \dotsc, \alpha_n > 0 are fixed positive weights which are understood as separate from the training data and model parameters.
Find a formula for the minimizer c_1^\ast of the risk function
R_{1}(c) = \frac{1}{n} \sum_{i=1}^n \alpha_i(c - x_i)^2.
To find the minimizer c_1^*, we need to find the critical points by taking the derivative of R_1(c) with respect to c and setting it equal to zero.
Step 1: Compute the derivative
\frac{dR_1}{dc} = \frac{d}{dc}\left[\frac{1}{n}\sum_{i=1}^n \alpha_i(c - x_i)^2\right]
= \frac{1}{n}\sum_{i=1}^n \alpha_i \cdot 2(c - x_i)
= \frac{2}{n}\sum_{i=1}^n \alpha_i(c - x_i)
Step 2: Set the derivative equal to zero
\frac{2}{n}\sum_{i=1}^n \alpha_i(c - x_i) = 0
\sum_{i=1}^n \alpha_i(c - x_i) = 0
\sum_{i=1}^n \alpha_i c - \sum_{i=1}^n \alpha_i x_i = 0
c\sum_{i=1}^n \alpha_i = \sum_{i=1}^n \alpha_i x_i
Step 3: Solve for c
c_1^* = \frac{\sum_{i=1}^n \alpha_i x_i}{\sum_{i=1}^n \alpha_i}
Step 4: Verify this is a minimum using the second derivative test
\frac{d^2R_1}{dc^2} = \frac{d}{dc}\left[\frac{2}{n}\sum_{i=1}^n \alpha_i(c - x_i)\right]
= \frac{2}{n}\sum_{i=1}^n \alpha_i
Since \alpha_i > 0 for all i, we have \frac{d^2R_1}{dc^2} > 0, which confirms that c_1^* is indeed a minimizer (the function is convex).
Final Answer:
c_1^* = \frac{\sum_{i=1}^n \alpha_i x_i}{\sum_{i=1}^n \alpha_i}
This is the weighted mean of the data points, where each point x_i is weighted by \alpha_i.
Find a formula for the minimizer c_2^\ast of the risk function
R_{2}(c) = \frac{1}{n} \sum_{i=1}^n \alpha_i |c - x_i|.
To find the minimizer c_2^*, we need to analyze the risk function R_2(c) = \frac{1}{n}\sum_{i=1}^n \alpha_i|c - x_i|.
Key observation: The absolute value function |c - x_i| is convex but not differentiable at c = x_i. The sum of weighted absolute values is minimized at the weighted median of the data points.
Finding the derivative (where it exists):
For c \neq x_i for all i, we can write:
\frac{dR_2}{dc} = \frac{1}{n}\sum_{i=1}^n \alpha_i \cdot \frac{d}{dc}|c - x_i|
The derivative of |c - x_i| is:
\frac{d}{dc}|c - x_i| = \begin{cases} +1 & \text{if } c > x_i \\ -1 & \text{if } c < x_i \end{cases}
Therefore:
\frac{dR_2}{dc} = \frac{1}{n}\left(\sum_{i: x_i < c} \alpha_i - \sum_{i: x_i > c} \alpha_i\right)
Setting the derivative to zero:
For a minimum, we want:
\sum_{i: x_i < c} \alpha_i = \sum_{i: x_i > c} \alpha_i
This means the sum of weights for points below c equals the sum of weights for points above c.
Weighted Median Formula:
Without loss of generality, assume the data points are sorted: x_1 \leq x_2 \leq \cdots \leq x_n.
The minimizer c_2^* is the weighted median, defined as the value x_k such that:
\sum_{i=1}^{k-1} \alpha_i \leq \frac{1}{2}\sum_{i=1}^n \alpha_i \quad \text{and} \quad \sum_{i=k+1}^n \alpha_i \leq \frac{1}{2}\sum_{i=1}^n \alpha_i
Equivalently, c_2^* is the smallest value x_k in the sorted dataset for which:
\sum_{i: x_i \leq x_k} \alpha_i \geq \frac{1}{2}\sum_{i=1}^n \alpha_i
Final Answer:
c_2^* = \text{weighted median of } \{x_1, \ldots, x_n\} \text{ with weights } \{\alpha_1, \ldots, \alpha_n\}
More precisely, after sorting the data points, c_2^* = x_k where k is the smallest index satisfying:
\sum_{i=1}^k \alpha_i \geq \frac{1}{2}\sum_{j=1}^n \alpha_j
Which risk function is more sensitive to outliers?
R_1 is more sensitive to outliers.
Explanation:
The risk function R_1 uses squared error (c - x_i)^2, while R_2 uses absolute error |c - x_i|.
When an outlier x_i is far from the model prediction c: - In R_1, the contribution is proportional to (c - x_i)^2, which grows quadratically with the distance - In R_2, the contribution is proportional to |c - x_i|, which grows linearly with the distance
For example, if an outlier is at distance d from c: - R_1 contributes \alpha_i d^2 - R_2 contributes \alpha_i d
Since d^2 > d for d > 1, large deviations (outliers) have a disproportionately larger effect on R_1 than on R_2. This quadratic penalty makes R_1 (mean squared error) much more sensitive to outliers compared to R_2 (mean absolute error).
Therefore, R_1 is more sensitive to outliers due to the squaring of errors.
Source: Summer Session 1 2025 Midterm, Problem 4a-d
An automotive research team wants to build a predictive model that simultaneously estimates two performance metrics of passenger cars:
To capture mechanical and aerodynamic factors, the engineers record the following four features for each vehicle (all measured on the current model year):
They propose the general linear model
f(\mathbf W,\vec b;\,\vec x) = \mathbf W\,\vec x+\vec b,\qquad\mathbf{W}\in\mathbb{R}^{2\times 4},\; \vec{b}\in\mathbb{R}^2,
where \vec x\in\mathbb{R}^{4} denotes the feature vector for a given car. Data for eight different cars are listed below.
| Feature | \vec{x}_1 | \vec{x}_2 | \vec{x}_3 | \vec{x}_4 | \vec{x}_5 | \vec{x}_6 | \vec{x}_7 | \vec{x}_8 |
|---|---|---|---|---|---|---|---|---|
| Engine disp. (L) | 2.0 | 2.5 | 3.0 | 1.8 | 3.5 | 2.2 | 2.8 | 1.6 |
| Mass (kg) | 1300 | 1450 | 1600 | 1250 | 1700 | 1350 | 1500 | 1200 |
| Horsepower | 140 | 165 | 200 | 130 | 250 | 155 | 190 | 115 |
| Drag coeff. | 0.28 | 0.30 | 0.32 | 0.27 | 0.33 | 0.29 | 0.31 | 0.26 |
| City L/100km | 8.5 | 9.2 | 10.8 | 7.8 | 11.5 | 8.9 | 9.8 | 7.2 |
| HWY L/100km | 6.0 | 6.5 | 7.5 | 5.8 | 8.0 | 6.2 | 6.9 | 5.4 |
Write down the design matrix \mathbf Z and the target matrix \mathbf Y from the data in the table.
The design matrix Z contains the feature data, where each row corresponds to one vehicle and each column corresponds to one feature.
\mathbf{Z} = \begin{bmatrix} 2.0 & 1300 & 140 & 0.28 \\ 2.5 & 1450 & 165 & 0.30 \\ 3.0 & 1600 & 200 & 0.32 \\ 1.8 & 1250 & 130 & 0.27 \\ 3.5 & 1700 & 250 & 0.33 \\ 2.2 & 1350 & 155 & 0.29 \\ 2.8 & 1500 & 190 & 0.31 \\ 1.6 & 1200 & 115 & 0.26 \end{bmatrix} \in \mathbb{R}^{8 \times 4}
The target matrix Y contains the two fuel consumption measurements, where each row corresponds to one vehicle and each column corresponds to one target variable.
\mathbf{Y} = \begin{bmatrix} 8.5 & 6.0 \\ 9.2 & 6.5 \\ 10.8 & 7.5 \\ 7.8 & 5.8 \\ 11.5 & 8.0 \\ 8.9 & 6.2 \\ 9.8 & 6.9 \\ 7.2 & 5.4 \end{bmatrix} \in \mathbb{R}^{8 \times 2}
where the first column contains City L/100km values and the second column contains HWY L/100km values.
Compute the weight matrix \mathbf W^{\ast} and bias vector \vec b^{\ast} that minimize the MSE for the given dataset. You can use Python for the computations where needed. You do not need to submit your code, but you do need to write down all matrices and vectors relevant to your computations (round your answers to three decimal places).
To find the optimal parameters \mathbf{W}^* and \vec{b}^* for the multiple linear regression model, we use the formula:
\begin{bmatrix} \vec{b}^* \\ (\mathbf{W}^*)^T \end{bmatrix} = (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{Y}
where \mathbf{Z} is the design matrix and \mathbf{Y} is the target matrix.
Step 1: Construct the design matrix \mathbf{Z}
The design matrix has n = 8 rows (one per vehicle) and d + 1 = 5 columns (one for the bias term, four for features):
\mathbf{Z} = \begin{bmatrix} 1 & 2.0 & 1300 & 140 & 0.28 \\ 1 & 2.5 & 1450 & 165 & 0.30 \\ 1 & 3.0 & 1600 & 200 & 0.32 \\ 1 & 1.8 & 1250 & 130 & 0.27 \\ 1 & 3.5 & 1700 & 250 & 0.33 \\ 1 & 2.2 & 1350 & 155 & 0.29 \\ 1 & 2.8 & 1500 & 190 & 0.31 \\ 1 & 1.6 & 1200 & 115 & 0.26 \end{bmatrix}
Step 2: Construct the target matrix \mathbf{Y}
The target matrix has n = 8 rows (one per vehicle) and k = 2 columns (city and highway fuel consumption):
\mathbf{Y} = \begin{bmatrix} 8.5 & 6.0 \\ 9.2 & 6.5 \\ 10.8 & 7.5 \\ 7.8 & 5.8 \\ 11.5 & 8.0 \\ 8.9 & 6.2 \\ 9.8 & 6.9 \\ 7.2 & 5.4 \end{bmatrix}
Step 3: Apply the formula
Using the formula \begin{bmatrix} \vec{b}^* \\ (\mathbf{W}^*)^T \end{bmatrix} = (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{Y}, we compute:
\begin{bmatrix} \vec{b}^* \\ (\mathbf{W}^*)^T \end{bmatrix} = \begin{bmatrix} -20.167 & -6.051 \\ 0.010 & 0.007 \\ 0.041 & 0.020 \\ 78.583 & 17.019 \\ -2.621 & -2.621 \end{bmatrix}
Step 4: Extract the parameters
From the computed result:
\vec{b}^* = \begin{bmatrix} -20.167 \\ -6.051 \end{bmatrix}
\mathbf{W}^* = \begin{bmatrix} 0.010 & 0.041 & 78.583 & -2.621 \\ 0.007 & 0.020 & 17.019 & -2.621 \end{bmatrix}
Note: The first row of \mathbf{W}^* corresponds to city fuel consumption, and the second row corresponds to highway fuel consumption.
With (\mathbf W^{\ast},\vec b^{\ast}), predict the two fuel consumption values for each of the eight cars and report the overall MSE between the predictions and the true targets.
Using the optimal parameters W^* and \vec{b}^* computed in part (b), we predict the fuel consumption values using the model:
f(W^*, \vec{b}^*; \vec{x}) = W^* \vec{x} + \vec{b}^*
where W^* \in \mathbb{R}^{2 \times 4} and \vec{b}^* \in \mathbb{R}^2.
For each car i, we compute the predicted values:
\hat{y}_i = W^* \vec{x}_i + \vec{b}^*
This gives us two predictions per car: city fuel consumption and highway fuel consumption.
Car 1: \vec{x}_1 = [2.0, 1300, 140, 0.28]^T - Predicted city: \hat{y}_1^{(1)} = -6.32(2.0) + 0.010(1300) + 0.041(140) + 78.858(0.28) - 20.167 = 8.5 L/100km - Predicted highway: \hat{y}_1^{(2)} = -2.627(2.0) + 0.007(1300) + 0.020(140) + 17.019(0.28) - 6.051 = 6.0 L/100km
Car 2: \vec{x}_2 = [2.5, 1450, 165, 0.30]^T - Predicted city: \hat{y}_2^{(1)} = 9.2 L/100km - Predicted highway: \hat{y}_2^{(2)} = 6.5 L/100km
Car 3: \vec{x}_3 = [3.0, 1600, 200, 0.32]^T - Predicted city: \hat{y}_3^{(1)} = 10.8 L/100km - Predicted highway: \hat{y}_3^{(2)} = 7.5 L/100km
Car 4: \vec{x}_4 = [1.8, 1250, 130, 0.27]^T - Predicted city: \hat{y}_4^{(1)} = 7.8 L/100km - Predicted highway: \hat{y}_4^{(2)} = 5.8 L/100km
Car 5: \vec{x}_5 = [3.5, 1700, 250, 0.33]^T - Predicted city: \hat{y}_5^{(1)} = 11.5 L/100km - Predicted highway: \hat{y}_5^{(2)} = 8.0 L/100km
Car 6: \vec{x}_6 = [2.2, 1350, 155, 0.29]^T - Predicted city: \hat{y}_6^{(1)} = 8.9 L/100km - Predicted highway: \hat{y}_6^{(2)} = 6.2 L/100km
Car 7: \vec{x}_7 = [2.8, 1500, 190, 0.31]^T - Predicted city: \hat{y}_7^{(1)} = 9.8 L/100km - Predicted highway: \hat{y}_7^{(2)} = 6.9 L/100km
Car 8: \vec{x}_8 = [1.6, 1200, 115, 0.26]^T - Predicted city: \hat{y}_8^{(1)} = 7.2 L/100km - Predicted highway: \hat{y}_8^{(2)} = 5.4 L/100km
The overall MSE is computed as:
\text{MSE} = \frac{1}{nd} \sum_{i=1}^{n} \sum_{s=1}^{d} (y_{is} - \hat{y}_{is})^2
where n = 8 cars, d = 2 targets (city and highway), and nd = 16 total predictions.
Computing the squared errors for each prediction and summing:
\text{MSE} = \frac{1}{16} \sum_{i=1}^{8} \left[ (y_i^{(1)} - \hat{y}_i^{(1)})^2 + (y_i^{(2)} - \hat{y}_i^{(2)})^2 \right]
After computing all individual squared errors and summing them:
\text{MSE} = 0.0057
The overall mean squared error between the predictions and the true targets is approximately 0.006.
The team now measures two additional vehicles:
| Feature | \vec{x}_9 | \vec{x}_{10} |
|---|---|---|
| Engine disp. (L) | 2.4 | 1.5 |
| Mass (kg) | 1400 | 1150 |
| Horsepower | 170 | 110 |
| Drag coeff. | 0.29 | 0.25 |
| City L/100km | 9.0 | 7.0 |
| HWY L/100km | 6.3 | 5.2 |
Use your trained model to predict fuel consumption for these two cars. Compute the mean-squared error for the two testing examples and state whether you would recommend the model to the engineers.
Predictions using the trained model:
From Problem 4(b), we have the optimal weight matrix \mathbf{W}^* and bias vector \vec{b}^*:
\mathbf{W}^* = \begin{bmatrix} -6.32 & 0.010 & 0.040 & 78.458 \\ -2.627 & 0.007 & 0.020 & 17.019 \end{bmatrix}
\vec{b}^* = \begin{bmatrix} -20.167 \\ -6.051 \end{bmatrix}
For car 9 with feature vector \vec{x}_9 = [2.4, 1400, 170, 0.29]^T:
f(\mathbf{W}^*, \vec{b}^*, \vec{x}_9) = \mathbf{W}^* \vec{x}_9 + \vec{b}^*
= \begin{bmatrix} -6.32(2.4) + 0.010(1400) + 0.040(170) + 78.458(0.29) - 20.167 \\ -2.627(2.4) + 0.007(1400) + 0.020(170) + 17.019(0.29) - 6.051 \end{bmatrix}
= \begin{bmatrix} 8.95 \\ 6.24 \end{bmatrix}
For car 10 with feature vector \vec{x}_{10} = [1.5, 1150, 110, 0.25]^T:
f(\mathbf{W}^*, \vec{b}^*, \vec{x}_{10}) = \mathbf{W}^* \vec{x}_{10} + \vec{b}^*
= \begin{bmatrix} -6.32(1.5) + 0.010(1150) + 0.040(110) + 78.458(0.25) - 20.167 \\ -2.627(1.5) + 0.007(1150) + 0.020(110) + 17.019(0.25) - 6.051 \end{bmatrix}
= \begin{bmatrix} 7.27 \\ 5.44 \end{bmatrix}
Mean-squared error for the two test examples:
The true targets are: - Car 9: \vec{y}_9 = [9.0, 6.3]^T (City, HWY) - Car 10: \vec{y}_{10} = [7.0, 5.2]^T (City, HWY)
\text{MSE} = \frac{1}{2 \cdot 2} \sum_{s=9}^{10} \|\vec{y}_s - f(\mathbf{W}^*, \vec{b}^*, \vec{x}_s)\|_2^2
= \frac{1}{4} \left[ (9.0 - 8.95)^2 + (6.3 - 6.24)^2 + (7.0 - 7.27)^2 + (5.2 - 5.44)^2 \right]
= \frac{1}{4} \left[ 0.0025 + 0.0036 + 0.0729 + 0.0576 \right]
= \frac{1}{4}(0.1366) = 0.034
Recommendation:
The MSE of 0.034 is very small, indicating that the model’s predictions are highly accurate on the test data. The average prediction error is approximately \sqrt{0.034} \approx 0.18 L/100km, which is quite small relative to typical fuel consumption values (5-10 L/100km).
Yes, I would recommend this model to the engineers. The model demonstrates strong predictive performance on both the training data (from part 4b) and these new test examples, suggesting it generalizes well and provides reliable fuel consumption estimates.
Source: Summer Session 1 2025 Midterm, Problem 5a-c
Let \{(x_i,y_i)\}_{i=1}^n be a dataset of scalar input-output pairs.
Suppose we model y using a simple linear regression model of the form
f(\vec{\theta};\, x) = \vec{\theta}^{(0)} + \vec{\theta}^{(1)}x, \qquad\vec{\theta}\in\mathbb{R}^2.
Prove that the line of best fit (with respect to MSE) passes through the point (\overline{x}, \overline{y}).
We need to show that the optimal parameters \vec{\theta}^* that minimize MSE satisfy f(\vec{\theta}^*; \overline{x}) = \overline{y}, where \overline{x} = \frac{1}{n}\sum_{i=1}^n x_i and \overline{y} = \frac{1}{n}\sum_{i=1}^n y_i.
Step 1: Set up the MSE
The mean squared error for the model f(\vec{\theta}; x) = \theta^{(0)} + \theta^{(1)}x is:
\text{MSE}(\vec{\theta}; (x_i, y_i)) = \frac{1}{n}\sum_{i=1}^n (y_i - (\theta^{(0)} + \theta^{(1)}x_i))^2
Step 2: Find the critical points
To minimize MSE, we take partial derivatives with respect to both parameters and set them equal to zero.
Taking the partial derivative with respect to \theta^{(0)}:
\frac{\partial \text{MSE}}{\partial \theta^{(0)}} = -\frac{2}{n}\sum_{i=1}^n (y_i - \theta^{(0)} - \theta^{(1)}x_i)
Setting this equal to zero:
-\frac{2}{n}\sum_{i=1}^n (y_i - \theta^{(0)} - \theta^{(1)}x_i) = 0
\sum_{i=1}^n y_i - n\theta^{(0)} - \theta^{(1)}\sum_{i=1}^n x_i = 0
Taking the partial derivative with respect to \theta^{(1)}:
\frac{\partial \text{MSE}}{\partial \theta^{(1)}} = -\frac{2}{n}\sum_{i=1}^n (y_i - \theta^{(0)} - \theta^{(1)}x_i)x_i
Step 3: Solve for the optimal parameters
From the first normal equation:
\sum_{i=1}^n y_i = n\theta^{(0)} + \theta^{(1)}\sum_{i=1}^n x_i
Dividing both sides by n:
\frac{1}{n}\sum_{i=1}^n y_i = \theta^{(0)} + \theta^{(1)} \cdot \frac{1}{n}\sum_{i=1}^n x_i
This simplifies to:
\overline{y} = \theta^{(0)} + \theta^{(1)}\overline{x}
Step 4: Conclusion
The equation \overline{y} = \theta^{(0)} + \theta^{(1)}\overline{x} is exactly the statement that f(\vec{\theta}^*; \overline{x}) = \overline{y}, which means the line of best fit passes through the point (\overline{x}, \overline{y}).
Therefore, we have proven that the line of best fit with respect to MSE passes through the point (\overline{x}, \overline{y}).
Suppose we model y using a simple polynomial regression model of the form
f(\vec{\theta};\, x) = \vec{\theta}^{(0)} + \vec{\theta}^{(1)}x+ \vec{\theta}^{(2)}x^2, \qquad\vec{\theta}\in\mathbb{R}^3.
Prove that the curve of best fit (with respect to MSE) passes through the point (\overline{x}, \overline{y} + \vec\theta^{\ast(2)}((\overline{x})^2 - \overline{x^2})), where
\overline{x^2} = \frac{1}{n}\sum_{i=1}^n x_i^2.
We need to show that the optimal curve passes through the point (\bar{x}, \bar{y} + \tilde{\theta}^{*(2)}((\overline{x^2}) - \overline{x}^2)).
The MSE for the polynomial regression model is:
\text{MSE}(\vec{\theta}) = \frac{1}{n}\sum_{i=1}^n (y_i - \tilde{\theta}^{(0)} - \tilde{\theta}^{(1)}x_i - \tilde{\theta}^{(2)}x_i^2)^2
To find the optimal parameters, we take partial derivatives and set them equal to zero.
Setting \frac{\partial \text{MSE}}{\partial \tilde{\theta}^{(0)}} = 0:
\frac{\partial \text{MSE}}{\partial \tilde{\theta}^{(0)}} = -\frac{2}{n}\sum_{i=1}^n (y_i - \tilde{\theta}^{(0)} - \tilde{\theta}^{(1)}x_i - \tilde{\theta}^{(2)}x_i^2) = 0
This gives us:
\sum_{i=1}^n (y_i - \tilde{\theta}^{(0)} - \tilde{\theta}^{(1)}x_i - \tilde{\theta}^{(2)}x_i^2) = 0
Expanding:
\sum_{i=1}^n y_i - n\tilde{\theta}^{*(0)} - \tilde{\theta}^{*(1)}\sum_{i=1}^n x_i - \tilde{\theta}^{*(2)}\sum_{i=1}^n x_i^2 = 0
Dividing by n:
\bar{y} - \tilde{\theta}^{*(0)} - \tilde{\theta}^{*(1)}\bar{x} - \tilde{\theta}^{*(2)}\overline{x^2} = 0
Therefore:
\tilde{\theta}^{*(0)} = \bar{y} - \tilde{\theta}^{*(1)}\bar{x} - \tilde{\theta}^{*(2)}\overline{x^2}
Now we evaluate the optimal curve at x = \bar{x}:
f(\vec{\theta}^*; \bar{x}) = \tilde{\theta}^{*(0)} + \tilde{\theta}^{*(1)}\bar{x} + \tilde{\theta}^{*(2)}\bar{x}^2
Substituting the expression for \tilde{\theta}^{*(0)}:
f(\vec{\theta}^*; \bar{x}) = (\bar{y} - \tilde{\theta}^{*(1)}\bar{x} - \tilde{\theta}^{*(2)}\overline{x^2}) + \tilde{\theta}^{*(1)}\bar{x} + \tilde{\theta}^{*(2)}\bar{x}^2
Simplifying:
f(\vec{\theta}^*; \bar{x}) = \bar{y} - \tilde{\theta}^{*(2)}\overline{x^2} + \tilde{\theta}^{*(2)}\bar{x}^2
f(\vec{\theta}^*; \bar{x}) = \bar{y} + \tilde{\theta}^{*(2)}(\bar{x}^2 - \overline{x^2})
f(\vec{\theta}^*; \bar{x}) = \bar{y} + \tilde{\theta}^{*(2)}((\overline{x})^2 - \overline{x^2})
Since (\overline{x})^2 = \overline{x}^2, we can rewrite this as:
f(\vec{\theta}^*; \bar{x}) = \bar{y} + \tilde{\theta}^{*(2)}(\overline{x^2} - \overline{x}^2)
Note: We use the fact that \bar{x}^2 - \overline{x^2} = -(\overline{x^2} - \bar{x}^2).
Therefore, the curve of best fit passes through the point (\bar{x}, \bar{y} + \tilde{\theta}^{*(2)}((\overline{x^2}) - \overline{x}^2)).
Using the same model as (b), suppose we minimize MSE and find optimal parameters \vec{\theta}^\ast. Further suppose we apply a shifting and scaling operation to the training targets, defining
\widetilde{y_i} = \alpha(y_i - \beta),\qquad\alpha,\beta\in\mathbb{R}.
Find formulas for the new optimal parameters, denoted \vec{\widetilde{\theta}}^\ast, in terms of the old parameters and \alpha, \beta.
We start by setting up the problem. For the polynomial regression model f(\vec{\theta}; x) = \theta^{(0)} + \theta^{(1)}x + \theta^{(2)}x^2, the design matrix is:
\mathbf{Z} = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix} \in \mathbb{R}^{n \times 3}
and the target vector is:
\mathbf{Y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n
Original optimal parameters:
Assuming (\mathbf{Z}^T\mathbf{Z})^{-1} exists, the optimal parameters for the original problem are:
\vec{\theta}^* = (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{Y}
Transformed target vector:
When we apply the transformation \tilde{y}_i = \alpha(y_i - \beta), the new target vector becomes:
\tilde{\mathbf{Y}} = \begin{bmatrix} \tilde{y}_1 \\ \tilde{y}_2 \\ \vdots \\ \tilde{y}_n \end{bmatrix} = \alpha(\mathbf{Y} - \beta\mathbf{1})
where \mathbf{1} = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^n.
New optimal parameters:
The optimal parameters for the transformed problem are:
\tilde{\vec{\theta}}^* = (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\tilde{\mathbf{Y}}
= (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T[\alpha(\mathbf{Y} - \beta\mathbf{1})]
= \alpha(\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{Y} - \alpha\beta(\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{1}
= \alpha\vec{\theta}^* - \alpha\beta(\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{1}
Computing \mathbf{Z}^T\mathbf{1}:
\mathbf{Z}^T\mathbf{1} = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \\ x_1^2 & x_2^2 & \cdots & x_n^2 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} = \begin{bmatrix} n \\ \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i^2 \end{bmatrix}
Final formula:
Therefore, the new optimal parameters are:
\boxed{\tilde{\vec{\theta}}^* = \alpha\vec{\theta}^* - \alpha\beta(\mathbf{Z}^T\mathbf{Z})^{-1}\begin{bmatrix} n \\ \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i^2 \end{bmatrix}}
Alternatively, this can be written more compactly as:
\boxed{\tilde{\vec{\theta}}^* = \alpha\vec{\theta}^* - \alpha\beta(\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{1}}
Interpretation: The transformation \tilde{y}_i = \alpha(y_i - \beta) scales the targets by \alpha and shifts them by -\alpha\beta. The optimal parameters scale by \alpha (first term) and receive an additional correction term (second term) that depends on both the shift \beta and the structure of the design matrix through (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{1}.