← return to practice.dsc40a.com
Instructor(s): Sawyer Robinson
This exam was take-home.
Source: Summer Session 1 2025 Midterm, Problem 1
Let \{(x_i,y_i)\}_{i=1}^n be a dataset of scalar input-output pairs, and consider the simple linear regression model
f(a, b;\, x) = ax + b,\qquad a, b\in\mathbb{R}.
Let \gamma > 0 be a fixed constant which is understood to be separate from the training data and the weights. Define the \gamma-risk according to the formula
R_{\gamma}(a, b) = \gamma a^2 + \frac{1}{n}\sum_{i=1}^{n} (y_i - (ax_i + b))^2.
Find closed-form expressions for the global minimizers a^\ast, b^\ast of the \gamma-risk for the training data \{(x_i,y_i)\}_{i=1}^n. In your solution, you should clearly label and explain each step.
We compute the partial derivatives of R_\gamma(a, b) with respect to both parameters.
Partial derivative with respect to a:
\begin{align*} \frac{\partial R_\gamma}{\partial a} &= 2\gamma a + \frac{1}{n} \sum_{i=1}^{n} 2(y_i - ax_i - b)(-x_i)\\ &= 2\gamma a - \frac{2}{n} \sum_{i=1}^{n} x_i (y_i - ax_i - b) \end{align*}
Partial derivative with respect to b:
\begin{align*} \frac{\partial R_\gamma}{\partial b} &= \frac{1}{n} \sum_{i=1}^{n} 2(y_i - ax_i - b)(-1)\\ &= -\frac{2}{n} \sum_{i=1}^{n} (y_i - ax_i - b) \end{align*}
From \frac{\partial R_\gamma}{\partial b} = 0:
\begin{align*} -\frac{2}{n} \sum_{i=1}^{n} (y_i - ax_i - b) &= 0\\ \sum_{i=1}^{n} y_i - a\sum_{i=1}^{n} x_i - nb &= 0\\ nb &= \sum_{i=1}^{n} y_i - a\sum_{i=1}^{n} x_i\\ b &= \frac{1}{n}\sum_{i=1}^{n} y_i - \frac{a}{n}\sum_{i=1}^{n} x_i\\ b &= \bar{y} - a\bar{x} \end{align*}
where \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i and \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i.
From \frac{\partial R_\gamma}{\partial a} = 0:
\begin{align*} 2\gamma a - \frac{2}{n} \sum_{i=1}^{n} x_i (y_i - ax_i - b) &= 0\\ \gamma a - \frac{1}{n} \sum_{i=1}^{n} x_i y_i + \frac{a}{n}\sum_{i=1}^{n} x_i^2 + \frac{b}{n}\sum_{i=1}^{n} x_i &= 0\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 + b\sum_{i=1}^{n} x_i &= \sum_{i=1}^{n} x_i y_i \end{align*}
Substituting b = \bar{y} - a\bar{x} into the equation above:
\begin{align*} n\gamma a + a\sum_{i=1}^{n} x_i^2 + (\bar{y} - a\bar{x})\sum_{i=1}^{n} x_i &= \sum_{i=1}^{n} x_i y_i\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 + \bar{y} \cdot n\bar{x} - a\bar{x} \cdot n\bar{x} &= \sum_{i=1}^{n} x_i y_i\\ n\gamma a + a\sum_{i=1}^{n} x_i^2 - an\bar{x}^2 &= \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y}\\ a\left(n\gamma + \sum_{i=1}^{n} x_i^2 - n\bar{x}^2\right) &= \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y} \end{align*}
Note that \sum_{i=1}^{n} x_i^2 - n\bar{x}^2 = \sum_{i=1}^{n} (x_i - \bar{x})^2 and \sum_{i=1}^{n} x_i y_i - n\bar{x}\bar{y} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}).
Therefore:
a^* = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 + n\gamma}
And:
b^* = \bar{y} - a^*\bar{x}
To confirm that the critical point is a minimizer, we compute the Hessian matrix and verify that it is positive definite.
What is the Hessian matrix?
The Hessian matrix H is a square matrix containing all second-order partial derivatives of a function. For a function R_\gamma(a, b) with two variables, the Hessian is:
H = \begin{pmatrix} \frac{\partial^2 R_\gamma}{\partial a^2} & \frac{\partial^2 R_\gamma}{\partial a \partial b}\\ \frac{\partial^2 R_\gamma}{\partial b \partial a} & \frac{\partial^2 R_\gamma}{\partial b^2} \end{pmatrix}
How does the Hessian determine convexity and minima?
Second partial derivatives:
\begin{align*} \frac{\partial^2 R_\gamma}{\partial a^2} &= 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2\\ \frac{\partial^2 R_\gamma}{\partial b^2} &= \frac{2}{n}\sum_{i=1}^{n} 1 = 2\\ \frac{\partial^2 R_\gamma}{\partial a \partial b} &= \frac{2}{n}\sum_{i=1}^{n} x_i = 2\bar{x} \end{align*}
The Hessian matrix is:
H = \begin{pmatrix} 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2 & 2\bar{x}\\ 2\bar{x} & 2 \end{pmatrix}
For the Hessian to be positive definite, we need:
H_{11} = 2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2 > 0. This is true since \gamma > 0 and all terms are non-negative.
\det(H) > 0:
\begin{align*} \det(H) &= 2\left(2\gamma + \frac{2}{n}\sum_{i=1}^{n} x_i^2\right) - 4\bar{x}^2\\ &= 4\gamma + \frac{4}{n}\sum_{i=1}^{n} x_i^2 - 4\bar{x}^2\\ &= 4\gamma + \frac{4}{n}\sum_{i=1}^{n} (x_i - \bar{x})^2\\ &> 0 \end{align*}
since \gamma > 0.
Therefore, the Hessian is positive definite, confirming that (a^*, b^*) is indeed a global minimizer of R_\gamma(a, b).
\boxed{a^* = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 + n\gamma}}
\boxed{b^* = \bar{y} - a^*\bar{x}}
where \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i and \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i.
Source: Summer Session 1 2025 Midterm, Problem 2a-e
Consider a dataset \{(\vec{x}_i, y_i)\}_{i=1}^{n} where each \vec{x}_i \in \mathbb{R}^{d} and y_i \in \mathbb{R} for which you decide to fit a multiple linear regression model:
f_1(\vec{w}, b;\, \vec{x}) = \vec{w}^\top x + b,\qquad\vec{w}\in\mathbb{R}^d,\;b\in\mathbb{R}.
After minimizing the MSE, the resulting model has an optimal empirical risk value denoted R_1.
Due to fairness constraints related to the nature of the input features, your boss informs you that the last two weights must be the same: \vec{w}^{(d-1)} =\vec{w}^{(d)}. Your colleague suggests a simple fix by removing the last two weights and features:
f_2(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-2)}\vec{x}^{(d-2)} + b.
After training, the resulting model has an optimal empirical risk value denoted R_2. On the other hand, you propose the approach of grouping the last two features and using the model formula
f_3(\vec{w}, b;\, \vec{x}) = \vec{w}^{(1)}\vec{x}^{(1)} + \vec{w}^{(2)}\vec{x}^{(2)} + \dotsc + \vec{w}^{(d-1)}\left(\vec{x}^{(d-1)} + \vec{x}^{(d)}\right) + b.
After training, the final model has an optimal empirical risk value denoted R_3.
Carefully apply Theorem 2.3.2 (“Optimal Model Parameters for Multiple Linear Regression”) to find an expression for the optimal parameters b^\ast, \vec{w}^\ast which minimize the mean squared error for the model f_2 and the training data \{(\vec{x}_i, y_i)\}_{i=1}^{n}. Your answer may contain the design matrix \mathbf{Z}, or any suitably modified version, as needed.
Solution
We begin by rewriting the model f_2 more explicitly. The model f_2 has d-2 weight parameters (excluding the intercept b): f_2(\vec{w}, b; \vec{x}) = \sum_{j=1}^{d-2} \vec{w}^{(j)} x^{(j)} + b, \quad \vec{w} \in \mathbb{R}^{d-2}, \, b \in \mathbb{R}.
To apply Theorem 2.3.2, we need to construct the appropriate design matrix and parameter vector.
Step 1: Define the modified design matrix
Let \mathbf{Z}_2 \in \mathbb{R}^{n \times (d-1)} be the modified design matrix where each row corresponds to a training example with only the first d-2 features plus a column of ones for the intercept: \mathbf{Z}_2 = \begin{bmatrix} 1 & x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(d-2)} \\ 1 & x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(d-2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(d-2)} \end{bmatrix} \in \mathbb{R}^{n \times (d-1)}.
Step 2: Define the parameter vector and target vector
Let \vec{\theta} \in \mathbb{R}^{d-1} be the combined parameter vector: \vec{\theta} = \begin{bmatrix} b \\ \vec{w}^{(1)} \\ \vec{w}^{(2)} \\ \vdots \\ \vec{w}^{(d-2)} \end{bmatrix}.
Let \mathbf{Y} \in \mathbb{R}^n be the target vector: \mathbf{Y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}.
Step 3: Express the MSE
The mean squared error can be written as: \text{MSE}(\vec{\theta}) = \frac{1}{n} \sum_{i=1}^n \left( y_i - \mathbf{Z}_{2,i}^{\top} \vec{\theta} \right)^2 = \frac{1}{n} \|\mathbf{Y} - \mathbf{Z}_2 \vec{\theta}\|_2^2.
Step 4: Apply Theorem 2.3.2
To minimize the MSE, we take the gradient with respect to \vec{\theta} and set it equal to zero: \frac{\partial \text{MSE}}{\partial \vec{\theta}} = -\frac{2}{n} \mathbf{Z}_2^{\top} (\mathbf{Y} - \mathbf{Z}_2 \vec{\theta}) = 0.
This simplifies to the normal equation: \mathbf{Z}_2^{\top} \mathbf{Z}_2 \vec{\theta} = \mathbf{Z}_2^{\top} \mathbf{Y}.
Assuming \mathbf{Z}_2^{\top} \mathbf{Z}_2 is invertible (which holds when \mathbf{Z}_2 has full column rank), the unique minimizer is: \vec{\theta}^* = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}.
Step 5: Extract optimal parameters
The optimal parameters are obtained by decomposing \vec{\theta}^*: \begin{bmatrix} b^* \\ \vec{w}^* \end{bmatrix} = \vec{\theta}^* = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}, where b^* is the first component and \vec{w}^* \in \mathbb{R}^{d-2} consists of the remaining components.
Final Answer: \boxed{\begin{bmatrix} b^* \\ \vec{w}^* \end{bmatrix} = \left( \mathbf{Z}_2^{\top} \mathbf{Z}_2 \right)^{-1} \mathbf{Z}_2^{\top} \mathbf{Y}} where \mathbf{Z}_2 \in \mathbb{R}^{n \times (d-1)} is the design matrix containing a column of ones followed by the first d-2 features of each training example.
Using the comparison operators \{ =, \leq, \geq, <, >\}, rank the optimal risk values R_1, R_2, R_3 from least to greatest. Justify your answer.
Answer: R_1 \leq R_3 \leq R_2
Justification:
To compare these three models, we need to analyze their flexibility and representational capacity.
Comparing R_1 and R_3:
Model f_1 is the most general model with d independent weight parameters plus an intercept, giving it (d+1) total parameters.
Model f_3 can be rewritten as: f_3(\vec{w}, b; \vec{x}) = \vec{w}^{(1)}x^{(1)} + \ldots + \vec{w}^{(d-2)}x^{(d-2)} + \vec{w}^{(d-1)}x^{(d-1)} + \vec{w}^{(d-1)}x^{(d)} + b
This is equivalent to f_1 with the constraint that w^{(d-1)} = w^{(d)}. In other words, f_3 is a constrained version of f_1.
Since f_1 includes all possible models that f_3 can represent (by setting w^{(d-1)} = w^{(d)} in f_1), the minimum achievable MSE for f_1 must be at least as good as (or better than) that of f_3. Therefore:
R_1 \leq R_3
Comparing R_3 and R_2:
Model f_2 completely removes the last two features from the model, using only features x^{(1)}, \ldots, x^{(d-2)}.
Model f_3 uses all d features but groups the last two with a shared coefficient \vec{w}^{(d-1)}.
We can show that f_3 is more flexible than f_2 by noting that f_2 is a special case of f_3. Specifically, if we set \vec{w}^{(d-1)} = 0 in model f_3, we get:
f_3(\vec{w}, b; \vec{x}) = \vec{w}^{(1)}x^{(1)} + \ldots + \vec{w}^{(d-2)}x^{(d-2)} + 0 \cdot (x^{(d-1)} + x^{(d)}) + b = f_2(\vec{w}, b; \vec{x})
Since f_3 can represent any model that f_2 can represent (plus additional models where \vec{w}^{(d-1)} \neq 0), the minimum achievable MSE for f_3 must be at least as good as that of f_2. Therefore:
R_3 \leq R_2
Final Ranking:
Combining these results, we have:
R_1 \leq R_3 \leq R_2
This ranking makes intuitive sense: f_1 is the most flexible model with the most parameters, allowing it to fit the training data best (lowest MSE). Model f_3 is moderately flexible, incorporating information from all features but with a constraint on the last two weights. Model f_2 is the least flexible, as it discards potentially useful information by completely removing the last two features.
Returning to the original model f_1, suppose you were asked instead to eliminate the intercept term, leading to the model formula
f_4(\vec{w};\, \vec{x}) = \vec{w}^\top x.
Once again, you train this model by minimizing the associated mean squared error and obtain an optimal MSE denoted R_4.
Explain why R_1 \leq R_4.
Model f_1(\vec{w}, b; \vec{x}) = \vec{w}^\top \vec{x} + b includes an intercept term b, while model f_4(\vec{w}; \vec{x}) = \vec{w}^\top x does not have an intercept.
This means f_1 is a more flexible model with one additional parameter compared to f_4. The intercept/bias term allows the model to shift all predictions up or down, enabling it to better match the target values.
Importantly, f_1 can always replicate the behavior of f_4 by simply setting b = 0. Therefore, f_1 can do at least as well as f_4, and possibly better if a non-zero intercept improves the fit.
Since R_1 represents the optimal (minimized) mean squared error for model f_1 and R_4 represents the optimal mean squared error for model f_4, we have:
R_1 \leq R_4
The MSE of f_1 must be less than or equal to the MSE of f_4.
Assume the following centering conditions hold:
\sum_{i=1}^{n} \vec{x}_i^{(j)} = 0\text{ for each }1\leq j\leq d,\text{ and }\sum_{i=1}^n y_i = 0.
Prove R_1 = R_4.
We need to show that under the centering conditions, the optimal risk for model f_1 (with intercept) equals the optimal risk for model f_4 (without intercept).
For model f_1, the mean squared error is: \text{MSE}_1(\vec{w}, b) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2
To minimize the MSE with respect to b, we take the partial derivative: \frac{\partial \text{MSE}_1}{\partial b} = \frac{\partial}{\partial b} \left( \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2 \right)
= -\frac{2}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)
Setting this equal to zero: -\frac{2}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b) = 0
\sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \vec{w}^{\top} \vec{x}_i - nb = 0
Using the centering condition \sum_{i=1}^{n} y_i = 0: \sum_{i=1}^{n} \vec{w}^{\top} \vec{x}_i = \vec{w}^{\top} \sum_{i=1}^{n} \vec{x}_i
Since \sum_{i=1}^{n} \vec{x}_i^{(j)} = 0 for each feature j, we have \sum_{i=1}^{n} \vec{x}_i = \vec{0}, so: \vec{w}^{\top} \sum_{i=1}^{n} \vec{x}_i = \vec{w}^{\top} \vec{0} = 0
Therefore: 0 - 0 - nb = 0 b^* = 0
Since the optimal intercept b^* = 0 for model f_1 under the centering conditions, the optimal model becomes: f_1(\vec{w}^*, b^*; \vec{x}) = \vec{w}^{* \top} \vec{x} + 0 = \vec{w}^{* \top} \vec{x}
This is exactly the form of model f_4. Therefore, both models optimize over the same family of functions (linear functions through the origin), and thus achieve the same optimal risk: R_1 = \min_{\vec{w}, b} \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i - b)^2 = \min_{\vec{w}} \frac{1}{n} \sum_{i=1}^{n} (y_i - \vec{w}^{\top} \vec{x}_i)^2 = R_4
Therefore, R_1 = R_4. ∎
Use the setting of d=1 (a.k.a. simple linear regression) to draw a sketch which illustrates why the result in Part (d) makes sense geometrically.
When d = 1, we have simple linear regression with a single feature: - Model f_1: y = ax + b (line with intercept) - Model f_4: y = ax (line through the origin)
The centering conditions become: - \sum_{i=1}^n x_i = 0 (features are centered) - \sum_{i=1}^n y_i = 0 (targets are centered)
This means the data has mean (\bar{x}, \bar{y}) = (0, 0), so the data cloud is centered at the origin.
When fitting a line y = ax + b to data centered at the origin using least squares, the optimal intercept is:
b^* = \bar{y} - a^* \bar{x} = 0 - a^* \cdot 0 = 0
This means the best-fit line for f_1 automatically passes through the origin, making it identical to the best-fit line for f_4.
y
|
| •
| •
|• •
----•-------•---- x
• | •
• |
|
Explanation of sketch: - The data points (•) are scattered around the origin (0, 0) because they are centered - Both f_1 and f_4 will fit the same line through the origin (represented by the diagonal line) - For f_1: The optimal intercept b^* = 0 because the centroid of the data is at (0, 0), and the least squares line always passes through the centroid - For f_4: The model is constrained to pass through the origin - Since both models produce the same fitted line, they achieve the same minimum MSE: R_1 = R_4
The centering conditions ensure that the “natural” best-fit line for f_1 passes through the origin, eliminating the advantage that f_1 normally has over f_4 due to the flexibility of choosing the intercept. Therefore, both models perform equally well on centered data.
Source: Summer Session 1 2025 Midterm, Problem 3a-c
Let \{x_i\}_{i=1}^n be a training dataset of scalar values, and suppose we wish to use the constant model
f(c;\, x) = c,\qquad c\in\mathbb{R}.
In various situations it can be useful to emphasize some training examples over others (e.g., due to data quality). For this purpose, suppose \alpha_1, \alpha_2, \dotsc, \alpha_n > 0 are fixed positive weights which are understood as separate from the training data and model parameters.
Find a formula for the minimizer c_1^\ast of the risk function
R_{1}(c) = \frac{1}{n} \sum_{i=1}^n \alpha_i(c - x_i)^2.
Find a formula for the minimizer c_2^\ast of the risk function
R_{2}(c) = \frac{1}{n} \sum_{i=1}^n \alpha_i |c - x_i|.
Which risk function is more sensitive to outliers?
Source: Summer Session 1 2025 Midterm, Problem 4a-d
An automotive research team wants to build a predictive model that simultaneously estimates two performance metrics of passenger cars:
To capture mechanical and aerodynamic factors, the engineers record the following four features for each vehicle (all measured on the current model year):
They propose the general linear model
f(\mathbf W,\vec b;\,\vec x) = \mathbf W\,\vec x+\vec b,\qquad\mathbf{W}\in\mathbb{R}^{2\times 4},\; \vec{b}\in\mathbb{R}^2,
where \vec x\in\mathbb{R}^{4} denotes the feature vector for a given car. Data for eight different cars are listed below.
| Feature | \vec{x}_1 | \vec{x}_2 | \vec{x}_3 | \vec{x}_4 | \vec{x}_5 | \vec{x}_6 | \vec{x}_7 | \vec{x}_8 |
|---|---|---|---|---|---|---|---|---|
| Engine disp. (L) | 2.0 | 2.5 | 3.0 | 1.8 | 3.5 | 2.2 | 2.8 | 1.6 |
| Mass (kg) | 1300 | 1450 | 1600 | 1250 | 1700 | 1350 | 1500 | 1200 |
| Horsepower | 140 | 165 | 200 | 130 | 250 | 155 | 190 | 115 |
| Drag coeff. | 0.28 | 0.30 | 0.32 | 0.27 | 0.33 | 0.29 | 0.31 | 0.26 |
| City L/100km | 8.5 | 9.2 | 10.8 | 7.8 | 11.5 | 8.9 | 9.8 | 7.2 |
| HWY L/100km | 6.0 | 6.5 | 7.5 | 5.8 | 8.0 | 6.2 | 6.9 | 5.4 |
Write down the design matrix \mathbf Z and the target matrix \mathbf Y from the data in the table.
Compute the weight matrix \mathbf W^{\ast} and bias vector \vec b^{\ast} that minimize the MSE for the given dataset. You can use Python for the computations where needed. You do not need to submit your code, but you do need to write down all matrices and vectors relevant to your computations (round your answers to three decimal places).
With (\mathbf W^{\ast},\vec b^{\ast}), predict the two fuel consumption values for each of the eight cars and report the overall MSE between the predictions and the true targets.
The team now measures two additional vehicles:
| Feature | \vec{x}_9 | \vec{x}_{10} |
|---|---|---|
| Engine disp. (L) | 2.4 | 1.5 |
| Mass (kg) | 1400 | 1150 |
| Horsepower | 170 | 110 |
| Drag coeff. | 0.29 | 0.25 |
| City L/100km | 9.0 | 7.0 |
| HWY L/100km | 6.3 | 5.2 |
Use your trained model to predict fuel consumption for these two cars. Compute the mean-squared error for the two testing examples and state whether you would recommend the model to the engineers.
Source: Summer Session 1 2025 Midterm, Problem 5a-c
Let \{(x_i,y_i)\}_{i=1}^n be a dataset of scalar input-output pairs.
Suppose we model y using a simple linear regression model of the form
f(\vec{\theta};\, x) = \vec{\theta}^{(0)} + \vec{\theta}^{(1)}x, \qquad\vec{\theta}\in\mathbb{R}^2.
Prove that the line of best fit (with respect to MSE) passes through the point (\overline{x}, \overline{y}).
Suppose we model y using a simple polynomial regression model of the form
f(\vec{\theta};\, x) = \vec{\theta}^{(0)} + \vec{\theta}^{(1)}x+ \vec{\theta}^{(2)}x^2, \qquad\vec{\theta}\in\mathbb{R}^3.
Prove that the curve of best fit (with respect to MSE) passes through the point (\overline{x}, \overline{y} + \vec\theta^{\ast(2)}((\overline{x})^2 - \overline{x^2})), where
\overline{x^2} = \frac{1}{n}\sum_{i=1}^n x_i^2.
Using the same model as (b), suppose we minimize MSE and find optimal parameters \vec{\theta}^\ast. Further suppose we apply a shifting and scaling operation to the training targets, defining
\widetilde{y_i} = \alpha(y_i - \beta),\qquad\alpha,\beta\in\mathbb{R}.
Find formulas for the new optimal parameters, denoted \vec{\widetilde{\theta}}^\ast, in terms of the old parameters and \alpha, \beta.