Summer Session 2025 Final Exam

L_{\rm Huber}(c; x_i) = \begin{cases} \frac{1}{2}(c - x_i)^2, & \text{if } |c - x_i| \leq \frac{1}{2}, \\ \frac{1}{2}|c - x_i| - \frac{1}{8}, & \text{if } |c - x_i| > \frac{1}{2}. \end{cases}

Prove that the empirical risk function R_{\rm Huber}(c) associated with the Huber loss is convex. Is it strictly convex?

The empirical risk function is given by: R_{\rm Huber}(c) = \frac{1}{n}\sum_{i=1}^n L_{\rm Huber}(c; x_i) = \frac{1}{4}\sum_{i=1}^4 L_{\rm Huber}(c; x_i)

where the Huber loss is: L_{\rm Huber}(c; x_i) = \begin{cases}\frac{1}{2}(c - x_i)^2, & \text{if } |c - x_i| \leq \frac{1}{2}, \\ \frac{1}{2}|c - x_i| - \frac{1}{8}, & \text{if } |c - x_i| > \frac{1}{2}. \end{cases}

Proving Convexity

To prove that R_{\rm Huber}(c) is convex, we will show that each individual loss function L_{\rm Huber}(c; x_i) is convex, then use the fact that the sum of convex functions is convex.

Step 1: Show each piece of the Huber loss is convex

For the first piece when |c - x_i| \leq \frac{1}{2}: - L_1(c) = \frac{1}{2}(c - x_i)^2 - The second derivative is: \frac{d^2L_1}{dc^2} = 1 > 0 - Therefore, this piece is strictly convex.

For the second piece when |c - x_i| > \frac{1}{2}: - L_2(c) = \frac{1}{2}|c - x_i| - \frac{1}{8} - This is an affine transformation of the absolute value function |c - x_i|, which is known to be convex. - The absolute value function is convex because for any \lambda \in [0,1] and any c_1, c_2: |\lambda c_1 + (1-\lambda)c_2 - x_i| \leq \lambda|c_1 - x_i| + (1-\lambda)|c_2 - x_i| (triangle inequality) - Therefore, this piece is convex.

Step 2: Verify the function is continuous and properly defined at the boundary

At the boundary where |c - x_i| = \frac{1}{2}: - From piece 1: L_1(c) = \frac{1}{2}\left(\frac{1}{2}\right)^2 = \frac{1}{8} - From piece 2: L_2(c) = \frac{1}{2}\left(\frac{1}{2}\right) - \frac{1}{8} = \frac{1}{4} - \frac{1}{8} = \frac{1}{8}

The values match, so L_{\rm Huber}(c; x_i) is continuous.

Step 3: Apply convexity properties

Since: 1. Each L_{\rm Huber}(c; x_i) is convex (as shown above) 2. A non-negative weighted sum of convex functions is convex 3. R_{\rm Huber}(c) = \frac{1}{4}\sum_{i=1}^4 L_{\rm Huber}(c; x_i) is a non-negative weighted sum

Therefore, R_{\rm Huber}(c) is convex.

Is it Strictly Convex?

No, the empirical risk function is not strictly convex.

The Huber loss contains a linear piece (the absolute value term \frac{1}{2}|c - x_i| - \frac{1}{8} when |c - x_i| > \frac{1}{2}).

A strictly convex function must satisfy: f(\lambda c_1 + (1-\lambda)c_2) < \lambda f(c_1) + (1-\lambda)f(c_2) for all c_1 \neq c_2 and \lambda \in (0,1).

However, in the linear region of the Huber loss, we have equality rather than strict inequality, which means the function is not strictly convex.

Specifically, if both c_1 and c_2 are far enough from some x_i (i.e., both fall in the linear region |c - x_i| > \frac{1}{2}), then the Huber loss for that data point will be linear between these two points, violating strict convexity.

Problem 1.2

Setup

We have the dataset: x_1 = 1, x_2 = 2, x_3 = 2, x_4 = 4

The Huber loss function is: L_{\rm Huber}(c; x_i) = \begin{cases} \frac{1}{2}(c - x_i)^2, & \text{if } |c - x_i| \leq \frac{1}{2} \\ \frac{1}{2}|c - x_i| - \frac{1}{8}, & \text{if } |c - x_i| > \frac{1}{2} \end{cases}

The empirical risk is: R_{\rm Huber}(c) = \frac{1}{4}\sum_{i=1}^{4} L_{\rm Huber}(c; x_i)

Derivative of Huber Loss

The derivative of the Huber loss with respect to c is: \frac{\partial L_{\rm Huber}(c; x_i)}{\partial c} = \begin{cases} c - x_i, & \text{if } |c - x_i| \leq \frac{1}{2} \\ \frac{1}{2}\text{sign}(c - x_i), & \text{if } |c - x_i| > \frac{1}{2} \end{cases}

Therefore: R'_{\rm Huber}(c) = \frac{1}{4}\sum_{i=1}^{4} \frac{\partial L_{\rm Huber}(c; x_i)}{\partial c}

Gradient Descent Update Rule

With step size \eta = 1: c_{t+1} = c_t - \eta \cdot R'_{\rm Huber}(c_t) = c_t - R'_{\rm Huber}(c_t)

Iteration t=0: Initial Guess

c_0 = \frac{3}{2} = 1.5

Computing individual loss derivatives: - x_1 = 1: |c_0 - x_1| = |1.5 - 1| = 0.5 \leq 0.5 → Quadratic case - L_{\rm Huber}(c_0; x_1) = \frac{1}{2}(0.5)^2 = 0.125 - \frac{\partial L}{\partial c} = 0.5

x_2 = 2: |c_0 - x_2| = |1.5 - 2| = 0.5 \leq 0.5 → Quadratic case
- L_{\rm Huber}(c_0; x_2) = \frac{1}{2}(-0.5)^2 = 0.125
- \frac{\partial L}{\partial c} = -0.5
x_3 = 2: |c_0 - x_3| = |1.5 - 2| = 0.5 \leq 0.5 → Quadratic case
- L_{\rm Huber}(c_0; x_3) = \frac{1}{2}(-0.5)^2 = 0.125
- \frac{\partial L}{\partial c} = -0.5
x_4 = 4: |c_0 - x_4| = |1.5 - 4| = 2.5 > 0.5 → Linear case
- L_{\rm Huber}(c_0; x_4) = \frac{1}{2}(2.5) - \frac{1}{8} = 1.125
- \frac{\partial L}{\partial c} = \frac{1}{2}\text{sign}(-2.5) = -0.5

Computing empirical risk and derivative: R_{\rm Huber}(c_0) = \frac{1}{4}(0.125 + 0.125 + 0.125 + 1.125) = \frac{1.5}{4} = 0.375

R'_{\rm Huber}(c_0) = \frac{1}{4}(0.5 - 0.5 - 0.5 - 0.5) = -0.25

Update: c_1 = 1.5 - (-0.25) = 1.75

Iteration t=1

c_1 = 1.75

Computing individual loss derivatives: - x_1 = 1: |c_1 - x_1| = 0.75 > 0.5 → Linear case - L = \frac{1}{2}(0.75) - \frac{1}{8} = 0.25 - \frac{\partial L}{\partial c} = 0.5

x_2 = 2: |c_1 - x_2| = 0.25 \leq 0.5 → Quadratic case
- L = \frac{1}{2}(0.25)^2 = 0.03125
- \frac{\partial L}{\partial c} = -0.25
x_3 = 2: |c_1 - x_3| = 0.25 \leq 0.5 → Quadratic case
- L = \frac{1}{2}(0.25)^2 = 0.03125
- \frac{\partial L}{\partial c} = -0.25
x_4 = 4: |c_1 - x_4| = 2.25 > 0.5 → Linear case
- L = \frac{1}{2}(2.25) - \frac{1}{8} = 1
- \frac{\partial L}{\partial c} = -0.5

Computing empirical risk and derivative: R_{\rm Huber}(c_1) = \frac{1}{4}(0.25 + 0.03125 + 0.03125 + 1) = 0.328125

R'_{\rm Huber}(c_1) = \frac{1}{4}(0.5 - 0.25 - 0.25 - 0.5) = -0.125

Update: c_2 = 1.75 - (-0.125) = 1.875

Iteration t=2

c_2 = 1.875

Computing individual loss derivatives: - x_1 = 1: |c_2 - x_1| = 0.875 > 0.5 → Linear case - L = \frac{1}{2}(0.875) - \frac{1}{8} = 0.3125 - \frac{\partial L}{\partial c} = 0.5

x_2 = 2: |c_2 - x_2| = 0.125 \leq 0.5 → Quadratic case
- L = \frac{1}{2}(0.125)^2 = 0.0078125
- \frac{\partial L}{\partial c} = -0.125
x_3 = 2: |c_2 - x_3| = 0.125 \leq 0.5 → Quadratic case
- L = \frac{1}{2}(0.125)^2 = 0.0078125
- \frac{\partial L}{\partial c} = -0.125
x_4 = 4: |c_2 - x_4| = 2.125 > 0.5 → Linear case
- L = \frac{1}{2}(2.125) - \frac{1}{8} = 0.9375
- \frac{\partial L}{\partial c} = -0.5

Computing empirical risk and derivative: R_{\rm Huber}(c_2) = \frac{1}{4}(0.3125 + 0.0078125 + 0.0078125 + 0.9375) = 0.31640625

R'_{\rm Huber}(c_2) = \frac{1}{4}(0.5 - 0.125 - 0.125 - 0.5) = -0.0625

Update: c_3 = 1.875 - (-0.0625) = 1.9375

Iteration t=3

c_3 = 1.9375

Computing individual loss derivatives: - x_1 = 1: |c_3 - x_1| = 0.9375 > 0.5 → Linear case - L = \frac{1}{2}(0.9375) - \frac{1}{8} = 0.34375 - \frac{\partial L}{\partial c} = 0.5

x_2 = 2: |c_3 - x_2| = 0.0625 \leq 0.5 → Quadratic case
- L = \frac{1}{2}(0.0625)^2 = 0.001953125
- \frac{\partial L}{\partial c} = -0.0625
x_3 = 2: |c_3 - x_3| = 0.0625 \leq 0.5 → Quadratic case
- L = \frac{1}{2}(0.0625)^2 = 0.001953125
- \frac{\partial L}{\partial c} = -0.0625
x_4 = 4: |c_3 - x_4| = 2.0625 > 0.5 → Linear case
- L = \frac{1}{2}(2.0625) - \frac{1}{8} = 0.90625
- \frac{\partial L}{\partial c} = -0.5

Computing empirical risk and derivative: R_{\rm Huber}(c_3) = \frac{1}{4}(0.34375 + 0.001953125 + 0.001953125 + 0.90625) = 0.3134765625

R'_{\rm Huber}(c_3) = \frac{1}{4}(0.5 - 0.0625 - 0.0625 - 0.5) = -0.03125

Update: c_4 = 1.9375 - (-0.03125) = 1.96875

Iteration t=4

c_4 = 1.96875

Computing individual loss derivatives: - x_1 = 1: |c_4 - x_1| = 0.96875 > 0.5 → Linear case - L = \frac{1}{2}(0.96875) - \frac{1}{8} = 0.359375 - \frac{\partial L}{\partial c} = 0.5

x_2 = 2: |c_4 - x_2| = 0.03125 \leq 0.5 → Quadratic case
- L = \frac{1}{2}(0.03125)^2 = 0.00048828125
- \frac{\partial L}{\partial c} = -0.03125
x_3 = 2: |c_4 - x_3| = 0.03125 \leq 0.5 → Quadratic case
- L = \frac{1}{2}(0.03125)^2 = 0.00048828125
- \frac{\partial L}{\partial c} = -0.03125
x_4 = 4: |c_4 - x_4| = 2.03125 > 0.5 → Linear case
- L = \frac{1}{2}(2.03125) - \frac{1}{8} = 0.890625
- \frac{\partial L}{\partial c} = -0.5

Computing empirical risk and derivative: R_{\rm Huber}(c_4) = \frac{1}{4}(0.359375 + 0.00048828125 + 0.00048828125 + 0.890625) \approx 0.3127

R'_{\rm Huber}(c_4) = \frac{1}{4}(0.5 - 0.03125 - 0.03125 - 0.5) = -0.015625

Summary Table

t	c_t	R_{\rm Huber}(c_t)	R'_{\rm Huber}(c_t)
0	1.5	0.375	-0.25
1	1.75	0.328125	-0.125
2	1.875	0.31640625	-0.0625
3	1.9375	0.3134765625	-0.03125
4	1.96875	0.3127	-0.015625

Final Answer

After four iterations of gradient descent with \eta = 1 and c_0 = \frac{3}{2}, we obtain:

c^\ast_{\rm Huber} \approx 1.96875

Problem 1.3

The derivative of empirical risk for absolute loss is given by

R'(h) = \frac{1}{n}\sum_{i=1}^{n} \text{sign}(h - y_i).

See lectures 3-4 for a derivation. Thus

R'(2.5) = \frac{1}{6}(1 - 1 - 1 + 1 + 1 + 1) = \frac{1}{3} > 0.

So the correct answer is positive.

Alternative solution. Order the dataset: (-1, 1, 1, 2, 3, 4). The loss is minimized in the range [1,2]. The point h = 2.5 is to the right of this minimum and therefore the function is increasing at this point and the derivative must be positive.

Problem 2

Consider training data \{(\vec{x}_i, y_i)\}_{i=1}^n, with \vec{x}_i\in \mathbb{R}^d and y_i \in \mathbb{R}. Let \mathbf{Z} be the design matrix, \vec{y} = [y_1 \dots y_n]^T, and model f(\vec{w},b;\vec{x}) = \vec{w}^\top \vec{x} + b.

Problem 2.1

To prove that the MSE minimizer is not unique, we will show that if \theta^\ast = [b^\ast\; w_1^\ast\; w_2^\ast\; \cdots\; w_d^\ast]^\top is a minimizer, then we can construct a different parameter vector that achieves the same MSE.

Suppose \theta^\ast = [b^\ast\; w_1^\ast\; w_2^\ast\; w_3^\ast\; \cdots\; w_d^\ast]^\top is a minimizer for the MSE. The predictions for each training example are:

f(\theta^\ast; \vec{x}_i) = b^\ast + w_1^\ast \vec{x}_i^{(1)} + w_2^\ast \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}

Since \vec{x}_i^{(1)} = \lambda \vec{x}_i^{(2)} for all i, we can substitute:

f(\theta^\ast; \vec{x}_i) = b^\ast + w_1^\ast (\lambda \vec{x}_i^{(2)}) + w_2^\ast \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}

Factoring out \vec{x}_i^{(2)}:

f(\theta^\ast; \vec{x}_i) = b^\ast + (\lambda w_1^\ast + w_2^\ast) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}

Now, for any \alpha \in \mathbb{R}, define a new parameter vector:

\widetilde{\theta} = [b^\ast\; (w_1^\ast + \alpha)\; (w_2^\ast - \lambda\alpha)\; w_3^\ast\; \cdots\; w_d^\ast]^\top

Let’s verify that this new parameter vector produces the same predictions:

f(\widetilde{\theta}; \vec{x}_i) = b^\ast + (w_1^\ast + \alpha) \vec{x}_i^{(1)} + (w_2^\ast - \lambda\alpha) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}

Substituting \vec{x}_i^{(1)} = \lambda \vec{x}_i^{(2)}:

f(\widetilde{\theta}; \vec{x}_i) = b^\ast + (w_1^\ast + \alpha)(\lambda \vec{x}_i^{(2)}) + (w_2^\ast - \lambda\alpha) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}

= b^\ast + (\lambda w_1^\ast + \lambda\alpha) \vec{x}_i^{(2)} + (w_2^\ast - \lambda\alpha) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}

= b^\ast + (\lambda w_1^\ast + w_2^\ast) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}

This is exactly the same as f(\theta^\ast; \vec{x}_i).

Since \widetilde{\theta} produces identical predictions for all training examples, it achieves the same MSE as \theta^\ast:

R(\widetilde{\theta}) = \frac{1}{n}\sum_{i=1}^n (y_i - f(\widetilde{\theta}; \vec{x}_i))^2 = \frac{1}{n}\sum_{i=1}^n (y_i - f(\theta^\ast; \vec{x}_i))^2 = R(\theta^\ast)

For any nonzero \alpha, we have \widetilde{\theta} \neq \theta^\ast, yet both minimize the MSE. Therefore, the minimizer is not unique.

Problem 2.2

Answer: No, R_2 = R_1 (not R_2 > R_1).

Explanation:

From part (a), we know that the first feature is an exact scalar multiple of the second feature: \vec{x}_i^{(1)} = \lambda \vec{x}_i^{(2)} for all training examples.

When we have the full model with both features, the prediction for example i is: \hat{y}_i = b + w_1 x_i^{(1)} + w_2 x_i^{(2)} + w_3 x_i^{(3)} + \cdots + w_d x_i^{(d)}

Since x_i^{(1)} = \lambda x_i^{(2)}, we can rewrite this as: \hat{y}_i = b + w_1 (\lambda x_i^{(2)}) + w_2 x_i^{(2)} + w_3 x_i^{(3)} + \cdots + w_d x_i^{(d)} = b + (w_1 \lambda + w_2) x_i^{(2)} + w_3 x_i^{(3)} + \cdots + w_d x_i^{(d)}

This shows that any prediction achievable with both features 1 and 2 can be achieved using only feature 2 by setting the coefficient on feature 2 to be (w_1 \lambda + w_2).

When we delete feature 1 and retrain, the new model is: \hat{y}_i = b' + w_2' x_i^{(2)} + w_3' x_i^{(3)} + \cdots + w_d' x_i^{(d)}

Since the second feature can “absorb” the contribution of the first feature (by adjusting its weight), the model without feature 1 has the same expressive power as the model with both features. Therefore, it can achieve the same minimum MSE.

Specifically, if (b^\ast, w_1^\ast, w_2^\ast, \ldots, w_d^\ast) minimizes R_1, then (b^\ast, w_2^\ast + \lambda w_1^\ast, w_3^\ast, \ldots, w_d^\ast) minimizes R_2 and produces identical predictions, so:

R_2 = R_1

The first feature is redundant given the second feature, so removing it does not reduce the model’s ability to fit the training data.

Problem 2.3

Answer: No, \mathbf{\widetilde{Z}}^\top \mathbf{\widetilde{Z}} is not invertible.

Explanation

Recall that \mathbf{\widetilde{Z}}^\top \mathbf{\widetilde{Z}} is invertible if and only if the columns of \mathbf{\widetilde{Z}} are linearly independent. We will show that the columns of \mathbf{\widetilde{Z}} are linearly dependent, which means \mathbf{\widetilde{Z}}^\top \mathbf{\widetilde{Z}} cannot be invertible.

Features in \mathbf{\widetilde{Z}}

After performing the three modifications, the design matrix \mathbf{\widetilde{Z}} contains the following feature columns:

The bias column (all 1s)
Original features: \vec{x}^{(2)}, \vec{x}^{(3)}, \vec{x}^{(4)}, \vec{x}^{(6)}, \ldots, \vec{x}^{(d)} (note that \vec{x}^{(1)} and \vec{x}^{(5)} were removed)
New interaction terms: \vec{x}^{(3)}\vec{x}^{(4)} and \vec{x}^{(1)}\vec{x}^{(2)}
New polynomial terms: (\vec{x}^{(1)})^2 and (\vec{x}^{(3)})^2

Key Observation

From part (a), we know that \vec{x}^{(1)} = \lambda \vec{x}^{(2)} where \lambda \neq 0.

Using this relationship, we can express the new polynomial and interaction features in terms of \vec{x}^{(2)}:

\vec{x}^{(1)}\vec{x}^{(2)} = (\lambda \vec{x}^{(2)}) \cdot \vec{x}^{(2)} = \lambda (\vec{x}^{(2)})^2

(\vec{x}^{(1)})^2 = (\lambda \vec{x}^{(2)})^2 = \lambda^2 (\vec{x}^{(2)})^2

Linear Dependence

From the equations above, we can write:

(\vec{x}^{(1)})^2 = \lambda^2 (\vec{x}^{(2)})^2 = \lambda \cdot \left[\lambda (\vec{x}^{(2)})^2\right] = \lambda \cdot \vec{x}^{(1)}\vec{x}^{(2)}

This shows that the feature (\vec{x}^{(1)})^2 is a scalar multiple of the feature \vec{x}^{(1)}\vec{x}^{(2)}.

Since one column of \mathbf{\widetilde{Z}} (the (\vec{x}^{(1)})^2 column) can be written as a scalar multiple of another column (the \vec{x}^{(1)}\vec{x}^{(2)} column), the columns of \mathbf{\widetilde{Z}} are linearly dependent.

Conclusion

Because the columns of \mathbf{\widetilde{Z}} are linearly dependent, the matrix \mathbf{\widetilde{Z}}^\top \mathbf{\widetilde{Z}} is not invertible.

Problem 3

Let \mathbf{Z} be the design matrix and \vec{y} the label vector. Assume \mathbf{Z}^\top \mathbf{Z} invertible and \theta^\ast unique for MSE. Let residuals \vec{r} = \vec{y} - \mathbf{Z}\vec{\theta^\ast}.

Problem 3.1

Solution:

Recall that \theta^* = [b^*\;\vec{w}^*]^\top is the unique minimizer of the mean squared error (MSE):

R(\theta) = \frac{1}{n}\|\vec{y} - \mathbf{Z}\theta\|^2

At a minimum, the gradient of R(\theta) with respect to \theta must equal zero. Let’s compute this gradient.

First, we can expand: R(\theta) = \frac{1}{n}(\vec{y} - \mathbf{Z}\theta)^\top(\vec{y} - \mathbf{Z}\theta)

Taking the gradient with respect to \theta: \nabla_\theta R(\theta) = \frac{1}{n} \cdot 2\mathbf{Z}^\top(\mathbf{Z}\theta - \vec{y}) = \frac{2}{n}\mathbf{Z}^\top(\mathbf{Z}\theta - \vec{y})

At the minimizer \theta^*, we have: \nabla_\theta R(\theta^*) = \vec{0}

Therefore: \frac{2}{n}\mathbf{Z}^\top(\mathbf{Z}\theta^* - \vec{y}) = \vec{0}

Simplifying: \mathbf{Z}^\top(\mathbf{Z}\theta^* - \vec{y}) = \vec{0}

Multiplying both sides by -1: \mathbf{Z}^\top(\vec{y} - \mathbf{Z}\theta^*) = \vec{0}

By definition, \vec{r} = \vec{y} - \mathbf{Z}\theta^*, so: \mathbf{Z}^\top\vec{r} = \vec{0}

Now, \mathbf{Z}^\top\vec{r} is a (d+1) \times 1 vector whose j-th entry is the dot product of the j-th column of \mathbf{Z} with the residual vector \vec{r}:

(\mathbf{Z}^\top\vec{r})_j = [\vec{z}_{1}^{j}\; \vec{z}_{2}^{j}\; \dotsc\; \vec{z}_{n}^{j}]\vec{r}

Since \mathbf{Z}^\top\vec{r} = \vec{0}, each component equals zero: [\vec{z}_{1}^{j}\; \vec{z}_{2}^{j}\; \dotsc\; \vec{z}_{n}^{j}]\vec{r} = 0 \quad \text{for each } 1 \leq j \leq d+1

This proves that \vec{r} is orthogonal to every column of \mathbf{Z}.

Problem 3.2

From part (a), we proved that the residual vector \vec{r} is orthogonal to every column of the design matrix \mathbf{Z}. That is, for each column j where 1 \leq j \leq d+1:

[\vec{z}_{1}^{j}\; \vec{z}_{2}^{j}\; \dotsc\; \vec{z}_{n}^{j}] \cdot \vec{r} = 0

Now, recall that in multiple linear regression, the design matrix \mathbf{Z} has the form:

\mathbf{Z} = \begin{bmatrix} 1 & \vec{x}_1^\top \\ 1 & \vec{x}_2^\top \\ \vdots & \vdots \\ 1 & \vec{x}_n^\top \end{bmatrix}

The first column of \mathbf{Z} corresponds to the intercept/bias term and is simply a column of all ones:

\text{Column 1 of } \mathbf{Z} = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix}

Since \vec{r} is orthogonal to every column of \mathbf{Z}, it must be orthogonal to this first column. Therefore:

\begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \cdot \vec{r} = 0

Computing this dot product explicitly:

\begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \cdot \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_n \end{bmatrix} = 1 \cdot r_1 + 1 \cdot r_2 + \cdots + 1 \cdot r_n = \sum_{i=1}^n r_i

Since this dot product equals zero, we have:

\sum_{i=1}^n r_i = 0

This shows that the residuals sum to zero, which is a fundamental property of least squares regression.

Problem 3.3

To prove that \mathbf{P} is symmetric, we need to show that \mathbf{P}^\top = \mathbf{P}.

Starting with the definition of \mathbf{P}:

\mathbf{P} = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top

Take the transpose of both sides:

\mathbf{P}^\top = \left[\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\right]^\top

Using the property that (ABC)^\top = C^\top B^\top A^\top, we have:

\mathbf{P}^\top = (\mathbf{Z}^\top)^\top \left[(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]^\top \mathbf{Z}^\top

Simplifying (\mathbf{Z}^\top)^\top = \mathbf{Z}:

\mathbf{P}^\top = \mathbf{Z} \left[(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]^\top \mathbf{Z}^\top

Now we need to simplify \left[(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]^\top.

First, note that \mathbf{Z}^\top\mathbf{Z} is symmetric. To see this:

(\mathbf{Z}^\top\mathbf{Z})^\top = \mathbf{Z}^\top(\mathbf{Z}^\top)^\top = \mathbf{Z}^\top\mathbf{Z}

Since \mathbf{Z}^\top\mathbf{Z} is symmetric, we have (\mathbf{Z}^\top\mathbf{Z})^\top = \mathbf{Z}^\top\mathbf{Z}.

Using the property that (A^{-1})^\top = (A^\top)^{-1}:

\left[(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]^\top = \left[(\mathbf{Z}^\top\mathbf{Z})^\top\right]^{-1} = (\mathbf{Z}^\top\mathbf{Z})^{-1}

Substituting this back:

\mathbf{P}^\top = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top = \mathbf{P}

Therefore, \mathbf{P} is symmetric. \square

Problem 3.4

We need to show that \mathbf{P}^2 = \mathbf{P} where \mathbf{P} = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top.

Let’s compute \mathbf{P}^2:

\mathbf{P}^2 = \mathbf{P} \cdot \mathbf{P} = \left[\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\right] \left[\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\right]

Using the associativity of matrix multiplication, we can regroup:

\mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top

The key is to recognize that the middle part \mathbf{Z}^\top\mathbf{Z} multiplied by its inverse (\mathbf{Z}^\top\mathbf{Z})^{-1} gives the identity:

\mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\left[\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]\mathbf{Z}^\top

Since (\mathbf{Z}^\top\mathbf{Z})^{-1}(\mathbf{Z}^\top\mathbf{Z}) = \mathbf{I}_{d+1} where \mathbf{I}_{d+1} is the (d+1) \times (d+1) identity matrix:

\mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{I}_{d+1}\mathbf{Z}^\top

Since multiplying by the identity matrix doesn’t change the result:

\mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top

\mathbf{P}^2 = \mathbf{P}

Therefore, \mathbf{P} is idempotent. \square

Common Mistake

A common error when proving idempotency is to incorrectly simplify after recognizing that (\mathbf{Z}^\top\mathbf{Z})^{-1}(\mathbf{Z}^\top\mathbf{Z}) = \mathbf{I}_{d+1}.

Incorrect approach:

After correctly writing: \mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top

Some students might write: \mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{I}_{d+1}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top

This is wrong because it incorrectly leaves both inverse terms in the expression. The error comes from not properly identifying which matrices multiply to form the identity.

What went wrong: The product (\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\mathbf{Z} equals \mathbf{I}_{d+1}, so when we substitute the identity, we should replace these three consecutive matrices, not insert the identity while keeping all the original terms.

Correct grouping: We must group \mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1} together (in that order) to get the identity matrix, which then simplifies the entire expression correctly.

Problem 4

A = \{\text{at least one } R_i = 3\},\quad B = \{\sum R_i = 6\},\quad C = \{R_i \le 4 \text{ for each } i\}.

Problem 4.1

The sample space \Omega consists of all possible outcomes when rolling a fair six-sided die four times independently. Since each roll R_i can result in any of the values \{1, 2, 3, 4, 5, 6\}, the sample space is the set of all ordered 4-tuples:

\Omega = \{(r_1, r_2, r_3, r_4) : r_i \in \{1, 2, 3, 4, 5, 6\} \text{ for each } i = 1, 2, 3, 4\}

Alternatively, we can express this as the Cartesian product:

\Omega = \{1, 2, 3, 4, 5, 6\}^4

The total number of outcomes in the sample space is:

|\Omega| = 6^4 = 1296

Each outcome (r_1, r_2, r_3, r_4) represents a specific sequence of four die rolls, where r_i is the result of the i-th roll. For example, (1, 3, 5, 2) represents rolling a 1 first, then a 3, then a 5, and finally a 2.

Problem 4.2

Computing \Pr(A)

We use the complement rule since it’s easier to compute the probability that no rolls equal 3.

\Pr(A) = 1 - \Pr(A^c) = 1 - \Pr(\text{no rolls equal 3})

For each roll, the probability of not getting a 3 is \frac{5}{6}. Since the four rolls are independent:

\Pr(\text{no rolls equal 3}) = \left(\frac{5}{6}\right)^4 = \frac{625}{1296}

Therefore:

\Pr(A) = 1 - \frac{625}{1296} = \frac{671}{1296} \approx \boxed{0.518}

Computing \Pr(B)

We need to count the number of ordered outcomes (R_1, R_2, R_3, R_4) where each R_i \in \{1, 2, 3, 4, 5, 6\} and R_1 + R_2 + R_3 + R_4 = 6.

Using the substitution s_i = R_i - 1 (so 0 \leq s_i \leq 5), we need:

s_1 + s_2 + s_3 + s_4 = 2

The number of non-negative integer solutions is given by stars and bars:

\binom{2 + 4 - 1}{4 - 1} = \binom{5}{3} = 10

Since the sum is only 2, no variable s_i can exceed 2, so all solutions automatically satisfy s_i \leq 5. Therefore, there are exactly 10 favorable outcomes.

The total number of possible outcomes is 6^4 = 1296.

\Pr(B) = \frac{10}{1296} = \frac{5}{648} \approx \boxed{0.008}

Computing \Pr(C)

For each roll to be at most 4, we have 4 favorable outcomes: \{1, 2, 3, 4\} out of 6 possible outcomes.

Since the rolls are independent:

\Pr(C) = \left(\frac{4}{6}\right)^4 = \left(\frac{2}{3}\right)^4 = \frac{16}{81} \approx \boxed{0.198}

Summary

\Pr(A) = \frac{671}{1296} \approx 0.518
\Pr(B) = \frac{10}{1296} \approx 0.008
\Pr(C) = \frac{16}{81} \approx 0.198

Problem 4.3

Computing \Pr{A \cap B}

We need to find the probability that at least one die shows 3 AND the sum of all four dice equals 6.

To find \Pr{A \cap B}, we count outcomes where both conditions hold simultaneously.

Finding all ways to get a sum of 6:

We need to partition 6 into 4 parts, where each part is between 1 and 6. The valid partitions are: - \{3, 1, 1, 1\} — contains a 3 ✓ - \{2, 2, 1, 1\} — does not contain a 3 ✗

For event A \cap B to occur, we need at least one 3 AND sum equal to 6. This means only the partition \{3, 1, 1, 1\} satisfies both conditions.

Counting ordered arrangements:

The number of permutations of \{3, 1, 1, 1\} is: \frac{4!}{1! \cdot 3!} = \frac{24}{6} = 4

These four outcomes are: (3,1,1,1), (1,3,1,1), (1,1,3,1), (1,1,1,3).

Computing the probability:

The total number of outcomes is 6^4 = 1296.

\Pr{A \cap B} = \frac{4}{1296} = \frac{1}{324} \approx \boxed{0.003}

Testing Independence

Events A and B are independent if and only if \Pr{A \cap B} = \Pr{A} \cdot \Pr{B}.

From part (b), we have: - \Pr{A} = 1 - \left(\frac{5}{6}\right)^4 = 1 - \frac{625}{1296} = \frac{671}{1296} \approx 0.518

\Pr{B} = \frac{10}{1296} (there are 4 arrangements of \{3,1,1,1\} plus 6 arrangements of \{2,2,1,1\}) \approx 0.008

Therefore: \Pr{A} \cdot \Pr{B} = \frac{671}{1296} \cdot \frac{10}{1296} = \frac{6710}{1{,}679{,}616} \approx 0.004

Conclusion:

Since \Pr{A \cap B} = \frac{4}{1296} \approx 0.003 and \Pr{A} \cdot \Pr{B} \approx 0.004, we have:

\Pr{A \cap B} \neq \Pr{A} \cdot \Pr{B}

Therefore, events A and B are NOT independent.

Problem 4.4

Recall: - A = \{\text{at least one } R_i = 3\} - C = \{R_i \leq 4 \text{ for each } 1 \leq i \leq 4\}

Computing \Pr{A \cap C}

The event A \cap C means: at least one die shows 3, AND all four dice show values \leq 4.

We’ll use the complement approach. First, note that:

\Pr{A \cap C} = \Pr{C} \cdot \Pr{A \mid C}

Step 1: Find \Pr{C}

For event C, each die must show a value in \{1, 2, 3, 4\}. - Probability that one die shows \leq 4: \frac{4}{6} = \frac{2}{3} - Probability that all four dice show \leq 4:

\Pr{C} = \left(\frac{2}{3}\right)^4 = \frac{16}{81}

Step 2: Find \Pr{A \mid C}

Given that all dice show \leq 4, we want the probability that at least one shows 3.

Using the complement: \Pr{A \mid C} = 1 - \Pr{\text{no die shows 3} \mid C}

Given a die shows \leq 4, the probability it doesn’t show 3 is \frac{3}{4} (values in \{1, 2, 4\}).

Thus: \Pr{\text{no die shows 3} \mid C} = \left(\frac{3}{4}\right)^4 = \frac{81}{256}

Therefore: \Pr{A \mid C} = 1 - \frac{81}{256} = \frac{175}{256}

Step 3: Compute \Pr{A \cap C}

\Pr{A \cap C} = \Pr{C} \cdot \Pr{A \mid C} = \frac{16}{81} \cdot \frac{175}{256} = \frac{2800}{20736} = \frac{175}{1296}

In decimal form: \Pr{A \cap C} = \frac{175}{1296} \approx \boxed{0.135}

Testing Independence

Events A and C are independent if and only if \Pr{A \cap C} = \Pr{A} \cdot \Pr{C}.

From part (b): - \Pr{A} = 1 - \left(\frac{5}{6}\right)^4 = 1 - \frac{625}{1296} = \frac{671}{1296} \approx 0.518 - \Pr{C} = \frac{16}{81} \approx 0.198

Compute \Pr{A} \cdot \Pr{C}:

\Pr{A} \cdot \Pr{C} = \frac{671}{1296} \cdot \frac{16}{81} = \frac{671 \cdot 16}{1296 \cdot 81} = \frac{10736}{104976} = \frac{671}{6561} \approx 0.102

Comparison: - \Pr{A \cap C} \approx 0.135 - \Pr{A} \cdot \Pr{C} \approx 0.102

Since 0.135 \neq 0.102, we have \Pr{A \cap C} \neq \Pr{A} \cdot \Pr{C}.

Conclusion: Events A and C are NOT independent.

Intuition: Knowing that all dice show \leq 4 (event C) actually increases the probability of seeing at least one 3 (event A), because we’ve eliminated the possibility of 5’s and 6’s, making 3’s more likely relative to the reduced sample space.

Problem 4.5

Finding \Pr(B \cap C)

Event B \cap C consists of outcomes where the sum of the four dice equals 6 and all four dice show values \leq 4.

To find these outcomes, we need to count all ways to partition 6 into exactly 4 positive integers, each between 1 and 4.

The possible partitions are: - \{1, 1, 1, 3\}: sum = 1 + 1 + 1 + 3 = 6 ✓ - \{1, 1, 2, 2\}: sum = 1 + 1 + 2 + 2 = 6 ✓

These are the only partitions that work. Any other combination either has a sum \neq 6 or requires at least one die to show a value > 4.

Now we count the number of ordered outcomes for each partition:

For \{1, 1, 1, 3\}: We need to count arrangements of three 1’s and one 3. \text{Number of arrangements} = \frac{4!}{3! \cdot 1!} = 4

For \{1, 1, 2, 2\}: We need to count arrangements of two 1’s and two 2’s. \text{Number of arrangements} = \frac{4!}{2! \cdot 2!} = 6

Total outcomes in B \cap C: 4 + 6 = 10

Since the sample space has 6^4 = 1296 total outcomes: \Pr(B \cap C) = \frac{10}{1296} = \frac{5}{648} \approx 0.008

Testing Independence

To determine if B and C are independent, we check whether \Pr(B \cap C) = \Pr(B) \times \Pr(C).

Finding \Pr(B):

From part (b), we computed \Pr(B) by finding all ways to get a sum of 6 with four dice (each showing 1-6). Notice that the partitions we found above are actually all the partitions of 6 into 4 positive integers \leq 6. Therefore: \Pr(B) = \frac{10}{1296} \approx 0.008

Finding \Pr(C):

Event C requires all four dice to show values \leq 4. Each die has 4 favorable outcomes out of 6 possible: \Pr(C) = \left(\frac{4}{6}\right)^4 = \left(\frac{2}{3}\right)^4 = \frac{256}{1296} = \frac{16}{81} \approx 0.198

Computing \Pr(B) \times \Pr(C): \Pr(B) \times \Pr(C) = \frac{10}{1296} \times \frac{256}{1296} = \frac{2560}{1{,}679{,}616} \approx 0.002

Conclusion

Since \Pr(B \cap C) \approx 0.008 and \Pr(B) \times \Pr(C) \approx 0.002, we have: \Pr(B \cap C) \neq \Pr(B) \times \Pr(C)

Therefore, events B and C are NOT independent.

Intuition: Notice that B \subseteq C (every outcome in B is also in C, since getting a sum of 6 with four dice automatically means all dice must show \leq 4). This subset relationship means that knowing C occurred significantly increases the probability of B, which is the hallmark of dependence.

Problem 5

Ridge regularized risk problem with \vec{x}_i \in \mathbb{R}^d, orthogonal unit-length columns in \mathbf{Z}:

R_\mu(\vec\theta) = \frac{1}{n}\|\vec{y}-\mathbf{Z}\vec\theta\|^2 + \mu \|\widetilde{\mathbf{I}_{d+1}}\vec\theta\|^2

Problem 5.1

We need to find the entries of the matrix \mathbf{Z}^{\top}\mathbf{Z}.

Recall that when we compute the matrix product \mathbf{Z}^{\top}\mathbf{Z}, the (i,j)-th entry is the dot product of the i-th row of \mathbf{Z}^{\top} with the j-th column of \mathbf{Z}.

Since the i-th row of \mathbf{Z}^{\top} equals the i-th column of \mathbf{Z}, we can write:

[\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = (\text{column } i \text{ of } \mathbf{Z})^{\top} (\text{column } j \text{ of } \mathbf{Z})

Let’s denote the columns of \mathbf{Z} as \vec{z}^{(1)}, \vec{z}^{(2)}, \ldots, \vec{z}^{(d+1)} \in \mathbb{R}^n. Then:

[\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = \vec{z}^{(i)} \cdot \vec{z}^{(j)}

Now we consider two cases:

Case 1: When i = j (diagonal entries)

For diagonal entries, we have: [\mathbf{Z}^{\top}\mathbf{Z}]_{ii} = \vec{z}^{(i)} \cdot \vec{z}^{(i)} = \|\vec{z}^{(i)}\|^2

We are given that each column has “vector length one.” In the context of design matrices and the result we need to prove, this means that each column has been normalized so that:

\frac{1}{n}\|\vec{z}^{(i)}\|^2 = 1

which is equivalent to:

\|\vec{z}^{(i)}\|^2 = n

Therefore: [\mathbf{Z}^{\top}\mathbf{Z}]_{ii} = n

Case 2: When i \neq j (off-diagonal entries)

For off-diagonal entries, we have: [\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = \vec{z}^{(i)} \cdot \vec{z}^{(j)}

We are given that the columns are pairwise orthogonal, which means: \vec{z}^{(i)} \cdot \vec{z}^{(j)} = 0 \quad \text{for all } i \neq j

Therefore: [\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = 0

Conclusion:

Combining both cases, we have: [\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = \begin{cases} n & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}

This is precisely the definition of the matrix n\mathbf{I}_{d+1}, where \mathbf{I}_{d+1} is the (d+1) \times (d+1) identity matrix.

Therefore, \mathbf{Z}^{\top}\mathbf{Z} = n\mathbf{I}_{d+1}. \quad \blacksquare

Problem 5.2

To find the minimizer of the ridge regularized risk, we take the gradient and set it equal to zero.

Step 1: Compute the gradient of R_{\mu}(\vec\theta)

The ridge risk is: R_{\mu}(\vec\theta) = \frac{1}{n}\|\,\vec{y}-\mathbf{Z}\vec\theta\|^{2} + \mu\,\|\widetilde{\mathbf{I}_{d+1}}\vec\theta\|^{2}

Taking the gradient: \nabla R_{\mu}(\vec\theta) = \frac{2}{n}\mathbf{Z}^{\top}(\mathbf{Z}\vec\theta - \vec{y}) + 2\mu\widetilde{\mathbf{I}_{d+1}}\vec\theta

Step 2: Set the gradient equal to zero

\frac{2}{n}\mathbf{Z}^{\top}\mathbf{Z}\vec\theta - \frac{2}{n}\mathbf{Z}^{\top}\vec{y} + 2\mu\widetilde{\mathbf{I}_{d+1}}\vec\theta = \vec{0}

Simplifying: \mathbf{Z}^{\top}\mathbf{Z}\vec\theta + n\mu\widetilde{\mathbf{I}_{d+1}}\vec\theta = \mathbf{Z}^{\top}\vec{y}

Step 3: Use the result from part (a)

From part (a), we know that \mathbf{Z}^{\top}\mathbf{Z} = n\mathbf{I}_{d+1}. Substituting:

n\mathbf{I}_{d+1}\vec\theta + n\mu\widetilde{\mathbf{I}_{d+1}}\vec\theta = \mathbf{Z}^{\top}\vec{y}

Factor out n: n(\mathbf{I}_{d+1} + \mu\widetilde{\mathbf{I}_{d+1}})\vec\theta = \mathbf{Z}^{\top}\vec{y}

Step 4: Analyze the matrix \mathbf{I}_{d+1} + \mu\widetilde{\mathbf{I}_{d+1}}

Recall that \widetilde{\mathbf{I}_{d+1}} is the identity matrix with the first diagonal entry (corresponding to the intercept) set to 0. Therefore, \mathbf{I}_{d+1} + \mu\widetilde{\mathbf{I}_{d+1}} is a diagonal matrix with: - First diagonal entry: 1 + 0 = 1 (intercept term) - Remaining diagonal entries: 1 + \mu (slope terms, j \geq 1)

Step 5: Solve for \vec\theta_{\mu}^{\ast} component-wise

For the slope terms (j \geq 1), the equation becomes: n(1 + \mu) \cdot \vec\theta_{\mu}^{\ast}{}^{(j)} = [\mathbf{Z}^{\top}\vec{y}]^{(j)}

Solving for \vec\theta_{\mu}^{\ast}{}^{(j)}: \vec\theta_{\mu}^{\ast}{}^{(j)} = \frac{1}{n(1+\mu)}[\mathbf{Z}^{\top}\vec{y}]^{(j)}

Step 6: Relate to the ridge-free minimizer

For the ridge-free case (\mu = 0), the normal equation gives: n\mathbf{I}_{d+1}\vec\theta_{0}^{\ast} = \mathbf{Z}^{\top}\vec{y}

So for each component: \vec\theta_{0}^{\ast}{}^{(j)} = \frac{1}{n}[\mathbf{Z}^{\top}\vec{y}]^{(j)}

Step 7: Substitute to obtain the final result

From Step 5: \vec\theta_{\mu}^{\ast}{}^{(j)} = \frac{1}{1+\mu} \cdot \frac{1}{n}[\mathbf{Z}^{\top}\vec{y}]^{(j)} = \frac{1}{1+\mu}\vec\theta_{0}^{\ast}{}^{(j)}

Therefore, for every slope index j \geq 1: \boxed{\vec\theta^{\ast}_{\mu}{}^{(j)} = \frac{1}{1+\mu}\;\vec\theta^{\ast}_{0}{}^{(j)}}

as required. \square

Problem 5.3

To prove: For each j \geq 1, the map \mu \mapsto |\vec{\theta}^{\ast}_{\mu}{}^{(j)}| is strictly decreasing and satisfies \lim_{\mu \to \infty} \vec{\theta}^{\ast}_{\mu}{}^{(j)} = 0.

Part 1: Proving strict decrease of |\vec{\theta}^{\ast}_{\mu}{}^{(j)}|

From part (b), we have for each slope index j \geq 1:

\vec{\theta}^{\ast}_{\mu}{}^{(j)} = \frac{1}{1+\mu} \vec{\theta}^{\ast}_{0}{}^{(j)}

Taking the absolute value of both sides:

|\vec{\theta}^{\ast}_{\mu}{}^{(j)}| = \left|\frac{1}{1+\mu}\right| \cdot |\vec{\theta}^{\ast}_{0}{}^{(j)}| = \frac{1}{1+\mu} \cdot |\vec{\theta}^{\ast}_{0}{}^{(j)}|

where the simplification uses the fact that \mu > 0, so 1 + \mu > 1 > 0.

Now consider the function g(\mu) = \frac{1}{1+\mu} \cdot |\vec{\theta}^{\ast}_{0}{}^{(j)}| for \mu > 0.

Taking the derivative with respect to \mu:

g'(\mu) = |\vec{\theta}^{\ast}_{0}{}^{(j)}| \cdot \frac{d}{d\mu}\left[\frac{1}{1+\mu}\right] = |\vec{\theta}^{\ast}_{0}{}^{(j)}| \cdot \left(-\frac{1}{(1+\mu)^2}\right) = -\frac{|\vec{\theta}^{\ast}_{0}{}^{(j)}|}{(1+\mu)^2}

Key observation: If \vec{\theta}^{\ast}_{0}{}^{(j)} \neq 0, then g'(\mu) < 0 for all \mu > 0.

This shows that |\vec{\theta}^{\ast}_{\mu}{}^{(j)}| is strictly decreasing in \mu whenever \vec{\theta}^{\ast}_{0}{}^{(j)} \neq 0.

(In the trivial case where \vec{\theta}^{\ast}_{0}{}^{(j)} = 0, we have \vec{\theta}^{\ast}_{\mu}{}^{(j)} = 0 for all \mu, which is constant but can be considered weakly decreasing.)

Part 2: Proving the limit

We need to show that:

\lim_{\mu \to \infty} \vec{\theta}^{\ast}_{\mu}{}^{(j)} = 0

Using the relationship from part (b):

\lim_{\mu \to \infty} \vec{\theta}^{\ast}_{\mu}{}^{(j)} = \lim_{\mu \to \infty} \frac{1}{1+\mu} \vec{\theta}^{\ast}_{0}{}^{(j)} = \vec{\theta}^{\ast}_{0}{}^{(j)} \cdot \lim_{\mu \to \infty} \frac{1}{1+\mu}

Since \lim_{\mu \to \infty} \frac{1}{1+\mu} = 0, we have:

\lim_{\mu \to \infty} \vec{\theta}^{\ast}_{\mu}{}^{(j)} = \vec{\theta}^{\ast}_{0}{}^{(j)} \cdot 0 = 0

Therefore, as the ridge regularization parameter \mu grows to infinity, all slope coefficients shrink to zero.

Summary: The absolute value of each slope coefficient |\vec{\theta}^{\ast}_{\mu}{}^{(j)}| decreases strictly as \mu increases (by virtue of having a negative derivative), and approaches 0 as \mu \to \infty. This demonstrates the shrinkage effect of ridge regularization on the model parameters.

Problem 5.4

Interpretation of Ridge Regularization with Orthonormal Features

Parts (b) and (c) reveal the following geometric effects of ridge regularization when the feature columns are orthonormal:

Uniform Shrinkage: Part (b) shows that ridge regularization applies a uniform multiplicative scaling factor of \frac{1}{1+\mu} to all slope coefficients. This means that each slope parameter \vec\theta^{\ast}_{\mu}{}^{(j)} is exactly \frac{1}{1+\mu} times its ridge-free counterpart \vec\theta^{\ast}_{0}{}^{(j)}. In other words, ridge regularization shrinks all slope coefficients toward zero by the same proportion.

Monotonic Shrinkage: Part (c) demonstrates that as the regularization parameter \mu increases: - The magnitude of each slope coefficient strictly decreases (gets smaller and smaller) - The shrinkage becomes more aggressive with larger \mu values - In the limit as \mu \to \infty, all slope coefficients are driven to exactly zero

Geometric Interpretation: When the feature columns are orthonormal, ridge regularization has a particularly clean geometric effect—it uniformly scales down the entire slope coefficient vector \vec{w}^\ast by the factor \frac{1}{1+\mu}. This is in contrast to more complex scenarios where non-orthogonal features can lead to different coefficients being shrunk by different amounts. The orthonormality ensures that the regularization penalty is applied equally and proportionally to all directions in the parameter space.

Practical Meaning: As we increase the regularization strength \mu, we are trading off model complexity (magnitude of coefficients) for better generalization. With very large \mu, the model approaches the trivial solution where all slopes are zero (effectively predicting only the intercept/mean).

Problem 6

Sawyer arranges 12 distinct hot sauce bottles: 5 Hot (H_1,\dots,H_5) and 7 Mild (M_1,\dots,M_7).

Problem 6.1

To ensure no two Hot bottles are adjacent, we can use the following strategy: first place the Mild bottles, then insert the Hot bottles into the gaps created.

Step 1: Arrange the Mild bottles

We have 7 distinct Mild bottles to arrange in a row. The number of ways to do this is:

Step 2: Identify positions for Hot bottles

Once the 7 Mild bottles are placed in a row, they create potential positions (gaps) where we can place the Hot bottles:

\_ \; M \; \_ \; M \; \_ \; M \; \_ \; M \; \_ \; M \; \_ \; M \; \_ \; M \; \_

There are 8 gaps total: one before the first Mild bottle, one after each of the 7 Mild bottles.

Step 3: Choose which gaps to use

We need to place 5 Hot bottles into these 8 available gaps. Since we want no two Hot bottles adjacent, we can place at most one Hot bottle in each gap. The number of ways to choose 5 gaps from 8 is:

\binom{8}{5}

Step 4: Arrange the Hot bottles in the chosen gaps

Once we’ve chosen which 5 gaps to use, we need to arrange the 5 distinct Hot bottles in those positions. The number of ways to do this is:

Step 5: Calculate the total

By the multiplication principle, the total number of valid arrangements is:

7! \times \binom{8}{5} \times 5!

Let’s compute this: - 7! = 5040 - \binom{8}{5} = \binom{8}{3} = \frac{8 \cdot 7 \cdot 6}{3 \cdot 2 \cdot 1} = \frac{336}{6} = 56 - 5! = 120

Therefore:

7! \times \binom{8}{5} \times 5! = 5040 \times 56 \times 120 = 33{,}868{,}800

Answer: There are \boxed{7! \times \binom{8}{5} \times 5! = 33{,}868{,}800} possible arrangements.

Problem 6.2

We need to count the number of ways to arrange 12 distinct bottles (5 Hot: H_1, \ldots, H_5 and 7 Mild: M_1, \ldots, M_7) such that all 5 Hot bottles are consecutive.

Step 1: Treat the Hot bottles as a single block

Since all 5 Hot bottles must be consecutive, we can treat them as a single “super-block.” This reduces our problem to arranging: - 1 super-block (containing all 5 Hot bottles) - 7 individual Mild bottles

This gives us 1 + 7 = 8 objects to arrange.

Step 2: Arrange the 8 objects

These 8 objects can be arranged in a row in: 8! = 40320 \text{ ways}

Step 3: Arrange the Hot bottles within the block

Within the super-block, the 5 distinct Hot bottles can be arranged among themselves in: 5! = 120 \text{ ways}

Step 4: Apply the multiplication principle

By the multiplication principle, the total number of arrangements is: 5! \times 8! = 120 \times 40320 = 4{,}838{,}400

Answer

\boxed{5! \times 8! = 4{,}838{,}400}

Problem 6.3

We need to count arrangements where exactly 4 Hot bottles are consecutive, and the 5th Hot bottle is not adjacent to this group of 4.

Step 1: Choose which 4 Hot bottles form the consecutive block

We have 5 distinct Hot bottles (H_1, H_2, H_3, H_4, H_5) and need to choose 4 of them to form the consecutive group:

\binom{5}{4} = 5 \text{ ways}

Step 2: Arrange the 4 chosen Hot bottles within the block

The 4 bottles within the consecutive block can be arranged in:

4! = 24 \text{ ways}

Step 3: Arrange all units with the non-adjacency constraint

Now we have 9 “units” to arrange in a row: - 1 block of 4 Hot bottles (treat as a single unit, denoted B) - 1 remaining Hot bottle (denoted H) - 7 distinct Mild bottles (M_1, \ldots, M_7)

We need to count arrangements where H is not adjacent to B.

Without constraint: The 9 distinct units can be arranged in 9! ways.

With H adjacent to B: - Treat H and B as a single super-unit - This super-unit can be arranged internally in 2 ways: HB or BH - We now have 8 units total: the super-unit plus 7 Mild bottles - These 8 units can be arranged in 8! ways - Total: 2 \times 8!

Arrangements where H is NOT adjacent to B:

9! - 2 \times 8! = 8!(9 - 2) = 7 \times 8! = 7 \times 40,320 = 282,240

Step 4: Combine all steps

Total number of valid arrangements:

\binom{5}{4} \times 4! \times (9! - 2 \times 8!)

= 5 \times 24 \times 282,240

= 120 \times 282,240

= 33,868,800

Final Answer

\boxed{33,868,800 = 5 \times 4! \times (7 \times 8!)}

Alternatively, this can be expressed as:

\boxed{\binom{5}{4} \times 4! \times (9! - 2 \times 8!)}

Problem 6.4

To find this probability, we need to divide the number of favorable outcomes (arrangements with no two Hot bottles adjacent) by the total number of possible arrangements.

Total Number of Arrangements

Since all 12 bottles are distinct (5 Hot bottles H_1, \ldots, H_5 and 7 Mild bottles M_1, \ldots, M_7), the total number of ways to arrange them in a row is:

\text{Total arrangements} = 12!

Favorable Arrangements (from part a)

From part (a), we found that the number of arrangements where no two Hot bottles are adjacent is:

\text{Favorable arrangements} = 7! \cdot \binom{8}{5} \cdot 5!

Explanation: - First, arrange the 7 Mild bottles in a row: 7! ways - This creates 8 possible positions (gaps) for the Hot bottles: one before the first Mild bottle, one after each Mild bottle (6 positions), and one after the last Mild bottle - Choose 5 of these 8 gaps to place the Hot bottles: \binom{8}{5} ways - Arrange the 5 distinct Hot bottles in the chosen positions: 5! ways

Calculating the Probability

The probability is:

P = \frac{7! \cdot \binom{8}{5} \cdot 5!}{12!}

Let’s compute each component: - 7! = 5040 - \binom{8}{5} = \frac{8!}{5! \cdot 3!} = \frac{8 \cdot 7 \cdot 6}{3 \cdot 2 \cdot 1} = 56 - 5! = 120 - 12! = 479{,}001{,}600

Therefore:

P = \frac{5040 \cdot 56 \cdot 120}{479{,}001{,}600} = \frac{33{,}868{,}800}{479{,}001{,}600}

Final Answer

P = \frac{7! \cdot \binom{8}{5} \cdot 5!}{12!} = \frac{33{,}868{,}800}{479{,}001{,}600} \approx \boxed{0.071}

The probability is approximately 0.071 or 7.1%.

Problem 7

Classifying video games into Adventure (y=1) or Strategy (y=0) based on 5 binary features. Training data for 14 games:

Problem 7.1

Game	multiplayer	puzzle	customization	procedural	magic	y
1	1	1	0	0	0	Adventure
…	…	…	…	…	…	…

To compute the class probabilities, we need to count how many games belong to each class in the training data.

From the training data table: - Adventure games (Y=1): games 1, 2, 3, 4, 5, 6 → 6 games total - Strategy games (Y=0): games 7, 8, 9, 10, 11, 12, 13, 14 → 8 games total - Total games: 14

The class probabilities are computed as the fraction of training examples in each class:

\mathbb{P}(Y=1) = \frac{\text{number of Adventure games}}{\text{total number of games}} = \frac{6}{14} = \frac{3}{7}

\mathbb{P}(Y=0) = \frac{\text{number of Strategy games}}{\text{total number of games}} = \frac{8}{14} = \frac{4}{7}

Answer: - \mathbb{P}(Y=1) = \frac{3}{7} \approx 0.429 - \mathbb{P}(Y=0) = \frac{4}{7} \approx 0.571

Problem 7.2

We need to determine which of the five binary features are present (1) or absent (0) based on the advertisement:

multiplayer: The ad explicitly states “(Now featuring multiplayer mode)” → 1
puzzle: The ad mentions “mazes” which typically involve puzzle-solving mechanics to navigate through them → 1
customization: No mention of character customization in the advertisement → 0
procedural: The ad explicitly states “procedurally generated mazes” → 1
magic: The ad mentions “the magician Ganfald,” indicating a magic system → 1

Therefore, the feature vector is:

\vec{x} = \begin{bmatrix}1 \\ 1 \\ 0 \\ 1 \\ 1\end{bmatrix}

Or in row notation: \vec{x} = [1, 1, 0, 1, 1]

Problem 7.3

From part (b), the feature vector for Nebulon Warrior II is: \vec{x} = [1, 1, 0, 1, 1]

corresponding to: multiplayer=1, puzzle=1, customization=0, procedural=1, magic=1.

Step 1: Count feature occurrences by class

From the training data: - Adventure games (Y=1): 6 games total (games 1-6) - Strategy games (Y=0): 8 games total (games 7-14)

For Adventure games (Y=1): - multiplayer=1: 5 times (games 1,2,4,5,6) - puzzle=1: 3 times (games 1,3,4) - customization=0: 6 times (all games) - procedural=1: 3 times (games 2,3,4) - magic=1: 1 time (game 6)

For Strategy games (Y=0): - multiplayer=1: 3 times (games 9,11,13) - puzzle=1: 3 times (games 8,10,13) - customization=0: 1 time (game 14) - procedural=1: 4 times (games 8,11,12,14) - magic=1: 3 times (games 12,13,14)

Step 2: Apply Laplace smoothing

Since we have binary features, we use Laplace smoothing with k=2 (adding 1 to each count, adding 2 to the denominator).

Conditional probabilities for Y=1 (Adventure): \mathbb{P}(\text{multiplayer}=1 \mid Y=1) = \frac{5+1}{6+2} = \frac{6}{8} = \frac{3}{4} \mathbb{P}(\text{puzzle}=1 \mid Y=1) = \frac{3+1}{6+2} = \frac{4}{8} = \frac{1}{2} \mathbb{P}(\text{customization}=0 \mid Y=1) = \frac{6+1}{6+2} = \frac{7}{8} \mathbb{P}(\text{procedural}=1 \mid Y=1) = \frac{3+1}{6+2} = \frac{4}{8} = \frac{1}{2} \mathbb{P}(\text{magic}=1 \mid Y=1) = \frac{1+1}{6+2} = \frac{2}{8} = \frac{1}{4}

Conditional probabilities for Y=0 (Strategy): \mathbb{P}(\text{multiplayer}=1 \mid Y=0) = \frac{3+1}{8+2} = \frac{4}{10} = \frac{2}{5} \mathbb{P}(\text{puzzle}=1 \mid Y=0) = \frac{3+1}{8+2} = \frac{4}{10} = \frac{2}{5} \mathbb{P}(\text{customization}=0 \mid Y=0) = \frac{1+1}{8+2} = \frac{2}{10} = \frac{1}{5} \mathbb{P}(\text{procedural}=1 \mid Y=0) = \frac{4+1}{8+2} = \frac{5}{10} = \frac{1}{2} \mathbb{P}(\text{magic}=1 \mid Y=0) = \frac{3+1}{8+2} = \frac{4}{10} = \frac{2}{5}

Step 3: Calculate unnormalized posterior probabilities

Using the Naive Bayes assumption and class priors: \mathbb{P}(Y=1) = \frac{6}{14} = \frac{3}{7}, \quad \mathbb{P}(Y=0) = \frac{8}{14} = \frac{4}{7}

For Y=1: \mathbb{P}(Y=1 \mid \vec{X}=\vec{x}) \propto \mathbb{P}(Y=1) \prod_{j=1}^{5} \mathbb{P}(X^{(j)}=x^{(j)} \mid Y=1) = \frac{3}{7} \cdot \frac{3}{4} \cdot \frac{1}{2} \cdot \frac{7}{8} \cdot \frac{1}{2} \cdot \frac{1}{4} = \frac{3 \cdot 3 \cdot 1 \cdot 7 \cdot 1 \cdot 1}{7 \cdot 4 \cdot 2 \cdot 8 \cdot 2 \cdot 4} = \frac{63}{1792}

For Y=0: \mathbb{P}(Y=0 \mid \vec{X}=\vec{x}) \propto \mathbb{P}(Y=0) \prod_{j=1}^{5} \mathbb{P}(X^{(j)}=x^{(j)} \mid Y=0) = \frac{4}{7} \cdot \frac{2}{5} \cdot \frac{2}{5} \cdot \frac{1}{5} \cdot \frac{1}{2} \cdot \frac{2}{5} = \frac{4 \cdot 2 \cdot 2 \cdot 1 \cdot 1 \cdot 2}{7 \cdot 5 \cdot 5 \cdot 5 \cdot 2 \cdot 5} = \frac{32}{8750} = \frac{16}{4375}

Step 4: Normalize to get posterior probabilities

The normalization constant is: Z = \frac{63}{1792} + \frac{16}{4375}

Converting to a common denominator (LCD = 1,120,000): \frac{63}{1792} = \frac{39,375}{1,120,000}, \quad \frac{16}{4375} = \frac{4,096}{1,120,000} Z = \frac{43,471}{1,120,000}

Therefore: \mathbb{P}(Y=1 \mid \vec{X}=\vec{x}) = \frac{39,375/1,120,000}{43,471/1,120,000} = \frac{39,375}{43,471} \approx 0.906

\mathbb{P}(Y=0 \mid \vec{X}=\vec{x}) = \frac{4,096/1,120,000}{43,471/1,120,000} = \frac{4,096}{43,471} \approx 0.094

Final Answer: \boxed{\mathbb{P}(Y=0 \mid \vec{X}=\vec{x}) \approx 0.094} \boxed{\mathbb{P}(Y=1 \mid \vec{X}=\vec{x}) \approx 0.906}

Problem 7.4

From part (c), we computed the posterior probabilities for the game Nebulon Warrior II with feature vector \vec{x} = [1, 1, 0, 1, 1]:

\mathbb{P}(Y=1 \mid \vec{X} = \vec{x}) = \frac{39,375}{43,471} \approx 0.906

\mathbb{P}(Y=0 \mid \vec{X} = \vec{x}) = \frac{4,096}{43,471} \approx 0.094

The Naive Bayes classifier predicts the class with the highest posterior probability. Since:

\mathbb{P}(Y=1 \mid \vec{X} = \vec{x}) > \mathbb{P}(Y=0 \mid \vec{X} = \vec{x})

the classifier would predict Adventure (i.e., \hat{y} = 1) for this new game.

This prediction makes intuitive sense: the game features multiplayer mode, puzzle-solving mechanics (mazes), procedurally generated content, and a magic system, which are characteristics more commonly associated with Adventure games in our training data. Specifically: - 5 out of 6 Adventure games have multiplayer, while only 3 out of 8 Strategy games do - 3 out of 6 Adventure games have puzzle mechanics, while 3 out of 8 Strategy games do - Only 1 out of 6 Adventure games has magic, while 3 out of 8 Strategy games do - The absence of customization also supports this prediction, as 7 out of 8 Strategy games have customization, but no Adventure games in the training set do

Despite the low occurrence of magic in Adventure games, the combination of all features together strongly suggests an Adventure classification.

Problem 1

Problem 1.1

Click to view the solution.

Proving Convexity

Is it Strictly Convex?

Problem 1.2

Click to view the solution.

Setup

Derivative of Huber Loss

Gradient Descent Update Rule

Iteration t=0: Initial Guess

Iteration t=1

Iteration t=2

Iteration t=3

Iteration t=4

Summary Table

Final Answer

Problem 1.3

Click to view the solution.

Problem 2

Problem 2.1

Click to view the solution.

Problem 2.2

Click to view the solution.

Problem 2.3

Click to view the solution.

Explanation

Features in \mathbf{\widetilde{Z}}

Key Observation

Linear Dependence

Conclusion

Problem 3

Problem 3.1

Click to view the solution.

Problem 3.2

Click to view the solution.

Problem 3.3

Click to view the solution.

Problem 3.4

Click to view the solution.

Common Mistake

Problem 4

Problem 4.1

Click to view the solution.

Problem 4.2

Click to view the solution.

Computing \Pr(A)

Computing \Pr(B)

Computing \Pr(C)

Summary

Problem 4.3

Click to view the solution.

Computing \Pr{A \cap B}

Testing Independence

Problem 4.4

Click to view the solution.

Computing \Pr{A \cap C}

Testing Independence

Problem 4.5

Click to view the solution.

Finding \Pr(B \cap C)

Testing Independence

Conclusion

Problem 5

Problem 5.1

Click to view the solution.

Problem 5.2

Click to view the solution.

Problem 5.3

Click to view the solution.

Part 1: Proving strict decrease of |\vec{\theta}^{\ast}_{\mu}{}^{(j)}|

Part 2: Proving the limit

Problem 5.4

Click to view the solution.

Problem 6

Problem 6.1

Click to view the solution.

Problem 6.2

Click to view the solution.

Answer