← return to practice.dsc40a.com
Instructor(s): Sawyer Roberson
This exam was take-home.
Source: DSC 40A Midterm Take-Home Exam, August 1, 2025, Exercise 1
Consider the following dataset of scalar-valued training examples:
\begin{array}{c|c|c|c} x_1 & x_2 & x_3 & x_4\\\hline 1 & 2 & 2 & 4 \end{array}
We model the collection of x_i’s using the constant model f(c;\, x) = c.
L_{\rm Huber}(c; x_i) = \begin{cases} \frac{1}{2}(c - x_i)^2, & \text{if } |c - x_i| \leq \frac{1}{2}, \\ \frac{1}{2}|c - x_i| - \frac{1}{8}, & \text{if } |c - x_i| > \frac{1}{2}. \end{cases}
Prove that the empirical risk function R_{\rm Huber}(c) associated with the Huber loss is convex. Is it strictly convex?
The empirical risk function is given by: R_{\rm Huber}(c) = \frac{1}{n}\sum_{i=1}^n L_{\rm Huber}(c; x_i) = \frac{1}{4}\sum_{i=1}^4 L_{\rm Huber}(c; x_i)
where the Huber loss is: L_{\rm Huber}(c; x_i) = \begin{cases}\frac{1}{2}(c - x_i)^2, & \text{if } |c - x_i| \leq \frac{1}{2}, \\ \frac{1}{2}|c - x_i| - \frac{1}{8}, & \text{if } |c - x_i| > \frac{1}{2}. \end{cases}
To prove that R_{\rm Huber}(c) is convex, we will show that each individual loss function L_{\rm Huber}(c; x_i) is convex, then use the fact that the sum of convex functions is convex.
Step 1: Show each piece of the Huber loss is convex
For the first piece when |c - x_i| \leq \frac{1}{2}: - L_1(c) = \frac{1}{2}(c - x_i)^2 - The second derivative is: \frac{d^2L_1}{dc^2} = 1 > 0 - Therefore, this piece is strictly convex.
For the second piece when |c - x_i| > \frac{1}{2}: - L_2(c) = \frac{1}{2}|c - x_i| - \frac{1}{8} - This is an affine transformation of the absolute value function |c - x_i|, which is known to be convex. - The absolute value function is convex because for any \lambda \in [0,1] and any c_1, c_2: |\lambda c_1 + (1-\lambda)c_2 - x_i| \leq \lambda|c_1 - x_i| + (1-\lambda)|c_2 - x_i| (triangle inequality) - Therefore, this piece is convex.
Step 2: Verify the function is continuous and properly defined at the boundary
At the boundary where |c - x_i| = \frac{1}{2}: - From piece 1: L_1(c) = \frac{1}{2}\left(\frac{1}{2}\right)^2 = \frac{1}{8} - From piece 2: L_2(c) = \frac{1}{2}\left(\frac{1}{2}\right) - \frac{1}{8} = \frac{1}{4} - \frac{1}{8} = \frac{1}{8}
The values match, so L_{\rm Huber}(c; x_i) is continuous.
Step 3: Apply convexity properties
Since: 1. Each L_{\rm Huber}(c; x_i) is convex (as shown above) 2. A non-negative weighted sum of convex functions is convex 3. R_{\rm Huber}(c) = \frac{1}{4}\sum_{i=1}^4 L_{\rm Huber}(c; x_i) is a non-negative weighted sum
Therefore, R_{\rm Huber}(c) is convex.
No, the empirical risk function is not strictly convex.
The Huber loss contains a linear piece (the absolute value term \frac{1}{2}|c - x_i| - \frac{1}{8} when |c - x_i| > \frac{1}{2}).
A strictly convex function must satisfy: f(\lambda c_1 + (1-\lambda)c_2) < \lambda f(c_1) + (1-\lambda)f(c_2) for all c_1 \neq c_2 and \lambda \in (0,1).
However, in the linear region of the Huber loss, we have equality rather than strict inequality, which means the function is not strictly convex.
Specifically, if both c_1 and c_2 are far enough from some x_i (i.e., both fall in the linear region |c - x_i| > \frac{1}{2}), then the Huber loss for that data point will be linear between these two points, violating strict convexity.
We have the dataset: x_1 = 1, x_2 = 2, x_3 = 2, x_4 = 4
The Huber loss function is: L_{\rm Huber}(c; x_i) = \begin{cases} \frac{1}{2}(c - x_i)^2, & \text{if } |c - x_i| \leq \frac{1}{2} \\ \frac{1}{2}|c - x_i| - \frac{1}{8}, & \text{if } |c - x_i| > \frac{1}{2} \end{cases}
The empirical risk is: R_{\rm Huber}(c) = \frac{1}{4}\sum_{i=1}^{4} L_{\rm Huber}(c; x_i)
The derivative of the Huber loss with respect to c is: \frac{\partial L_{\rm Huber}(c; x_i)}{\partial c} = \begin{cases} c - x_i, & \text{if } |c - x_i| \leq \frac{1}{2} \\ \frac{1}{2}\text{sign}(c - x_i), & \text{if } |c - x_i| > \frac{1}{2} \end{cases}
Therefore: R'_{\rm Huber}(c) = \frac{1}{4}\sum_{i=1}^{4} \frac{\partial L_{\rm Huber}(c; x_i)}{\partial c}
With step size \eta = 1: c_{t+1} = c_t - \eta \cdot R'_{\rm Huber}(c_t) = c_t - R'_{\rm Huber}(c_t)
c_0 = \frac{3}{2} = 1.5
Computing individual loss derivatives: - x_1 = 1: |c_0 - x_1| = |1.5 - 1| = 0.5 \leq 0.5 → Quadratic case - L_{\rm Huber}(c_0; x_1) = \frac{1}{2}(0.5)^2 = 0.125 - \frac{\partial L}{\partial c} = 0.5
Computing empirical risk and derivative: R_{\rm Huber}(c_0) = \frac{1}{4}(0.125 + 0.125 + 0.125 + 1.125) = \frac{1.5}{4} = 0.375
R'_{\rm Huber}(c_0) = \frac{1}{4}(0.5 - 0.5 - 0.5 - 0.5) = -0.25
Update: c_1 = 1.5 - (-0.25) = 1.75
c_1 = 1.75
Computing individual loss derivatives: - x_1 = 1: |c_1 - x_1| = 0.75 > 0.5 → Linear case - L = \frac{1}{2}(0.75) - \frac{1}{8} = 0.25 - \frac{\partial L}{\partial c} = 0.5
Computing empirical risk and derivative: R_{\rm Huber}(c_1) = \frac{1}{4}(0.25 + 0.03125 + 0.03125 + 1) = 0.328125
R'_{\rm Huber}(c_1) = \frac{1}{4}(0.5 - 0.25 - 0.25 - 0.5) = -0.125
Update: c_2 = 1.75 - (-0.125) = 1.875
c_2 = 1.875
Computing individual loss derivatives: - x_1 = 1: |c_2 - x_1| = 0.875 > 0.5 → Linear case - L = \frac{1}{2}(0.875) - \frac{1}{8} = 0.3125 - \frac{\partial L}{\partial c} = 0.5
Computing empirical risk and derivative: R_{\rm Huber}(c_2) = \frac{1}{4}(0.3125 + 0.0078125 + 0.0078125 + 0.9375) = 0.31640625
R'_{\rm Huber}(c_2) = \frac{1}{4}(0.5 - 0.125 - 0.125 - 0.5) = -0.0625
Update: c_3 = 1.875 - (-0.0625) = 1.9375
c_3 = 1.9375
Computing individual loss derivatives: - x_1 = 1: |c_3 - x_1| = 0.9375 > 0.5 → Linear case - L = \frac{1}{2}(0.9375) - \frac{1}{8} = 0.34375 - \frac{\partial L}{\partial c} = 0.5
Computing empirical risk and derivative: R_{\rm Huber}(c_3) = \frac{1}{4}(0.34375 + 0.001953125 + 0.001953125 + 0.90625) = 0.3134765625
R'_{\rm Huber}(c_3) = \frac{1}{4}(0.5 - 0.0625 - 0.0625 - 0.5) = -0.03125
Update: c_4 = 1.9375 - (-0.03125) = 1.96875
c_4 = 1.96875
Computing individual loss derivatives: - x_1 = 1: |c_4 - x_1| = 0.96875 > 0.5 → Linear case - L = \frac{1}{2}(0.96875) - \frac{1}{8} = 0.359375 - \frac{\partial L}{\partial c} = 0.5
Computing empirical risk and derivative: R_{\rm Huber}(c_4) = \frac{1}{4}(0.359375 + 0.00048828125 + 0.00048828125 + 0.890625) \approx 0.3127
R'_{\rm Huber}(c_4) = \frac{1}{4}(0.5 - 0.03125 - 0.03125 - 0.5) = -0.015625
| t | c_t | R_{\rm Huber}(c_t) | R'_{\rm Huber}(c_t) |
|---|---|---|---|
| 0 | 1.5 | 0.375 | -0.25 |
| 1 | 1.75 | 0.328125 | -0.125 |
| 2 | 1.875 | 0.31640625 | -0.0625 |
| 3 | 1.9375 | 0.3134765625 | -0.03125 |
| 4 | 1.96875 | 0.3127 | -0.015625 |
After four iterations of gradient descent with \eta = 1 and c_0 = \frac{3}{2}, we obtain:
c^\ast_{\rm Huber} \approx 1.96875
The derivative of empirical risk for absolute loss is given by
R'(h) = \frac{1}{n}\sum_{i=1}^{n} \text{sign}(h - y_i).
See lectures 3-4 for a derivation. Thus
R'(2.5) = \frac{1}{6}(1 - 1 - 1 + 1 + 1 + 1) = \frac{1}{3} > 0.
So the correct answer is positive.
Alternative solution. Order the dataset: (-1, 1, 1, 2, 3, 4). The loss is minimized in the range [1,2]. The point h = 2.5 is to the right of this minimum and therefore the function is increasing at this point and the derivative must be positive.
Source: DSC 40A Midterm Take-Home Exam, August 1, 2025, Exercise 2
Consider training data \{(\vec{x}_i, y_i)\}_{i=1}^n, with \vec{x}_i\in \mathbb{R}^d and y_i \in \mathbb{R}. Let \mathbf{Z} be the design matrix, \vec{y} = [y_1 \dots y_n]^T, and model f(\vec{w},b;\vec{x}) = \vec{w}^\top \vec{x} + b.
To prove that the MSE minimizer is not unique, we will show that if \theta^\ast = [b^\ast\; w_1^\ast\; w_2^\ast\; \cdots\; w_d^\ast]^\top is a minimizer, then we can construct a different parameter vector that achieves the same MSE.
Suppose \theta^\ast = [b^\ast\; w_1^\ast\; w_2^\ast\; w_3^\ast\; \cdots\; w_d^\ast]^\top is a minimizer for the MSE. The predictions for each training example are:
f(\theta^\ast; \vec{x}_i) = b^\ast + w_1^\ast \vec{x}_i^{(1)} + w_2^\ast \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}
Since \vec{x}_i^{(1)} = \lambda \vec{x}_i^{(2)} for all i, we can substitute:
f(\theta^\ast; \vec{x}_i) = b^\ast + w_1^\ast (\lambda \vec{x}_i^{(2)}) + w_2^\ast \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}
Factoring out \vec{x}_i^{(2)}:
f(\theta^\ast; \vec{x}_i) = b^\ast + (\lambda w_1^\ast + w_2^\ast) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}
Now, for any \alpha \in \mathbb{R}, define a new parameter vector:
\widetilde{\theta} = [b^\ast\; (w_1^\ast + \alpha)\; (w_2^\ast - \lambda\alpha)\; w_3^\ast\; \cdots\; w_d^\ast]^\top
Let’s verify that this new parameter vector produces the same predictions:
f(\widetilde{\theta}; \vec{x}_i) = b^\ast + (w_1^\ast + \alpha) \vec{x}_i^{(1)} + (w_2^\ast - \lambda\alpha) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}
Substituting \vec{x}_i^{(1)} = \lambda \vec{x}_i^{(2)}:
f(\widetilde{\theta}; \vec{x}_i) = b^\ast + (w_1^\ast + \alpha)(\lambda \vec{x}_i^{(2)}) + (w_2^\ast - \lambda\alpha) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}
= b^\ast + (\lambda w_1^\ast + \lambda\alpha) \vec{x}_i^{(2)} + (w_2^\ast - \lambda\alpha) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}
= b^\ast + (\lambda w_1^\ast + w_2^\ast) \vec{x}_i^{(2)} + w_3^\ast \vec{x}_i^{(3)} + \cdots + w_d^\ast \vec{x}_i^{(d)}
This is exactly the same as f(\theta^\ast; \vec{x}_i).
Since \widetilde{\theta} produces identical predictions for all training examples, it achieves the same MSE as \theta^\ast:
R(\widetilde{\theta}) = \frac{1}{n}\sum_{i=1}^n (y_i - f(\widetilde{\theta}; \vec{x}_i))^2 = \frac{1}{n}\sum_{i=1}^n (y_i - f(\theta^\ast; \vec{x}_i))^2 = R(\theta^\ast)
For any nonzero \alpha, we have \widetilde{\theta} \neq \theta^\ast, yet both minimize the MSE. Therefore, the minimizer is not unique.
Answer: No, R_2 = R_1 (not R_2 > R_1).
Explanation:
From part (a), we know that the first feature is an exact scalar multiple of the second feature: \vec{x}_i^{(1)} = \lambda \vec{x}_i^{(2)} for all training examples.
When we have the full model with both features, the prediction for example i is: \hat{y}_i = b + w_1 x_i^{(1)} + w_2 x_i^{(2)} + w_3 x_i^{(3)} + \cdots + w_d x_i^{(d)}
Since x_i^{(1)} = \lambda x_i^{(2)}, we can rewrite this as: \hat{y}_i = b + w_1 (\lambda x_i^{(2)}) + w_2 x_i^{(2)} + w_3 x_i^{(3)} + \cdots + w_d x_i^{(d)} = b + (w_1 \lambda + w_2) x_i^{(2)} + w_3 x_i^{(3)} + \cdots + w_d x_i^{(d)}
This shows that any prediction achievable with both features 1 and 2 can be achieved using only feature 2 by setting the coefficient on feature 2 to be (w_1 \lambda + w_2).
When we delete feature 1 and retrain, the new model is: \hat{y}_i = b' + w_2' x_i^{(2)} + w_3' x_i^{(3)} + \cdots + w_d' x_i^{(d)}
Since the second feature can “absorb” the contribution of the first feature (by adjusting its weight), the model without feature 1 has the same expressive power as the model with both features. Therefore, it can achieve the same minimum MSE.
Specifically, if (b^\ast, w_1^\ast, w_2^\ast, \ldots, w_d^\ast) minimizes R_1, then (b^\ast, w_2^\ast + \lambda w_1^\ast, w_3^\ast, \ldots, w_d^\ast) minimizes R_2 and produces identical predictions, so:
R_2 = R_1
The first feature is redundant given the second feature, so removing it does not reduce the model’s ability to fit the training data.
Answer: No, \mathbf{\widetilde{Z}}^\top \mathbf{\widetilde{Z}} is not invertible.
Recall that \mathbf{\widetilde{Z}}^\top \mathbf{\widetilde{Z}} is invertible if and only if the columns of \mathbf{\widetilde{Z}} are linearly independent. We will show that the columns of \mathbf{\widetilde{Z}} are linearly dependent, which means \mathbf{\widetilde{Z}}^\top \mathbf{\widetilde{Z}} cannot be invertible.
After performing the three modifications, the design matrix \mathbf{\widetilde{Z}} contains the following feature columns:
From part (a), we know that \vec{x}^{(1)} = \lambda \vec{x}^{(2)} where \lambda \neq 0.
Using this relationship, we can express the new polynomial and interaction features in terms of \vec{x}^{(2)}:
\vec{x}^{(1)}\vec{x}^{(2)} = (\lambda \vec{x}^{(2)}) \cdot \vec{x}^{(2)} = \lambda (\vec{x}^{(2)})^2
(\vec{x}^{(1)})^2 = (\lambda \vec{x}^{(2)})^2 = \lambda^2 (\vec{x}^{(2)})^2
From the equations above, we can write:
(\vec{x}^{(1)})^2 = \lambda^2 (\vec{x}^{(2)})^2 = \lambda \cdot \left[\lambda (\vec{x}^{(2)})^2\right] = \lambda \cdot \vec{x}^{(1)}\vec{x}^{(2)}
This shows that the feature (\vec{x}^{(1)})^2 is a scalar multiple of the feature \vec{x}^{(1)}\vec{x}^{(2)}.
Since one column of \mathbf{\widetilde{Z}} (the (\vec{x}^{(1)})^2 column) can be written as a scalar multiple of another column (the \vec{x}^{(1)}\vec{x}^{(2)} column), the columns of \mathbf{\widetilde{Z}} are linearly dependent.
Because the columns of \mathbf{\widetilde{Z}} are linearly dependent, the matrix \mathbf{\widetilde{Z}}^\top \mathbf{\widetilde{Z}} is not invertible.
Source: DSC 40A Midterm Take-Home Exam, August 1, 2025, Exercise 3
Let \mathbf{Z} be the design matrix and \vec{y} the label vector. Assume \mathbf{Z}^\top \mathbf{Z} invertible and \theta^\ast unique for MSE. Let residuals \vec{r} = \vec{y} - \mathbf{Z}\vec{\theta^\ast}.
Solution:
Recall that \theta^* = [b^*\;\vec{w}^*]^\top is the unique minimizer of the mean squared error (MSE):
R(\theta) = \frac{1}{n}\|\vec{y} - \mathbf{Z}\theta\|^2
At a minimum, the gradient of R(\theta) with respect to \theta must equal zero. Let’s compute this gradient.
First, we can expand: R(\theta) = \frac{1}{n}(\vec{y} - \mathbf{Z}\theta)^\top(\vec{y} - \mathbf{Z}\theta)
Taking the gradient with respect to \theta: \nabla_\theta R(\theta) = \frac{1}{n} \cdot 2\mathbf{Z}^\top(\mathbf{Z}\theta - \vec{y}) = \frac{2}{n}\mathbf{Z}^\top(\mathbf{Z}\theta - \vec{y})
At the minimizer \theta^*, we have: \nabla_\theta R(\theta^*) = \vec{0}
Therefore: \frac{2}{n}\mathbf{Z}^\top(\mathbf{Z}\theta^* - \vec{y}) = \vec{0}
Simplifying: \mathbf{Z}^\top(\mathbf{Z}\theta^* - \vec{y}) = \vec{0}
Multiplying both sides by -1: \mathbf{Z}^\top(\vec{y} - \mathbf{Z}\theta^*) = \vec{0}
By definition, \vec{r} = \vec{y} - \mathbf{Z}\theta^*, so: \mathbf{Z}^\top\vec{r} = \vec{0}
Now, \mathbf{Z}^\top\vec{r} is a (d+1) \times 1 vector whose j-th entry is the dot product of the j-th column of \mathbf{Z} with the residual vector \vec{r}:
(\mathbf{Z}^\top\vec{r})_j = [\vec{z}_{1}^{j}\; \vec{z}_{2}^{j}\; \dotsc\; \vec{z}_{n}^{j}]\vec{r}
Since \mathbf{Z}^\top\vec{r} = \vec{0}, each component equals zero: [\vec{z}_{1}^{j}\; \vec{z}_{2}^{j}\; \dotsc\; \vec{z}_{n}^{j}]\vec{r} = 0 \quad \text{for each } 1 \leq j \leq d+1
This proves that \vec{r} is orthogonal to every column of \mathbf{Z}.
From part (a), we proved that the residual vector \vec{r} is orthogonal to every column of the design matrix \mathbf{Z}. That is, for each column j where 1 \leq j \leq d+1:
[\vec{z}_{1}^{j}\; \vec{z}_{2}^{j}\; \dotsc\; \vec{z}_{n}^{j}] \cdot \vec{r} = 0
Now, recall that in multiple linear regression, the design matrix \mathbf{Z} has the form:
\mathbf{Z} = \begin{bmatrix} 1 & \vec{x}_1^\top \\ 1 & \vec{x}_2^\top \\ \vdots & \vdots \\ 1 & \vec{x}_n^\top \end{bmatrix}
The first column of \mathbf{Z} corresponds to the intercept/bias term and is simply a column of all ones:
\text{Column 1 of } \mathbf{Z} = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix}
Since \vec{r} is orthogonal to every column of \mathbf{Z}, it must be orthogonal to this first column. Therefore:
\begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \cdot \vec{r} = 0
Computing this dot product explicitly:
\begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \cdot \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_n \end{bmatrix} = 1 \cdot r_1 + 1 \cdot r_2 + \cdots + 1 \cdot r_n = \sum_{i=1}^n r_i
Since this dot product equals zero, we have:
\sum_{i=1}^n r_i = 0
This shows that the residuals sum to zero, which is a fundamental property of least squares regression.
To prove that \mathbf{P} is symmetric, we need to show that \mathbf{P}^\top = \mathbf{P}.
Starting with the definition of \mathbf{P}:
\mathbf{P} = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top
Take the transpose of both sides:
\mathbf{P}^\top = \left[\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\right]^\top
Using the property that (ABC)^\top = C^\top B^\top A^\top, we have:
\mathbf{P}^\top = (\mathbf{Z}^\top)^\top \left[(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]^\top \mathbf{Z}^\top
Simplifying (\mathbf{Z}^\top)^\top = \mathbf{Z}:
\mathbf{P}^\top = \mathbf{Z} \left[(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]^\top \mathbf{Z}^\top
Now we need to simplify \left[(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]^\top.
First, note that \mathbf{Z}^\top\mathbf{Z} is symmetric. To see this:
(\mathbf{Z}^\top\mathbf{Z})^\top = \mathbf{Z}^\top(\mathbf{Z}^\top)^\top = \mathbf{Z}^\top\mathbf{Z}
Since \mathbf{Z}^\top\mathbf{Z} is symmetric, we have (\mathbf{Z}^\top\mathbf{Z})^\top = \mathbf{Z}^\top\mathbf{Z}.
Using the property that (A^{-1})^\top = (A^\top)^{-1}:
\left[(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]^\top = \left[(\mathbf{Z}^\top\mathbf{Z})^\top\right]^{-1} = (\mathbf{Z}^\top\mathbf{Z})^{-1}
Substituting this back:
\mathbf{P}^\top = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top = \mathbf{P}
Therefore, \mathbf{P} is symmetric. \square
We need to show that \mathbf{P}^2 = \mathbf{P} where \mathbf{P} = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top.
Let’s compute \mathbf{P}^2:
\mathbf{P}^2 = \mathbf{P} \cdot \mathbf{P} = \left[\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\right] \left[\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\right]
Using the associativity of matrix multiplication, we can regroup:
\mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top
The key is to recognize that the middle part \mathbf{Z}^\top\mathbf{Z} multiplied by its inverse (\mathbf{Z}^\top\mathbf{Z})^{-1} gives the identity:
\mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\left[\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\right]\mathbf{Z}^\top
Since (\mathbf{Z}^\top\mathbf{Z})^{-1}(\mathbf{Z}^\top\mathbf{Z}) = \mathbf{I}_{d+1} where \mathbf{I}_{d+1} is the (d+1) \times (d+1) identity matrix:
\mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{I}_{d+1}\mathbf{Z}^\top
Since multiplying by the identity matrix doesn’t change the result:
\mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top
\mathbf{P}^2 = \mathbf{P}
Therefore, \mathbf{P} is idempotent. \square
A common error when proving idempotency is to incorrectly simplify after recognizing that (\mathbf{Z}^\top\mathbf{Z})^{-1}(\mathbf{Z}^\top\mathbf{Z}) = \mathbf{I}_{d+1}.
Incorrect approach:
After correctly writing: \mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top
Some students might write: \mathbf{P}^2 = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{I}_{d+1}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top
This is wrong because it incorrectly leaves both inverse terms in the expression. The error comes from not properly identifying which matrices multiply to form the identity.
What went wrong: The product (\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top\mathbf{Z} equals \mathbf{I}_{d+1}, so when we substitute the identity, we should replace these three consecutive matrices, not insert the identity while keeping all the original terms.
Correct grouping: We must group \mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1} together (in that order) to get the identity matrix, which then simplifies the entire expression correctly.
Source: DSC 40A Midterm Take-Home Exam, August 1, 2025, Exercise 4
Four independent rolls R_1,R_2,R_3,R_4 of a fair six-sided die. Events:
A = \{\text{at least one } R_i = 3\},\quad B = \{\sum R_i = 6\},\quad C = \{R_i \le 4 \text{ for each } i\}.
The sample space \Omega consists of all possible outcomes when rolling a fair six-sided die four times independently. Since each roll R_i can result in any of the values \{1, 2, 3, 4, 5, 6\}, the sample space is the set of all ordered 4-tuples:
\Omega = \{(r_1, r_2, r_3, r_4) : r_i \in \{1, 2, 3, 4, 5, 6\} \text{ for each } i = 1, 2, 3, 4\}
Alternatively, we can express this as the Cartesian product:
\Omega = \{1, 2, 3, 4, 5, 6\}^4
The total number of outcomes in the sample space is:
|\Omega| = 6^4 = 1296
Each outcome (r_1, r_2, r_3, r_4) represents a specific sequence of four die rolls, where r_i is the result of the i-th roll. For example, (1, 3, 5, 2) represents rolling a 1 first, then a 3, then a 5, and finally a 2.
We use the complement rule since it’s easier to compute the probability that no rolls equal 3.
\Pr(A) = 1 - \Pr(A^c) = 1 - \Pr(\text{no rolls equal 3})
For each roll, the probability of not getting a 3 is \frac{5}{6}. Since the four rolls are independent:
\Pr(\text{no rolls equal 3}) = \left(\frac{5}{6}\right)^4 = \frac{625}{1296}
Therefore:
\Pr(A) = 1 - \frac{625}{1296} = \frac{671}{1296} \approx \boxed{0.518}
We need to count the number of ordered outcomes (R_1, R_2, R_3, R_4) where each R_i \in \{1, 2, 3, 4, 5, 6\} and R_1 + R_2 + R_3 + R_4 = 6.
Using the substitution s_i = R_i - 1 (so 0 \leq s_i \leq 5), we need:
s_1 + s_2 + s_3 + s_4 = 2
The number of non-negative integer solutions is given by stars and bars:
\binom{2 + 4 - 1}{4 - 1} = \binom{5}{3} = 10
Since the sum is only 2, no variable s_i can exceed 2, so all solutions automatically satisfy s_i \leq 5. Therefore, there are exactly 10 favorable outcomes.
The total number of possible outcomes is 6^4 = 1296.
\Pr(B) = \frac{10}{1296} = \frac{5}{648} \approx \boxed{0.008}
For each roll to be at most 4, we have 4 favorable outcomes: \{1, 2, 3, 4\} out of 6 possible outcomes.
Since the rolls are independent:
\Pr(C) = \left(\frac{4}{6}\right)^4 = \left(\frac{2}{3}\right)^4 = \frac{16}{81} \approx \boxed{0.198}
We need to find the probability that at least one die shows 3 AND the sum of all four dice equals 6.
To find \Pr{A \cap B}, we count outcomes where both conditions hold simultaneously.
Finding all ways to get a sum of 6:
We need to partition 6 into 4 parts, where each part is between 1 and 6. The valid partitions are: - \{3, 1, 1, 1\} — contains a 3 ✓ - \{2, 2, 1, 1\} — does not contain a 3 ✗
For event A \cap B to occur, we need at least one 3 AND sum equal to 6. This means only the partition \{3, 1, 1, 1\} satisfies both conditions.
Counting ordered arrangements:
The number of permutations of \{3, 1, 1, 1\} is: \frac{4!}{1! \cdot 3!} = \frac{24}{6} = 4
These four outcomes are: (3,1,1,1), (1,3,1,1), (1,1,3,1), (1,1,1,3).
Computing the probability:
The total number of outcomes is 6^4 = 1296.
\Pr{A \cap B} = \frac{4}{1296} = \frac{1}{324} \approx \boxed{0.003}
Events A and B are independent if and only if \Pr{A \cap B} = \Pr{A} \cdot \Pr{B}.
From part (b), we have: - \Pr{A} = 1 - \left(\frac{5}{6}\right)^4 = 1 - \frac{625}{1296} = \frac{671}{1296} \approx 0.518
Therefore: \Pr{A} \cdot \Pr{B} = \frac{671}{1296} \cdot \frac{10}{1296} = \frac{6710}{1{,}679{,}616} \approx 0.004
Conclusion:
Since \Pr{A \cap B} = \frac{4}{1296} \approx 0.003 and \Pr{A} \cdot \Pr{B} \approx 0.004, we have:
\Pr{A \cap B} \neq \Pr{A} \cdot \Pr{B}
Therefore, events A and B are NOT independent.
Recall: - A = \{\text{at least one } R_i = 3\} - C = \{R_i \leq 4 \text{ for each } 1 \leq i \leq 4\}
The event A \cap C means: at least one die shows 3, AND all four dice show values \leq 4.
We’ll use the complement approach. First, note that:
\Pr{A \cap C} = \Pr{C} \cdot \Pr{A \mid C}
Step 1: Find \Pr{C}
For event C, each die must show a value in \{1, 2, 3, 4\}. - Probability that one die shows \leq 4: \frac{4}{6} = \frac{2}{3} - Probability that all four dice show \leq 4:
\Pr{C} = \left(\frac{2}{3}\right)^4 = \frac{16}{81}
Step 2: Find \Pr{A \mid C}
Given that all dice show \leq 4, we want the probability that at least one shows 3.
Using the complement: \Pr{A \mid C} = 1 - \Pr{\text{no die shows 3} \mid C}
Given a die shows \leq 4, the probability it doesn’t show 3 is \frac{3}{4} (values in \{1, 2, 4\}).
Thus: \Pr{\text{no die shows 3} \mid C} = \left(\frac{3}{4}\right)^4 = \frac{81}{256}
Therefore: \Pr{A \mid C} = 1 - \frac{81}{256} = \frac{175}{256}
Step 3: Compute \Pr{A \cap C}
\Pr{A \cap C} = \Pr{C} \cdot \Pr{A \mid C} = \frac{16}{81} \cdot \frac{175}{256} = \frac{2800}{20736} = \frac{175}{1296}
In decimal form: \Pr{A \cap C} = \frac{175}{1296} \approx \boxed{0.135}
Events A and C are independent if and only if \Pr{A \cap C} = \Pr{A} \cdot \Pr{C}.
From part (b): - \Pr{A} = 1 - \left(\frac{5}{6}\right)^4 = 1 - \frac{625}{1296} = \frac{671}{1296} \approx 0.518 - \Pr{C} = \frac{16}{81} \approx 0.198
Compute \Pr{A} \cdot \Pr{C}:
\Pr{A} \cdot \Pr{C} = \frac{671}{1296} \cdot \frac{16}{81} = \frac{671 \cdot 16}{1296 \cdot 81} = \frac{10736}{104976} = \frac{671}{6561} \approx 0.102
Comparison: - \Pr{A \cap C} \approx 0.135 - \Pr{A} \cdot \Pr{C} \approx 0.102
Since 0.135 \neq 0.102, we have \Pr{A \cap C} \neq \Pr{A} \cdot \Pr{C}.
Conclusion: Events A and C are NOT independent.
Intuition: Knowing that all dice show \leq 4 (event C) actually increases the probability of seeing at least one 3 (event A), because we’ve eliminated the possibility of 5’s and 6’s, making 3’s more likely relative to the reduced sample space.
Event B \cap C consists of outcomes where the sum of the four dice equals 6 and all four dice show values \leq 4.
To find these outcomes, we need to count all ways to partition 6 into exactly 4 positive integers, each between 1 and 4.
The possible partitions are: - \{1, 1, 1, 3\}: sum = 1 + 1 + 1 + 3 = 6 ✓ - \{1, 1, 2, 2\}: sum = 1 + 1 + 2 + 2 = 6 ✓
These are the only partitions that work. Any other combination either has a sum \neq 6 or requires at least one die to show a value > 4.
Now we count the number of ordered outcomes for each partition:
For \{1, 1, 1, 3\}: We need to count arrangements of three 1’s and one 3. \text{Number of arrangements} = \frac{4!}{3! \cdot 1!} = 4
For \{1, 1, 2, 2\}: We need to count arrangements of two 1’s and two 2’s. \text{Number of arrangements} = \frac{4!}{2! \cdot 2!} = 6
Total outcomes in B \cap C: 4 + 6 = 10
Since the sample space has 6^4 = 1296 total outcomes: \Pr(B \cap C) = \frac{10}{1296} = \frac{5}{648} \approx 0.008
To determine if B and C are independent, we check whether \Pr(B \cap C) = \Pr(B) \times \Pr(C).
Finding \Pr(B):
From part (b), we computed \Pr(B) by finding all ways to get a sum of 6 with four dice (each showing 1-6). Notice that the partitions we found above are actually all the partitions of 6 into 4 positive integers \leq 6. Therefore: \Pr(B) = \frac{10}{1296} \approx 0.008
Finding \Pr(C):
Event C requires all four dice to show values \leq 4. Each die has 4 favorable outcomes out of 6 possible: \Pr(C) = \left(\frac{4}{6}\right)^4 = \left(\frac{2}{3}\right)^4 = \frac{256}{1296} = \frac{16}{81} \approx 0.198
Computing \Pr(B) \times \Pr(C): \Pr(B) \times \Pr(C) = \frac{10}{1296} \times \frac{256}{1296} = \frac{2560}{1{,}679{,}616} \approx 0.002
Since \Pr(B \cap C) \approx 0.008 and \Pr(B) \times \Pr(C) \approx 0.002, we have: \Pr(B \cap C) \neq \Pr(B) \times \Pr(C)
Therefore, events B and C are NOT independent.
Intuition: Notice that B \subseteq C (every outcome in B is also in C, since getting a sum of 6 with four dice automatically means all dice must show \leq 4). This subset relationship means that knowing C occurred significantly increases the probability of B, which is the hallmark of dependence.
Source: DSC 40A Midterm Take-Home Exam, August 1, 2025, Exercise 5
Ridge regularized risk problem with \vec{x}_i \in \mathbb{R}^d, orthogonal unit-length columns in \mathbf{Z}:
R_\mu(\vec\theta) = \frac{1}{n}\|\vec{y}-\mathbf{Z}\vec\theta\|^2 + \mu \|\widetilde{\mathbf{I}_{d+1}}\vec\theta\|^2
We need to find the entries of the matrix \mathbf{Z}^{\top}\mathbf{Z}.
Recall that when we compute the matrix product \mathbf{Z}^{\top}\mathbf{Z}, the (i,j)-th entry is the dot product of the i-th row of \mathbf{Z}^{\top} with the j-th column of \mathbf{Z}.
Since the i-th row of \mathbf{Z}^{\top} equals the i-th column of \mathbf{Z}, we can write:
[\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = (\text{column } i \text{ of } \mathbf{Z})^{\top} (\text{column } j \text{ of } \mathbf{Z})
Let’s denote the columns of \mathbf{Z} as \vec{z}^{(1)}, \vec{z}^{(2)}, \ldots, \vec{z}^{(d+1)} \in \mathbb{R}^n. Then:
[\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = \vec{z}^{(i)} \cdot \vec{z}^{(j)}
Now we consider two cases:
Case 1: When i = j (diagonal entries)
For diagonal entries, we have: [\mathbf{Z}^{\top}\mathbf{Z}]_{ii} = \vec{z}^{(i)} \cdot \vec{z}^{(i)} = \|\vec{z}^{(i)}\|^2
We are given that each column has “vector length one.” In the context of design matrices and the result we need to prove, this means that each column has been normalized so that:
\frac{1}{n}\|\vec{z}^{(i)}\|^2 = 1
which is equivalent to:
\|\vec{z}^{(i)}\|^2 = n
Therefore: [\mathbf{Z}^{\top}\mathbf{Z}]_{ii} = n
Case 2: When i \neq j (off-diagonal entries)
For off-diagonal entries, we have: [\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = \vec{z}^{(i)} \cdot \vec{z}^{(j)}
We are given that the columns are pairwise orthogonal, which means: \vec{z}^{(i)} \cdot \vec{z}^{(j)} = 0 \quad \text{for all } i \neq j
Therefore: [\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = 0
Conclusion:
Combining both cases, we have: [\mathbf{Z}^{\top}\mathbf{Z}]_{ij} = \begin{cases} n & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}
This is precisely the definition of the matrix n\mathbf{I}_{d+1}, where \mathbf{I}_{d+1} is the (d+1) \times (d+1) identity matrix.
Therefore, \mathbf{Z}^{\top}\mathbf{Z} = n\mathbf{I}_{d+1}. \quad \blacksquare
\theta^\ast_\mu{}^{(j)} = \frac{1}{1+\mu} \theta^\ast_0{}^{(j)}
To find the minimizer of the ridge regularized risk, we take the gradient and set it equal to zero.
Step 1: Compute the gradient of R_{\mu}(\vec\theta)
The ridge risk is: R_{\mu}(\vec\theta) = \frac{1}{n}\|\,\vec{y}-\mathbf{Z}\vec\theta\|^{2} + \mu\,\|\widetilde{\mathbf{I}_{d+1}}\vec\theta\|^{2}
Taking the gradient: \nabla R_{\mu}(\vec\theta) = \frac{2}{n}\mathbf{Z}^{\top}(\mathbf{Z}\vec\theta - \vec{y}) + 2\mu\widetilde{\mathbf{I}_{d+1}}\vec\theta
Step 2: Set the gradient equal to zero
\frac{2}{n}\mathbf{Z}^{\top}\mathbf{Z}\vec\theta - \frac{2}{n}\mathbf{Z}^{\top}\vec{y} + 2\mu\widetilde{\mathbf{I}_{d+1}}\vec\theta = \vec{0}
Simplifying: \mathbf{Z}^{\top}\mathbf{Z}\vec\theta + n\mu\widetilde{\mathbf{I}_{d+1}}\vec\theta = \mathbf{Z}^{\top}\vec{y}
Step 3: Use the result from part (a)
From part (a), we know that \mathbf{Z}^{\top}\mathbf{Z} = n\mathbf{I}_{d+1}. Substituting:
n\mathbf{I}_{d+1}\vec\theta + n\mu\widetilde{\mathbf{I}_{d+1}}\vec\theta = \mathbf{Z}^{\top}\vec{y}
Factor out n: n(\mathbf{I}_{d+1} + \mu\widetilde{\mathbf{I}_{d+1}})\vec\theta = \mathbf{Z}^{\top}\vec{y}
Step 4: Analyze the matrix \mathbf{I}_{d+1} + \mu\widetilde{\mathbf{I}_{d+1}}
Recall that \widetilde{\mathbf{I}_{d+1}} is the identity matrix with the first diagonal entry (corresponding to the intercept) set to 0. Therefore, \mathbf{I}_{d+1} + \mu\widetilde{\mathbf{I}_{d+1}} is a diagonal matrix with: - First diagonal entry: 1 + 0 = 1 (intercept term) - Remaining diagonal entries: 1 + \mu (slope terms, j \geq 1)
Step 5: Solve for \vec\theta_{\mu}^{\ast} component-wise
For the slope terms (j \geq 1), the equation becomes: n(1 + \mu) \cdot \vec\theta_{\mu}^{\ast}{}^{(j)} = [\mathbf{Z}^{\top}\vec{y}]^{(j)}
Solving for \vec\theta_{\mu}^{\ast}{}^{(j)}: \vec\theta_{\mu}^{\ast}{}^{(j)} = \frac{1}{n(1+\mu)}[\mathbf{Z}^{\top}\vec{y}]^{(j)}
Step 6: Relate to the ridge-free minimizer
For the ridge-free case (\mu = 0), the normal equation gives: n\mathbf{I}_{d+1}\vec\theta_{0}^{\ast} = \mathbf{Z}^{\top}\vec{y}
So for each component: \vec\theta_{0}^{\ast}{}^{(j)} = \frac{1}{n}[\mathbf{Z}^{\top}\vec{y}]^{(j)}
Step 7: Substitute to obtain the final result
From Step 5: \vec\theta_{\mu}^{\ast}{}^{(j)} = \frac{1}{1+\mu} \cdot \frac{1}{n}[\mathbf{Z}^{\top}\vec{y}]^{(j)} = \frac{1}{1+\mu}\vec\theta_{0}^{\ast}{}^{(j)}
Therefore, for every slope index j \geq 1: \boxed{\vec\theta^{\ast}_{\mu}{}^{(j)} = \frac{1}{1+\mu}\;\vec\theta^{\ast}_{0}{}^{(j)}}
as required. \square
To prove: For each j \geq 1, the map \mu \mapsto |\vec{\theta}^{\ast}_{\mu}{}^{(j)}| is strictly decreasing and satisfies \lim_{\mu \to \infty} \vec{\theta}^{\ast}_{\mu}{}^{(j)} = 0.
From part (b), we have for each slope index j \geq 1:
\vec{\theta}^{\ast}_{\mu}{}^{(j)} = \frac{1}{1+\mu} \vec{\theta}^{\ast}_{0}{}^{(j)}
Taking the absolute value of both sides:
|\vec{\theta}^{\ast}_{\mu}{}^{(j)}| = \left|\frac{1}{1+\mu}\right| \cdot |\vec{\theta}^{\ast}_{0}{}^{(j)}| = \frac{1}{1+\mu} \cdot |\vec{\theta}^{\ast}_{0}{}^{(j)}|
where the simplification uses the fact that \mu > 0, so 1 + \mu > 1 > 0.
Now consider the function g(\mu) = \frac{1}{1+\mu} \cdot |\vec{\theta}^{\ast}_{0}{}^{(j)}| for \mu > 0.
Taking the derivative with respect to \mu:
g'(\mu) = |\vec{\theta}^{\ast}_{0}{}^{(j)}| \cdot \frac{d}{d\mu}\left[\frac{1}{1+\mu}\right] = |\vec{\theta}^{\ast}_{0}{}^{(j)}| \cdot \left(-\frac{1}{(1+\mu)^2}\right) = -\frac{|\vec{\theta}^{\ast}_{0}{}^{(j)}|}{(1+\mu)^2}
Key observation: If \vec{\theta}^{\ast}_{0}{}^{(j)} \neq 0, then g'(\mu) < 0 for all \mu > 0.
This shows that |\vec{\theta}^{\ast}_{\mu}{}^{(j)}| is strictly decreasing in \mu whenever \vec{\theta}^{\ast}_{0}{}^{(j)} \neq 0.
(In the trivial case where \vec{\theta}^{\ast}_{0}{}^{(j)} = 0, we have \vec{\theta}^{\ast}_{\mu}{}^{(j)} = 0 for all \mu, which is constant but can be considered weakly decreasing.)
We need to show that:
\lim_{\mu \to \infty} \vec{\theta}^{\ast}_{\mu}{}^{(j)} = 0
Using the relationship from part (b):
\lim_{\mu \to \infty} \vec{\theta}^{\ast}_{\mu}{}^{(j)} = \lim_{\mu \to \infty} \frac{1}{1+\mu} \vec{\theta}^{\ast}_{0}{}^{(j)} = \vec{\theta}^{\ast}_{0}{}^{(j)} \cdot \lim_{\mu \to \infty} \frac{1}{1+\mu}
Since \lim_{\mu \to \infty} \frac{1}{1+\mu} = 0, we have:
\lim_{\mu \to \infty} \vec{\theta}^{\ast}_{\mu}{}^{(j)} = \vec{\theta}^{\ast}_{0}{}^{(j)} \cdot 0 = 0
Therefore, as the ridge regularization parameter \mu grows to infinity, all slope coefficients shrink to zero.
Summary: The absolute value of each slope coefficient |\vec{\theta}^{\ast}_{\mu}{}^{(j)}| decreases strictly as \mu increases (by virtue of having a negative derivative), and approaches 0 as \mu \to \infty. This demonstrates the shrinkage effect of ridge regularization on the model parameters.
Interpretation of Ridge Regularization with Orthonormal Features
Parts (b) and (c) reveal the following geometric effects of ridge regularization when the feature columns are orthonormal:
Uniform Shrinkage: Part (b) shows that ridge regularization applies a uniform multiplicative scaling factor of \frac{1}{1+\mu} to all slope coefficients. This means that each slope parameter \vec\theta^{\ast}_{\mu}{}^{(j)} is exactly \frac{1}{1+\mu} times its ridge-free counterpart \vec\theta^{\ast}_{0}{}^{(j)}. In other words, ridge regularization shrinks all slope coefficients toward zero by the same proportion.
Monotonic Shrinkage: Part (c) demonstrates that as the regularization parameter \mu increases: - The magnitude of each slope coefficient strictly decreases (gets smaller and smaller) - The shrinkage becomes more aggressive with larger \mu values - In the limit as \mu \to \infty, all slope coefficients are driven to exactly zero
Geometric Interpretation: When the feature columns are orthonormal, ridge regularization has a particularly clean geometric effect—it uniformly scales down the entire slope coefficient vector \vec{w}^\ast by the factor \frac{1}{1+\mu}. This is in contrast to more complex scenarios where non-orthogonal features can lead to different coefficients being shrunk by different amounts. The orthonormality ensures that the regularization penalty is applied equally and proportionally to all directions in the parameter space.
Practical Meaning: As we increase the regularization strength \mu, we are trading off model complexity (magnitude of coefficients) for better generalization. With very large \mu, the model approaches the trivial solution where all slopes are zero (effectively predicting only the intercept/mean).
Source: DSC 40A Midterm Take-Home Exam, August 1, 2025, Exercise 6
Sawyer arranges 12 distinct hot sauce bottles: 5 Hot (H_1,\dots,H_5) and 7 Mild (M_1,\dots,M_7).
To ensure no two Hot bottles are adjacent, we can use the following strategy: first place the Mild bottles, then insert the Hot bottles into the gaps created.
Step 1: Arrange the Mild bottles
We have 7 distinct Mild bottles to arrange in a row. The number of ways to do this is:
7!
Step 2: Identify positions for Hot bottles
Once the 7 Mild bottles are placed in a row, they create potential positions (gaps) where we can place the Hot bottles:
\_ \; M \; \_ \; M \; \_ \; M \; \_ \; M \; \_ \; M \; \_ \; M \; \_ \; M \; \_
There are 8 gaps total: one before the first Mild bottle, one after each of the 7 Mild bottles.
Step 3: Choose which gaps to use
We need to place 5 Hot bottles into these 8 available gaps. Since we want no two Hot bottles adjacent, we can place at most one Hot bottle in each gap. The number of ways to choose 5 gaps from 8 is:
\binom{8}{5}
Step 4: Arrange the Hot bottles in the chosen gaps
Once we’ve chosen which 5 gaps to use, we need to arrange the 5 distinct Hot bottles in those positions. The number of ways to do this is:
5!
Step 5: Calculate the total
By the multiplication principle, the total number of valid arrangements is:
7! \times \binom{8}{5} \times 5!
Let’s compute this: - 7! = 5040 - \binom{8}{5} = \binom{8}{3} = \frac{8 \cdot 7 \cdot 6}{3 \cdot 2 \cdot 1} = \frac{336}{6} = 56 - 5! = 120
Therefore:
7! \times \binom{8}{5} \times 5! = 5040 \times 56 \times 120 = 33{,}868{,}800
Answer: There are \boxed{7! \times \binom{8}{5} \times 5! = 33{,}868{,}800} possible arrangements.
We need to count the number of ways to arrange 12 distinct bottles (5 Hot: H_1, \ldots, H_5 and 7 Mild: M_1, \ldots, M_7) such that all 5 Hot bottles are consecutive.
Step 1: Treat the Hot bottles as a single block
Since all 5 Hot bottles must be consecutive, we can treat them as a single “super-block.” This reduces our problem to arranging: - 1 super-block (containing all 5 Hot bottles) - 7 individual Mild bottles
This gives us 1 + 7 = 8 objects to arrange.
Step 2: Arrange the 8 objects
These 8 objects can be arranged in a row in: 8! = 40320 \text{ ways}
Step 3: Arrange the Hot bottles within the block
Within the super-block, the 5 distinct Hot bottles can be arranged among themselves in: 5! = 120 \text{ ways}
Step 4: Apply the multiplication principle
By the multiplication principle, the total number of arrangements is: 5! \times 8! = 120 \times 40320 = 4{,}838{,}400
\boxed{5! \times 8! = 4{,}838{,}400}
We need to count arrangements where exactly 4 Hot bottles are consecutive, and the 5th Hot bottle is not adjacent to this group of 4.
Step 1: Choose which 4 Hot bottles form the consecutive block
We have 5 distinct Hot bottles (H_1, H_2, H_3, H_4, H_5) and need to choose 4 of them to form the consecutive group:
\binom{5}{4} = 5 \text{ ways}
Step 2: Arrange the 4 chosen Hot bottles within the block
The 4 bottles within the consecutive block can be arranged in:
4! = 24 \text{ ways}
Step 3: Arrange all units with the non-adjacency constraint
Now we have 9 “units” to arrange in a row: - 1 block of 4 Hot bottles (treat as a single unit, denoted B) - 1 remaining Hot bottle (denoted H) - 7 distinct Mild bottles (M_1, \ldots, M_7)
We need to count arrangements where H is not adjacent to B.
Without constraint: The 9 distinct units can be arranged in 9! ways.
With H adjacent to B: - Treat H and B as a single super-unit - This super-unit can be arranged internally in 2 ways: HB or BH - We now have 8 units total: the super-unit plus 7 Mild bottles - These 8 units can be arranged in 8! ways - Total: 2 \times 8!
Arrangements where H is NOT adjacent to B:
9! - 2 \times 8! = 8!(9 - 2) = 7 \times 8! = 7 \times 40,320 = 282,240
Step 4: Combine all steps
Total number of valid arrangements:
\binom{5}{4} \times 4! \times (9! - 2 \times 8!)
= 5 \times 24 \times 282,240
= 120 \times 282,240
= 33,868,800
\boxed{33,868,800 = 5 \times 4! \times (7 \times 8!)}
Alternatively, this can be expressed as:
\boxed{\binom{5}{4} \times 4! \times (9! - 2 \times 8!)}
To find this probability, we need to divide the number of favorable outcomes (arrangements with no two Hot bottles adjacent) by the total number of possible arrangements.
Since all 12 bottles are distinct (5 Hot bottles H_1, \ldots, H_5 and 7 Mild bottles M_1, \ldots, M_7), the total number of ways to arrange them in a row is:
\text{Total arrangements} = 12!
From part (a), we found that the number of arrangements where no two Hot bottles are adjacent is:
\text{Favorable arrangements} = 7! \cdot \binom{8}{5} \cdot 5!
Explanation: - First, arrange the 7 Mild bottles in a row: 7! ways - This creates 8 possible positions (gaps) for the Hot bottles: one before the first Mild bottle, one after each Mild bottle (6 positions), and one after the last Mild bottle - Choose 5 of these 8 gaps to place the Hot bottles: \binom{8}{5} ways - Arrange the 5 distinct Hot bottles in the chosen positions: 5! ways
The probability is:
P = \frac{7! \cdot \binom{8}{5} \cdot 5!}{12!}
Let’s compute each component: - 7! = 5040 - \binom{8}{5} = \frac{8!}{5! \cdot 3!} = \frac{8 \cdot 7 \cdot 6}{3 \cdot 2 \cdot 1} = 56 - 5! = 120 - 12! = 479{,}001{,}600
Therefore:
P = \frac{5040 \cdot 56 \cdot 120}{479{,}001{,}600} = \frac{33{,}868{,}800}{479{,}001{,}600}
P = \frac{7! \cdot \binom{8}{5} \cdot 5!}{12!} = \frac{33{,}868{,}800}{479{,}001{,}600} \approx \boxed{0.071}
The probability is approximately 0.071 or 7.1%.
Source: DSC 40A Midterm Take-Home Exam, August 1, 2025, Exercise 7
Classifying video games into Adventure (y=1) or Strategy (y=0) based on 5 binary features. Training data for 14 games:
| Game | multiplayer | puzzle | customization | procedural | magic | y |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 0 | 0 | Adventure |
| … | … | … | … | … | … | … |
To compute the class probabilities, we need to count how many games belong to each class in the training data.
From the training data table: - Adventure games (Y=1): games 1, 2, 3, 4, 5, 6 → 6 games total - Strategy games (Y=0): games 7, 8, 9, 10, 11, 12, 13, 14 → 8 games total - Total games: 14
The class probabilities are computed as the fraction of training examples in each class:
\mathbb{P}(Y=1) = \frac{\text{number of Adventure games}}{\text{total number of games}} = \frac{6}{14} = \frac{3}{7}
\mathbb{P}(Y=0) = \frac{\text{number of Strategy games}}{\text{total number of games}} = \frac{8}{14} = \frac{4}{7}
Answer: - \mathbb{P}(Y=1) = \frac{3}{7} \approx 0.429 - \mathbb{P}(Y=0) = \frac{4}{7} \approx 0.571
We need to determine which of the five binary features are present (1) or absent (0) based on the advertisement:
multiplayer: The ad explicitly states “(Now featuring multiplayer mode)” → 1
puzzle: The ad mentions “mazes” which typically involve puzzle-solving mechanics to navigate through them → 1
customization: No mention of character customization in the advertisement → 0
procedural: The ad explicitly states “procedurally generated mazes” → 1
magic: The ad mentions “the magician Ganfald,” indicating a magic system → 1
Therefore, the feature vector is:
\vec{x} = \begin{bmatrix}1 \\ 1 \\ 0 \\ 1 \\ 1\end{bmatrix}
Or in row notation: \vec{x} = [1, 1, 0, 1, 1]
From part (b), the feature vector for Nebulon Warrior II is: \vec{x} = [1, 1, 0, 1, 1]
corresponding to: multiplayer=1, puzzle=1, customization=0, procedural=1, magic=1.
From the training data: - Adventure games (Y=1): 6 games total (games 1-6) - Strategy games (Y=0): 8 games total (games 7-14)
For Adventure games (Y=1): - multiplayer=1: 5 times (games 1,2,4,5,6) - puzzle=1: 3 times (games 1,3,4) - customization=0: 6 times (all games) - procedural=1: 3 times (games 2,3,4) - magic=1: 1 time (game 6)
For Strategy games (Y=0): - multiplayer=1: 3 times (games 9,11,13) - puzzle=1: 3 times (games 8,10,13) - customization=0: 1 time (game 14) - procedural=1: 4 times (games 8,11,12,14) - magic=1: 3 times (games 12,13,14)
Since we have binary features, we use Laplace smoothing with k=2 (adding 1 to each count, adding 2 to the denominator).
Conditional probabilities for Y=1 (Adventure): \mathbb{P}(\text{multiplayer}=1 \mid Y=1) = \frac{5+1}{6+2} = \frac{6}{8} = \frac{3}{4} \mathbb{P}(\text{puzzle}=1 \mid Y=1) = \frac{3+1}{6+2} = \frac{4}{8} = \frac{1}{2} \mathbb{P}(\text{customization}=0 \mid Y=1) = \frac{6+1}{6+2} = \frac{7}{8} \mathbb{P}(\text{procedural}=1 \mid Y=1) = \frac{3+1}{6+2} = \frac{4}{8} = \frac{1}{2} \mathbb{P}(\text{magic}=1 \mid Y=1) = \frac{1+1}{6+2} = \frac{2}{8} = \frac{1}{4}
Conditional probabilities for Y=0 (Strategy): \mathbb{P}(\text{multiplayer}=1 \mid Y=0) = \frac{3+1}{8+2} = \frac{4}{10} = \frac{2}{5} \mathbb{P}(\text{puzzle}=1 \mid Y=0) = \frac{3+1}{8+2} = \frac{4}{10} = \frac{2}{5} \mathbb{P}(\text{customization}=0 \mid Y=0) = \frac{1+1}{8+2} = \frac{2}{10} = \frac{1}{5} \mathbb{P}(\text{procedural}=1 \mid Y=0) = \frac{4+1}{8+2} = \frac{5}{10} = \frac{1}{2} \mathbb{P}(\text{magic}=1 \mid Y=0) = \frac{3+1}{8+2} = \frac{4}{10} = \frac{2}{5}
Using the Naive Bayes assumption and class priors: \mathbb{P}(Y=1) = \frac{6}{14} = \frac{3}{7}, \quad \mathbb{P}(Y=0) = \frac{8}{14} = \frac{4}{7}
For Y=1: \mathbb{P}(Y=1 \mid \vec{X}=\vec{x}) \propto \mathbb{P}(Y=1) \prod_{j=1}^{5} \mathbb{P}(X^{(j)}=x^{(j)} \mid Y=1) = \frac{3}{7} \cdot \frac{3}{4} \cdot \frac{1}{2} \cdot \frac{7}{8} \cdot \frac{1}{2} \cdot \frac{1}{4} = \frac{3 \cdot 3 \cdot 1 \cdot 7 \cdot 1 \cdot 1}{7 \cdot 4 \cdot 2 \cdot 8 \cdot 2 \cdot 4} = \frac{63}{1792}
For Y=0: \mathbb{P}(Y=0 \mid \vec{X}=\vec{x}) \propto \mathbb{P}(Y=0) \prod_{j=1}^{5} \mathbb{P}(X^{(j)}=x^{(j)} \mid Y=0) = \frac{4}{7} \cdot \frac{2}{5} \cdot \frac{2}{5} \cdot \frac{1}{5} \cdot \frac{1}{2} \cdot \frac{2}{5} = \frac{4 \cdot 2 \cdot 2 \cdot 1 \cdot 1 \cdot 2}{7 \cdot 5 \cdot 5 \cdot 5 \cdot 2 \cdot 5} = \frac{32}{8750} = \frac{16}{4375}
The normalization constant is: Z = \frac{63}{1792} + \frac{16}{4375}
Converting to a common denominator (LCD = 1,120,000): \frac{63}{1792} = \frac{39,375}{1,120,000}, \quad \frac{16}{4375} = \frac{4,096}{1,120,000} Z = \frac{43,471}{1,120,000}
Therefore: \mathbb{P}(Y=1 \mid \vec{X}=\vec{x}) = \frac{39,375/1,120,000}{43,471/1,120,000} = \frac{39,375}{43,471} \approx 0.906
\mathbb{P}(Y=0 \mid \vec{X}=\vec{x}) = \frac{4,096/1,120,000}{43,471/1,120,000} = \frac{4,096}{43,471} \approx 0.094
Final Answer: \boxed{\mathbb{P}(Y=0 \mid \vec{X}=\vec{x}) \approx 0.094} \boxed{\mathbb{P}(Y=1 \mid \vec{X}=\vec{x}) \approx 0.906}
From part (c), we computed the posterior probabilities for the game Nebulon Warrior II with feature vector \vec{x} = [1, 1, 0, 1, 1]:
\mathbb{P}(Y=1 \mid \vec{X} = \vec{x}) = \frac{39,375}{43,471} \approx 0.906
\mathbb{P}(Y=0 \mid \vec{X} = \vec{x}) = \frac{4,096}{43,471} \approx 0.094
The Naive Bayes classifier predicts the class with the highest posterior probability. Since:
\mathbb{P}(Y=1 \mid \vec{X} = \vec{x}) > \mathbb{P}(Y=0 \mid \vec{X} = \vec{x})
the classifier would predict Adventure (i.e., \hat{y} = 1) for this new game.
This prediction makes intuitive sense: the game features multiplayer mode, puzzle-solving mechanics (mazes), procedurally generated content, and a magic system, which are characteristics more commonly associated with Adventure games in our training data. Specifically: - 5 out of 6 Adventure games have multiplayer, while only 3 out of 8 Strategy games do - 3 out of 6 Adventure games have puzzle mechanics, while 3 out of 8 Strategy games do - Only 1 out of 6 Adventure games has magic, while 3 out of 8 Strategy games do - The absence of customization also supports this prediction, as 7 out of 8 Strategy games have customization, but no Adventure games in the training set do
Despite the low occurrence of magic in Adventure games, the combination of all features together strongly suggests an Adventure classification.