← return to practice.dsc40a.com
Instructor(s): Truong Son Hy, Mahdi Soleymani
This exam was administered in person. Students had 50 minutes to take this exam.
Source: Fall 2022 Midterm, Problem 1
Suppose that we are given f(x) = x^3 + x^2 and learning rate \alpha = 1/4.
Write down the updating rule for gradient descent in general, then write down the updating rule for gradient descent for the function f(x).
In general, the updating rule for gradient descent is: x_{i + 1} = x_i - \alpha \nabla f(x_i) = x_i - \alpha \frac{\partial f}{\partial x}(x_i), where \alpha \in \mathbb{R}_+ is the learning rate or step size. For this function, since f is a single-variable function, we can write down the updating rule as: x_{i + 1} = x_i - \alpha \frac{df}{dx}(x_i) = x_i - \alpha f'(x_i). We also have: \frac{df}{dx} = f'(x) = 3x^2 + 2x, thus the updating rule can be written down as: x_{i + 1} = x_i - \alpha(3x_i^2 + 2x_i) = -\frac{3}{4} x_i^2 + \frac{1}{2}x_i.
If we start at x_0 = -1, should we go left or right? Can you verify this mathematically? What is x_1? Can gradient descent converge? If so, where it might converge to, given appropriate step size?
We have f'(x_0) = f'(-1) = 3(-1)^2 + 2(-1) = 1 > 0, so we go left, and x_1 = x_0 - \alpha f'(x_0) = -1 - \frac{1}{4} = -\frac{5}{4}. Intuitively, the gradient descent cannot converge in this case because \text{lim}_{x \rightarrow -\infty} f(x) = -\infty,
We need to find all local minimums and local maximums. First, we solve the equation f'(x) = 0 to find all critical points.
We have: f'(x) = 0 \Leftrightarrow 3x^2 + 2x = 0 \Leftrightarrow x = -\frac{2}{3} \ \ \text{and} \ \ x = 0.
Now, we consider the second-order derivative: f''(x) = \frac{d^2f}{dx^2} = 6x + 2.
We have f''(x) = 0 only when x = -1/3. Thus, for x < -1/3, f''(x) is negative or the slope f'(x) decreases; and for x > -1/3, f''(x) is positive or the slope f'(x) increases. Keep in mind that -1 < -2/3 < -1/3 < 0 < 1.
Therefore, f has a local maximum at x = -2/3 and a local minimum at x = 0. If the gradient descent starts at x_0 = -1 and it always goes left then it will never meet the local minimum at x = 0, and it will go left infinitely. We say the gradient descent cannot converge, or is divergent.
If we start at x_0 = 1, should we go left or right? Can you verify this mathematically? What is x_1? Can gradient descent converge? If so, where it might converge to, given appropriate step size?
We have f'(x_0) = f'(-1) = 3 \cdot 1^2 + 2 \cdot 1 = 5 > 0, so we go left, and x_1 = x_0 - \alpha f'(x_0) = 1 - \frac{1}{4} \cdot 5 = -\frac{1}{4}.
From the previous part, function f has a local minimum at x = 0, so the gradient descent can converge (given appropriate step size) at this local minimum.
Write down 1 condition to terminate the gradient descent algorithm (in general).
There are several ways to terminate the gradient descent algorithm:
If the change in the optimization objective is too small, i.e. |f(x_i) - f(x_{i + 1})| < \epsilon where \epsilon is a small constant,
If the gradient is close to zero or the norm of the gradient is very small, i.e. \|\nabla f(x_i)\| < \lambda where \lambda is a small constant.
Source: Fall 2022 Midterm, Problem 2
Calculate w_1^* and w_0^* of the linear regression line for the following dataset, \mathcal{D}. You may use the fill-in-the-blanks and the table below to organize your work.
\mathcal{D} = \{(0, 0), (4, 2), (5, 1)\}.
\bar x = \phantom{\hspace{.5in}} \bar y = \phantom{\hspace{.5in}}
x_i | y_i | (x_i - \bar x) | (y_i - \bar y) | (x_i - \bar x)(y_i - \bar y) | (x_i - \bar x)^2 |
---|---|---|---|---|---|
0 | 0 | ||||
4 | 2 | ||||
5 | 1 |
w_1^* =
w_0^* =
Finally, what can we say about the correlation r and the slope for this dataset (mathematically)?
\bar{x} = \frac{1}{3}(0 + 4 + 5) = 3
\quad \quad\bar{y} = \frac{1}{3}(0 + 2 + 1) =
1
x_i | y_i | (x_i - \bar x) | (y_i - \bar y) | (x_i - \bar x)(y_i - \bar y) | (x_i - \bar x)^2 |
---|---|---|---|---|---|
0 | 0 | -3 | -1 | 3 | 9 |
4 | 2 | 1 | 1 | 1 | 1 |
5 | 1 | 2 | 0 | 0 | 4 |
w_1^* = \displaystyle
\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i -
\bar{x})^2}
= \frac{3 + 1 + 0}{9 + 1 + 4} = \frac{4}{14} = \frac{2}{7}
w_0^* = \bar{y} - w_1^* \bar{x} = 1 - \frac{2}{7} \cdot 3 = 1 - \frac{6}{7} = \frac{1}{7}
Because w_1^* = r \frac{\sigma_y}{\sigma_x} where the standard deviations \sigma_y and \sigma_x are non-negative, and w_1^* = 2/7 > 0, the correlation r is positive and the slope is positive.
Source: Fall 2022 Midterm, Problem 3
Mahdi runs a local pastry shop near UCSD and sells traditional desert called Baklava. He bakes Baklavas every morning to keep his promise of providing fresh Baklavas to his customers daily. Here is the amount of Baklava he sold each day during last week in pounds(lb): y_1=100, y_2=110, y_3=75, y_4=90, y_5=105, y_6=90, y_7=25
Mahdi needs your help as a data scientist to suggest the best constant prediction (h^*) of daily sales that minimizes the empirical risk using L(h,y) as the loss function. Answer the following questions and give a brief justification for each part. This problem has many parts, if you get stuck, move on and come back later!
Let L(h,y)=|y-h|. What is h^*? (We’ll later refer to this prediction as h_1^*).
As we have seen in lectures, the median minimizes the absolute loss risk function. h^*_1=\text{Median}(y_1, \cdots, y_7).
Let L(h,y)=(y-h)^2. What is h^*? (We’ll later refer to this prediction as h_2^*).
As we have seen in lectures, the mean minimizes the square loss risk function. h^*_2=\text{Mean}(y_1, \cdots, y_7).
True or False: Removing y_1 and y_3 from the dataset does not change h_2^*.
True
False
False. It changes the mean from 85 to 84. (However, the median is not changed.)
Mahdi thinks that y_7 is an outlier. Hence, he asks you to remove y_7 and update your predictions in parts (a) and (b) accordingly. Without calculating the new predictions, can you justify which prediction changes more? h^*_1 or h_2^*?
Removing y_7 affects h_2^* more than h_1^*. This is because the mean squared loss is more sensitive to outliers than absolute loss, and removing data changes the mean more.
True or False: Let L(y,h)=|y-h|^3. You can use the Gradient descent algorithm to find h^*.
True
False
False. The function |y-h|^3 is not differentiable everywhere so we can not use the gradient descent to find the minimum.
True or False: Let L(y,h)=\sin(y-h). The Gradient descent
algorithm is guaranteed to converge, provided that a proper learning
rate is given.
True
False
False. The function is not convex, so the gradient descent algorithm is not guaranteed to converge.
Mahdi has noticed that Baklava daily sale is associated with weather temperature. So he asks you to incorporate this feature to get a better prediction. Suppose the last week’s daily temperatures are x_1, x_2, \cdots, x_7 in Fahrenheit (F). We know that \bar x=65, \sigma_x=8.5 and the best linear prediction that minimizes the mean squared error is H^*(x)=-3x+w_0^*.
What is the correlation coefficient (r) between x and y? What does that mean?
r=-0.95. This means the weather temperature inversely affects Baklava sales, i.e., they are highly negatively correlated.
We know w_1^* = \frac{\sigma_y}{\sigma_x}r. We know that \sigma_x=8.5 and w_1^*=-3. We can find \sigma_y as follows:
\begin{aligned} \sigma_y^2 =& \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y})^2\\ =& \frac{1}{7}[(100-85)^2+(110-85)^2+(75-85)^2+(90-85)^2+(105-85)^2+(90-85)^2+(25-85)^2]\\ =&\frac{1}{7}[15^2+25^2+10^2+5^2+20^2+5^2+60^2]=714.28 \end{aligned}
Then, \sigma_y=26.7 which results in r=-0.95.
True or False: The unit of r is \frac{lb}{F} (Pound per Fahrenheit).
True
False
False. The correlation coefficient has no unit. (It is always a unitless number in [-1,1] range.)
Find w^*_0. (Hint: You’ll need to find \bar y for the given dataset)
w_0^*=280
Note that H(\bar x)=\bar y. Therefore, \begin{aligned} H(65)=-3\cdot 65 +w_0^*=85 \xrightarrow[]{}w_0^*=280. \end{aligned}
What would the best linear prediction H^*(x) be if we multiply all x_i’s by 2?
H^*(x) = -1.5x + 280
The standard deviation scales by a factor of 2, i.e., \sigma_x'=2\cdot \sigma_x.
The same
is true for the mean, i.e., \bar{x}'=2
\cdot \bar{x}.
The correlation r, standard deviation of the y-values \sigma_y, and the mean of the y-values \bar y do not change.
(You can verify
these claims by plugging 2x in for
x in their respective formulas and
seeing what happens, but it’s faster to visually reason why
this happens.)
Therefore, w_1'^*=\frac{\sigma_y'}{\sigma_x'}r' = \frac{(\sigma_y)}{(2\cdot\sigma_x)}(r) = \frac{w_1^*}{2} = -1.5.
We can find w_0'^* as follows:
\begin{align*} \bar{y}'&=H(\bar{x}')\\&=\frac{w_1^*}{2}(2\bar{x})+w_0'^*\\&=w_1^*\bar{x}+w_0'^* \\ &\downarrow \\ (85) &= -3(65) + w_0'^* \\ w_0'^*&=280 \end{align*}
So, H^*(x) would be -1.5x + 280.
What would the best linear prediction H^*(x) be if we add 20 to all x_i’s?
H^*(x) = -3x + 340
All parameters remain unchanged except \bar{x}'=\bar{x}+20. Since r, \sigma_x and \sigma_y are not changed, w_1^* does not change. Then, one can find w_0^* as follows:
\begin{align*} \bar{y}'&=H(\bar{x}') \\ &\downarrow \\ (85) &=-3(65+20)+w_0^* \\ w_0^*&=340 \end{align*}
So, H^*(x) would be -3x + 340.
Source: Fall 2022 Midterm, Problem 4
Consider a dataset that consists of y_1, \cdots, y_n. In class, we used calculus to minimize mean squared error, R_{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (h - y_i)^2. In this problem, we want you to apply the same approach to a slightly different loss function defined below: L_{\text{midterm}}(y,h)=(\alpha y - h)^2+\lambda h
Write down the empiricial risk R_{\text{midterm}}(h) by using the above loss function.
R_{\text{midterm}}(h)=\frac{1}{n}\sum_{i=1}^{n}[(\alpha y_i - h)^2+\lambda h]=[\frac{1}{n}\sum_{i=1}^{n}(\alpha y_i - h)^2] +\lambda h
The mean of dataset is \bar{y}, i.e. \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i. Find h^* that minimizes R_{\text{midterm}}(h) using calculus. Your result should be in terms of \bar{y}, \alpha and \lambda.
h^*=\alpha \bar{y} - \frac{\lambda}{2}
\begin{align*} \frac{d}{dh}R_{\text{midterm}}(h)&= [\frac{2}{n}\sum_{i=1}^{n}(h- \alpha y_i )] +\lambda \\ &=2 h-2\alpha \bar{y} + \lambda. \end{align*}
By setting \frac{d}{dh}R_{\text{midterm}}(h)=0 we get 2 h^*-2\alpha \bar{y} + \lambda=0 \Rightarrow h^*=\alpha \bar{y} - \frac{\lambda}{2}.