← return to practice.dsc40a.com

**Instructor(s):** Truong Son Hy, Mahdi Soleymani

This exam was administered in person. Students had **50
minutes** to take this exam.

*Source: Fall 2022
Midterm, Problem 1*

Suppose that we are given f(x) = x^3 + x^2 and learning rate \alpha = 1/4.

Write down the updating rule for gradient descent in general, then write down the updating rule for gradient descent for the function f(x).

If we start at x_0 = -1, should we go left or right? Can you verify this mathematically? What is x_1? Can gradient descent converge? If so, where it might converge to, given appropriate step size?

We have f'(x_0) = f'(-1) = 3(-1)^2
+ 2(-1) = 1 > 0, so we go **left**, and x_1 = x_0 - \alpha f'(x_0) = -1 - \frac{1}{4} =
-\frac{5}{4}. Intuitively, the gradient descent **cannot
converge in this case** because \text{lim}_{x \rightarrow -\infty} f(x) =
-\infty,

We need to find all local minimums and local maximums. First, we solve the equation f'(x) = 0 to find all critical points.

We have: f'(x) = 0 \Leftrightarrow 3x^2 + 2x = 0 \Leftrightarrow x = -\frac{2}{3} \ \ \text{and} \ \ x = 0.

Now, we consider the second-order derivative: f''(x) = \frac{d^2f}{dx^2} = 6x + 2.

We have f''(x) = 0 only when x = -1/3. Thus, for x < -1/3, f''(x) is negative or the slope f'(x) decreases; and for x > -1/3, f''(x) is positive or the slope f'(x) increases. Keep in mind that -1 < -2/3 < -1/3 < 0 < 1.

Therefore, f has a local maximum at x = -2/3 and a local minimum at x = 0. If the gradient descent starts at x_0 = -1 and it always goes left then it will never meet the local minimum at x = 0, and it will go left infinitely. We say the gradient descent cannot converge, or is divergent.

If we start at x_0 = 1, should we go left or right? Can you verify this mathematically? What is x_1? Can gradient descent converge? If so, where it might converge to, given appropriate step size?

We have f'(x_0) = f'(-1) = 3 \cdot
1^2 + 2 \cdot 1 = 5 > 0, so we go **left**, and
x_1 = x_0 - \alpha f'(x_0) = 1 -
\frac{1}{4} \cdot 5 = -\frac{1}{4}.

From the previous part, function f
has a local minimum at x = 0, so the
gradient descent **can converge** (given appropriate step
size) at this local minimum.

Write down 1 condition to terminate the gradient descent algorithm (in general).

There are several ways to terminate the gradient descent algorithm:

If the change in the optimization objective is too small, i.e. |f(x_i) - f(x_{i + 1})| < \epsilon where \epsilon is a small constant,

If the gradient is close to zero or the norm of the gradient is very small, i.e. \|\nabla f(x_i)\| < \lambda where \lambda is a small constant.

*Source: Fall 2022
Midterm, Problem 2*

Calculate w_1^* and w_0^* of the linear regression line for the following dataset, \mathcal{D}. You may use the fill-in-the-blanks and the table below to organize your work.

\mathcal{D} = \{(0, 0), (4, 2), (5, 1)\}.

\bar x = \phantom{\hspace{.5in}} \bar y = \phantom{\hspace{.5in}}

x_i | y_i | (x_i - \bar x) | (y_i - \bar y) | (x_i - \bar x)(y_i - \bar y) | (x_i - \bar x)^2 |
---|---|---|---|---|---|

0 | 0 | ||||

4 | 2 | ||||

5 | 1 |

w_1^* =

w_0^* =

Finally, what can we say about the correlation r and the slope for this dataset (mathematically)?

\bar{x} = \frac{1}{3}(0 + 4 + 5) = 3
\quad \quad\bar{y} = \frac{1}{3}(0 + 2 + 1) =
1

x_i | y_i | (x_i - \bar x) | (y_i - \bar y) | (x_i - \bar x)(y_i - \bar y) | (x_i - \bar x)^2 |
---|---|---|---|---|---|

0 | 0 | -3 | -1 | 3 | 9 |

4 | 2 | 1 | 1 | 1 | 1 |

5 | 1 | 2 | 0 | 0 | 4 |

w_1^* = \displaystyle
\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i -
\bar{x})^2}
= \frac{3 + 1 + 0}{9 + 1 + 4} = \frac{4}{14} = \frac{2}{7}

w_0^* = \bar{y} - w_1^* \bar{x} = 1 - \frac{2}{7} \cdot 3 = 1 - \frac{6}{7} = \frac{1}{7}

Because w_1^* = r
\frac{\sigma_y}{\sigma_x} where the standard deviations \sigma_y and \sigma_x are non-negative, and w_1^* = 2/7 > 0, **the correlation
r is positive and the slope is
positive**.

*Source: Fall 2022
Midterm, Problem 3*

Mahdi runs a local pastry shop near UCSD and sells traditional desert called Baklava. He bakes Baklavas every morning to keep his promise of providing fresh Baklavas to his customers daily. Here is the amount of Baklava he sold each day during last week in pounds(lb): y_1=100, y_2=110, y_3=75, y_4=90, y_5=105, y_6=90, y_7=25

Mahdi needs your help as a data scientist to suggest __the best
constant prediction (h^*) of daily
sales that minimizes the empirical risk using L(h,y) as the loss function__. Answer the
following questions and give a **brief justification** for
each part. **This problem has many parts, if you get stuck, move
on and come back later!**

Let L(h,y)=|y-h|. What is h^*? (We’ll later refer to this prediction as h_1^*).

Let L(h,y)=(y-h)^2. What is h^*? (We’ll later refer to this prediction as h_2^*).

**True** or **False**: Removing y_1 and y_3
from the dataset does not change h_2^*.

True

False

False. It changes the mean from 85 to 84. (However, the median is not changed.)

Mahdi thinks that y_7 is an outlier.
Hence, he asks you to remove y_7 and
update your predictions in parts (a) and (b) accordingly. Without
calculating the new predictions, can you justify which prediction
changes *more*? h^*_1 or h_2^*?

**True** or **False**: Let L(y,h)=|y-h|^3. You can use the Gradient
descent algorithm to find h^*.

True

False

**True** or **False**: Let L(y,h)=\sin(y-h). The Gradient descent
algorithm is guaranteed to converge, provided that a proper learning
rate is given.

True

False

False. The function is not convex, so the gradient descent algorithm is not guaranteed to converge.

Mahdi has noticed that Baklava daily sale is associated with weather
temperature. So he asks you to incorporate this feature to get a better
prediction. Suppose the last week’s daily temperatures are x_1, x_2, \cdots, x_7 in Fahrenheit (F). We
know that \bar x=65, \sigma_x=8.5 and the best linear prediction
that __minimizes the mean squared error__ is H^*(x)=-3x+w_0^*.

What is the correlation coefficient (r) between x and y? What does that mean?

r=-0.95. This means the weather temperature inversely affects Baklava sales, i.e., they are highly negatively correlated.

We know w_1^* = \frac{\sigma_y}{\sigma_x}r. We know that \sigma_x=8.5 and w_1^*=-3. We can find \sigma_y as follows:

\begin{aligned} \sigma_y^2 =& \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y})^2\\ =& \frac{1}{7}[(100-85)^2+(110-85)^2+(75-85)^2+(90-85)^2+(105-85)^2+(90-85)^2+(25-85)^2]\\ =&\frac{1}{7}[15^2+25^2+10^2+5^2+20^2+5^2+60^2]=714.28 \end{aligned}

Then, \sigma_y=26.7 which results in r=-0.95.

**True** or **False**: The unit of r is \frac{lb}{F} (Pound per Fahrenheit).

True

False

False. The correlation coefficient has no unit. (It is always a unitless number in [-1,1] range.)

Find w^*_0. *(Hint: You’ll need
to find \bar y for the given
dataset)*

w_0^*=280

Note that H(\bar x)=\bar y. Therefore, \begin{aligned} H(65)=-3\cdot 65 +w_0^*=85 \xrightarrow[]{}w_0^*=280. \end{aligned}

What would the best linear prediction H^*(x) be if we multiply all x_i’s by 2?

H^*(x) = -1.5x + 280

The standard deviation scales by a factor of 2, i.e., \sigma_x'=2\cdot \sigma_x.

The same
is true for the mean, i.e., \bar{x}'=2
\cdot \bar{x}.

The correlation r, standard deviation of the y-values \sigma_y, and the mean of the y-values \bar y do not change.

(You can verify
these claims by plugging 2x in for
x in their respective formulas and
seeing what happens, but it’s faster to *visually* reason why
this happens.)

Therefore, w_1'^*=\frac{\sigma_y'}{\sigma_x'}r' = \frac{(\sigma_y)}{(2\cdot\sigma_x)}(r) = \frac{w_1^*}{2} = -1.5.

We can find w_0'^* as follows:

\begin{align*} \bar{y}'&=H(\bar{x}')\\&=\frac{w_1^*}{2}(2\bar{x})+w_0'^*\\&=w_1^*\bar{x}+w_0'^* \\ &\downarrow \\ (85) &= -3(65) + w_0'^* \\ w_0'^*&=280 \end{align*}

So, H^*(x) would be -1.5x + 280.

What would the best linear prediction H^*(x) be if we add 20 to all x_i’s?

H^*(x) = -3x + 340

All parameters remain unchanged except \bar{x}'=\bar{x}+20. Since r, \sigma_x and \sigma_y are not changed, w_1^* does not change. Then, one can find w_0^* as follows:

\begin{align*} \bar{y}'&=H(\bar{x}') \\ &\downarrow \\ (85) &=-3(65+20)+w_0^* \\ w_0^*&=340 \end{align*}

So, H^*(x) would be -3x + 340.

*Source: Fall 2022
Midterm, Problem 4*

Consider a dataset that consists of y_1, \cdots, y_n. In class, we used calculus to minimize mean squared error, R_{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (h - y_i)^2. In this problem, we want you to apply the same approach to a slightly different loss function defined below: L_{\text{midterm}}(y,h)=(\alpha y - h)^2+\lambda h

Write down the empiricial risk R_{\text{midterm}}(h) by using the above loss function.

The mean of dataset is \bar{y}, i.e. \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i. Find h^* that minimizes R_{\text{midterm}}(h) using calculus. Your result should be in terms of \bar{y}, \alpha and \lambda.

h^*=\alpha \bar{y} - \frac{\lambda}{2}

\begin{align*} \frac{d}{dh}R_{\text{midterm}}(h)&= [\frac{2}{n}\sum_{i=1}^{n}(h- \alpha y_i )] +\lambda \\ &=2 h-2\alpha \bar{y} + \lambda. \end{align*}

By setting \frac{d}{dh}R_{\text{midterm}}(h)=0 we get 2 h^*-2\alpha \bar{y} + \lambda=0 \Rightarrow h^*=\alpha \bar{y} - \frac{\lambda}{2}.