← return to practice.dsc40a.com

**Instructor(s):** Aobo Li

This exam was administered in person. Students had **50
minutes** to take this exam.

*Source: Winter 2024 Final
Part 1, Problem 1*

Suppose there is a dataset containing 10000 integers:

- 2500 of them are 3s
- 2500 of them are 5s
- 4500 of them are 7s
- 500 of them are 9s.

Calculate the median of this dataset.

6

We know there is an even number of integers in this dataset because 10000 \% 2 = 0. We can find the middle of the dataset as follows: \frac{10000}{2} = 5000. This means the element in the 5000th position and 5001st position can give us our median. The element at the 5000th position is a 5 because 2500 + 2500 = 5000. The element at the 5001st position is a 7 because the next number after 5 is 7. We can then plug 5 and 7 into the equation: \frac{x_{5000} + x_{5001}}{2} = \frac{5 + 7}{2} = 6

How does the mean of this dataset compared to its median?

The mean is larger than the median

The mean is smaller than the median

The mean and the median are equal

The mean is smaller than the median.

We can calculate the mean as follows: \frac{2500 \cdot 3 + 2500 \cdot 5 + 4500 \cdot 7 + 500 \cdot 9}{10000} = 5.6 Using part (a) we know that 5.6 < 6, which means the mean is smaller than the median.

*Source: Winter 2024 Final
Part 1, Problem 2*

You and a friend independently perform gradient descent on the same
function, but after 200 iterations, you
have different results. Which of the following is sufficient **on
its own** to explain the difference in your results?
**Note:** When we say “same function” we assume the
learning rate and initial predictions are the same too until said
otherwise.

**Select all that apply.**

The function is nonconvex.

The function is not differentiable.

You and your friend chose different learning rates.

You and your friend chose the same initial predictions.

None of the above.

Bubbles 3: “You and your friend chose different learning rates.”

If the function is nonconvex and you and your friend have the same initial prediction and learning rate you should end up in the same location local or global minimum.

If the function is not differentiable then you cannot perform gradient descent, so this cannot be an answer.

If you and your friend chose different learning rates it is possible to have different results because if you have a really large learning rate you might be hopping over the global minimum without properly converging. Your friend could choose a smaller learning rate, which will allow you to converge to the global minimum.

If you and your friend chose the same initial predictions you are guaranteed to end up in the same spot.

Because two of the option choices are possible the answer cannot be “None of the above.”

*Source: Winter 2024 Final
Part 1, Problem 3*

Suppose there is a dataset contains four values: -2, -1, 2, 4. You would like to use gradient descent to minimize the mean square error over this dataset.

Write down the expression of mean square error and its derivative given this dataset.

R_{sq}(h) = \dfrac{1}{4}\sum_{i=1}^{4}(y_i-h)^2

Recall the equation for R_{sq}(h) = \frac{1}{n}\sum_{i=1}^{n}(y_i-h)^2, so we simply need to replace n with 4 because there are 4 elements in our dataset.

\frac{dR_{sq}(h)}{dh} = \frac{1}{2}\sum_{i=1}^{4}(h-y_i)

Since we have the equation for R_{sq}(h) we can calculate the derivative:

\begin{align*} \frac{dR_{sq}(h)}{dh} &= \frac{dR_{sq}(h)}{dh}(\frac{1}{4}\sum_{i=1}^{4}(y_i-h)^2) \\ &= \frac{1}{4}\sum_{i=1}^{4}\frac{dR_{sq}(h)}{dh}((y_i-h)^2) \\ \end{align*} We can use the chain rule to find the derivative of (y_i-h)^2. Recall the chain rule is: \frac{df(x)}{dx}[(f(x))^n] = n(f(x))^{n-1} \cdot f'(x).

\begin{align*} &= \frac{1}{4}\sum_{i=1}^{4}2(y_i-h) \cdot -1 \\ &= \frac{1}{4}\sum_{i=1}^{4} 2(h - y_i) \\ &= \frac{1}{2}\sum_{i=1}^{4} (h-y_i) \end{align*}

Suppose you choose the initial position to be h_0 and the learning rate to be \frac{1}{4}. After two gradient descent steps, h_2=\frac{1}{4}. What is the value of h_0?

h_{0} = -\frac{5}{4}

The gradient descent equation is given by: \begin{aligned} h_i = h_{i-1} - \alpha \frac{dR_{sq}(h_{i-1})}{dh} \end{aligned} Recall the learning rate is equal to \alpha. We can plug in \alpha = \frac{1}{4}, h_2 = \frac{1}{4}, and i=2, in that case we obtain: \begin{aligned} \frac{1}{4} &= h_{1} - \alpha \frac{dR_{sq}(h_{1})}{dh}\\ &= h_{1} - \alpha \frac{1}{2}\sum_{i=1}^{4}(h_1-y_i)\\ &= h_{1} - \frac{1}{8}(h_{1} + 2 + h_{1} + 1 + h_{1} - 2 + h_{1} - 4)\\ &= h_{1} - \frac{1}{8}(4h_{1} -3)\\ \end{aligned} Solving this equation, we obtain that h_{1} = -\frac{1}{4}. We can then repeat this step once more to obtain h_0: \begin{aligned} -\frac{1}{4} &= h_{0} - \alpha \frac{dR_{sq}(h_{0})}{dh}\\ &= h_{0} - \frac{1}{4}(h_{0} + 2 + h_{0} + 1 + h_{0} - 2 + h_{0} - 4)\\ &= h_{0} - \frac{1}{8}(4h_{0} -3)\\ \end{aligned} Solving this equation, we obtain that h_{0} = -\frac{5}{4}.

Given that we set the tolerance of gradient descent to be 0.1. How many **additional
steps** beyond h_2 do we need to
take to reach convergence?

0

1

2

3

4

It will never converge

2 or 3 additional steps (both are correct).

The gradient descent equation is given by: \begin{aligned} h_i = h_{i-1} - \alpha \frac{dR_{sq}(h_{i-1})}{dh} \end{aligned} Now we start from \alpha = \frac{1}{4}, h_2 = \frac{1}{4}, and i=2, in that case we obtain: \begin{aligned} h_{3} &= h_{2} - \alpha \frac{dR_{sq}(h_{2})}{dh}\\ &= h_{2} - \frac{1}{8}(4h_{2} -3)\\ &= \frac{1}{4} - \frac{1}{8}(1 -3)\\ &= \frac{1}{2}\\ \end{aligned} Iteratively, we have: \begin{aligned} h_{4} &= h_{3} - \alpha \frac{dR_{sq}(h_{3})}{dh}\\ &= h_{3} - \frac{1}{8}(4h_{3} -3)\\ &= \frac{1}{2} - \frac{1}{8}(2 -3)\\ &= \frac{5}{8}\\ \end{aligned} \begin{aligned} h_{5} &= h_{4} - \alpha \frac{dR_{sq}(h_{4})}{dh}\\ &= h_{3} - \frac{1}{8}(4h_{4} -3)\\ &= \frac{5}{8} - \frac{1}{8}(\frac{5}{2}-3)\\ &= \frac{11}{16}\\ \end{aligned} from h_4 to h_5, the change is smaller than the tolerance. That means we need additional 2 or 3 steps to reach convergence (depending on if you actually perform h_4 to h_5, so both 2 and 3 are considered correct answer).

*Source: Winter 2024 Final
Part 1, Problem 4*

Albert collected 400 data points from a radiation detector. Each data point contains 3 features: feature A, feature B and feature C. The true particle energy E is also reported. Albert wants to design a linear regression algorithm to predict the energy E of each particle, given a combination of one or more of feature A, B, and C. As the first step, Albert calculated the correlation coefficients among A, B, C and E. He wrote it down in the following table, where each cell of the table represents the correlaton of two terms:

A | B | C | E | |
---|---|---|---|---|

A | 1 | -0.99 | 0.13 | 0.8 |

B | -0.99 | 1 | 0.25 | -0.95 |

C | 0.13 | 0.25 | 1 | 0.72 |

E | 0.8 | -0.95 | 0.72 | 1 |

Albert wants to start with a simple model: fitting only a single feature to obtain the true energy (i.e. y = w_0+w_1 x). Which feature should he choose as x to get the lowest mean square error?

A

B

C

B

B is the correct answer, because it has the highest absolute correlation (0.95), the negative sign in front of B just means it is negatively correlated to energy, and it can be compensated by a negative sign in the weight.

Albert wants to add another feature to his linear regression in part (a) to further boost the model’s performance. (i.e. y = w_0 + w_1 x + w_2 x_2) Which feature should he choose as x_2 to make additional improvements?

A

B

C

C

C is the correct answer, because although A has a higher correlation with energy, it also has an extremely high correlation with B (-0.99), that means adding A into the fit will not be too useful, since it provides almost the same information as B.

Albert further refines his algorithm by fitting a prediction rule of the form: \begin{aligned} H(A,B,C) = w_0 + w_1 \cdot A\cdot C + w_2 \cdot B^{C-7} \end{aligned}

Given this prediction rule, what are the dimensions of the design matrix X?

\begin{bmatrix} & & & \\ & & & \\ & & & \\ \end{bmatrix}_{r \times c}

So, what are r and c in r \text{ rows} \times c \text{ columns}?

400 \text{ rows} \times 3 \text{ columns}

Recall there are 400 data points, which means there will be 400 rows. There will be 3 columns; one is the bias column of all 1s, one is for the feature A\cdot C, and one is for the feature B^{C-7}.