← return to practice.dsc40a.com

This page contains all problems about Summary Statistics and the Constant Model.

*Source: Fall 2021
Midterm, Problem 1*

King Triton just made an Instagram account and has been keeping track of the number of likes his posts have received so far.

His first 7 posts have received a mean of 16 likes; the specific like counts in sorted order are

8, 12, 12, 15, 18, 20, 27

King Triton wants to predict the number of likes his next post will
receive, using a constant prediction rule h. For each loss function L(h, y), determine the constant prediction
h^* that minimizes empirical risk. If
you believe there are multiple minimizers, specify them all. If you
believe you need more information to answer the question or that there
is no minimizer, state that clearly. **Give a brief justification
for each answer.**

L(h, y) = |y - h|

L(h, y) = (y - h)^2

L(h, y) = 4(y - h)^2

This is squared loss, multiplied by a constant. Note that when we go to minimize empirical risk for this loss function, we will take the derivative of empirical risk and set it equal to 0; at that point the constant factor of 4 can be divided from both sides, so this problem boils down to minimizing ordinary mean squared error. The only difference is that the graph of mean squared error will be stretched vertically by a factor of 4; the minimizing value will be in the same place.

For more justification, here we consider any general re-scaling \alpha (y-h)^2:

\begin{aligned} R_{sq}(h) &= \frac{1}{n} \sum_{i = 1}^n \alpha (y_i - h)^2 \\ &= \alpha \cdot \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2 \\ \frac{d}{dh} R_{sq}(h) &= \alpha \cdot \frac{1}{n} \sum_{i = 1}^n 2(y_i - h)(-1) = 0\\ &\implies -\frac{2\alpha}{n}\sum_{i = 1}^n (y_i - h) = 0 \\ &\implies \sum_{i = 1}^n (y_i - h) = 0 \\ &\implies h^* = \frac{1}{n} \sum_{i = 1}^n y_i \end{aligned}

L(h, y) = \begin{cases} 0 & h = y \\ 100 & h \neq y \end{cases}

L(h, y) = (3y - 4h)^2

L(h, y) = (y - h)^3

*Hint: Do not spend too long on this subpart.*

No minimizer.

Note that unlike |y - h|, (y - h)^2, and all of the other loss functions we’ve seen, (y - h)^3 tends towards -\infty, rather than having a minimum output of 0. This means that there is no h that minimizes \frac{1}{n} \sum_{i = 1}^n (y_i - h)^3; the larger we make h, the more negative (and hence “smaller") this empirical risk becomes.

*Source: Fall 2021
Midterm, Problem 2*

Consider a set of 23 data points y_1, y_2, y_3, ..., y_{23} such that y_1 < y_2 < ... < y_{23}. Let’s call this Dataset A.

We create a new dataset, Dataset B, by repeating each point in Dataset A once. That is, Dataset B is the set of 46 points y_1, y_1, y_2, y_2, ..., y_{23}, y_{23}.

Answer the following questions regarding the relationship between
Dataset A and Dataset B. **Justify your answers.**

Suppose the minimizer of mean absolute error R_{abs}(h) for Dataset A is 5. What is the minimizer of mean absolute error for Dataset B? If you believe there are multiple minimizers, specify them all. If you believe you need more information to answer the question, state that clearly.

The minimizer of MAE for Dataset B is still 5.

Note that when we repeat each data point, we go from having an odd number of data points (23) to an even number (46). This means the minimizer is the set of all values between the middle two values. But the middle two values will now both be y_{12}, and so there are no numbers “in between" them – only y_{12} minimizes MAE.

Suppose the *mean absolute deviation from the median* for
Dataset A is 17. What is the *mean absolute deviation from the
median* for Dataset B? If you believe you need more information to
answer the question, state that clearly.

The mean absolute deviation from the median for Dataset B is still 17.

As we saw in previous part, the median itself does not change. When adding together the deviations from the median, each point is repeated twice, so the sum of all deviations from the median is doubled. However, there are twice as many data points in Dataset B than there are in Dataset A, so we divide by 2n (46) instead of n (23) in our average.

In short, for Dataset B both the numerator and denominator in the calculation of mean absolute deviation from the median \frac{\sum_{i = 1}^n |y_i - \text{Median}(y)|}{n} are double what they were for Dataset A, so the end result is the same as for Dataset A.

Suppose the function R_A(h) represents mean absolute error for Dataset A and R_B(h) represents mean absolute error for Dataset B.

Is it true that R_A(h) = R_B(h) for any real number h? (In other words, are the graphs of R_A(h) and R_B(h) identical?) Explain your reasoning.

Yes, R_{A}(h) = R_{B}(h).

Recall that the definition of R_{abs}(h) is:

R_{abs}(h) = \frac{1}{n} \sum_{i = 1}^n |y_i - h|

For the first dataset, we have:

R_{A}(h) = \frac{1}{23} \sum_{i = 1}^{23} |y_i - h|

and for the second dataset, we have:

\begin{aligned} R_{B}(h) &= \frac{1}{46} \sum_{i = 1}^{46} \left( |y_1 - h| + |y_1 - h| + |y_2 - h| + |y_2 - h| + ... + |y_{23} - h| + |y_{23} - h| \right) \\ &= \frac{1}{46} \left( 2 \sum_{i = 1}^{23} |y_i - h| \right) \\ &= \frac{1}{23} \sum_{i = 1}^{23} |y_i - h| \\ &= R_{A}(h) \end{aligned}

*Source: Fall 2021 Final
Exam, Problem 1*

The mean of 12 non-negative numbers is 45. Suppose we remove 2 of these numbers. What is the largest possible value of the mean of the remaining 10 numbers? Show your work.

54.

To maximize the mean of the remaining 10 numbers, we want to minimize the numbers that are removed. The smallest possible non-negative number is 0, so to maximize the mean of the remaining 10, we should remove two 0s from the set of numbers. Recall that the sum of the 12 number set is 12 \cdot 45; then, the maximum possible mean of the remaining 10 is

\frac{12 \cdot 45 - 2 \cdot 0}{10} = \frac{6}{5} \cdot 45 = 54

*Source: Fall 2021 Final
Exam, Problem 2*

Consider a set of 8 data points, y_1, y_2, ..., y_8 that are in sorted order, i.e. y_1 < y_2 < ... < y_8. Suppose that y_4 = 10, y_5 = 14, and y_6 = 22. Recall that mean absolute error, R_{abs}(h), is defined as R_{abs}(h) = \frac{1}{n} \sum_{i = 1}^n |y_i - h|. Suppose that R_{abs}(11) = 9.

What is R_{abs}(22)? Show your work.

*Hint: Use the formula for the slope of R at h.*

R_{abs}(22) = 11

We can write the points given to us as:

y_1, y_2, y_3, 10, 14, 22, y_7, y_8

Since there are an even number of data points, all values of h between the middle two points minimize R_{abs}(h). In this case, all values of h in the interval [10, 14] minimize R_{abs}(h) and, as a result, have the same value of R_{abs}(h). Thus, R_{abs}(14) = R_{abs}(11) = 9.

From lecture, we know that:

\text{slope of $R$ at $h$} = \frac{1}{n} \left( \left(\text{\# points } < h\right) - \left(\text{\# points } > h\right)\right)

We can use this formula to determine what to add to R_{abs}(14) to get R_{abs}(22). For any h in the interval (14, 22), the slope of R at h (given by plugging any h \in (14, 22) into the slope equation)is \frac{1}{8} (5 - 3) = \frac{1}{4}. This means that for every 1 unit we move to the right from h = 14 until we get to h = 22, R_{abs}(h) increases by \frac{1}{4}. So,

R_{abs}(22) = R_{abs}(14) + (22 - 14) \cdot \frac{1}{4} = 9 + 2 = 11

You can visualize the solution to this problem with this interactive graph.

*Source: Fall 2021 Final
Exam, Problem 4*

You may find the following properties of logarithms helpful in this question. Assume that all logarithms in this question are natural logarithms, i.e. of base e.

- e^{\log(x)} = x
- \log(a) + \log(b) = \log(a \cdot b)
- \log(a) - \log(b) = \log \left( \frac{a}{b} \right)
- \log(a^c) = c \log (a)
- \frac{d}{dx} \log x = \frac{1}{x}

Billy, the avocado-farmer-turned-waiter-turned-Instagram-influencer that you’re all-too-familiar with, is trying his hand at coming up with loss functions. He comes up with the Billy loss, L_B(h, y), defined as follows:

L_B(h, y) = \left[ \log \left( \frac{y}{h} \right) \right]^2

Throughout this problem, assume that all ys are positive.

Show that: \frac{d}{dh} L_B(h, y) = - \frac{2}{h} \log \left( \frac{y}{h} \right)

Show that the constant prediction h^* that minimizes for Billy loss is:

h^* = \left(y_1 \cdot y_2 \cdot ... \cdot y_n \right)^{\frac{1}{n}}

You do not need to perform a second derivative test, but otherwise you must show your work.

*Hint: To confirm that you’re interpreting the result correctly,
h^* for the dataset 3, 5, 16 is (3 \cdot 5 \cdot 16)^{\frac{1}{3}} =
240^{\frac{1}{3}} \approx 6.214.*

*Source: Fall 2022
Midterm, Problem 3*

Mahdi runs a local pastry shop near UCSD and sells traditional desert called Baklava. He bakes Baklavas every morning to keep his promise of providing fresh Baklavas to his customers daily. Here is the amount of Baklava he sold each day during last week in pounds(lb): y_1=100, y_2=110, y_3=75, y_4=90, y_5=105, y_6=90, y_7=25

Mahdi needs your help as a data scientist to suggest __the best
constant prediction (h^*) of daily
sales that minimizes the empirical risk using L(h,y) as the loss function__. Answer the
following questions and give a **brief justification** for
each part. **This problem has many parts, if you get stuck, move
on and come back later!**

Let L(h,y)=|y-h|. What is h^*? (We’ll later refer to this prediction as h_1^*).

Let L(h,y)=(y-h)^2. What is h^*? (We’ll later refer to this prediction as h_2^*).

**True** or **False**: Removing y_1 and y_3
from the dataset does not change h_2^*.

True

False

False. It changes the mean from 85 to 84. (However, the median is not changed.)

Mahdi thinks that y_7 is an outlier.
Hence, he asks you to remove y_7 and
update your predictions in parts (a) and (b) accordingly. Without
calculating the new predictions, can you justify which prediction
changes *more*? h^*_1 or h_2^*?

**True** or **False**: Let L(y,h)=|y-h|^3. You can use the Gradient
descent algorithm to find h^*.

True

False

**True** or **False**: Let L(y,h)=\sin(y-h). The Gradient descent
algorithm is guaranteed to converge, provided that a proper learning
rate is given.

True

False

False. The function is not convex, so the gradient descent algorithm is not guaranteed to converge.

Mahdi has noticed that Baklava daily sale is associated with weather
temperature. So he asks you to incorporate this feature to get a better
prediction. Suppose the last week’s daily temperatures are x_1, x_2, \cdots, x_7 in Fahrenheit (F). We
know that \bar x=65, \sigma_x=8.5 and the best linear prediction
that __minimizes the mean squared error__ is H^*(x)=-3x+w_0^*.

What is the correlation coefficient (r) between x and y? What does that mean?

r=-0.95. This means the weather temperature inversely affects Baklava sales, i.e., they are highly negatively correlated.

We know w_1^* = \frac{\sigma_y}{\sigma_x}r. We know that \sigma_x=8.5 and w_1^*=-3. We can find \sigma_y as follows:

\begin{aligned} \sigma_y^2 =& \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y})^2\\ =& \frac{1}{7}[(100-85)^2+(110-85)^2+(75-85)^2+(90-85)^2+(105-85)^2+(90-85)^2+(25-85)^2]\\ =&\frac{1}{7}[15^2+25^2+10^2+5^2+20^2+5^2+60^2]=714.28 \end{aligned}

Then, \sigma_y=26.7 which results in r=-0.95.

**True** or **False**: The unit of r is \frac{lb}{F} (Pound per Fahrenheit).

True

False

False. The correlation coefficient has no unit. (It is always a unitless number in [-1,1] range.)

Find w^*_0. *(Hint: You’ll need
to find \bar y for the given
dataset)*

w_0^*=280

Note that H(\bar x)=\bar y. Therefore, \begin{aligned} H(65)=-3\cdot 65 +w_0^*=85 \xrightarrow[]{}w_0^*=280. \end{aligned}

What would the best linear prediction H^*(x) be if we multiply all x_i’s by 2?

H^*(x) = -1.5x + 280

The standard deviation scales by a factor of 2, i.e., \sigma_x'=2\cdot \sigma_x.

The same
is true for the mean, i.e., \bar{x}'=2
\cdot \bar{x}.

The correlation r, standard deviation of the y-values \sigma_y, and the mean of the y-values \bar y do not change.

(You can verify
these claims by plugging 2x in for
x in their respective formulas and
seeing what happens, but it’s faster to *visually* reason why
this happens.)

Therefore, w_1'^*=\frac{\sigma_y'}{\sigma_x'}r' = \frac{(\sigma_y)}{(2\cdot\sigma_x)}(r) = \frac{w_1^*}{2} = -1.5.

We can find w_0'^* as follows:

\begin{align*} \bar{y}'&=H(\bar{x}')\\&=\frac{w_1^*}{2}(2\bar{x})+w_0'^*\\&=w_1^*\bar{x}+w_0'^* \\ &\downarrow \\ (85) &= -3(65) + w_0'^* \\ w_0'^*&=280 \end{align*}

So, H^*(x) would be -1.5x + 280.

What would the best linear prediction H^*(x) be if we add 20 to all x_i’s?

H^*(x) = -3x + 340

All parameters remain unchanged except \bar{x}'=\bar{x}+20. Since r, \sigma_x and \sigma_y are not changed, w_1^* does not change. Then, one can find w_0^* as follows:

\begin{align*} \bar{y}'&=H(\bar{x}') \\ &\downarrow \\ (85) &=-3(65+20)+w_0^* \\ w_0^*&=340 \end{align*}

So, H^*(x) would be -3x + 340.

*Source: Fall 2022
Midterm, Problem 4*

Consider a dataset that consists of y_1, \cdots, y_n. In class, we used calculus to minimize mean squared error, R_{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (h - y_i)^2. In this problem, we want you to apply the same approach to a slightly different loss function defined below: L_{\text{midterm}}(y,h)=(\alpha y - h)^2+\lambda h

Write down the empiricial risk R_{\text{midterm}}(h) by using the above loss function.

The mean of dataset is \bar{y}, i.e. \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i. Find h^* that minimizes R_{\text{midterm}}(h) using calculus. Your result should be in terms of \bar{y}, \alpha and \lambda.

h^*=\alpha \bar{y} - \frac{\lambda}{2}

\begin{align*} \frac{d}{dh}R_{\text{midterm}}(h)&= [\frac{2}{n}\sum_{i=1}^{n}(h- \alpha y_i )] +\lambda \\ &=2 h-2\alpha \bar{y} + \lambda. \end{align*}

By setting \frac{d}{dh}R_{\text{midterm}}(h)=0 we get 2 h^*-2\alpha \bar{y} + \lambda=0 \Rightarrow h^*=\alpha \bar{y} - \frac{\lambda}{2}.

*Source: Spring 2023 Midterm
1, Problem 1*

Consider a dataset such that 60 \leq y_1 \leq y_2 \leq \dots \leq y_n. Let R_{abs}(h) represent the mean absolute error of a constant prediction h on this dataset. Suppose we know that R_{abs}(27) = 94.

Find \bar y, or the mean of \{y_1, y_2, \dots, y_n\}.

\bar{y} = 121

There are several ways to complete this problem. One way is to
interpret mean absolute error of a constant prediction h as the average distance of each data point
to h. So we are told that the average
distance of each data point to 27 is
94. That is, the data points are 94 units away from 27 on average. Since all the data is at least
60, they must be 94 units *more* than 27, or 121,
on average.

You can also arrive at the same answer algebraically using the definition of R_{abs}(h). We have \begin{aligned} R_{abs}(h) &= \frac1n \sum_{i=1}^{n}|y_i - h| \\ R_{abs}(27) &= \frac1n \sum_{i=1}^{n}|y_i - 27| \\ &= \frac1n \sum_{i=1}^{n}(y_i - 27) \qquad \text{because each~} y_i\geq 60 \\ &= \frac1n\left( \sum_{i=1}^{n}y_i - \sum_{i=1}^{n}27\right) \\ &= \frac1n\left( n\cdot\bar y - n\cdot27\right) \\ &= \bar y - 27 \end{aligned}

Since we are told that R(27) = 94, we can set \bar y - 27 = 94, to find that \bar y = 121.

Find R_{abs}(58).

R_{abs}(58) = 63

Again, we can complete this problem in multiple ways. Interpreting R_{abs}(h) as the average distance of each data point to h, we can see that since 58 is 31 units closer to each data point than 27, R_{abs}(58) = R_{abs}(27) - 31 = 94 - 31 = 63.

Another way to do this problem uses the answer from part (a). Since the data points average to 121, their distance to 58 is 121-58 = 63, on average.

Which of the following *could* be the mean absolute deviation
from the median for this dataset? There is only one correct answer.

18

62

94

102

Our answer is 18.

One easy way to do this problem is to recognize that none of these values are too low, but they may be too high. For example, the mean absolute deviation from the median can be as low as 0, when all the data points are the same. Since we are told there is only one correct answer and we know that no answer choice is too low, that means the correct answer must be the lowest option, 18.

We can also rule out all the other answer choices to show that they are too high. To do that, we’ll show that 62 is too high, and therefore, both 94 and 102 are too high as well. We are told that all data points are at least 60. By similar logic as we used in part (b), we can see that R_{abs}(60) = 61. Since the minimum value of R_{abs}(h) occurs at the median h^*\geq 60 and we already know R_{abs}(60) = 61, it must be the case that R(h^*) \leq 61.

*Source: Spring 2023 Midterm
1, Problem 2*

Let R_{sq}(h) represent the mean squared error of a constant prediction h for a given dataset. Find a dataset \{y_1, y_2\} such that the graph of R_{sq}(h) has its minimum at the point (7,16).

The dataset is {3, 11}.

We’ve already learned that R_{sq}(h) is minimized at the mean of the data, and the minimum value of R_sq(h) is the variance of the data. So we need to provide a dataset of two points with a mean of 7 and a variance of 16. Recall that the variance is the average squared distance of each data point to the mean. Since we want a variance of 16, we can make each point 4 units away from the mean. Therefore, our data set can be y_1 = 3, y_2 = 11. In fact, this is the only solution.

A more calculative approach uses the formulas for mean and variance and solves a system of two equations:

\begin{aligned} \frac{y_1+y_2}{2} &= 7 \\ \frac12 \left((y_1 - 7)^2 + (y_2 - 7)^2 \right) &= 16 \end{aligned}

*Source: Spring 2023 Final
Part 1, Problem 1*

For a given dataset \{y_1, y_2, \dots,
y_n\}, let M_{abs}(h) represent
the **median** absolute error of the constant prediction
h on that dataset (as opposed to the
mean absolute error R_{abs}(h)).

For the dataset \{4, 9, 10, 14, 15\}, what is M_{abs}(9)?

5

The first step is to calculate the absolute errors (|y_i - h|).

\begin{align*} \text{Absolute Errors} &= \{|4-9|, |9-9|, |10-9|, |14-9|, |15-9|\} \\ \text{Absolute Errors} &= \{|-5|, |0|, |1|, |5|, |6|\} \\ \text{Absolute Errors} &= \{5, 0, 1, 5, 6\} \end{align*}

Now we have to order the values inside of the absolute errors: \{0, 1, 5, 5, 6\}. We can see the median is 5, so M_{abs}(9) =5.

For the same dataset \{4, 9, 10, 14, 15\}, find another integer h such that M_{abs}(9) = M_{abs}(h).

5 or 15

Our goal is to find another number that will give us the same median of absolute errors as in part (a).

One way to do this is to plug in a number and guess. Another way requires noticing you can modify 10 (the middle element) to become 5 in either direction (negative or positive) because of the absolute value.

We can solve this equation to get |10-x| = 5 \rightarrow x = 15 \text{ and } x = 5.

We can then test this by following the same steps as we did in part (a).

**For x = 15:**
\begin{align*}
\text{Absolute Errors} &= \{|4-15|, |9-15|, |10-15|, |14-15|,
|15-15|\} \\
\text{Absolute Errors} &= \{|-11|, |-6|, |-5|, |-1|, |0|\} \\
\text{Absolute Errors} &= \{11, 6, 5, 1, 0\}
\end{align*}

Then we order the elements to get the absolute errors: \{0, 1, 5, 6, 11\}. We can see the median is 5, so M_{abs}(15) =5.

**For x = 5:**
\begin{align*}
\text{Absolute Errors} &= \{|4-5|, |9-5|, |10-5|, |14-5|, |15-5|\}
\\
\text{Absolute Errors} &= \{|-1|, |4|, |5|, |9|, |10|\} \\
\text{Absolute Errors} &= \{1, 4, 5, 9, 10\}
\end{align*}

We do not have to re-order the elements because they are in order already. We can see the median is 5, so M_{abs}(5) =5.

Based on your answers to parts (a) and (b), discuss in **at
most two sentences** what is problematic about using the median
absolute error to make predictions.

*Source: Spring 2023 Final
Part 1, Problem 2*

Match each dataset with the graph of its mean absolute error, R_{abs}(h).

\{4, 7, 9, 10, 11\}

Graph 1

Graph 2

Graph 3

Graph 4

Graph 2

The important thing to note here is the y axis is equal to R_{abs}(h) and the x axis is equal to our h. The easiest way to figure out which graph belongs to which dataset is to choose some numbers for h and see if a line matches up with the chosen points.

h^* | R_{abs}(h) |
---|---|

0 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-0| = \frac{1}{5} \cdot 41 \approx 8 |

8 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-8| = \frac{1}{5} \cdot 11 = 2 |

18 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-18| = \frac{1}{5} \cdot 49 \approx 10 |

When looking at these three places we can see that this dataset matches Graph 2.

\{-3, 1, 9, 11, 20\}

Graph 1

Graph 2

Graph 3

Graph 4

Graph 1

We will use the same approach we used in part (a).

h^* | R_{abs}(h) |
---|---|

0 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-0| = \frac{1}{5} \cdot 44 \approx 9 |

8 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-8| = \frac{1}{5} \cdot 34 \approx 7 |

18 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-18| = \frac{1}{5} \cdot 56 \approx 11 |

When looking at these three places we can see that this dataset matches Graph 1.

\{8, 9, 12, 13, 15\}

Graph 1

Graph 2

Graph 3

Graph 4

Graph 3

We will use the same approach we used in the previous parts.

h^* | R_{abs}(h) |
---|---|

0 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-0| = \frac{1}{5} \cdot 44 \approx 11 |

8 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-8| = \frac{1}{5} \cdot 17 \approx 3 |

18 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-18| = \frac{1}{5} \cdot 33 \approx 7 |

When looking at these three places we can see that this dataset matches Graph 3.

\{11, 12, 12, 13, 14\}

Graph 1

Graph 2

Graph 3

Graph 4

Graph 4

We will use the same approach we used in the previous parts.

h^* | R_{abs}(h) |
---|---|

0 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-0| = \frac{1}{5} \cdot 62 \approx 12 |

8 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-8| = \frac{1}{5} \cdot 22 \approx 4 |

18 | \frac{1}{5}\sum_{i = 1}^{5}|y_i-18| = \frac{1}{5} \cdot 28 \approx 6 |

When looking at these three places we can see that this dataset
matches Graph 4. Another way would be choosing the only option not
chosen in the other parts yet!

*Source: Winter 2022 Midterm
1, Problem 1*

Define the extreme mean (EM) of a dataset to be the average of its largest and smallest values. Let f(x)=-3x+4. Show that for any dataset x_1\leq x_2 \leq \dots \leq x_n, EM(f(x_1), f(x_2), \dots, f(x_n)) = f(EM(x_1, x_2, \dots, x_n)).

This linear transformation reverses the order of the data because if a<b, then -3a>-3b and so adding four to both sides gives f(a)>f(b). Since x_1\leq x_2 \leq \dots \leq x_n, this means that the smallest of f(x_1), f(x_2), \dots, f(x_n) is f(x_n) and the largest is f(x_1). Therefore,

\begin{aligned} EM(f(x_1), f(x_2), \dots, f(x_n)) &= \dfrac{f(x_n) + f(x_1)}{2} \\ &= \dfrac{-3x_n+4-3x_1+4}{2} \\ &= \dfrac{-3x_n-3x_1}{2} + 4\\ &= -3\left(\dfrac{x_1+x_n}{2}\right) + 4 \\ &= -3EM(x_1, x_2, \dots, x_n)+ 4\\ &= f(EM(x_1, x_2, \dots, x_n)). \end{aligned}

*Source: Winter 2022 Midterm
1, Problem 2*

Consider a new loss function, L(h, y) = e^{(h-y)^2}. Given a dataset y_1, y_2, \dots, y_n, let R(h) represent the empirical risk for the dataset using this loss function.

For the dataset \{1, 3, 4\}, calculate R(2). Simplify your answer as much as possible without a calculator.

R(2) = \frac13 (2e+e^4)

We need to calculate the loss for each data point then average the losses. That is, we need to calculate R(2) = \dfrac{1}{3} \sum_{i=1}^{3} e^{(2-y_i)^2}. The table below records the necessary information:

y_i | 1 | 3 | 4 |
---|---|---|---|

2-y_i | 1 | -1 | -2 |

(2-y_i)^2 | 1 | 1 | 4 |

e^{(2-y_i)^2} | e | e | e^4 |

This means: \begin{aligned} R(2) &= \dfrac{1}{3} \sum_{i=1}^{3} e^{(2-y_i)^2} \\ &= \frac13 (e+e+e^4) \\ &= \frac13 (2e+e^4) \end{aligned}

For the same dataset \{1, 3, 4\}, perform one iteration of gradient descent on R(h), starting at an initial prediction of h_0=2 with a step size of \alpha=\frac{1}{2}. Show your work and simplify your answer.

h_1 = 2 + \frac{2e^4}{3}

First, we calculate the derivative of R(h). Using the chain rule, we have \begin{align*} R(h) &= \dfrac1n \sum_{i=1}^n e^{(h-y_i)^2} \\ R'(h) &= \dfrac1n \sum_{i=1}^n e^{(h-y_i)^2}\cdot 2(h-y_i) \\ \end{align*} To apply the gradient descent update rule, we next have to calculate R'(h_0) or R'(2). Plugging in h=2 to the derivative we calculated above gives: \begin{align*} R'(2) &= \dfrac1n \sum_{i=1}^n e^{(2-y_i)^2}\cdot 2(2-y_i) \end{align*}

The table below records the necessary information (note that we’ve done most of the work already).

y_i | 1 | 3 | 4 |
---|---|---|---|

2-y_i | 1 | -1 | -2 |

(2-y_i)^2 | 1 | 1 | 4 |

e^{(2-y_i)^2} | e | e | e^4 |

e^{(2-y_i)^2}\cdot 2(2-y_i) | 2e | -2e | -4e^4 |

Therefore: \begin{aligned} R'(2) &= \dfrac{1}{3} \sum_{i=1}^{3} e^{(2-y_i)^2\cdot 2(2-y_i)} \\ &= \frac13 (2e - 2e -4e^4) \\ &= \frac{-4e^4}{3}. \end{aligned} Applying the gradient descent update rule gives: \begin{aligned} h_1 &= h_0 - \alpha\cdot R'(h_0) \\ &= 2 - \frac{1}{2}\cdot \frac{-4e^4}{3} \\ &= 2 + \frac{2e^4}{3} \end{aligned}

*Source: Winter 2023 Final,
Problem 1*

For each of the loss functions below, **find the constant
prediction h^*** which minimizes
the corresponding empirical risk with respect to the data y_1 = -3, y_2 = 2, y_3 = 2, y_4 = -2, y_5 = -6 .

The **\alpha-absolute**
loss is defined as follows: L_{\alpha-\text{abs} }(h, y) = |h - (y-\alpha)
|.

Use \alpha=3.

h^*=-2-3=-5

This is equivalent to the absolute loss on the same dataset shifted by \alpha. Therefore the optimal solution that minimizes this loss is the median (-2) shifted by -\alpha (-3).

Use \beta=2. Hint: plot the empirical risk function for y\in[-6, 3].

h^*=2 \cdot 2=4.

This is equivalent to the absolute loss on the same dataset scaled by \beta. Therefore the optimal solution that minimizes this loss is the mode scaled by \beta.

*Source: Winter 2024 Final
Part 1, Problem 1*

Suppose there is a dataset containing 10000 integers:

- 2500 of them are 3s
- 2500 of them are 5s
- 4500 of them are 7s
- 500 of them are 9s.

Calculate the median of this dataset.

6

We know there is an even number of integers in this dataset because 10000 \% 2 = 0. We can find the middle of the dataset as follows: \frac{10000}{2} = 5000. This means the element in the 5000th position and 5001st position can give us our median. The element at the 5000th position is a 5 because 2500 + 2500 = 5000. The element at the 5001st position is a 7 because the next number after 5 is 7. We can then plug 5 and 7 into the equation: \frac{x_{5000} + x_{5001}}{2} = \frac{5 + 7}{2} = 6

How does the mean of this dataset compared to its median?

The mean is larger than the median

The mean is smaller than the median

The mean and the median are equal

The mean is smaller than the median.

We can calculate the mean as follows: \frac{2500 \cdot 3 + 2500 \cdot 5 + 4500 \cdot 7 + 500 \cdot 9}{10000} = 5.6 Using part (a) we know that 5.6 < 6, which means the mean is smaller than the median.

*Source: Winter 2024 Midterm
1, Problem 1*

Consider a dataset D with 5 data points \{7,5,1,2,a\}, where a is a positive real number. Note that a is not necessarily an integer.

Express the mean of D as a function of a, simplify the expression as much as possible.

\text{Mean($D$)} = \frac{a}{5} + 3

Depending on the range of a, the median of D could assume one of three possible values. Write out all possible median of D along with the corresponding range of a for each case. Express the ranges using double inequalities, e.g., i.e. 3<a\leq8:

Determine the range of a that satisfies: \text{Mean}(D) < \text{Median}(D) Make sure to show your work.

\dfrac{15}{4}<a<10

Since there are 3 possible median
values, we will have to discuss each situation separately.

In case 1, when 0<a\leq2, \text{Median}(D) = 2. So, we have:

\begin{align*} \text{Mean}(D) &< \text{Median}(D)\\ 3 + \frac{a}{5} &< 2\\ a&<-5 \end{align*}

But a<-5 is in conflict with the condition 0<a\leq2, therefore there is no solution in this situation, and Median(D) = 2 is impossible.

In case 2, when 2<a<5, \text{Median}(D) = a. So, we have:

\begin{align*} \text{Mean}(D) &< \text{Median}(D)\\ 3 + \frac{a}{5} &< a\\ 3 &< \frac{4}{5} a\\ a &> \frac{15}{4}\\ \end{align*}

So a has to be larger than \frac{15}{4}. But remember from the prerequisite condition that 2<a<5.

To satisfy both conditions, we must have \frac{15}{4}<a<5.

In case 3, when a\geq5, \text{Median}(D) = 5. So, we have: \begin{align*} \text{Mean}(D) &< \text{Median}(D)\\ 3 + \frac{a}{5} &< 5\\ a&<10 \end{align*}

combining with the prerequisite condition, we have 5\leq a<10

Combining the range of all three cases, we have \dfrac{15}{4}<a<10 as our final answer.

*Source: Winter 2024 Midterm
1, Problem 2*

Let R_{sq}(h) represent the mean squared error of a constant prediction h for a given dataset. For the dataset \{3, y_{1}\}, the graph of R_{sq}(h) has its minimum at the point (5,r_{1}). Find out the value of y_{1} and r_{1}

y_1 = 7, r_1 = 4

The mean squared error is written as: \begin{align*} R_{sq}(h) = \frac{1}{n}\sum_{i=0}^{n}(y_{i}-h)^2 \end{align*}

Since we only have two data points (n=2), the equation simplifies to:

\begin{align*} R_{sq}(h) = \frac{1}{2}((y_{0}-h)^2+ (y_{1}-h)^2) \end{align*}

Taking the derivative with respect to h, we have: \begin{align*} \frac{dR_{sq}(h)}{dh} = -(y_{0}-h)- (y_{1}-h) \end{align*}

We know that the derivative has to be 0 at the local minima, therefore at h=5, we have:

\begin{align*} \frac{dR_{sq}(h)}{dh} = -(3-5)- (y_{1}-5) &= 0\\ % -2+y_1-5 &=0\\ y_1 &= 7 \end{align*}

So we know that the dataset is \{3,7\}. Given all these information, we can calculate r_1 with:

\begin{align*} R_{sq}(5) &= \frac{1}{2}((y_{0}-5)^2+ (y_{1}-5)^2)\\ &=\frac{1}{2}((3-5)^2+ (7-5)^2)\\ &=\frac{1}{2}(4+4)=4 \end{align*}