Summary Statistics and the Constant Model

← return to practice.dsc40a.com


This page contains all problems about Summary Statistics and the Constant Model.


Problem 1

Source: Fall 2021 Midterm, Problem 1

King Triton just made an Instagram account and has been keeping track of the number of likes his posts have received so far.

His first 7 posts have received a mean of 16 likes; the specific like counts in sorted order are

8, 12, 12, 15, 18, 20, 27

King Triton wants to predict the number of likes his next post will receive, using a constant prediction rule h. For each loss function L(h, y), determine the constant prediction h^* that minimizes empirical risk. If you believe there are multiple minimizers, specify them all. If you believe you need more information to answer the question or that there is no minimizer, state that clearly. Give a brief justification for each answer.


Problem 1.1

L(h, y) = |y - h|

This is absolute loss, and hence we’re looking for the minimizer of mean absolute error, which is the median, 15.


Problem 1.2

L(h, y) = (y - h)^2

This is squared loss, and hence we’re looking for the minimizer of mean squared error, which is the mean, 16.


Problem 1.3

L(h, y) = 4(y - h)^2

This is squared loss, multiplied by a constant. Note that when we go to minimize empirical risk for this loss function, we will take the derivative of empirical risk and set it equal to 0; at that point the constant factor of 4 can be divided from both sides, so this problem boils down to minimizing ordinary mean squared error. The only difference is that the graph of mean squared error will be stretched vertically by a factor of 4; the minimizing value will be in the same place.

For more justification, here we consider any general re-scaling \alpha (y-h)^2:

\begin{aligned} R_{sq}(h) &= \frac{1}{n} \sum_{i = 1}^n \alpha (y_i - h)^2 \\ &= \alpha \cdot \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2 \\ \frac{d}{dh} R_{sq}(h) &= \alpha \cdot \frac{1}{n} \sum_{i = 1}^n 2(y_i - h)(-1) = 0\\ &\implies -\frac{2\alpha}{n}\sum_{i = 1}^n (y_i - h) = 0 \\ &\implies \sum_{i = 1}^n (y_i - h) = 0 \\ &\implies h^* = \frac{1}{n} \sum_{i = 1}^n y_i \end{aligned}


Problem 1.4

L(h, y) = \begin{cases} 0 & h = y \\ 100 & h \neq y \end{cases}

This is a scaled version of 0-1 loss. We know that empirical risk for 0-1 loss is minimized at the mode, so that also applies here. The mode, i.e. the most common value, is 12.


Problem 1.5

L(h, y) = (3y - 4h)^2

Note that we can write (3y - 4h)^2 as \left( 3 \left( y - \frac{4}{3}h \right) \right)^2 = 9 \left( y - \frac{4}{3}h \right)^2. As we’ve seen, the constant factor out front has no impact on the minimizing value. Using the same principle as in the last part, we can say that \frac{4}{3} h^* = \bar{x} \implies h^* = \frac{3}{4} \bar{x} = \frac{3}{4} \cdot 16 = 12


Problem 1.6

L(h, y) = (y - h)^3

Hint: Do not spend too long on this subpart.

No minimizer.

Note that unlike |y - h|, (y - h)^2, and all of the other loss functions we’ve seen, (y - h)^3 tends towards -\infty, rather than having a minimum output of 0. This means that there is no h that minimizes \frac{1}{n} \sum_{i = 1}^n (y_i - h)^3; the larger we make h, the more negative (and hence “smaller") this empirical risk becomes.



Problem 2

Source: Fall 2021 Midterm, Problem 2

Consider a set of 23 data points y_1, y_2, y_3, ..., y_{23} such that y_1 < y_2 < ... < y_{23}. Let’s call this Dataset A.

We create a new dataset, Dataset B, by repeating each point in Dataset A once. That is, Dataset B is the set of 46 points y_1, y_1, y_2, y_2, ..., y_{23}, y_{23}.

Answer the following questions regarding the relationship between Dataset A and Dataset B. Justify your answers.


Problem 2.1

Suppose the minimizer of mean absolute error R_{abs}(h) for Dataset A is 5. What is the minimizer of mean absolute error for Dataset B? If you believe there are multiple minimizers, specify them all. If you believe you need more information to answer the question, state that clearly.

The minimizer of MAE for Dataset B is still 5.

Note that when we repeat each data point, we go from having an odd number of data points (23) to an even number (46). This means the minimizer is the set of all values between the middle two values. But the middle two values will now both be y_{12}, and so there are no numbers “in between" them – only y_{12} minimizes MAE.


Problem 2.2

Suppose the mean absolute deviation from the median for Dataset A is 17. What is the mean absolute deviation from the median for Dataset B? If you believe you need more information to answer the question, state that clearly.

The mean absolute deviation from the median for Dataset B is still 17.

As we saw in previous part, the median itself does not change. When adding together the deviations from the median, each point is repeated twice, so the sum of all deviations from the median is doubled. However, there are twice as many data points in Dataset B than there are in Dataset A, so we divide by 2n (46) instead of n (23) in our average.

In short, for Dataset B both the numerator and denominator in the calculation of mean absolute deviation from the median \frac{\sum_{i = 1}^n |y_i - \text{Median}(y)|}{n} are double what they were for Dataset A, so the end result is the same as for Dataset A.


Problem 2.3

Suppose the function R_A(h) represents mean absolute error for Dataset A and R_B(h) represents mean absolute error for Dataset B.

Is it true that R_A(h) = R_B(h) for any real number h? (In other words, are the graphs of R_A(h) and R_B(h) identical?) Explain your reasoning.

Yes, R_{A}(h) = R_{B}(h).

Recall that the definition of R_{abs}(h) is:

R_{abs}(h) = \frac{1}{n} \sum_{i = 1}^n |y_i - h|

For the first dataset, we have:

R_{A}(h) = \frac{1}{23} \sum_{i = 1}^{23} |y_i - h|

and for the second dataset, we have:

\begin{aligned} R_{B}(h) &= \frac{1}{46} \sum_{i = 1}^{46} \left( |y_1 - h| + |y_1 - h| + |y_2 - h| + |y_2 - h| + ... + |y_{23} - h| + |y_{23} - h| \right) \\ &= \frac{1}{46} \left( 2 \sum_{i = 1}^{23} |y_i - h| \right) \\ &= \frac{1}{23} \sum_{i = 1}^{23} |y_i - h| \\ &= R_{A}(h) \end{aligned}



Problem 3

Source: Fall 2021 Final Exam, Problem 1

The mean of 12 non-negative numbers is 45. Suppose we remove 2 of these numbers. What is the largest possible value of the mean of the remaining 10 numbers? Show your work.

54.

To maximize the mean of the remaining 10 numbers, we want to minimize the numbers that are removed. The smallest possible non-negative number is 0, so to maximize the mean of the remaining 10, we should remove two 0s from the set of numbers. Recall that the sum of the 12 number set is 12 \cdot 45; then, the maximum possible mean of the remaining 10 is

\frac{12 \cdot 45 - 2 \cdot 0}{10} = \frac{6}{5} \cdot 45 = 54


Problem 4

Source: Fall 2021 Final Exam, Problem 2

Consider a set of 8 data points, y_1, y_2, ..., y_8 that are in sorted order, i.e. y_1 < y_2 < ... < y_8. Suppose that y_4 = 10, y_5 = 14, and y_6 = 22. Recall that mean absolute error, R_{abs}(h), is defined as R_{abs}(h) = \frac{1}{n} \sum_{i = 1}^n |y_i - h|. Suppose that R_{abs}(11) = 9.

What is R_{abs}(22)? Show your work.

Hint: Use the formula for the slope of R at h.

R_{abs}(22) = 11

We can write the points given to us as:

y_1, y_2, y_3, 10, 14, 22, y_7, y_8

Since there are an even number of data points, all values of h between the middle two points minimize R_{abs}(h). In this case, all values of h in the interval [10, 14] minimize R_{abs}(h) and, as a result, have the same value of R_{abs}(h). Thus, R_{abs}(14) = R_{abs}(11) = 9.

From lecture, we know that:

\text{slope of $R$ at $h$} = \frac{1}{n} \left( \left(\text{\# points } < h\right) - \left(\text{\# points } > h\right)\right)

We can use this formula to determine what to add to R_{abs}(14) to get R_{abs}(22). Note that for values of h in the interval (14, 22), the slope of R at h is \frac{1}{8} (5 - 3) = \frac{1}{4}. This means that for every 1 unit we move to the right from h = 14 until we get to h = 22, R_{abs}(h) increases by \frac{1}{4}. So,

R_{abs}(22) = R_{abs}(14) + (22 - 14) \cdot \frac{1}{4} = 9 + 2 = 11


Problem 5

Source: Fall 2021 Final Exam, Problem 4

You may find the following properties of logarithms helpful in this question. Assume that all logarithms in this question are natural logarithms, i.e. of base e.

Billy, the avocado-farmer-turned-waiter-turned-Instagram-influencer that you’re all-too-familiar with, is trying his hand at coming up with loss functions. He comes up with the Billy loss, L_B(h, y), defined as follows:

L_B(h, y) = \left[ \log \left( \frac{y}{h} \right) \right]^2

Throughout this problem, assume that all ys are positive.


Problem 5.1

Show that: \frac{d}{dh} L_B(h, y) = - \frac{2}{h} \log \left( \frac{y}{h} \right)

\begin{align*} \frac{d}{dh} L_B(h, y) &= \frac{d}{dh} \left[ \log \left( \frac{y}{h} \right) \right]^2 \\ &= 2 \cdot \log \left( \frac{y}{h} \right) \cdot \frac{d}{dh} \log \left( \frac{y}{h} \right) \\ &= 2 \cdot \log \left( \frac{y}{h} \right) \cdot \frac{d}{dh} \left( \log(y) - \log(h) \right) \\ &= 2 \cdot \log \left( \frac{y}{h} \right) \cdot \left( - \frac{1}{h} \right) \\ &= -\frac{2}{h} \log \left( \frac{y}{h} \right) \end{align*}


Problem 5.2

Show that the constant prediction h^* that minimizes for Billy loss is:

h^* = \left(y_1 \cdot y_2 \cdot ... \cdot y_n \right)^{\frac{1}{n}}

You do not need to perform a second derivative test, but otherwise you must show your work.

Hint: To confirm that you’re interpreting the result correctly, h^* for the dataset 3, 5, 16 is (3 \cdot 5 \cdot 16)^{\frac{1}{3}} = 240^{\frac{1}{3}} \approx 6.214.

\begin{align*} R_B(h) &= \frac{1}{n} \sum_{i = 1}^n \left[ \log \left( \frac{y_i}{h} \right) \right]^2 \\ \frac{d}{dh} R_B(h) &= \frac{1}{n} \sum_{i = 1}^n \frac{d}{dh} \left[ \log \left( \frac{y_i}{h} \right) \right]^2 \\ &= \frac{1}{n} \sum_{i = 1}^n -\frac{2}{h} \log \left( \frac{y_i}{h} \right) \\ &= -\frac{2}{nh} \sum_{i = 1}^n \log \left( \frac{y_i}{h} \right) = 0 \\ 0 &= \sum_{i = 1}^n \log \left( \frac{y_i}{h} \right) = \sum_{i = 1}^n \left( \log(y_i) - \log(h)\right) \\ 0 &= \sum_{i = 1}^n \log(y_i) - \log(h) \sum_{i = 1}^n 1 \\ 0 &= \left( \log(y_1) + \log(y_2) + ... + \log(y_n) \right) - n \log(h) \\ \log(h^n) &= \log(y_1 \cdot y_2 \cdot ... \cdot y_n) \\ h^n &= y_1 \cdot y_2 \cdot ... \cdot y_n \\ h^* &= (y_1 \cdot y_2 \cdot ... \cdot y_n)^{\frac{1}{n}} \end{align*}



Problem 6

Source: Fall 2022 Midterm, Problem 3

Mahdi runs a local pastry shop near UCSD and sells traditional desert called Baklava. He bakes Baklavas every morning to keep his promise of providing fresh Baklavas to his customers daily. Here is the amount of Baklava he sold each day during last week in pounds(lb): y_1=100, y_2=110, y_3=75, y_4=90, y_5=105, y_6=90, y_7=25

Mahdi needs your help as a data scientist to suggest the best constant prediction (h^*) of daily sales that minimizes the empirical risk using L(h,y) as the loss function. Answer the following questions and give a brief justification for each part. This problem has many parts, if you get stuck, move on and come back later!


Problem 6.1

Let L(h,y)=|y-h|. What is h^*? (We’ll later refer to this prediction as h_1^*).

As we have seen in lectures, the median minimizes the absolute loss risk function. h^*_1=\text{Median}(y_1, \cdots, y_7).


Problem 6.2

Let L(h,y)=(y-h)^2. What is h^*? (We’ll later refer to this prediction as h_2^*).

As we have seen in lectures, the mean minimizes the square loss risk function. h^*_2=\text{Mean}(y_1, \cdots, y_7).


Problem 6.3

True or False: Removing y_1 and y_3 from the dataset does not change h_2^*.

False. It changes the mean from 85 to 84. (However, the median is not changed.)


Problem 6.4

Mahdi thinks that y_7 is an outlier. Hence, he asks you to remove y_7 and update your predictions in parts (a) and (b) accordingly. Without calculating the new predictions, can you justify which prediction changes more? h^*_1 or h_2^*?

Removing y_7 affects h_2^* more than h_1^*. This is because the mean squared loss is more sensitive to outliers than absolute loss, and removing data changes the mean more.


Problem 6.5

True or False: Let L(y,h)=|y-h|^3. You can use the Gradient descent algorithm to find h^*.

False. The function |y-h|^3 is not differentiable everywhere so we can not use the gradient descent to find the minimum.


Problem 6.6

True or False: Let L(y,h)=\sin(y-h). The Gradient descent algorithm is guaranteed to converge, provided that a proper learning rate is given.

False. The function is not convex, so the gradient descent algorithm is not guaranteed to converge.


Problem 6.7

Mahdi has noticed that Baklava daily sale is associated with weather temperature. So he asks you to incorporate this feature to get a better prediction. Suppose the last week’s daily temperatures are x_1, x_2, \cdots, x_7 in Fahrenheit (F). We know that \bar x=65, \sigma_x=8.5 and the best linear prediction that minimizes the mean squared error is H^*(x)=-3x+w_0^*.

What is the correlation coefficient (r) between x and y? What does that mean?

r=-0.95. This means the weather temperature inversely affects Baklava sales, i.e., they are highly negatively correlated.

We know w_1^* = \frac{\sigma_y}{\sigma_x}r. We know that \sigma_x=8.5 and w_1^*=-3. We can find \sigma_y as follows:

\begin{aligned} \sigma_y^2 =& \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y})^2\\ =& \frac{1}{7}[(100-85)^2+(110-85)^2+(75-85)^2+(90-85)^2+(105-85)^2+(90-85)^2+(25-85)^2]\\ =&\frac{1}{7}[15^2+25^2+10^2+5^2+20^2+5^2+60^2]=714.28 \end{aligned}

Then, \sigma_y=26.7 which results in r=-0.95.


Problem 6.8

True or False: The unit of r is \frac{lb}{F} (Pound per Fahrenheit).

False. The correlation coefficient has no unit. (It is always a unitless number in [-1,1] range.)


Problem 6.9

Find w^*_0. (Hint: You’ll need to find \bar y for the given dataset)

w_0^*=280

Note that H(\bar x)=\bar y. Therefore, \begin{aligned} H(65)=-3\cdot 65 +w_0^*=85 \xrightarrow[]{}w_0^*=280. \end{aligned}


Problem 6.10

What would the best linear prediction H^*(x) be if we multiply all x_i’s by 2?

H^*(x) = -1.5x + 280

The standard deviation scales by a factor of 2, i.e., \sigma_x'=2\cdot \sigma_x.
The same is true for the mean, i.e., \bar{x}'=2 \cdot \bar{x}.
The correlation r, standard deviation of the y-values \sigma_y, and the mean of the y-values \bar y do not change.
(You can verify these claims by plugging 2x in for x in their respective formulas and seeing what happens, but it’s faster to visually reason why this happens.)

Therefore, w_1'^*=\frac{\sigma_y'}{\sigma_x'}r' = \frac{(\sigma_y)}{(2\cdot\sigma_x)}(r) = \frac{w_1^*}{2} = -1.5.

We can find w_0'^* as follows:

\begin{align*} \bar{y}'&=H(\bar{x}')\\&=\frac{w_1^*}{2}(2\bar{x})+w_0'^*\\&=w_1^*\bar{x}+w_0'^* \\ &\downarrow \\ (85) &= -3(65) + w_0'^* \\ w_0'^*&=280 \end{align*}

So, H^*(x) would be -1.5x + 280.


Problem 6.11

What would the best linear prediction H^*(x) be if we add 20 to all x_i’s?

H^*(x) = -3x + 340

All parameters remain unchanged except \bar{x}'=\bar{x}+20. Since r, \sigma_x and \sigma_y are not changed, w_1^* does not change. Then, one can find w_0^* as follows:

\begin{align*} \bar{y}'&=H(\bar{x}') \\ &\downarrow \\ (85) &=-3(65+20)+w_0^* \\ w_0^*&=340 \end{align*}

So, H^*(x) would be -3x + 340.



Problem 7

Source: Fall 2022 Midterm, Problem 4

Consider a dataset that consists of y_1, \cdots, y_n. In class, we used calculus to minimize mean squared error, R_{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (h - y_i)^2. In this problem, we want you to apply the same approach to a slightly different loss function defined below: L_{\text{midterm}}(y,h)=(\alpha y - h)^2+\lambda h


Problem 7.1

Write down the empiricial risk R_{\text{midterm}}(h) by using the above loss function.

R_{\text{midterm}}(h)=\frac{1}{n}\sum_{i=1}^{n}[(\alpha y_i - h)^2+\lambda h]=[\frac{1}{n}\sum_{i=1}^{n}(\alpha y_i - h)^2] +\lambda h


Problem 7.2

The mean of dataset is \bar{y}, i.e. \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i. Find h^* that minimizes R_{\text{midterm}}(h) using calculus. Your result should be in terms of \bar{y}, \alpha and \lambda.

h^*=\alpha \bar{y} - \frac{\lambda}{2}



\begin{align*} \frac{d}{dh}R_{\text{midterm}}(h)&= [\frac{2}{n}\sum_{i=1}^{n}(h- \alpha y_i )] +\lambda \\ &=2 h-2\alpha \bar{y} + \lambda. \end{align*}

By setting \frac{d}{dh}R_{\text{midterm}}(h)=0 we get 2 h^*-2\alpha \bar{y} + \lambda=0 \Rightarrow h^*=\alpha \bar{y} - \frac{\lambda}{2}.



Problem 8

Source: Spring 2023 Midterm 1, Problem 1

Consider a dataset such that 60 \leq y_1 \leq y_2 \leq \dots \leq y_n. Let R_{abs}(h) represent the mean absolute error of a constant prediction h on this dataset. Suppose we know that R_{abs}(27) = 94.


Problem 8.1

Find \bar y, or the mean of \{y_1, y_2, \dots, y_n\}.

\bar{y} = 121

There are several ways to complete this problem. One way is to interpret mean absolute error of a constant prediction h as the average distance of each data point to h. So we are told that the average distance of each data point to 27 is 94. That is, the data points are 94 units away from 27 on average. Since all the data is at least 60, they must be 94 units more than 27, or 121, on average.

You can also arrive at the same answer algebraically using the definition of R_{abs}(h). We have \begin{aligned} R_{abs}(h) &= \frac1n \sum_{i=1}^{n}|y_i - h| \\ R_{abs}(27) &= \frac1n \sum_{i=1}^{n}|y_i - 27| \\ &= \frac1n \sum_{i=1}^{n}(y_i - 27) \qquad \text{because each~} y_i\geq 60 \\ &= \frac1n\left( \sum_{i=1}^{n}y_i - \sum_{i=1}^{n}27\right) \\ &= \frac1n\left( n\cdot\bar y - n\cdot27\right) \\ &= \bar y - 27 \end{aligned}

Since we are told that R(27) = 94, we can set \bar y - 27 = 94, to find that \bar y = 121.



Problem 8.2

Find R_{abs}(58).

R_{abs}(58) = 63

Again, we can complete this problem in multiple ways. Interpreting R_{abs}(h) as the average distance of each data point to h, we can see that since 58 is 31 units closer to each data point than 27, R_{abs}(58) = R_{abs}(27) - 31 = 94 - 31 = 63.

Another way to do this problem uses the answer from part (a). Since the data points average to 121, their distance to 58 is 121-58 = 63, on average.



Problem 8.3

Which of the following could be the mean absolute deviation from the median for this dataset? There is only one correct answer.

Our answer is 18.

One easy way to do this problem is to recognize that none of these values are too low, but they may be too high. For example, the mean absolute deviation from the median can be as low as 0, when all the data points are the same. Since we are told there is only one correct answer and we know that no answer choice is too low, that means the correct answer must be the lowest option, 18.

We can also rule out all the other answer choices to show that they are too high. To do that, we’ll show that 62 is too high, and therefore, both 94 and 102 are too high as well. We are told that all data points are at least 60. By similar logic as we used in part (b), we can see that R_{abs}(60) = 61. Since the minimum value of R_{abs}(h) occurs at the median h^*\geq 60 and we already know R_{abs}(60) = 61, it must be the case that R(h^*) \leq 61.



Problem 9

Source: Spring 2023 Midterm 1, Problem 2

Let R_{sq}(h) represent the mean squared error of a constant prediction h for a given dataset. Find a dataset \{y_1, y_2\} such that the graph of R_{sq}(h) has its minimum at the point (7,16).

The dataset is {3, 11}.

We’ve already learned that R_{sq}(h) is minimized at the mean of the data, and the minimum value of R_sq(h) is the variance of the data. So we need to provide a dataset of two points with a mean of 7 and a variance of 16. Recall that the variance is the average squared distance of each data point to the mean. Since we want a variance of 16, we can make each point 4 units away from the mean. Therefore, our data set can be y_1 = 3, y_2 = 11. In fact, this is the only solution.

A more calculative approach uses the formulas for mean and variance and solves a system of two equations:

\begin{aligned} \frac{y_1+y_2}{2} &= 7 \\ \frac12 \left((y_1 - 7)^2 + (y_2 - 7)^2 \right) &= 16 \end{aligned}


Problem 10

Source: Spring 2023 Final Part 1, Problem 1

For a given dataset \{y_1, y_2, \dots, y_n\}, let M_{abs}(h) represent the median absolute error of the constant prediction h on that dataset (as opposed to the mean absolute error R_{abs}(h)).


Problem 10.1

For the dataset \{4, 9, 10, 14, 15\}, what is M_{abs}(9)?

5

The first step is to calculate the absolute errors (|y_i - h|).

\begin{align*} \text{Absolute Errors} &= \{|4-9|, |9-9|, |10-9|, |14-9|, |15-9|\} \\ \text{Absolute Errors} &= \{|-5|, |0|, |1|, |5|, |6|\} \\ \text{Absolute Errors} &= \{5, 0, 1, 5, 6\} \end{align*}

Now we have to order the values inside of the absolute errors: \{0, 1, 5, 5, 6\}. We can see the median is 5, so M_{abs}(9) =5.


Problem 10.2

For the same dataset \{4, 9, 10, 14, 15\}, find another integer h such that M_{abs}(9) = M_{abs}(h).

5 or 15

Our goal is to find another number that will give us the same median of absolute errors as in part (a).

One way to do this is to plug in a number and guess. Another way requires noticing you can modify 10 (the middle element) to become 5 in either direction (negative or positive) because of the absolute value.

We can solve this equation to get |10-x| = 5 \rightarrow x = 15 \text{ and } x = 5.

We can then test this by following the same steps as we did in part (a).

For x = 15: \begin{align*} \text{Absolute Errors} &= \{|4-15|, |9-15|, |10-15|, |14-15|, |15-15|\} \\ \text{Absolute Errors} &= \{|-11|, |-6|, |-5|, |-1|, |0|\} \\ \text{Absolute Errors} &= \{11, 6, 5, 1, 0\} \end{align*}

Then we order the elements to get the absolute errors: \{0, 1, 5, 6, 11\}. We can see the median is 5, so M_{abs}(15) =5.

For x = 5: \begin{align*} \text{Absolute Errors} &= \{|4-5|, |9-5|, |10-5|, |14-5|, |15-5|\} \\ \text{Absolute Errors} &= \{|-1|, |4|, |5|, |9|, |10|\} \\ \text{Absolute Errors} &= \{1, 4, 5, 9, 10\} \end{align*}

We do not have to re-order the elements because they are in order already. We can see the median is 5, so M_{abs}(5) =5.


Problem 10.3

Based on your answers to parts (a) and (b), discuss in at most two sentences what is problematic about using the median absolute error to make predictions.

The numbers 5 and 15 are clearly bad predictions (close to the extreme values in the dataset), yet they are considered just as good a prediction by this metric as the number 9, which is roughly in the center of the dataset. Intuitively, 9 is a much better prediction, but this way of measuring the quality of a prediction does not recognize that.



Problem 11

Source: Spring 2023 Final Part 1, Problem 2

Match each dataset with the graph of its mean absolute error, R_{abs}(h).


Problem 11.1


\{4, 7, 9, 10, 11\}

Graph 2

The important thing to note here is the y axis is equal to R_{abs}(h) and the x axis is equal to our h. The easiest way to figure out which graph belongs to which dataset is to choose some numbers for h and see if a line matches up with the chosen points.

h^* R_{abs}(h)
0 \frac{1}{5}\sum_{i = 1}^{5}|y_i-0| = \frac{1}{5} \cdot 41 \approx 8
8 \frac{1}{5}\sum_{i = 1}^{5}|y_i-8| = \frac{1}{5} \cdot 11 = 2
18 \frac{1}{5}\sum_{i = 1}^{5}|y_i-18| = \frac{1}{5} \cdot 49 \approx 10


When looking at these three places we can see that this dataset matches Graph 2.


Problem 11.2


\{-3, 1, 9, 11, 20\}

Graph 1

We will use the same approach we used in part (a).

h^* R_{abs}(h)
0 \frac{1}{5}\sum_{i = 1}^{5}|y_i-0| = \frac{1}{5} \cdot 44 \approx 9
8 \frac{1}{5}\sum_{i = 1}^{5}|y_i-8| = \frac{1}{5} \cdot 34 \approx 7
18 \frac{1}{5}\sum_{i = 1}^{5}|y_i-18| = \frac{1}{5} \cdot 56 \approx 11


When looking at these three places we can see that this dataset matches Graph 1.


Problem 11.3


\{8, 9, 12, 13, 15\}

Graph 3

We will use the same approach we used in the previous parts.

h^* R_{abs}(h)
0 \frac{1}{5}\sum_{i = 1}^{5}|y_i-0| = \frac{1}{5} \cdot 44 \approx 11
8 \frac{1}{5}\sum_{i = 1}^{5}|y_i-8| = \frac{1}{5} \cdot 17 \approx 3
18 \frac{1}{5}\sum_{i = 1}^{5}|y_i-18| = \frac{1}{5} \cdot 33 \approx 7


When looking at these three places we can see that this dataset matches Graph 3.


Problem 11.4


\{11, 12, 12, 13, 14\}

Graph 4

We will use the same approach we used in the previous parts.

h^* R_{abs}(h)
0 \frac{1}{5}\sum_{i = 1}^{5}|y_i-0| = \frac{1}{5} \cdot 62 \approx 12
8 \frac{1}{5}\sum_{i = 1}^{5}|y_i-8| = \frac{1}{5} \cdot 22 \approx 4
18 \frac{1}{5}\sum_{i = 1}^{5}|y_i-18| = \frac{1}{5} \cdot 28 \approx 6


When looking at these three places we can see that this dataset matches Graph 4. Another way would be choosing the only option not chosen in the other parts yet!



Problem 12

Source: Winter 2022 Midterm 1, Problem 1

Define the extreme mean (EM) of a dataset to be the average of its largest and smallest values. Let f(x)=-3x+4. Show that for any dataset x_1\leq x_2 \leq \dots \leq x_n, EM(f(x_1), f(x_2), \dots, f(x_n)) = f(EM(x_1, x_2, \dots, x_n)).

This linear transformation reverses the order of the data because if a<b, then -3a>-3b and so adding four to both sides gives f(a)>f(b). Since x_1\leq x_2 \leq \dots \leq x_n, this means that the smallest of f(x_1), f(x_2), \dots, f(x_n) is f(x_n) and the largest is f(x_1). Therefore,

\begin{aligned} EM(f(x_1), f(x_2), \dots, f(x_n)) &= \dfrac{f(x_n) + f(x_1)}{2} \\ &= \dfrac{-3x_n+4-3x_1+4}{2} \\ &= \dfrac{-3x_n-3x_1}{2} + 4\\ &= -3\left(\dfrac{x_1+x_n}{2}\right) + 4 \\ &= -3EM(x_1, x_2, \dots, x_n)+ 4\\ &= f(EM(x_1, x_2, \dots, x_n)). \end{aligned}


Problem 13

Source: Winter 2022 Midterm 1, Problem 2

Consider a new loss function, L(h, y) = e^{(h-y)^2}. Given a dataset y_1, y_2, \dots, y_n, let R(h) represent the empirical risk for the dataset using this loss function.


Problem 13.1

For the dataset \{1, 3, 4\}, calculate R(2). Simplify your answer as much as possible without a calculator.

R(2) = \frac13 (2e+e^4)

We need to calculate the loss for each data point then average the losses. That is, we need to calculate R(2) = \dfrac{1}{3} \sum_{i=1}^{3} e^{(2-y_i)^2}. The table below records the necessary information:

y_i 1 3 4
2-y_i 1 -1 -2
(2-y_i)^2 1 1 4
e^{(2-y_i)^2} e e e^4


This means: \begin{aligned} R(2) &= \dfrac{1}{3} \sum_{i=1}^{3} e^{(2-y_i)^2} \\ &= \frac13 (e+e+e^4) \\ &= \frac13 (2e+e^4) \end{aligned}


Problem 13.2

For the same dataset \{1, 3, 4\}, perform one iteration of gradient descent on R(h), starting at an initial prediction of h_0=2 with a step size of \alpha=\frac{1}{2}. Show your work and simplify your answer.

h_1 = 2 + \frac{2e^4}{3}

First, we calculate the derivative of R(h). Using the chain rule, we have \begin{align*} R(h) &= \dfrac1n \sum_{i=1}^n e^{(h-y_i)^2} \\ R'(h) &= \dfrac1n \sum_{i=1}^n e^{(h-y_i)^2}\cdot 2(h-y_i) \\ \end{align*} To apply the gradient descent update rule, we next have to calculate R'(h_0) or R'(2). Plugging in h=2 to the derivative we calculated above gives: \begin{align*} R'(2) &= \dfrac1n \sum_{i=1}^n e^{(2-y_i)^2}\cdot 2(2-y_i) \end{align*}

The table below records the necessary information (note that we’ve done most of the work already).

y_i              1 3 4
2-y_i 1 -1 -2
(2-y_i)^2 1 1 4
e^{(2-y_i)^2} e e e^4
e^{(2-y_i)^2}\cdot 2(2-y_i) 2e -2e -4e^4


Therefore: \begin{aligned} R'(2) &= \dfrac{1}{3} \sum_{i=1}^{3} e^{(2-y_i)^2\cdot 2(2-y_i)} \\ &= \frac13 (2e - 2e -4e^4) \\ &= \frac{-4e^4}{3}. \end{aligned} Applying the gradient descent update rule gives: \begin{aligned} h_1 &= h_0 - \alpha\cdot R'(h_0) \\ &= 2 - \frac{1}{2}\cdot \frac{-4e^4}{3} \\ &= 2 + \frac{2e^4}{3} \end{aligned}



Problem 14

Source: Winter 2024 Final Part 1, Problem 1

Suppose there is a dataset containing 10000 integers:

Problem 14.1

Calculate the median of this dataset.

6

We know there is an even number of integers in this dataset because 10000 \% 2 = 0. We can find the middle of the dataset as follows: \frac{10000}{2} = 5000. This means the element in the 5000th position and 5001st position can give us our median. The element at the 5000th position is a 5 because 2500 + 2500 = 5000. The element at the 5001st position is a 7 because the next number after 5 is 7. We can then plug 5 and 7 into the equation: \frac{x_{5000} + x_{5001}}{2} = \frac{5 + 7}{2} = 6


Problem 14.2

How does the mean of this dataset compared to its median?

The mean is smaller than the median.

We can calculate the mean as follows: \frac{2500 \cdot 3 + 2500 \cdot 5 + 4500 \cdot 7 + 500 \cdot 9}{10000} = 5.6 Using part (a) we know that 5.6 < 6, which means the mean is smaller than the median.



Problem 15

Source: Winter 2024 Final Part 1, Problem 3

Suppose there is a dataset contains four values: -2, -1, 2, 4. You would like to use gradient descent to minimize the mean square error over this dataset.


Problem 15.1

Write down the expression of mean square error and its derivative given this dataset.

R_{sq}(h) = \dfrac{1}{4}\sum_{i=1}^{4}(y_i-h)^2

Recall the equation for R_{sq}(h) = \frac{1}{n}\sum_{i=1}^{n}(y_i-h)^2, so we simply need to replace n with 4 because there are 4 elements in our dataset.

\frac{dR_{sq}(h)}{dh} = \frac{1}{2}\sum_{i=1}^{4}(h-y_i)

Since we have the equation for R_{sq}(h) we can calculate the derivative:

\begin{align*} \frac{dR_{sq}(h)}{dh} &= \frac{dR_{sq}(h)}{dh}(\frac{1}{4}\sum_{i=1}^{4}(y_i-h)^2) \\ &= \frac{1}{4}\sum_{i=1}^{4}\frac{dR_{sq}(h)}{dh}((y_i-h)^2) \\ \end{align*} We can use the chain rule to find the derivative of (y_i-h)^2. Recall the chain rule is: \frac{df(x)}{dx}[(f(x))^n] = n(f(x))^{n-1} \cdot f'(x).

\begin{align*} &= \frac{1}{4}\sum_{i=1}^{4}2(y_i-h) \cdot -1 \\ &= \frac{1}{4}\sum_{i=1}^{4} 2(h - y_i) \\ &= \frac{1}{2}\sum_{i=1}^{4} (h-y_i) \end{align*}


Problem 15.2

Suppose you choose the initial position to be h_0 and the learning rate to be \frac{1}{4}. After two gradient descent steps, h_2=\frac{1}{4}. What is the value of h_0?

h_{0} = -\frac{5}{4}

The gradient descent equation is given by: \begin{aligned} h_i = h_{i-1} - \alpha \frac{dR_{sq}(h_{i-1})}{dh} \end{aligned} Recall the learning rate is equal to \alpha. We can plug in \alpha = \frac{1}{4}, h_2 = \frac{1}{4}, and i=2, in that case we obtain: \begin{aligned} \frac{1}{4} &= h_{1} - \alpha \frac{dR_{sq}(h_{1})}{dh}\\ &= h_{1} - \alpha \frac{1}{2}\sum_{i=1}^{4}(h_1-y_i)\\ &= h_{1} - \frac{1}{8}(h_{1} + 2 + h_{1} + 1 + h_{1} - 2 + h_{1} - 4)\\ &= h_{1} - \frac{1}{8}(4h_{1} -3)\\ \end{aligned} Solving this equation, we obtain that h_{1} = -\frac{1}{4}. We can then repeat this step once more to obtain h_0: \begin{aligned} -\frac{1}{4} &= h_{0} - \alpha \frac{dR_{sq}(h_{0})}{dh}\\ &= h_{0} - \frac{1}{4}(h_{0} + 2 + h_{0} + 1 + h_{0} - 2 + h_{0} - 4)\\ &= h_{0} - \frac{1}{8}(4h_{0} -3)\\ \end{aligned} Solving this equation, we obtain that h_{0} = -\frac{5}{4}.


Problem 15.3

Given that we set the tolerance of gradient descent to be 0.1. How many additional steps beyond h_2 do we need to take to reach convergence?

2 or 3 additional steps (both are correct).

The gradient descent equation is given by: \begin{aligned} h_i = h_{i-1} - \alpha \frac{dR_{sq}(h_{i-1})}{dh} \end{aligned} Now we start from \alpha = \frac{1}{4}, h_2 = \frac{1}{4}, and i=2, in that case we obtain: \begin{aligned} h_{3} &= h_{2} - \alpha \frac{dR_{sq}(h_{2})}{dh}\\ &= h_{2} - \frac{1}{8}(4h_{2} -3)\\ &= \frac{1}{4} - \frac{1}{8}(1 -3)\\ &= \frac{1}{2}\\ \end{aligned} Iteratively, we have: \begin{aligned} h_{4} &= h_{3} - \alpha \frac{dR_{sq}(h_{3})}{dh}\\ &= h_{3} - \frac{1}{8}(4h_{3} -3)\\ &= \frac{1}{2} - \frac{1}{8}(2 -3)\\ &= \frac{5}{8}\\ \end{aligned} \begin{aligned} h_{5} &= h_{4} - \alpha \frac{dR_{sq}(h_{4})}{dh}\\ &= h_{3} - \frac{1}{8}(4h_{4} -3)\\ &= \frac{5}{8} - \frac{1}{8}(\frac{5}{2}-3)\\ &= \frac{11}{16}\\ \end{aligned} from h_4 to h_5, the change is smaller than the tolerance. That means we need additional 2 or 3 steps to reach convergence (depending on if you actually perform h_4 to h_5, so both 2 and 3 are considered correct answer).



Problem 16

Source: Winter 2024 Midterm 1, Problem 1

Consider a dataset D with 5 data points \{7,5,1,2,a\}, where a is a positive real number. Note that a is not necessarily an integer.


Problem 16.1

Express the mean of D as a function of a, simplify the expression as much as possible.

\text{Mean($D$)} = \frac{a}{5} + 3



Problem 16.2

Depending on the range of a, the median of D could assume one of three possible values. Write out all possible median of D along with the corresponding range of a for each case. Express the ranges using double inequalities, e.g., i.e. 3<a\leq8:

\begin{cases} \text{Median($D$)} = 2 & \text{if a is in the range of } 0<a\leq2 \\ \text{Median($D$)} = a & \text{if a is in the range of } 2<a\leq5 \\ \text{Median($D$)} = 5 & \text{if a is in the range of } 5<a\leq\infty \\ \end{cases}


Problem 16.3

Determine the range of a that satisfies: \text{Mean}(D) < \text{Median}(D) Make sure to show your work.

\dfrac{15}{4}<a<10

Since there are 3 possible median values, we will have to discuss each situation separately.

In case 1, when 0<a\leq2, \text{Median}(D) = 2. So, we have:

\begin{align*} \text{Mean}(D) &< \text{Median}(D)\\ 3 + \frac{a}{5} &< 2\\ a&<-5 \end{align*}

But a<-5 is in conflict with the condition 0<a\leq2, therefore there is no solution in this situation, and Median(D) = 2 is impossible.

In case 2, when 2<a<5, \text{Median}(D) = a. So, we have:

\begin{align*} \text{Mean}(D) &< \text{Median}(D)\\ 3 + \frac{a}{5} &< a\\ 3 &< \frac{4}{5} a\\ a &> \frac{15}{4}\\ \end{align*}

So a has to be larger than \frac{15}{4}. But remember from the prerequisite condition that 2<a<5.

To satisfy both conditions, we must have \frac{15}{4}<a<5.

In case 3, when a\geq5, \text{Median}(D) = 5. So, we have: \begin{align*} \text{Mean}(D) &< \text{Median}(D)\\ 3 + \frac{a}{5} &< 5\\ a&<10 \end{align*}

combining with the prerequisite condition, we have 5\leq a<10

Combining the range of all three cases, we have \dfrac{15}{4}<a<10 as our final answer.



Problem 17

Source: Winter 2024 Midterm 1, Problem 2

Let R_{sq}(h) represent the mean squared error of a constant prediction h for a given dataset. For the dataset \{3, y_{1}\}, the graph of R_{sq}(h) has its minimum at the point (5,r_{1}). Find out the value of y_{1} and r_{1}

y_1 = 7, r_1 = 4

The mean squared error is written as: \begin{align*} R_{sq}(h) = \frac{1}{n}\sum_{i=0}^{n}(y_{i}-h)^2 \end{align*}

Since we only have two data points (n=2), the equation simplifies to:

\begin{align*} R_{sq}(h) = \frac{1}{2}((y_{0}-h)^2+ (y_{1}-h)^2) \end{align*}

Taking the derivative with respect to h, we have: \begin{align*} \frac{dR_{sq}(h)}{dh} = -(y_{0}-h)- (y_{1}-h) \end{align*}

We know that the derivative has to be 0 at the local minima, therefore at h=5, we have:

\begin{align*} \frac{dR_{sq}(h)}{dh} = -(3-5)- (y_{1}-5) &= 0\\ % -2+y_1-5 &=0\\ y_1 &= 7 \end{align*}

So we know that the dataset is \{3,7\}. Given all these information, we can calculate r_1 with:

\begin{align*} R_{sq}(5) &= \frac{1}{2}((y_{0}-5)^2+ (y_{1}-5)^2)\\ &=\frac{1}{2}((3-5)^2+ (7-5)^2)\\ &=\frac{1}{2}(4+4)=4 \end{align*}


👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.