← return to practice.dsc40a.com

**Instructor(s):** Suraj Rampure

This exam was administered online. Students had **180
minutes** to take this exam.

*Source: Fall 2021 Final
Exam, Problem 1*

The mean of 12 non-negative numbers is 45. Suppose we remove 2 of these numbers. What is the largest possible value of the mean of the remaining 10 numbers? Show your work.

54.

To maximize the mean of the remaining 10 numbers, we want to minimize the numbers that are removed. The smallest possible non-negative number is 0, so to maximize the mean of the remaining 10, we should remove two 0s from the set of numbers. Recall that the sum of the 12 number set is 12 \cdot 45; then, the maximum possible mean of the remaining 10 is

\frac{12 \cdot 45 - 2 \cdot 0}{10} = \frac{6}{5} \cdot 45 = 54

*Source: Fall 2021 Final
Exam, Problem 2*

Consider a set of 8 data points, y_1, y_2, ..., y_8 that are in sorted order, i.e. y_1 < y_2 < ... < y_8. Suppose that y_4 = 10, y_5 = 14, and y_6 = 22. Recall that mean absolute error, R_{abs}(h), is defined as R_{abs}(h) = \frac{1}{n} \sum_{i = 1}^n |y_i - h|. Suppose that R_{abs}(11) = 9.

What is R_{abs}(22)? Show your work.

*Hint: Use the formula for the slope of R at h.*

R_{abs}(22) = 11

We can write the points given to us as:

y_1, y_2, y_3, 10, 14, 22, y_7, y_8

Since there are an even number of data points, all values of h between the middle two points minimize R_{abs}(h). In this case, all values of h in the interval [10, 14] minimize R_{abs}(h) and, as a result, have the same value of R_{abs}(h). Thus, R_{abs}(14) = R_{abs}(11) = 9.

From lecture, we know that:

\text{slope of $R$ at $h$} = \frac{1}{n} \left( \left(\text{\# points } < h\right) - \left(\text{\# points } > h\right)\right)

We can use this formula to determine what to add to R_{abs}(14) to get R_{abs}(22). For any h in the interval (14, 22), the slope of R at h (given by plugging any h \in (14, 22) into the slope equation)is \frac{1}{8} (5 - 3) = \frac{1}{4}. This means that for every 1 unit we move to the right from h = 14 until we get to h = 22, R_{abs}(h) increases by \frac{1}{4}. So,

R_{abs}(22) = R_{abs}(14) + (22 - 14) \cdot \frac{1}{4} = 9 + 2 = 11

You can visualize the solution to this problem with this interactive graph.

*Source: Fall 2021 Final
Exam, Problem 3*

Consider the function R(h) = (h - 2)^3

Suppose we want to try and minimize R(h) using gradient descent. We choose an initial guess of h_0 = 1 and a step size of \alpha = 1.

Run the gradient descent algorithm twice, i.e. determine h_1 and h_2. Show your work. Also, state whether or not gradient descent will eventually find the minimizing value of h, given the specified h_0 and \alpha.

h_1 = -2, h_2 = -50

No, gradient descent will not converge; h_i will keep getting larger and more negative (i.e. it tends to negative infinity).

\frac{dR}{dh} R(h) = 3 (h - 2)^2

\begin{align*} h_1 &= h_0 - \alpha \cdot \frac{dR}{dh} R(h_0) \\ &= 1 - 3(1 - 2)^2 \\ &= -2 \end{align*}

\begin{align*} h_2 &= h_1 - \alpha \cdot \frac{dR}{dh} R(h_1) \\ &= -2 - 3(-2 - 2)^2 \\ &= -2 - 48 \\ &= -50 \end{align*}

*Source: Fall 2021 Final
Exam, Problem 4*

You may find the following properties of logarithms helpful in this question. Assume that all logarithms in this question are natural logarithms, i.e. of base e.

- e^{\log(x)} = x
- \log(a) + \log(b) = \log(a \cdot b)
- \log(a) - \log(b) = \log \left( \frac{a}{b} \right)
- \log(a^c) = c \log (a)
- \frac{d}{dx} \log x = \frac{1}{x}

Billy, the avocado-farmer-turned-waiter-turned-Instagram-influencer that you’re all-too-familiar with, is trying his hand at coming up with loss functions. He comes up with the Billy loss, L_B(h, y), defined as follows:

L_B(h, y) = \left[ \log \left( \frac{y}{h} \right) \right]^2

Throughout this problem, assume that all ys are positive.

Show that: \frac{d}{dh} L_B(h, y) = - \frac{2}{h} \log \left( \frac{y}{h} \right)

Show that the constant prediction h^* that minimizes for Billy loss is:

h^* = \left(y_1 \cdot y_2 \cdot ... \cdot y_n \right)^{\frac{1}{n}}

You do not need to perform a second derivative test, but otherwise you must show your work.

*Hint: To confirm that you’re interpreting the result correctly,
h^* for the dataset 3, 5, 16 is (3 \cdot 5 \cdot 16)^{\frac{1}{3}} =
240^{\frac{1}{3}} \approx 6.214.*

*Source: Fall 2021 Final
Exam, Problem 5*

Suppose we have a dataset of n houses that were recently sold in the San Diego area. For each house, we have its square footage and most recent sale price. The correlation between square footage and price is r.

First, we minimize mean squared error to fit a linear prediction rule that uses square footage to predict price. The resulting prediction rule has an intercept of w_0^* and slope of w_1^*. In other words,

\text{predicted price} = w_0^* + w_1^* \cdot \text{square footage}

We’re now interested in minimizing mean squared error to fit a linear
prediction rule **that uses price to predict square
footage**. Suppose this new regression line has an intercept of
\beta_0^* and slope of \beta_1^*.

What is \beta_1^*? Give your answer in terms of one or more of n, r, w_0^*, and w_1^*. Show your work.

\beta_1^* = \frac{r^2}{w_1^*}

Throughout this solution, let x represent square footage and y represent price.

We know that w_1^* = r \frac{\sigma_y}{\sigma_x}. But what about \beta_1^*?

When we take a rule that predicts price from square footage and transform it into a rule that predicts square footage from price, the roles of x and y have swapped; suddenly, square footage is no longer our independent variable, but our dependent variable, and vice versa for price. This means that the altered dataset we work with when using our new prediction rule has \sigma_x standard deviation for its dependent variable (square footage), and \sigma_y for its independent variable (price). So, we can write the formula for \beta_1^* as follows: \beta_1^* = r \frac{\sigma_x}{\sigma_y}

In essence, swapping the independent and dependent variables of a dataset changes the slope of the regression line from r \frac{\sigma_y}{\sigma_x} to r \frac{\sigma_x}{\sigma_y}.

From here, we can use a little algebra to get our \beta_1^* in terms of one or more n, r, w_0^*, and w_1^*:

\begin{align*} \beta_1^* &= r \frac{\sigma_x}{\sigma_y} \\ w_1^* \cdot \beta_1^* &= w_1^* \cdot r \frac{\sigma_x}{\sigma_y} \\ w_1^* \cdot \beta_1^* &= ( r \frac{\sigma_y}{\sigma_x}) \cdot r \frac{\sigma_x}{\sigma_y} \end{align*}

The fractions \frac{\sigma_y}{\sigma_x} and \frac{\sigma_x}{\sigma_y} cancel out and we get:

\begin{align*} w_1^* \cdot \beta_1^* &= r^2 \\ \beta_1^* &= \frac{r^2}{w_1^*} \end{align*}

**For this part only**, assume that the following
quantities hold:

- n = 100
- r = 0.6
- w_0^* = 1000
- w_1^* = 250
- The average square footage of homes in the dataset is 2000

Given this information, what is \beta_0^*? Give your answer as a constant, rounded to two decimal places. Show your work.

\beta_0^* = 1278.56

We start with the formula for the intercept of the regression line. Note that x and y are opposite what they’d normally be since we’re using price to predict square footage.

\beta_0^* = \bar{x} - \beta_1^* \bar{y}

We’re told that the average square footage of homes in the dataset is 2000, so \bar{x} = 2000. We also know from part (a) that \beta_1^* = \frac{r^2}{w_1^*}, and from the information given in this part this is \beta_1^* = \frac{r^2}{w_1^*} = \frac{0.6^2}{250}.

Finally, we need the average price of all homes in the dataset, \bar{y}. We aren’t given this information directly, but we can use the fact that (\bar{x}, \bar{y}) are on the regression line that uses square footage to predict price to find \bar{y}. Specifically, we have that \bar{y} = w_0^* + w_1^* \bar{x}; we know that w_0^* = 1000, \bar{x} = 2000, and w_1^* = 250, so \bar{y} = 1000 + 2000 \cdot 250 = 501000.

Putting these pieces together, we have

\begin{align*} \beta_0^* &= \bar{x} - \beta_1^* \bar{y} \\ &= 2000 - \frac{0.6^2}{250} \cdot 501000 \\ &= 2000 - 0.6^2 \cdot 2004 \\ &= 1278.56 \end{align*}

*Source: Fall 2021 Final
Exam, Problem 6*

Billy’s aunt owns a jewellery store, and gives him data on 5000 of the diamonds in her store. For each diamond, we have:

**carat**: the weight of the diamond, in carats**length**: the length of the diamond, in centimeters**width**: the width of the diamond, in centimeters**price**: the value of the diamond, in dollars

The first 5 rows of the 5000-row dataset are shown below:

carat | length | width | price |
---|---|---|---|

0.40 | 4.81 | 4.76 | 1323 |

1.04 | 6.58 | 6.53 | 5102 |

0.40 | 4.74 | 4.76 | 696 |

0.40 | 4.67 | 4.65 | 798 |

0.50 | 4.90 | 4.95 | 987 |

Billy has enlisted our help in predicting the price of a diamond
given various other features.

Suppose we want to fit a linear prediction rule that uses two features, carat and length, to predict price. Specifically, our prediction rule will be of the form

\text{predicted price} = w_0 + w_1 \cdot \text{carat} + w_2 \cdot \text{length}

We will use least squares to find \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}.

Write out the first 5 rows of the design matrix, X. Your matrix should not have any variables in it.

Suppose the optimal parameter vector \vec{w}^* is given by

\vec{w}^* = \begin{bmatrix} 2000 \\ 10000 \\ -1000 \end{bmatrix}

What is the predicted price of a diamond with 0.65 carats and a length of 4 centimeters? Show your work.

The predicted price is 4500 dollars.

2000 + 10000 \cdot 0.65 - 1000 \cdot 4 = 4500

Suppose \vec{e} = \begin{bmatrix} e_1 \\ e_2 \\ ... \\ e_n \end{bmatrix} is the error/residual vector, defined as

\vec{e} = \vec{y} - X \vec{w}^*

where \vec{y} is the observation vector containing the prices for each diamond.

For each of the following quantities, state whether they are guaranteed to be equal to 0 the scalar, \vec{0} the vector of all 0s, or neither. No justification is necessary.

- \sum_{i = 1}^n e_i
- || \vec{y} - X \vec{w}^* ||^2
- X^TX \vec{w}^*
- 2X^TX \vec{w}^* - 2X^T\vec{y}

- \sum_{i = 1}^n e_i:
**Yes**, this is guaranteed to be 0. This was discussed in Homework 4; it is a consequence of the fact that X^T (y - X \vec{w}^*) = 0 and that we have an intercept term in our prediction rule (and hence a column of all 1s in our design matrix, X). - || \vec{y} - X \vec{w}^* ||^2:
**No**, this is not guaranteed to be 0. This is the mean squared error of our prediction rule, multiplied by n. \vec{w}^* is found by minimizing mean squared error, but the minimum value of mean squared error isn’t necessarily 0 — in fact, this quantity is only 0 if we can write \vec{y} exactly as X \vec{w}^* with no prediction errors. - X^TX \vec{w}^*:
**No**, this is not guaranteed to be 0, either. - 2X^TX \vec{w}^* - 2X^T\vec{y}:
**Yes**, this is guaranteed to be 0. Recall, the optimal parameter vector \vec{w}^* satisfies the normal equations X^TX\vec{w}^* = X^T \vec{y}. Subtracting X^T \vec{y} from both sides of this equation and multiplying both sides by 2 yields the desired result. (You may also find 2X^TX \vec{w}^* - 2X^T\vec{y} to be familiar from lecture — it is the gradient of the mean squared error, multiplied by n, and we set the gradient to 0 to find \vec{w}^*.)

Suppose we introduce two more features:

- width alone, and
- area, which is defined as length times width

Suppose we also decide to remove the intercept term of our prediction rule. With all of these changes, our prediction rule is now

\text{predicted price} = w_1 \cdot \text{carat} + w_2 \cdot \text{length} + w_3 \cdot \text{width} + w_4 \cdot (\text{length} \cdot \text{width})

- Write out just the first 2 rows of the design matrix X for this new prediction rule. You do
**not**need to simplify the numbers in your matrix, it is fine if they involve the multiplication symbol. - Is the optimal coefficient for carat, w_1^*, for this new prediction rule guaranteed to be equal to 10000, the optimal coefficient for carat in our original prediction rule? No justification is necessary.

- X = \begin{bmatrix} 0.40 & 4.81 & 4.76 & 4.81 \cdot 4.76 \\ 1.04 & 6.58 & 6.53 & 6.58 \cdot 6.53 \end{bmatrix}
- No, it’s not guaranteed that the \vec{w}_1^* for this new prediction rule is equal to the \vec{w}_1^* for the original prediction rule. The value of \vec{w}_1^* in the new prediction rule will be influenced by the fact that there’s no longer an intercept term and that there are two new features (width and area) that weren’t previously there.

*Source: Fall 2021 Final
Exam, Problem 7*

Consider the following dataset of 7 points.

Suppose we want to identify k=2 clusters in this dataset using k-Means Clustering.

Determine the centroids \vec{\mu}_1 and \vec{\mu}_2 that minimize inertia. (Let \vec{\mu}_1 be the centroid with a smaller x_1 coordinate.) Justify your answers.

*Note: You don’t have to run the k-Means Clustering algorithm to
answer this question.*

\vec{\mu}_1 = \begin{bmatrix} 3 \\ 9 \end{bmatrix}, \vec{\mu}_2 = \begin{bmatrix} 8 \\ 2 \end{bmatrix}

It’s clear that there are two clusters — one in the top left, and one in the bottom right.

To find \vec{\mu}_1, the centroid for the top-left cluster, we take the mean of four points assigned to cluster 1, giving

\vec{\mu}_1 = \frac{1}{4} \left( \begin{bmatrix} 2 \\ 8 \end{bmatrix} + \begin{bmatrix} 4 \\ 8 \end{bmatrix} + \begin{bmatrix} 2 \\ 10 \end{bmatrix} + \begin{bmatrix} 4 \\ 10 \end{bmatrix} \right) = \begin{bmatrix} 3 \\ 9 \end{bmatrix}

You can also arrive at this result without any algebra by taking the middle of the ‘square’.

To find \vec{\mu}_2, the centroid for the bottom-right cluster, we take the mean of three points assigned to cluster 2, giving

\vec{\mu}_2 = \frac{1}{3} \left( \begin{bmatrix} 7 \\ 1 \end{bmatrix} + \begin{bmatrix} 8 \\ 4 \end{bmatrix} + \begin{bmatrix} 9 \\ 1 \end{bmatrix} \right) = \begin{bmatrix} 8 \\ 2 \end{bmatrix}

Thus,

\vec{\mu}_1 = \begin{bmatrix} 3 \\ 9 \end{bmatrix}, \vec{\mu}_2 = \begin{bmatrix} 8 \\ 2 \end{bmatrix}

What is the total inertia for the centroids you chose in part (a)? Show your work.

16.

We’ll proceed by determining the inertia of each cluster individually and adding the results together.

First, let’s consider the top-left cluster, whose centroid is at \begin{bmatrix} 3 \\ 9 \end{bmatrix}. The squared distance between the centroid and each of the four points in the cluster individually is 1^2 + 1^2 = 2. It’s easiest to see this by drawing a picture, but we can calculate all squared distances algebraically as well:

\begin{bmatrix} 2 \\ 8 \end{bmatrix} - \begin{bmatrix} 3 \\ 9 \end{bmatrix} = \begin{bmatrix} -1 \\ -1 \end{bmatrix} \implies \text{squared distance} = (-1)^2 + (-1)^2 = 2

\begin{bmatrix} 4 \\ 8 \end{bmatrix} - \begin{bmatrix} 3 \\ 9 \end{bmatrix} = \begin{bmatrix} 1 \\ -1 \end{bmatrix} \implies \text{squared distance} = (1)^2 + (-1)^2 = 2

\begin{bmatrix} 2 \\ 10 \end{bmatrix} - \begin{bmatrix} 3 \\ 9 \end{bmatrix} = \begin{bmatrix} -1 \\ 1 \end{bmatrix} \implies \text{squared distance} = (-1)^2 + (1)^2 = 2

\begin{bmatrix} 4 \\ 10 \end{bmatrix} - \begin{bmatrix} 3 \\ 9 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \end{bmatrix} \implies \text{squared distance} = (1)^2 + (1)^2 = 2

Thus, the inertia for cluster 1 is 2 + 2 + 2 + 2 = 8.

We follow a similar procedure for cluster 2:

\begin{bmatrix} 7 \\ 1 \end{bmatrix} - \begin{bmatrix} 8 \\ 2 \end{bmatrix} = \begin{bmatrix} -1 \\ -1 \end{bmatrix} \implies \text{squared distance} = (-1)^2 + (-1)^2 = 2

\begin{bmatrix} 8 \\ 4 \end{bmatrix} - \begin{bmatrix} 8 \\ 2 \end{bmatrix} = \begin{bmatrix} 0 \\ 2 \end{bmatrix} \implies \text{squared distance} = (0)^2 + (2)^2 = 4

\begin{bmatrix} 9 \\ 1 \end{bmatrix} - \begin{bmatrix} 8 \\ 2 \end{bmatrix} = \begin{bmatrix} 1 \\ -1 \end{bmatrix} \implies \text{squared distance} = (-1)^2 + (1)^2 = 2

Thus, the inertia for cluster 2 is 2 + 4 + 2 = 8.

This means that the total inertia for the whole dataset is 8 + 8 = 16.

*Source: Fall 2021 Final
Exam, Problem 8*

Billy brings you back to Dirty Birds, the restaurant where he is a waiter. He tells you that Dirty Birds has 30 different flavors of chicken wings, 18 of which are ‘wet’ (e.g. honey garlic) and 12 of which are ‘dry’ (e.g. lemon pepper).

Each time you place an order at Dirty Birds, you get to pick 4 different flavors. The order in which you pick your flavors does not matter.

**Note:** In this problem, you may leave your answers in
terms of {n \choose k}, P(n, k), and/or factorials. However, you must
provide a one-sentence explanation for your answer in each subpart.

How many ways can we select 4 flavors in total?

{30 \choose 4}

There are 30 total flavors and we want to select 4 such that order does not matter. In this subpart we don’t care about whether our flavors are dry or wet.

How many ways can we select 4 flavors such that we select an equal number of wet and dry flavors?

{18 \choose 2}{12 \choose 2}

To select an equal number of wet and dry flavors, we need to select 2 wet flavors and 2 dry flavors. There are 18 wet flavors, so there are {18 \choose 2} ways to select 2 of them. Similarly, there are {12 \choose 2} ways to select 2 of the 12 dry flavors. We need to select 18 wet flavors AND 12 dry flavors, so we multiply these numbers.

Billy tells you he’ll surprise you with 4 different flavors, randomly selected from the 30 flavors available. What’s the probability that he brings you at least one wet flavor and and least one dry flavor?

1 - \frac{{18 \choose 4}}{{30 \choose 4}} - \frac{{12 \choose 4}}{{30 \choose 4}}

Note that the event “Billy brings you at least one wet flavor and at least one dry flavor” is the complement of the event “Billy brings you only wet flavors or only dry flavors”. Thus, we can find the probability that Billy brings you only wet flavors or only dry flavors and subtract it from 1, using the complement rule.

To find the probability that Billy brings you only wet flavors, we first count the number of ways to select 4 wet flavors from 18; this is {18 \choose 4}. The total number of ways of selecting 4 flavors is {30 \choose 4}, hence the probability that he brings you only wet flavors is \frac{{18 \choose 4}}{{30 \choose 4}}. Similarly, the probability he brings you only dry flavors is \frac{{18 \choose 4}}{{30 \choose 4}}. Since the events “Billy brings you only wet flavors” and “Billy brings you only dry flavors” are mutually exclusive, we can add these probabilities to find the probability of the event “Billy brings you only wet flavors or only dry flavors”, yielding \frac{{18 \choose 4}}{{30 \choose 4}} + \frac{{12 \choose 4}}{{30 \choose 4}}. We subtract this number from 1 to get the desired result.

Suppose you go to Dirty Birds once a day for 7 straight days. Each
time you go there, Billy brings you 4 different flavors, randomly
selected from the 30 flavors available. What’s the probability that on
at least one of the 7 days, he brings you all wet flavors or all dry
flavors? (*Note: All 4 flavors for a particular day must be
different, but it is possible to get the same flavor on multiple
days.*)

1 - \left[1 - \frac{{18 \choose 4}}{{30 \choose 4}} - \frac{{12 \choose 4}}{{30 \choose 4}} \right]^7

Note that the event “at least once, Billy brings you all wet or all dry wings” is the complement of the event “Billy never brings you all wet or all dry wings.” This new event is equivalent to “Billy always brings you at least one wet flavor and at least one dry flavor.”

In part (c), we found that for one particular day, the probability that Billy brings you at least one wet flavor and at least one dry flavor is 1 - \frac{{18 \choose 4}}{{30 \choose 4}} - \frac{{12 \choose 4}}{{30 \choose 4}}. Each day the flavors “reset” (we are told that flavors can be chosen on multiple days), and so this probability is the same for each of the 7 days. This makes the probability that he brings us at least one wet flavor and at least one dry flavor on all 7 days \left[1 - \frac{{18 \choose 4}}{{30 \choose 4}} - \frac{{12 \choose 4}}{{30 \choose 4}} \right]^7, and so the desired probability — the probability that he brings us all wet flavors or all dry flavors on at least one of the 7 days — is

1 - \left[1 - \frac{{18 \choose 4}}{{30 \choose 4}} - \frac{{12 \choose 4}}{{30 \choose 4}} \right]^7

*Source: Fall 2021 Final
Exam, Problem 9*

In this question, we’ll consider the phone number 6789998212 (mentioned in Soulja Boy’s 2008 classic, “Kiss Me thru the Phone”).

**Note:** In this problem, you may leave your answers in
terms of {n \choose k}, P(n, k), and/or factorials. However, you must
provide a one-sentence explanation for your answer in each subpart.

How many permutations of 6789998212 are there?

There are {10 \choose 1}{9 \choose 1}{8 \choose 2}{6 \choose 3}{3 \choose 2}{1 \choose 1} permutations.

If all digits in 6789998212 were *unique*, there would be
10! permutations. However, some of the
digits are repeated; for each repeated digit, we need to divide by the
number of ways to re-arrange the repeated digits amongst themselves.

Let’s count the number of occurrences of each digit.

- 6: 1 occurrence
- 7: 1 occurrence
- 8: 2 occurrences
- 9: 3 occurrences
- 2: 2 occurrences
- 1: 1 occurrence

Thus, the number of permutations of 6789998212 is

\frac{10!}{2! 3! 2!}

Another way to arrive at this result is by saying first, we need to choose 1 of the 10 positions to contain a 1. Then, we need to choose 1 of the remaining 9 positions to contain a 7. Then, we need to choose 2 of the remaining 8 positions to contain a 8, and so on and so forth. This method yields

{10 \choose 1}{9 \choose 1}{8 \choose 2}{6 \choose 3}{3 \choose 2}{1 \choose 1}

If you expand out this result, you’ll see that it’s equal to the first result. You may also notice that this is equivalent to {10 \choose 3}{7 \choose 2}{5 \choose 2}{3 \choose 1}{2 \choose 1}{1 \choose 1}, which follows the same approach but chooses locations for the repeated characters first.

How many permutations of 6789998212 have all three 9s next to each other?

There are \displaystyle \frac{8!}{2!2!} permutations.

The key insight here is that we can treat all three 9s as being just a single character, X, and determine the number of permutations of the 8-character string 678X8212. To do so, we again count the number of occurrences of each digit:

- 6: 1 occurrence
- 7: 1 occurrence
- 8: 2 occurrences
- X: 1 occurrence
- 2: 2 occurrences
- 1: 1 occurrence

Thus, the number of permutations of 6789998212 with all 3 9s appearing together is

\frac{8!}{2!2!}

How many permutations of 6789998212 end with a 1 and start with a 6?

There are \displaystyle \frac{8!}{2! 3! 2!} permutations.

Similarly to the previous part, we can “fix” the 1 at the start and 6 at the end. All we really need to determine is the number of permutations of 78999822, which is

\frac{8!}{2! 3! 2!}

How many different 3 digit numbers with unique digits can we create by selecting digits from 6789998212?

There are \displaystyle \frac{8!}{2! 3! 2!} = 120 permutations.

The key here is that the 3 digit numbers that we’re creating must have unique digits. There are 6 unique digits in 6789998212: 6, 7, 8, 9, 2, 1.

There are 6 options for the first digit of our 3 digit number, 5 options for the second digit, and 4 options for the third digit. Thus, the number of 3 digit numbers we can create in this way is 6 \cdot 5 \cdot 4 = 120. This is also {6 \choose 3} \cdot {3!} (Choosing three elements from six, multiplying by 3! because order matters).

*Source: Fall 2021 Final
Exam, Problem 10*

Suppose you’re given the following probabilities:

- P(A | B) = \frac{2}{5}
- P(B | A) = \frac{1}{4}
- P(A | C) = \frac{2}{3}

If A and C are independent, what is P(B)? Show your work.

P(B) = \frac{5}{12}

If A and C are independent, then P(A) = P(A | C) = \frac{2}{3}. Then, from Bayes’ rule, we have

\begin{align} P(A | B) &= \frac{P(A) P(B|A)}{P(B)} \\ \implies P(B) &= \frac{P(A) P(B|A)}{P(A|B)}\\ &= \frac{\frac{2}{3} \cdot \frac{1}{4}}{\frac{2}{5}} \\ &= \frac{5}{12} \end{align}

Suppose A and C are not independent, and now suppose that P(A | \overline{C}) = \frac{1}{5}. Given that A and B are independent, what is P(C)? Show your work.

P(C) = \frac{3}{7}

Since we’re given that A and B are independent, P(A) = P(A | B) = \frac{2}{5}. By the Law of Total Probability, we have

P(A) = P(A \cap C) + P(A \cap \overline{C}) = P(C) P(A | C) + P(\overline{C}) P(A | \overline{C})

For simplicity, let p = P(C). Then, substituting in our known values P(A), P(A | C), and P(A | \overline{C}) and solving for p yields

\begin{align*} P(A) = P(A \cap C) + P(A \cap \overline{C}) &= P(C) P(A | C) + P(\overline{C}) P(A | \overline{C}) \\ \frac{2}{5} &= \frac{2}{3} p + \frac{1}{5} (1 - p) \\ \frac{2}{5} &= \left(\frac{10}{15} - \frac{3}{15} \right)p + \frac{1}{5} \\ \frac{1}{5} &= \frac{7}{15} p \\ p &= \frac{15}{7} \cdot \frac{1}{5} = \frac{3}{7} \end{align*}

Thus, P(C) = \frac{3}{7}.

*Source: Fall 2021 Final
Exam, Problem 11*

Below, we have a (real) sample of 12 passengers on the RMS Titanic, a British cruise ship that tragically sank in 1912. For each passenger in our sample, we have the following information:

**survived**: 1 if they survived the sinking of the Titanic, 0 if they did not**sex**: either Male or Female**class**: the cabin class they were in, either First, Second, or Third (with First being the most luxurious)**age**: either child (for those under 18) or adult (for those 18 and older)

survived | sex | class | age |
---|---|---|---|

0 | female | Third | adult |

0 | male | First | adult |

0 | male | Third | adult |

1 | female | First | adult |

1 | female | First | adult |

1 | female | First | child |

1 | female | Second | adult |

1 | female | Second | adult |

1 | female | Third | adult |

1 | female | Third | child |

1 | male | First | child |

1 | male | Second | adult |

Selma Augusta Emilia was a 38 year old female in Third class. Using the Naive Bayes classification algorithm with no smoothing, predict whether Selma Augusta Emilia survived the sinking of the Titanic. You must show all intermediate steps, including the probabilities you used to make your classification, and you must clearly state what your prediction is.

We predict that Selma Augusta Emilia **survived** the
Titanic.

We want to determine which is larger, P(\text{1 | female, Third, adult}) or P(\text{0 | female, Third, adult}), where 1 means the passenger survived and 0 means the passenger did not survive. The Naive Bayes algorithm works by re-writing these two probabilities using Bayes’ theorem, recognizing that we can determine which is larger by comparing their numerators, and using an assumption of conditional independence on their numerators.

Let’s first consider P(1 | \text{female, Third, adult}). By Bayes’ rule, the numerator of this probability is P(1) P(\text{female, Third, adult} | 1) and by the assumption of conditional independence given a class, this is equal to

P(1) P(\text{female} | 1) P(\text{Third} | 1) P(\text{adult} | 1)

We can estimate these four probabilities using our training set.

- P(1) = \frac{9}{12} = \frac{3}{4}, because 9 of the 12 people in our training dataset are survivors (survived = 1).
- P(\text{female} | 1) = \frac{7}{9}, because 7 of the 9 survivors in our dataset are female.
- P(\text{Third} | 1) = \frac{2}{9}, because 2 of the 9 survivors in our dataset are in Third class.
- P(\text{adult} | 1) = \frac{6}{9} = \frac{2}{3}, because 6 of the 9 survivors in our dataset are adults.

So, this numerator becomes

P(1) P(\text{female} | 1) P(\text{Third} | 1) P(\text{adult} | 1) = \frac{3}{4} \cdot \frac{7}{9} \cdot \frac{2}{9} \cdot \frac{2}{3} = \frac{7}{81}

Similarly, we must estimate P(0) P(\text{female} | 0) P(\text{Third} | 0) P(\text{adult} | 0):

- P(0) = \frac{3}{12} = \frac{1}{4}, because 3 of the 12 people in our training dataset are not survivors (survived = 0).
- P(\text{female} | 0) = \frac{1}{3}, because 1 of the 3 non-survivors in our dataset are female.
- P(\text{Third} | 0) = \frac{2}{3}, because 2 of the 3 non-survivors in our dataset are in Third class.
- P(\text{adult} | 0) = \frac{3}{3} = 1, because all 3 of the non-survivors in our dataset are adults.

Then, we have

P(0) P(\text{female} | 0) P(\text{Third} | 0) P(\text{adult} | 0) = \frac{1}{4} \cdot \frac{1}{3} \cdot \frac{2}{3} \cdot 1 = \frac{1}{18}

Note that \frac{1}{18} = \frac{4.5}{81}. Since this is smaller than \frac{7}{81}, the numerator for class 1, we predict that Selma Augusta Emilia survived the Titanic.

If you try to use Naive Bayes to classify whether Selma Augusta Emilia’s 8 year old daughter (who also sat with her in Third class) survived the Titanic, one of the two probabilities that you’d compare would turn out to be 0.

Using smoothing, compute P(\text{child } | \text{ survived}) and P(\text{child } | \text{ didn't survive}). Show your work. Note that all you need to do is find these two probabilities; you don’t need to make any predictions.

P(\text{child } | \text{ survived}) = \displaystyle\frac{4}{11}, P(\text{child } | \text{ didn't survive}) = \displaystyle\frac{1}{5}

Note that this subpart uses the notation P(\text{child } | \text{ survived}) and P(\text{child } | \text{ didn't survive}); these are equivalent to P(\text{child} | 1) and P(\text{child} | 0) using the notation in the previous subpart’s solution.

P(\text{child } | \text{ survived}) = \frac{(\text{\# children that survived} + 1)}{(\text{\# children that survived} + 1) + \text{(\# adults that survived} + 1)} = \frac{3 + 1}{(3 + 1) + (6 + 1)} = \frac{4}{11}

P(\text{child } | \text{ didn't survive}) = \frac{(\text{\# children that didn't survive} + 1)}{(\text{\# children that didn't survive} + 1) + \text{(\# adults that didn't survive} + 1)} = \frac{0 + 1}{(0 + 1) + (3 + 1)} = \frac{1}{5}