Gradient Descent and Convexity

← return to practice.dsc40a.com


This page contains all problems about Gradient Descent and Convexity.


Problem 1

Source: Spring 2023 Final Part 1, Problem 3

You and a friend independently perform gradient descent on the same function, but after 100 iterations, you have different results. Which of the following is sufficient on its own to explain the difference in your results? Note: When we say “same function” we assume the learning rate and initial predictions are the same too until said otherwise.

Select all that apply.

Bubbles 1 and 3 are true: “The function is nonconvex” and “You and your friend chose different learning rates.”

If the function is nonconvex it is possible for you and your friend to end in different places if you start in different places. For example if you have a polynomial with a local minima and a global minimum then it is possible you could find the local minima and your friend could find the global minima, which would mean you have different results.

If the function is not differentiable then you cannot perform gradient descent, so this cannot be an answer.

If you and your friend chose different learning rates it is possible to have different results because if you have a really large learning rate you might be hopping over the global minimum without properly converging. Your friend could choose a smaller learning rate, which will allow you to converge to the global minimum.

If you and your friend chose the same initial predictions you are guaranteed to end up in the same spot.

Because two of the options are possible the answer cannot be “None of the above.”


Problem 2

Source: Spring 2021 Midterm 1, Problem 1

Consider the function R(h) = \sqrt{(h - 3)^2 + 1} = ((h - 3)^2 + 1)^{\frac{1}{2}}, which is a convex and differentiable function with only one local minimum.


Problem 2.1

Perform by hand two iterations of the gradient descent algorithm on this function, using an initial prediction of h_0 = 2 and a learning rate of \alpha = 2\sqrt{2}. Show your work and your final answers, h_1 and h_2.

h_1 = 4, h_2 = 2

The updating rule for gradient descent in the one-dimensional case is: h_{i+1} = h_{i} - \alpha \cdot \frac{dR}{dh}(h_i)

We can find \frac{dR}{dh} by taking the derivative of R(h): \frac{d}{dh}R(h) = \frac{d}{dh}(\sqrt{(h - 3)^2 + 1}) = \dfrac{h-3}{\sqrt{\left(h-3\right)^2+1}}

Now we can use \alpha = 2\sqrt{2} and h_0 = 2 to begin updating:

\begin{align*} h_{1} &= h_{0} - \alpha \cdot \frac{dR}{dh}(h_0) \\ h_{1} &= 2 - 2\sqrt{2} \cdot \left(\dfrac{2-3}{\sqrt{\left(2-3\right)^2+1}}\right) \\ h_{1} &= 2 - 2\sqrt{2} \cdot (\dfrac{-1}{\sqrt{2}}) \\ h_{1} &= 4 \end{align*}
\begin{align*} h_{2} &= h_{1} - \alpha \cdot \frac{dR}{dh}(h_1) \\ h_{2} &= 4 - 2\sqrt{2} \cdot \left(\dfrac{4-3}{\sqrt{\left(4-3\right)^2+1}}\right) \\ h_{2} &= 4 - 2\sqrt{2} \cdot (\dfrac{1}{\sqrt{2}}) \\ h_{2} &= 2 \end{align*}


Problem 2.2

With more iterations, will we eventually converge to the minimizer? Explain.

No, this algorithm will not converge to the minimizer because if we do more iterations, we’ll keep oscillating back and forth between predictions of 2 and 4. We showed the first two iterations of the algorithm in part 1, but the next two would be exactly the same, and the two after that, and so on. This happens because the learning rate is too big, resulting in steps that are too big, and we keep jumping over the true minimizer at h = 3.



Problem 3

Source: Spring 2023 Midterm 1, Problem 3

In general, the logarithm of a convex function is not convex. Give an example of a function f(x) such that f(x) is convex, but \log_{10}(f(x)) is not convex.

There are many correct answers to this question. Some simple answers are f(x) = x and f(x) = x^2. The logarithms of these function are \log_{10}(x) and 2\log_{10}(x), both of which are nonconvex because there are pairs of points such that the line connecting them goes below the function.


Problem 4

Source: Fall 2021 Midterm, Problem 3

Remember to show your work and justify your answers.

Suppose we want to minimize the function

R(h) = e^{(h + 1)^2}


Problem 4.1

Without using gradient descent or calculus, what is the value h^* that minimizes R(h)?

h^* = -1

The minimum possible value of the exponent is 0, since anything squared is non-negative. The exponent is 0 when (x+1)^2 = 0, i.e. when x = -1. Since e^{(x+1)^2} gets larger as (x+1)^2 gets larger, the minimizing input h^* is -1.


Problem 4.2

Now, suppose we want to use gradient descent to minimize R(h). Assume we use an initial guess of h_0 = 0. What is h_1? Give your answer in terms of a generic step size, \alpha, and other constants. (e is a constant.)

h_1 = -\alpha \cdot 2e

First, we find \frac{dR}{dh}(h):

\frac{dR}{dh}(h) = 2(x+1) e^{(x+1)^2}

Then, we know that

h_1 = h_0 - \alpha \frac{dR}{dh}(h_0) = 0 - \alpha \frac{dR}{dh}(0)

In our case, \frac{dR}{dh}(0) = 2(0 + 1) e^{(0+1)^2} = 2e, so

h_1 = -\alpha \cdot 2e


Problem 4.3

Using your answers from the previous two parts, what should we set the value of \alpha to be if we want to ensure that gradient descent finds h^* after just one iteration?

\alpha = \frac{1}{2e}

We know from the part (b) that h_1 = -\alpha \cdot 2e, and we know from part (a) that h^* = -1. If gradient descent converges in one iteration, that means that h_1 = h^*; solving this yields

-\alpha \cdot 2e = -1 \implies \alpha = \frac{1}{2e}


Problem 4.4

Below is a graph of R(h) with no axis labels.

True or False: Given an appropriate choice of step size, \alpha, gradient descent is guaranteed to find the minimizer of R(h).

True.

R(h) is convex, since the graph is bowl shaped. (It can also be proved that R(h) is convex using the second derivative test.) It is also differentiable, as we saw in part (b). As a result, since it’s both convex and differentiable, gradient descent is guaranteed to be able to minimize it given an appropriate choice of step size.



Problem 5

Source: Fall 2022 Midterm, Problem 1

Suppose that we are given f(x) = x^3 + x^2 and learning rate \alpha = 1/4.


Problem 5.1

Write down the updating rule for gradient descent in general, then write down the updating rule for gradient descent for the function f(x).

In general, the updating rule for gradient descent is: x_{i + 1} = x_i - \alpha \nabla f(x_i) = x_i - \alpha \frac{\partial f}{\partial x}(x_i), where \alpha \in \mathbb{R}_+ is the learning rate or step size. For this function, since f is a single-variable function, we can write down the updating rule as: x_{i + 1} = x_i - \alpha \frac{df}{dx}(x_i) = x_i - \alpha f'(x_i). We also have: \frac{df}{dx} = f'(x) = 3x^2 + 2x, thus the updating rule can be written down as: x_{i + 1} = x_i - \alpha(3x_i^2 + 2x_i) = -\frac{3}{4} x_i^2 + \frac{1}{2}x_i.


Problem 5.2

If we start at x_0 = -1, should we go left or right? Can you verify this mathematically? What is x_1? Can gradient descent converge? If so, where it might converge to, given appropriate step size?

We have f'(x_0) = f'(-1) = 3(-1)^2 + 2(-1) = 1 > 0, so we go left, and x_1 = x_0 - \alpha f'(x_0) = -1 - \frac{1}{4} = -\frac{5}{4}. Intuitively, the gradient descent cannot converge in this case because \text{lim}_{x \rightarrow -\infty} f(x) = -\infty,

We need to find all local minimums and local maximums. First, we solve the equation f'(x) = 0 to find all critical points.

We have: f'(x) = 0 \Leftrightarrow 3x^2 + 2x = 0 \Leftrightarrow x = -\frac{2}{3} \ \ \text{and} \ \ x = 0.

Now, we consider the second-order derivative: f''(x) = \frac{d^2f}{dx^2} = 6x + 2.

We have f''(x) = 0 only when x = -1/3. Thus, for x < -1/3, f''(x) is negative or the slope f'(x) decreases; and for x > -1/3, f''(x) is positive or the slope f'(x) increases. Keep in mind that -1 < -2/3 < -1/3 < 0 < 1.

Therefore, f has a local maximum at x = -2/3 and a local minimum at x = 0. If the gradient descent starts at x_0 = -1 and it always goes left then it will never meet the local minimum at x = 0, and it will go left infinitely. We say the gradient descent cannot converge, or is divergent.


Problem 5.3

If we start at x_0 = 1, should we go left or right? Can you verify this mathematically? What is x_1? Can gradient descent converge? If so, where it might converge to, given appropriate step size?

We have f'(x_0) = f'(-1) = 3 \cdot 1^2 + 2 \cdot 1 = 5 > 0, so we go left, and x_1 = x_0 - \alpha f'(x_0) = 1 - \frac{1}{4} \cdot 5 = -\frac{1}{4}.

From the previous part, function f has a local minimum at x = 0, so the gradient descent can converge (given appropriate step size) at this local minimum.


Problem 5.4

Write down 1 condition to terminate the gradient descent algorithm (in general).

There are several ways to terminate the gradient descent algorithm:

  • If the change in the optimization objective is too small, i.e. |f(x_i) - f(x_{i + 1})| < \epsilon where \epsilon is a small constant,

  • If the gradient is close to zero or the norm of the gradient is very small, i.e. \|\nabla f(x_i)\| < \lambda where \lambda is a small constant.



Problem 6

Source: Winter 2024 Final Part 1, Problem 2

You and a friend independently perform gradient descent on the same function, but after 200 iterations, you have different results. Which of the following is sufficient on its own to explain the difference in your results? Note: When we say “same function” we assume the learning rate and initial predictions are the same too until said otherwise.

Select all that apply.

Bubbles 1 and 3: “The function is nonconvex” and “You and your friend chose different learning rates.”

If the function is nonconvex it is possible for you and your friend to end in different places if you start in different places. For example if you have a polynomial with a local minima and a global minimum then it is possible you could find the local minima and your friend could find the global minima, which would mean you have different results.

If the function is not differentiable then you cannot perform gradient descent, so this cannot be an answer.

If you and your friend chose different learning rates it is possible to have different results because if you have a really large learning rate you might be hopping over the global minimum without properly converging. Your friend could choose a smaller learning rate, which will allow you to converge to the global minimum.

If you and your friend chose the same initial predictions you are guaranteed to end up in the same spot.

Because two of the option choices are possible the answer cannot be “None of the above.”


Problem 7

Source: Winter 2024 Midterm 1, Problem 3

The hyperbolic cosine function is defined as cosh(x) = \frac{1}{2}(e^{x} + e^{-x}). In this problem, we aim to prove the convexity of this function using power series expansion.


Problem 7.1

Prove that f(x) = x^{n} is convex if n is an even integer.

Take the second derivative of f:

\begin{align*} f'(x) &= nx^{n-1}\\ f''(x) &= n(n-1)x^{n-2} \end{align*}

If n is even, then n-2 must also be even, therefore f''(x) = n(n-1)x^{n-2} will always be a positive number. This means the second derivative of f(x) is always larger than 0 and therefore passes the second derivative test.


Problem 7.2

Power series expansion is a powerful tool to analyze complicated functions. In power series expansion, a function can be written as an infinite sum of polynomial functions with certain coefficients. For example, the exponential function can be written as: \begin{align*} e^{x} = \sum_{n=0}^{\infty}\frac{x^{n}}{n!} = 1 + x + \frac{x^{2}}{2} + \frac{x^{3}}{6} + \frac{x^{4}}{24} + ... \end{align*}

where n! denotes the factorial of n, defined as the product of all positive integers up to n, i.e. n! = 1\cdot 2\cdot 3\cdot ... \cdot (n-1)\cdot n. Given the power series expansion of e^{x} above, write the power series expansion of e^{-x} and explicitly specify the first 5 terms, i.e., similar to the format of the equation above.

By plugging -x in for each x, we get:

e^{-x} = \displaystyle\sum_{n=0}^{\infty}\frac{(-x)^{n}}{n!}=1-x+\frac{x^{2}}{2} - \frac{x^{3}}{6}+\frac{x^{4}}{24}+ ...



Problem 7.3

Using the conclusions you reached in part (a) and part (b), prove that cosh(x) = \frac{1}{2}(e^{x} + e^{-x}) is convex.

Given that:

\begin{align*} e^{x} &= \sum_{n=0}^{\infty}\frac{x^{n}}{n!} = 1 + x + \frac{x^{2}}{2} + \frac{x^{3}}{6} + \frac{x^{4}}{24} + ....\\ e^{-x} &= \sum_{n=0}^{\infty}\frac{(-x)^{n}}{n!} = 1 - x + \frac{x^{2}}{2} - \frac{x^{3}}{6} + \frac{x^{4}}{24} + .... \end{align*}

We can add their power series expansion together, and we will obtain:

\begin{align*} e^{x} + e^{-x} &= \sum_{n=0}^{\infty}\frac{x^{n}}{n!} + \sum_{n=0}^{\infty}\frac{x^{n}}{n!}\\ &=\sum_{n=0}^{\infty}\frac{(x)^{n} + (-x)^{n}}{n!} \end{align*}

Within this infinite sum, if n is even, then the negative sign in (-x)^{n} will disappear; if n is odd, then the negative sign in (-x)^{n} will be kept and travel out of the parenthesis. Therefore we have:

\begin{align*} e^{x} + e^{-x} &= \sum_{n=0}^{\infty}\frac{x^{n}+x^{n}}{n!} \mathrm{(for\; even\; n)} + \sum_{n=0}^{\infty}\frac{x^{n}-x^{n}}{n!}\mathrm{(for\; odd\; n)}\\ &=\sum_{n=0}^{\infty}\frac{2x^{n}}{n!} \mathrm{(for\; even\; n)} \end{align*}

Therefore, cosh(x)=\displaystyle\frac{e^{x}+e^{-x}}{2} is a sum of x^{n}, where n is even. Since we have already proved in part (a) that x^{n} are always convex for even n, cosh(x) is an infinite sum of convex functions and therefore also convex.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.