#assignment #sdsc6015
题目链接SDSC6015 - Question of Assignment 2

Problem 1[10 marks]

Prove that if the function f:RdRf: \mathbb{R}^{d}\rightarrow \mathbb{R} has a subgradient at every point in its domain, then ff is convex.


Solution:
Let x,yRdx, y \in \mathbb{R}^d, λ[0,1]\lambda \in [0,1], and define z=λx+(1λ)yz = \lambda x + (1-\lambda)y.
Since a subgradient exists at every point, for any gzf(z)g_z \in \partial f(z), we have:

f(x)f(z)+gz,xz,f(y)f(z)+gz,yzf(x) \geq f(z) + \langle g_z, x - z \rangle, \quad f(y) \geq f(z) + \langle g_z, y - z \rangle

Taking a weighted sum of the two inequalities:

λf(x)+(1λ)f(y)f(z)+gz,λ(xz)+(1λ)(yz)\lambda f(x) + (1-\lambda) f(y) \geq f(z) + \langle g_z, \lambda(x - z) + (1-\lambda)(y - z) \rangle

Substitute z=λx+(1λ)yz = \lambda x + (1-\lambda)y and simplify the inner product term:

λ(xz)+(1λ)(yz)=λ(1λ)(xy)λ(1λ)(xy)=0\lambda(x - z) + (1-\lambda)(y - z) = \lambda(1-\lambda)(x - y) - \lambda(1-\lambda)(x - y) = 0

Thus:

λf(x)+(1λ)f(y)f(z)=f(λx+(1λ)y)\lambda f(x) + (1-\lambda) f(y) \geq f(z) = f(\lambda x + (1-\lambda)y)

Therefore, ff is convex.


Problem 2[20 marks]

Assume the function f:RdRf: \mathbb{R}^{d}\rightarrow \mathbb{R} has the sum structure:

f(x)=1ni=1nfi(x) f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)

where each fi:RdRf_{i}: \mathbb{R}^{d}\rightarrow \mathbb{R} is L-smooth. Consider the over-parameterized setting, meaning there exists xx^{*} such that fi(x)=0,i{1,2,,n}\nabla f_{i}\left(x^{*}\right)=0,\forall i\in\{1,2,\ldots, n\}.
We run standard SGD by uniformly sampling ii and updating with step-size η>0\eta>0:

xt+1=xtηfi(xt)x_{t+1}=x_{t}-\eta\nabla f_{i}\left(x_{t}\right)

(i) Given the over-parameterization of ff, show that:

E[fi(xt)2xt]2L(f(xt)f(x))\mathbb{E}\left[\left\|\nabla f_{i}\left(x_{t}\right)\right\|^{2} \mid x_{t}\right]\leqslant 2 L\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)

(Hint: use the L-Lipschitz continuity of fi\nabla f_{i} and fi(x)=0\nabla f_{i}\left(x^{*}\right)=0.)


Solution:

Since each fif_i is L-smooth and fi(x)=0\nabla f_i(x^*) = 0, by the co-coercivity property of L-smooth functions, for any xx and yy:

fi(x)fi(y)22L(fi(x)fi(y)fi(y)(xy))\|\nabla f_i(x) - \nabla f_i(y)\|^2 \leq 2L \left( f_i(x) - f_i(y) - \nabla f_i(y)^\top (x-y) \right)

Set y=xy = x^* and use fi(x)=0\nabla f_i(x^*) = 0 to obtain:

fi(xt)22L(fi(xt)fi(x))\|\nabla f_i(x_t)\|^2 \leq 2L \left( f_i(x_t) - f_i(x^*) \right)

This holds for each ii. Taking the expectation over ii (uniform sampling), and noting Ei[fi(xt)]=f(xt)\mathbb{E}_i [f_i(x_t)] = f(x_t) and Ei[fi(x)]=f(x)\mathbb{E}_i [f_i(x^*)] = f(x^*):

Ei[fi(xt)2]2LEi[fi(xt)fi(x)]=2L(f(xt)f(x))\mathbb{E}_i \left[ \|\nabla f_i(x_t)\|^2 \right] \leq 2L \mathbb{E}_i \left[ f_i(x_t) - f_i(x^*) \right] = 2L \left( f(x_t) - f(x^*) \right)

Thus:

E[fi(xt)2xt]2L(f(xt)f(x))\mathbb{E}\left[\|\nabla f_i(x_t)\|^2 \mid x_t\right] \leq 2L \left(f(x_t) - f(x^*)\right)


(ii) Using the result from (i), prove that:

E[f(xt+1)xt]f(xt)ηf(xt)2+η2L2(f(xt)f(x))\mathbb{E}\left[f\left(x_{t+1}\right)\mid x_{t}\right]\leqslant f\left(x_{t}\right)-\eta\left\|\nabla f\left(x_{t}\right)\right\|^{2}+\eta^{2} L^{2}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)

(Hint: substitute the SGD update into the smoothness bound of ff.)


Solution:
By the L-smoothness of ff:

f(xt+1)f(xt)+f(xt)(xt+1xt)+L2xt+1xt2f(x_{t+1}) \leq f(x_t) + \nabla f(x_t)^\top (x_{t+1} - x_t) + \frac{L}{2} \|x_{t+1} - x_t\|^2

Substitute the SGD update xt+1xt=ηfi(xt)x_{t+1} - x_t = -\eta \nabla f_i(x_t):

f(xt+1)f(xt)ηf(xt)fi(xt)+Lη22fi(xt)2f(x_{t+1}) \leq f(x_t) - \eta \nabla f(x_t)^\top \nabla f_i(x_t) + \frac{L \eta^2}{2} \|\nabla f_i(x_t)\|^2

Take the conditional expectation over ii:

Ei[f(xt+1)xt]f(xt)ηf(xt)2+Lη22Ei[fi(xt)2xt]\mathbb{E}_i \left[ f(x_{t+1}) \mid x_t \right] \leq f(x_t) - \eta \|\nabla f(x_t)\|^2 + \frac{L \eta^2}{2} \mathbb{E}_i \left[ \|\nabla f_i(x_t)\|^2 \mid x_t \right]

Apply the result from (i): Ei[fi(xt)2xt]2L(f(xt)f(x))\mathbb{E}_i \left[ \|\nabla f_i(x_t)\|^2 \mid x_t \right] \leq 2L \left(f(x_t) - f(x^*)\right):

E[f(xt+1)xt]f(xt)ηf(xt)2+η2L2(f(xt)f(x))\mathbb{E}\left[f(x_{t+1}) \mid x_t\right] \leqslant f(x_t) - \eta \|\nabla f(x_t)\|^2 + \eta^2 L^2 \left(f(x_t) - f(x^*)\right)


Problem 3[10 marks]

Let f:RdRf: \mathbb{R}^{d}\rightarrow \mathbb{R} be convex, L-smooth, and differentiable, with xx^{*} as the unique global minimum of ff. Given an initial point x1Rdx_{1}\in \mathbb{R}^{d} and T>0T>0, consider Scalar AdaGrad with Gt=j=1tf(xt)2G_{t}=\sum_{j=1}^{t}\|\nabla f(x_{t})\|^{2} and update:

xt+1=xtRGtf(xt),t=1,2,,Tx_{t+1}=x_{t}-\frac{R}{\sqrt{G_{t}}}\nabla f\left( x_{t}\right),\quad t=1,2,\ldots,T

Prove that:

f(1Tt=1Txt)f(x)2R2LTf\left(\frac{1}{T}\sum_{t=1}^{T} x_{t}\right)-f\left(x^{*}\right)\leqslant\frac{2 R^{2} L}{T}

where R=maxt=1TxtxR=\max_{t=1}^{T}\left\|x_{t}-x^{*}\right\|.
Hint: use convexity and smoothness to show:

f(1Tt=1Txt)f(x)1T(t=1Tf(xt),xtx12Lt=1Tf(xt)2)f\left(\frac{1}{T}\sum_{t=1}^{T}x_{t}\right)-f\left(x^{*}\right)\leqslant\frac{1}{T}\left(\sum_{t=1}^{T}\left\langle\nabla f\left(x_{t}\right),x_{t}-x^{*}\right\rangle-\frac{1}{2L}\sum_{t=1}^{T}\|\nabla f\left(x_{t}\right)\|^{2}\right)

where t=1Tf(xt),xtx\sum_{t=1}^{T}\left\langle\nabla f\left(x_{t}\right), x_{t}-x^{*}\right\rangle can be bounded as in the proof of Theorem 3 from Lecture 6.


Solution:
From the hint (convexity + L-smoothness):

f(1Tt=1Txt)f(x)1T(t=1Tf(xt),xtx12Lt=1Tf(xt)2)f\left(\frac{1}{T}\sum_{t=1}^{T} x_t\right) - f(x^*) \leqslant \frac{1}{T} \left( \sum_{t=1}^{T} \langle \nabla f(x_t), x_t - x^* \rangle - \frac{1}{2L} \sum_{t=1}^{T} \|\nabla f(x_t)\|^2 \right)

For the AdaGrad update xt+1=xtRGtf(xt)x_{t+1} = x_t - \frac{R}{\sqrt{G_t}} \nabla f(x_t) with Gt=j=1tf(xj)2G_t = \sum_{j=1}^{t} \|\nabla f(x_j)\|^2, refer to the proof of Theorem 3 in Lecture 6:

xt+1x2=xtx22RGtf(xt),xtx+R2Gtf(xt)2\|x_{t+1} - x^*\|^2 = \|x_t - x^*\|^2 - \frac{2R}{\sqrt{G_t}} \langle \nabla f(x_t), x_t - x^* \rangle + \frac{R^2}{G_t} \|\nabla f(x_t)\|^2

Rearranging and summing over tt:

t=1Tf(xt),xtxGtx1x22R+R2t=1Tf(xt)2Gt\sum_{t=1}^{T} \frac{\langle \nabla f(x_t), x_t - x^* \rangle}{\sqrt{G_t}} \leq \frac{\|x_1 - x^*\|^2}{2R} + \frac{R}{2} \sum_{t=1}^{T} \frac{\|\nabla f(x_t)\|^2}{G_t}

Given R=maxtxtxR = \max_t \|x_t - x^*\| and Gtf(xt)2G_t \geq \|\nabla f(x_t)\|^2, we derive:

t=1Tf(xt),xtxR22GT+R2GT=R2GT\sum_{t=1}^{T} \langle \nabla f(x_t), x_t - x^* \rangle \leq \frac{R^2}{2} \sqrt{G_T} + \frac{R}{2} \sqrt{G_T} = R^2 \sqrt{G_T}

Combining with the lemma and GTTmaxtf(xt)T2L(f(x1)f(x))\sqrt{G_T} \leq \sqrt{T} \max_t \|\nabla f(x_t)\| \leq \sqrt{T} \cdot \sqrt{2L(f(x_1)-f(x^*))}, we obtain:

f(1Tt=1Txt)f(x)2R2LTf\left(\frac{1}{T}\sum_{t=1}^{T} x_t\right) - f(x^*) \leqslant \frac{2R^{2} L}{T}


Problem 4 Practical Implementation of Stochastic Gradient Descent [60 marks]

  • For implementation, it is recommended to use Google Colab.

  • Please open the file “HW2_Lab_SGD.ipynb” and insert your code following the provided instructions. All necessary datasets and helper functions can be found in the HW2 Lab folder.

  • For submission, please submit your completed “HW2_Lab_SGD.ipynb” file along with a PDF containing the output results.