#assignment #sdsc6015
题目链接SDSC6015 - Question of Assignment 2
Problem 1[10 marks]
Prove that if the function f : R d → R f: \mathbb{R}^{d}\rightarrow \mathbb{R} f : R d → R has a subgradient at every point in its domain, then f f f is convex.
Solution:
Let x , y ∈ R d x, y \in \mathbb{R}^d x , y ∈ R d , λ ∈ [ 0 , 1 ] \lambda \in [0,1] λ ∈ [ 0 , 1 ] , and define z = λ x + ( 1 − λ ) y z = \lambda x + (1-\lambda)y z = λ x + ( 1 − λ ) y .
Since a subgradient exists at every point, for any g z ∈ ∂ f ( z ) g_z \in \partial f(z) g z ∈ ∂ f ( z ) , we have:
f ( x ) ≥ f ( z ) + ⟨ g z , x − z ⟩ , f ( y ) ≥ f ( z ) + ⟨ g z , y − z ⟩ f(x) \geq f(z) + \langle g_z, x - z \rangle, \quad
f(y) \geq f(z) + \langle g_z, y - z \rangle
f ( x ) ≥ f ( z ) + ⟨ g z , x − z ⟩ , f ( y ) ≥ f ( z ) + ⟨ g z , y − z ⟩
Taking a weighted sum of the two inequalities:
λ f ( x ) + ( 1 − λ ) f ( y ) ≥ f ( z ) + ⟨ g z , λ ( x − z ) + ( 1 − λ ) ( y − z ) ⟩ \lambda f(x) + (1-\lambda) f(y) \geq f(z) + \langle g_z, \lambda(x - z) + (1-\lambda)(y - z) \rangle
λ f ( x ) + ( 1 − λ ) f ( y ) ≥ f ( z ) + ⟨ g z , λ ( x − z ) + ( 1 − λ ) ( y − z )⟩
Substitute z = λ x + ( 1 − λ ) y z = \lambda x + (1-\lambda)y z = λ x + ( 1 − λ ) y and simplify the inner product term:
λ ( x − z ) + ( 1 − λ ) ( y − z ) = λ ( 1 − λ ) ( x − y ) − λ ( 1 − λ ) ( x − y ) = 0 \lambda(x - z) + (1-\lambda)(y - z) = \lambda(1-\lambda)(x - y) - \lambda(1-\lambda)(x - y) = 0
λ ( x − z ) + ( 1 − λ ) ( y − z ) = λ ( 1 − λ ) ( x − y ) − λ ( 1 − λ ) ( x − y ) = 0
Thus:
λ f ( x ) + ( 1 − λ ) f ( y ) ≥ f ( z ) = f ( λ x + ( 1 − λ ) y ) \lambda f(x) + (1-\lambda) f(y) \geq f(z) = f(\lambda x + (1-\lambda)y)
λ f ( x ) + ( 1 − λ ) f ( y ) ≥ f ( z ) = f ( λ x + ( 1 − λ ) y )
Therefore, f f f is convex.
Problem 2[20 marks]
Assume the function f : R d → R f: \mathbb{R}^{d}\rightarrow \mathbb{R} f : R d → R has the sum structure:
f ( x ) = 1 n ∑ i = 1 n f i ( x ) f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)
f ( x ) = n 1 i = 1 ∑ n f i ( x )
where each f i : R d → R f_{i}: \mathbb{R}^{d}\rightarrow \mathbb{R} f i : R d → R is L-smooth. Consider the over-parameterized setting, meaning there exists x ∗ x^{*} x ∗ such that ∇ f i ( x ∗ ) = 0 , ∀ i ∈ { 1 , 2 , … , n } \nabla f_{i}\left(x^{*}\right)=0,\forall i\in\{1,2,\ldots, n\} ∇ f i ( x ∗ ) = 0 , ∀ i ∈ { 1 , 2 , … , n } .
We run standard SGD by uniformly sampling i i i and updating with step-size η > 0 \eta>0 η > 0 :
x t + 1 = x t − η ∇ f i ( x t ) x_{t+1}=x_{t}-\eta\nabla f_{i}\left(x_{t}\right)
x t + 1 = x t − η ∇ f i ( x t )
(i) Given the over-parameterization of f f f , show that:
E [ ∥ ∇ f i ( x t ) ∥ 2 ∣ x t ] ⩽ 2 L ( f ( x t ) − f ( x ∗ ) ) \mathbb{E}\left[\left\|\nabla f_{i}\left(x_{t}\right)\right\|^{2} \mid x_{t}\right]\leqslant 2 L\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)
E [ ∥ ∇ f i ( x t ) ∥ 2 ∣ x t ] ⩽ 2 L ( f ( x t ) − f ( x ∗ ) )
(Hint: use the L-Lipschitz continuity of ∇ f i \nabla f_{i} ∇ f i and ∇ f i ( x ∗ ) = 0 \nabla f_{i}\left(x^{*}\right)=0 ∇ f i ( x ∗ ) = 0 .)
Solution:
Since each f i f_i f i is L-smooth and ∇ f i ( x ∗ ) = 0 \nabla f_i(x^*) = 0 ∇ f i ( x ∗ ) = 0 , by the co-coercivity property of L-smooth functions, for any x x x and y y y :
∥ ∇ f i ( x ) − ∇ f i ( y ) ∥ 2 ≤ 2 L ( f i ( x ) − f i ( y ) − ∇ f i ( y ) ⊤ ( x − y ) ) \|\nabla f_i(x) - \nabla f_i(y)\|^2 \leq 2L \left( f_i(x) - f_i(y) - \nabla f_i(y)^\top (x-y) \right)
∥∇ f i ( x ) − ∇ f i ( y ) ∥ 2 ≤ 2 L ( f i ( x ) − f i ( y ) − ∇ f i ( y ) ⊤ ( x − y ) )
Set y = x ∗ y = x^* y = x ∗ and use ∇ f i ( x ∗ ) = 0 \nabla f_i(x^*) = 0 ∇ f i ( x ∗ ) = 0 to obtain:
∥ ∇ f i ( x t ) ∥ 2 ≤ 2 L ( f i ( x t ) − f i ( x ∗ ) ) \|\nabla f_i(x_t)\|^2 \leq 2L \left( f_i(x_t) - f_i(x^*) \right)
∥∇ f i ( x t ) ∥ 2 ≤ 2 L ( f i ( x t ) − f i ( x ∗ ) )
This holds for each i i i . Taking the expectation over i i i (uniform sampling), and noting E i [ f i ( x t ) ] = f ( x t ) \mathbb{E}_i [f_i(x_t)] = f(x_t) E i [ f i ( x t )] = f ( x t ) and E i [ f i ( x ∗ ) ] = f ( x ∗ ) \mathbb{E}_i [f_i(x^*)] = f(x^*) E i [ f i ( x ∗ )] = f ( x ∗ ) :
E i [ ∥ ∇ f i ( x t ) ∥ 2 ] ≤ 2 L E i [ f i ( x t ) − f i ( x ∗ ) ] = 2 L ( f ( x t ) − f ( x ∗ ) ) \mathbb{E}_i \left[ \|\nabla f_i(x_t)\|^2 \right] \leq 2L \mathbb{E}_i \left[ f_i(x_t) - f_i(x^*) \right] = 2L \left( f(x_t) - f(x^*) \right)
E i [ ∥∇ f i ( x t ) ∥ 2 ] ≤ 2 L E i [ f i ( x t ) − f i ( x ∗ ) ] = 2 L ( f ( x t ) − f ( x ∗ ) )
Thus:
E [ ∥ ∇ f i ( x t ) ∥ 2 ∣ x t ] ≤ 2 L ( f ( x t ) − f ( x ∗ ) ) \mathbb{E}\left[\|\nabla f_i(x_t)\|^2 \mid x_t\right] \leq 2L \left(f(x_t) - f(x^*)\right)
E [ ∥∇ f i ( x t ) ∥ 2 ∣ x t ] ≤ 2 L ( f ( x t ) − f ( x ∗ ) )
(ii) Using the result from (i), prove that:
E [ f ( x t + 1 ) ∣ x t ] ⩽ f ( x t ) − η ∥ ∇ f ( x t ) ∥ 2 + η 2 L 2 ( f ( x t ) − f ( x ∗ ) ) \mathbb{E}\left[f\left(x_{t+1}\right)\mid x_{t}\right]\leqslant f\left(x_{t}\right)-\eta\left\|\nabla f\left(x_{t}\right)\right\|^{2}+\eta^{2} L^{2}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)
E [ f ( x t + 1 ) ∣ x t ] ⩽ f ( x t ) − η ∥ ∇ f ( x t ) ∥ 2 + η 2 L 2 ( f ( x t ) − f ( x ∗ ) )
(Hint: substitute the SGD update into the smoothness bound of f f f .)
Solution:
By the L-smoothness of f f f :
f ( x t + 1 ) ≤ f ( x t ) + ∇ f ( x t ) ⊤ ( x t + 1 − x t ) + L 2 ∥ x t + 1 − x t ∥ 2 f(x_{t+1}) \leq f(x_t) + \nabla f(x_t)^\top (x_{t+1} - x_t) + \frac{L}{2} \|x_{t+1} - x_t\|^2
f ( x t + 1 ) ≤ f ( x t ) + ∇ f ( x t ) ⊤ ( x t + 1 − x t ) + 2 L ∥ x t + 1 − x t ∥ 2
Substitute the SGD update x t + 1 − x t = − η ∇ f i ( x t ) x_{t+1} - x_t = -\eta \nabla f_i(x_t) x t + 1 − x t = − η ∇ f i ( x t ) :
f ( x t + 1 ) ≤ f ( x t ) − η ∇ f ( x t ) ⊤ ∇ f i ( x t ) + L η 2 2 ∥ ∇ f i ( x t ) ∥ 2 f(x_{t+1}) \leq f(x_t) - \eta \nabla f(x_t)^\top \nabla f_i(x_t) + \frac{L \eta^2}{2} \|\nabla f_i(x_t)\|^2
f ( x t + 1 ) ≤ f ( x t ) − η ∇ f ( x t ) ⊤ ∇ f i ( x t ) + 2 L η 2 ∥∇ f i ( x t ) ∥ 2
Take the conditional expectation over i i i :
E i [ f ( x t + 1 ) ∣ x t ] ≤ f ( x t ) − η ∥ ∇ f ( x t ) ∥ 2 + L η 2 2 E i [ ∥ ∇ f i ( x t ) ∥ 2 ∣ x t ] \mathbb{E}_i \left[ f(x_{t+1}) \mid x_t \right] \leq f(x_t) - \eta \|\nabla f(x_t)\|^2 + \frac{L \eta^2}{2} \mathbb{E}_i \left[ \|\nabla f_i(x_t)\|^2 \mid x_t \right]
E i [ f ( x t + 1 ) ∣ x t ] ≤ f ( x t ) − η ∥∇ f ( x t ) ∥ 2 + 2 L η 2 E i [ ∥∇ f i ( x t ) ∥ 2 ∣ x t ]
Apply the result from (i): E i [ ∥ ∇ f i ( x t ) ∥ 2 ∣ x t ] ≤ 2 L ( f ( x t ) − f ( x ∗ ) ) \mathbb{E}_i \left[ \|\nabla f_i(x_t)\|^2 \mid x_t \right] \leq 2L \left(f(x_t) - f(x^*)\right) E i [ ∥∇ f i ( x t ) ∥ 2 ∣ x t ] ≤ 2 L ( f ( x t ) − f ( x ∗ ) ) :
E [ f ( x t + 1 ) ∣ x t ] ⩽ f ( x t ) − η ∥ ∇ f ( x t ) ∥ 2 + η 2 L 2 ( f ( x t ) − f ( x ∗ ) ) \mathbb{E}\left[f(x_{t+1}) \mid x_t\right] \leqslant f(x_t) - \eta \|\nabla f(x_t)\|^2 + \eta^2 L^2 \left(f(x_t) - f(x^*)\right)
E [ f ( x t + 1 ) ∣ x t ] ⩽ f ( x t ) − η ∥∇ f ( x t ) ∥ 2 + η 2 L 2 ( f ( x t ) − f ( x ∗ ) )
Problem 3[10 marks]
Let f : R d → R f: \mathbb{R}^{d}\rightarrow \mathbb{R} f : R d → R be convex, L-smooth, and differentiable, with x ∗ x^{*} x ∗ as the unique global minimum of f f f . Given an initial point x 1 ∈ R d x_{1}\in \mathbb{R}^{d} x 1 ∈ R d and T > 0 T>0 T > 0 , consider Scalar AdaGrad with G t = ∑ j = 1 t ∥ ∇ f ( x t ) ∥ 2 G_{t}=\sum_{j=1}^{t}\|\nabla f(x_{t})\|^{2} G t = ∑ j = 1 t ∥∇ f ( x t ) ∥ 2 and update:
x t + 1 = x t − R G t ∇ f ( x t ) , t = 1 , 2 , … , T x_{t+1}=x_{t}-\frac{R}{\sqrt{G_{t}}}\nabla f\left( x_{t}\right),\quad t=1,2,\ldots,T
x t + 1 = x t − G t R ∇ f ( x t ) , t = 1 , 2 , … , T
Prove that:
f ( 1 T ∑ t = 1 T x t ) − f ( x ∗ ) ⩽ 2 R 2 L T f\left(\frac{1}{T}\sum_{t=1}^{T} x_{t}\right)-f\left(x^{*}\right)\leqslant\frac{2 R^{2} L}{T}
f ( T 1 t = 1 ∑ T x t ) − f ( x ∗ ) ⩽ T 2 R 2 L
where R = max t = 1 T ∥ x t − x ∗ ∥ R=\max_{t=1}^{T}\left\|x_{t}-x^{*}\right\| R = max t = 1 T ∥ x t − x ∗ ∥ .
Hint: use convexity and smoothness to show:
f ( 1 T ∑ t = 1 T x t ) − f ( x ∗ ) ⩽ 1 T ( ∑ t = 1 T ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ − 1 2 L ∑ t = 1 T ∥ ∇ f ( x t ) ∥ 2 ) f\left(\frac{1}{T}\sum_{t=1}^{T}x_{t}\right)-f\left(x^{*}\right)\leqslant\frac{1}{T}\left(\sum_{t=1}^{T}\left\langle\nabla f\left(x_{t}\right),x_{t}-x^{*}\right\rangle-\frac{1}{2L}\sum_{t=1}^{T}\|\nabla f\left(x_{t}\right)\|^{2}\right)
f ( T 1 t = 1 ∑ T x t ) − f ( x ∗ ) ⩽ T 1 ( t = 1 ∑ T ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ − 2 L 1 t = 1 ∑ T ∥∇ f ( x t ) ∥ 2 )
where ∑ t = 1 T ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ \sum_{t=1}^{T}\left\langle\nabla f\left(x_{t}\right), x_{t}-x^{*}\right\rangle ∑ t = 1 T ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ can be bounded as in the proof of Theorem 3 from Lecture 6.
Solution:
From the hint (convexity + L-smoothness):
f ( 1 T ∑ t = 1 T x t ) − f ( x ∗ ) ⩽ 1 T ( ∑ t = 1 T ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ − 1 2 L ∑ t = 1 T ∥ ∇ f ( x t ) ∥ 2 ) f\left(\frac{1}{T}\sum_{t=1}^{T} x_t\right) - f(x^*) \leqslant \frac{1}{T} \left( \sum_{t=1}^{T} \langle \nabla f(x_t), x_t - x^* \rangle - \frac{1}{2L} \sum_{t=1}^{T} \|\nabla f(x_t)\|^2 \right)
f ( T 1 t = 1 ∑ T x t ) − f ( x ∗ ) ⩽ T 1 ( t = 1 ∑ T ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ − 2 L 1 t = 1 ∑ T ∥∇ f ( x t ) ∥ 2 )
For the AdaGrad update x t + 1 = x t − R G t ∇ f ( x t ) x_{t+1} = x_t - \frac{R}{\sqrt{G_t}} \nabla f(x_t) x t + 1 = x t − G t R ∇ f ( x t ) with G t = ∑ j = 1 t ∥ ∇ f ( x j ) ∥ 2 G_t = \sum_{j=1}^{t} \|\nabla f(x_j)\|^2 G t = ∑ j = 1 t ∥∇ f ( x j ) ∥ 2 , refer to the proof of Theorem 3 in Lecture 6:
∥ x t + 1 − x ∗ ∥ 2 = ∥ x t − x ∗ ∥ 2 − 2 R G t ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ + R 2 G t ∥ ∇ f ( x t ) ∥ 2 \|x_{t+1} - x^*\|^2 = \|x_t - x^*\|^2 - \frac{2R}{\sqrt{G_t}} \langle \nabla f(x_t), x_t - x^* \rangle + \frac{R^2}{G_t} \|\nabla f(x_t)\|^2
∥ x t + 1 − x ∗ ∥ 2 = ∥ x t − x ∗ ∥ 2 − G t 2 R ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ + G t R 2 ∥∇ f ( x t ) ∥ 2
Rearranging and summing over t t t :
∑ t = 1 T ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ G t ≤ ∥ x 1 − x ∗ ∥ 2 2 R + R 2 ∑ t = 1 T ∥ ∇ f ( x t ) ∥ 2 G t \sum_{t=1}^{T} \frac{\langle \nabla f(x_t), x_t - x^* \rangle}{\sqrt{G_t}} \leq \frac{\|x_1 - x^*\|^2}{2R} + \frac{R}{2} \sum_{t=1}^{T} \frac{\|\nabla f(x_t)\|^2}{G_t}
t = 1 ∑ T G t ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ ≤ 2 R ∥ x 1 − x ∗ ∥ 2 + 2 R t = 1 ∑ T G t ∥∇ f ( x t ) ∥ 2
Given R = max t ∥ x t − x ∗ ∥ R = \max_t \|x_t - x^*\| R = max t ∥ x t − x ∗ ∥ and G t ≥ ∥ ∇ f ( x t ) ∥ 2 G_t \geq \|\nabla f(x_t)\|^2 G t ≥ ∥∇ f ( x t ) ∥ 2 , we derive:
∑ t = 1 T ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ ≤ R 2 2 G T + R 2 G T = R 2 G T \sum_{t=1}^{T} \langle \nabla f(x_t), x_t - x^* \rangle \leq \frac{R^2}{2} \sqrt{G_T} + \frac{R}{2} \sqrt{G_T} = R^2 \sqrt{G_T}
t = 1 ∑ T ⟨ ∇ f ( x t ) , x t − x ∗ ⟩ ≤ 2 R 2 G T + 2 R G T = R 2 G T
Combining with the lemma and G T ≤ T max t ∥ ∇ f ( x t ) ∥ ≤ T ⋅ 2 L ( f ( x 1 ) − f ( x ∗ ) ) \sqrt{G_T} \leq \sqrt{T} \max_t \|\nabla f(x_t)\| \leq \sqrt{T} \cdot \sqrt{2L(f(x_1)-f(x^*))} G T ≤ T max t ∥∇ f ( x t ) ∥ ≤ T ⋅ 2 L ( f ( x 1 ) − f ( x ∗ )) , we obtain:
f ( 1 T ∑ t = 1 T x t ) − f ( x ∗ ) ⩽ 2 R 2 L T f\left(\frac{1}{T}\sum_{t=1}^{T} x_t\right) - f(x^*) \leqslant \frac{2R^{2} L}{T}
f ( T 1 t = 1 ∑ T x t ) − f ( x ∗ ) ⩽ T 2 R 2 L
Problem 4 Practical Implementation of Stochastic Gradient Descent [60 marks]
For implementation, it is recommended to use Google Colab.
Please open the file “HW2_Lab_SGD.ipynb” and insert your code following the provided instructions. All necessary datasets and helper functions can be found in the HW2 Lab folder.
For submission, please submit your completed “HW2_Lab_SGD.ipynb” file along with a PDF containing the output results.