#sdsc6015

English / 中文

Review

Click to expand

Convex Optimization Problems

The general form of a convex optimization problem is:

minxRdf(x)\min_{x \in \mathbb{R}^d} f(x)

where ff is a convex function, Rd\mathbb{R}^d is a convex set, and xx^* is its minimizer:

x=argminxRdf(x)x^* = \arg\min_{x \in \mathbb{R}^d} f(x)

The update rule for Gradient Descent (GD) is:

xk+1=xkηk+1f(xk)x_{k+1} = x_k - \eta_{k+1} \nabla f(x_k)

  • xkx_k: current parameter point

  • ηk>0\eta_k > 0: step size (learning rate)

  • xk+1x_{k+1}: updated parameter point


Smooth Functions

Definition:

A function f:dom(f)Rf: \text{dom}(f) \to \mathbb{R} is differentiable and there exists L>0L > 0 such that for all x,yXdom(f)x, y \in X \subseteq \text{dom}(f):

f(y)f(x)+f(x)(yx)+L2xy2f(y) \leq f(x) + \nabla f(x)^\top (y - x) + \frac{L}{2} \|x - y\|^2

Then ff is called LL-smooth.

💡 Smoothness means the gradient of the function does not change too quickly, with an upper bound controlled by LL.


Subgradient

Definition:

For a convex function f:RdRf: \mathbb{R}^d \to \mathbb{R}, a subgradient gg at point xx satisfies:

f(y)f(x)+g(yx),yf(y) \geq f(x) + g^\top (y - x), \quad \forall y

The set of all subgradients is called the subdifferential:

f(x)={gRdg is a subgradient of f at x}\partial f(x) = \{ g \in \mathbb{R}^d \mid g \text{ is a subgradient of } f \text{ at } x \}

🔍 For differentiable convex functions, the subdifferential is the gradient; for non-differentiable functions (e.g., x|x|), the subgradient may not be unique.

Subgradient Method:

The update rule is:

xk+1=xkηk+1gk,gkf(xk)x_{k+1} = x_k - \eta_{k+1} g_k, \quad g_k \in \partial f(x_k)

  • Note: The subgradient method is not necessarily a descent method (e.g., f(x)=xf(x) = |x| may cause oscillation).


Convergence Performance Comparison Table

Function Properties Algorithm Convergence Bound Iterations (to achieve error ε\varepsilon)
Convex, LL-Lipschitz GD f(xbest(T))f(x)RLTf(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{RL}{\sqrt{T}} O(R2L2ε2)\mathcal{O}\left( \frac{R^2 L^2}{\varepsilon^2} \right)
Convex, LL-Smooth GD f(xbest(T))f(x)R2L2Tf(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{R^2 L}{2T} O(R2L2ε)\mathcal{O}\left( \frac{R^2 L}{2\varepsilon} \right)
μ\mu-SC, LL-Smooth GD f(xbest(T))f(x)L2(1μL)TR2f(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{L}{2} \left(1 - \frac{\mu}{L}\right)^T R^2 O(LμlnR2L2ε)\mathcal{O}\left( \frac{L}{\mu} \ln \frac{R^2 L}{2\varepsilon} \right)
Convex, LL-Lipschitz Subgradient f(xbest(T))f(x)LRTf(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{LR}{\sqrt{T}} O(R2L2ε2)\mathcal{O}\left( \frac{R^2 L^2}{\varepsilon^2} \right)
μ\mu-SC, gB|g| \leq B Subgradient f(xbest(T))f(x)2B2μ(T+1)f(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{2B^2}{\mu(T+1)} O(2B2με)\mathcal{O}\left( \frac{2B^2}{\mu\varepsilon} \right)

Where:

  • R=x0xR = \|x_0 - x^*\|

  • xbest(T)=argmini=0,,Tf(xi)x_{\text{best}}^{(T)} = \arg\min_{i=0,\dots,T} f(x_i)


Strongly Convex Functions

Definition

A function f:dom(f)Rf: \text{dom}(f) \to \mathbb{R} is called strongly convex if there exists a parameter μ>0\mu > 0 such that for all x,ydom(f)x, y \in \text{dom}(f) (where dom(f)\text{dom}(f) is a convex set):

f(y)f(x)+f(x)(yx)+μ2yx2f(y) \geq f(x) + \nabla f(x)^\top (y - x) + \frac{\mu}{2} \|y - x\|^2

Screenshot 2025-09-22 17.53.13.png

Here, f(x)\nabla f(x) is the gradient of ff at point xx (if ff is differentiable).

For non-differentiable functions, the subgradient version of the definition is: for all gf(x)g \in \partial f(x) (subdifferential), we have:

f(y)f(x)+g(yx)+μ2yx2f(y) \geq f(x) + g^\top (y - x) + \frac{\mu}{2} \|y - x\|^2

💡 Intuitive explanation: A strongly convex function has a value at any point xx that is higher than a “strengthened” linear approximation (i.e., plus a quadratic term μ2yx2\frac{\mu}{2} \|y - x\|^2). This ensures the function has stronger curvature, leading to faster convergence in optimization.

Key Properties

(1) Strong Convexity Implies Strict Convexity

If ff is μ\mu-strongly convex, then it is also strictly convex. This means for all xyx \neq y and λ(0,1)\lambda \in (0,1):

f(λx+(1λ)y)<λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda) y) < \lambda f(x) + (1-\lambda) f(y)

Proof outline (from supplementary notes p15):
Assume xyx \neq y, let z=λx+(1λ)yz = \lambda x + (1-\lambda) y. By strong convexity:

f(x)>f(z)+f(z)(xz)f(y)>f(z)+f(z)(yz)\begin{aligned} f(x) &> f(z) + \nabla f(z)^\top (x - z) \\ f(y) &> f(z) + \nabla f(z)^\top (y - z) \end{aligned}

Weighted average these two inequalities (weights λ\lambda and 1λ1-\lambda), the gradient terms cancel, yielding:

λf(x)+(1λ)f(y)>f(z)\lambda f(x) + (1-\lambda) f(y) > f(z)

Thus proving strict convexity.

(2) Existence of a Unique Global Minimum

A strongly convex function has exactly one global minimum point xx^*.
Proof outline (from supplementary notes p15):
Assume xx^* is a minimum point, then f(x)=0\nabla f(x^*) = 0 (differentiable case) or 0f(x)0 \in \partial f(x^*) (non-differentiable case). By strong convexity:

f(y)f(x)+μ2yx2f(y) \geq f(x^*) + \frac{\mu}{2} \|y - x^*\|^2

When yxy \neq x^*, μ2yx2>0\frac{\mu}{2} \|y - x^*\|^2 > 0, so f(y)>f(x)f(y) > f(x^*), meaning xx^* is unique.

Examples

A typical example of a strongly convex function is f(x)=exf(x) = e^{|x|}, which is strongly convex for some parameter μ\mu. Specifically, when μ=1\mu = 1, this function satisfies the strong convexity definition.

Screenshot 2025-09-22 17.51.53.png

Applications in Optimization

Strong convexity significantly improves the convergence rate of optimization algorithms. Discussed in two cases:

(1) Gradient Descent for Smooth and Strongly Convex Functions

If ff is LL-smooth and μ\mu-strongly convex (differentiable), then choosing step size η=1L\eta = \frac{1}{L}, the iterates of gradient descent satisfy:

xt+1x2(1μL)xtx2\|x_{t+1} - x^*\|^2 \leq \left(1 - \frac{\mu}{L} \right) \|x_t - x^*\|^2

Error decays exponentially:

f(xT)f(x)L2(1μL)Tx0x2f(x_T) - f(x^*) \leq \frac{L}{2} \left(1 - \frac{\mu}{L} \right)^T \|x_0 - x^*\|^2

Proof:

From gradient descent update: xt+1=xtηf(xt)x_{t+1} = x_t - \eta \nabla f(x_t).
Consider distance change:

xt+1x2=xtηf(xt)x2=xtx22ηf(xt)(xtx)+η2f(xt)2\|x_{t+1} - x^*\|^2 = \|x_t - \eta \nabla f(x_t) - x^*\|^2 = \|x_t - x^*\|^2 - 2\eta \nabla f(x_t)^\top (x_t - x^*) + \eta^2 \|\nabla f(x_t)\|^2

By strong convexity:

f(xt)(xtx)f(xt)f(x)+μ2xtx2\nabla f(x_t)^\top (x_t - x^*) \geq f(x_t) - f(x^*) + \frac{\mu}{2} \|x_t - x^*\|^2

Substitute:

xt+1x2xtx22η(f(xt)f(x)+μ2xx2)+ηf(xt)2\|x_{t+1} - x^*\|^2 \leq \|x_t - x^*\|^2 - 2\eta \left( f(x_t) - f(x^*) + \frac{\mu}{2} \|x极 - x^*\|^2 \right) + \eta^极 \|\nabla f(x_t)\|^2

Rearrange:

xt+1x2(1μη)xtx22η(f(xt)f(x))+η2f(xt)2\|x_{t+1} - x^*\|^2 \leq (1 - \mu \eta) \|x_t - x^*\|^2 - 2\eta (f(x_t) - f(x^*)) + \eta^2 \|\nabla f(x_t)\|^2

Now set η=1/L\eta = 1/L. By smoothness, we have the sufficient decrease lemma:

f(xt+1)f(xt)12Lf(xt)2f(x_{t+1}) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2

Thus, 12Lf(xt)2f(x+1)f(xt)-\frac{1}{2L} \|\nabla f(x_t)\|^2 \leq f(x_{极+1}) - f(x_t), but here we need to bound η2f(xt)2\eta^2 \|\nabla f(x_t)\|^2.

Actually, from smoothness:

f(x)f(xt)12Lf(xt)2f(x^*) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2

So f(xt)22L(f(xt)f(x))\|\nabla f(x_t)\|^2 \leq 2L (f(x_t) - f(x^*)).

Substitute:

η2f(xt)2=1L2f(xt)2L(f(xt)f(x))\eta^2 \|\nabla f(x_t)\|^2 = \frac{1}{L^2} \|\nabla f(x_t)\|^极 \leq \frac{2}{L} (f(x_t) - f(x^*))

Then:

xt+1x2(1μη)xtx22η(f(xt)f(x))+2L(f(xt)f(x))\|x_{t+1} - x^*\|^2 \leq (1 - \mu \eta) \|x_t - x^*\|^2 - 2\eta (f(x_t) - f(x^*)) + \frac{2}{L} (f(x_t) - f(x^*))

Since η=1/L\eta = 1/L, 2η+2L=0-2\eta + \frac{2}{L} = 0, so:

xt+1x2(1μL)xtx2\|x_{t+1} - x^*\|^2 \leq (1 - \frac{\mu}{L}) \|x_t - x^*\|^2

This proves the first part.

For the second part, by smoothness:

f(xT)f(x)L2xTx2L2(1μL)Tx0x2f(x_T) -极 f(x^*) \leq \frac{L}{2} \|x_T - x^*\|^2 \leq \frac{L}{2} \left(1 - \frac{\mu}{L}\right)^T \|x_0 - x^*\|^2

Q.E.D.

Iteration requirement: To achieve error ε\varepsilon, the number of iterations needed is:

TLμln(R2L2ε),where R=x0xT \geq \frac{L}{\mu} \ln \left( \frac{R^2 L}{2\varepsilon} \right), \quad \text{where } R = \|x_0 - x^*\|

This is much faster than the non-strongly convex case (e.g., O(1/ε)\mathcal{O}(1/\varepsilon) rate for smooth convex functions), as the error decays as O(eT)\mathcal{O}(e^{-T}).

(2) Subgradient Method for Strongly Convex Functions

If ff is μ\mu-strongly convex (possibly non-differentiable), and the subgradient norm is bounded (i.e., gtB\|g_t\| \leq B for all gtf(xt)g_t \in \partial f(x_t)), then use decreasing step size:

ηt=2μ(t+1)\eta_t = \frac{2}{\mu(t + 1)}

And compute the weighted average point:

xˉT=2T(T+1)t=1Ttxt\bar{x}_T = \frac{2}{T(T+1)} \sum_{t=1}^T t \cdot x_t

The convergence rate is:

f(xˉT)f(x)2B2μ(T+1)f(\bar{x}_T) - f(x^*) \leq \frac{2B^2}{\mu(T + 1)}

Proof:

From subgradient update: xt+1=xtηtgtx_{t+1} = x_t - \eta_t g_t, where gtf(xt)g_t \in \partial f(x_t).
Consider distance change:

xt+1x2=xtηtgtx2=xtx22ηtgt(xtx)+ηt2gt2\|x_{t+1} - x^*\|^2 = \|x_t - \eta_t g_t - x^*\|^2 = \|x_t - x^*\|^2 - 2\eta_t g_t^\top (x_t - x^*) + \eta_t^2 \|g_t\|^2

By strong convexity:

gt(xtx)f(xt)f(x)+μ2xtx2g_t^\top (x_t - x^*) \geq f(x_t) - f(x^*) + \frac{\mu}{2} \|x_t - x^*\|^2

Substitute:

xt+x2xtx22ηt(f(xt)f(x)+μ2xtx2)+ηt2gt2\|x_{t+极} - x^*\|^2 \leq \|x_t - x^*\|^2 - 2\eta_t \left( f(x_t) - f(x^*) + \frac{\mu}{2} \|x_t - x^*\|^2 \right) + \eta_t^2 \|g_t\|^2

Rearrange:

xt+1x2(1μηt)xtx22ηt(f(xt)f(x))+ηt2B2\|x_{t+1} - x^*\|^2 \leq (1 - \mu \eta_t) \|x_t - x^*\|^2 - 2\eta_t (f(x_t) - f(x^*)) + \eta_t^2 B^2

Now move terms:

2ηt(f(xt)f(x))(1μηt)xtx2xt+1x2+ηt2B22\eta_t (f(x_t) - f(x^*)) \leq (1 - \mu \eta_t) \|x_t - x^*\|^2 - \|x_{t+1} - x^*\|^2 + \eta_t^2 B^2

Substitute step size ηt=2μ(t+1)\eta_t = \frac{2}{\mu(t+1)}, then 1μηt=12t+1=t1t+11 - \mu \eta_t = 1 - \frac{2}{t+1} = \frac{t-1}{t+1}.

Multiply both sides by tt (for telescoping):

2tηt(f(xt)f(x))t(1μηt)xtx2txt+1x2+tηt2B22t \eta_t (f(x_t) - f(x^*)) \leq t(1 - \mu \eta_t) \|x_t - x^*\|^2 - t \|x_{t+1} - x^*\|^2 + t \eta_t^2 B^2

Note t(1μηt)=tt1t+1=t(t1)t+1t(1 - \mu \eta_t) = t \cdot \frac{t-1}{t+1} = \frac{t(t-1)}{t+1}, and tηt2=t4μ2(t+1)2=4tμ2(t+1)2t \eta_t^2 = t \cdot \frac{4}{\mu^2 (t+1)^2} = \frac{4t}{\mu^2 (t+1)^2}.

But more directly, observe:

t(1μηt)=t(t1)t+1,and(t+1)xt+1x2 might appeart(1 - \mu \eta_t) = \frac{t(t-1)}{t+1}, \quad \text{and} \quad (t+1) \|x_{t+1} - x^*\|^2 \text{ might appear}

Actually, from the original inequality:

2ηt(f(xt)f(x))ηt2B22+12ηt(xtx2xt+1x2)μ2xtx22\eta_t (f(x_t) - f(x^*)) \leq \frac{\eta_t^2 B^2}{2} + \frac{1}{2\eta_t} \left( \|x_t - x^*\|^2 - \|x_{t+1} - x^*\|^2 \right) - \frac{\mu}{2} \|x_t - x^*\|^2

But the course material uses another approach.

According to the course material proof:
From the inequality:

f(xt)f(x)B2ηt2+ηt1μ2xtx2ηt12xt+1x2f(x_t) - f(x^*) \leq \frac{B^2 \eta_t}{2} + \frac{\eta_t^{-1} - \mu}{2} \|x_t - x^*\|^2 - \frac{\eta_t^{-1}}{2} \|x_{t+1} - x^*\|^2

Substitute ηt=2μ(t+1)\eta_t = \frac{2}{\mu(t+1)}, then ηt1=μ(t+1)2\eta_t^{-1} = \frac{\mu(t+1)}{2}.

So:

f(xt)f(x)B222μ(t+1)+12(μ(t+1)2μ)xtx212μ(t+1)2xt+1x2f(x_t) - f(x^*) \leq \frac{B^2}{2} \cdot \frac{2}{\mu(t+1)} + \frac{1}{2} \left( \frac{\mu(t+1)}{2} - \mu \right) \|x_t - x^*\|^2 - \frac{1}{2} \cdot \frac{\mu(t+1)}{2} \|x_{t+1} - x^*\|^2

Simplify:

f(xt)f(x)B2μ(t+1)+μ4((t+1)2)xtx2μ(t+1)4xt+1x2f(x_t) - f(x^*) \leq \frac{B^2}{\mu(t+1)} + \frac{\mu}{4} \left( (t+1) - 2 \right) \|x_t - x^*\|^2 - \frac{\mu(t+1)}{4} \|x_{t+1} - x^*\|^2

i.e.:

f(x_t) - f(x^*) \leq \frac{B^2}{\极\mu(t+1)} + \frac{\mu}{4} (t-1) \|x_t - x^*\|^2 - \frac{\mu(t+1)}{4} \|x_{t+1} - x^*\|^2

Multiply both sides by tt:

t(f(xt)f(x))tB2μ(t+1)+μ4(t(t1)xtx2t(t+1)xt+1x2)t (f(x_t) - f(x^*)) \leq \frac{t B^2}{\mu(t+1)} + \frac{\mu}{4} \left( t(t-1) \|x_t - x^*\|^2 - t(t+1) \|x_{t+1} - x^*\|^2 \right)

Sum from t=1t=1 to TT:

\sum_{t=1}^T t (f(x_t) - f(x^*)) \leq \sum_{t=1}^T \frac{t B^2}{\mu(t+1)} + \frac{\mu}{4} \left( \sum_{t=1}^T [t(t-1) \|x_t - x^*\|^2 - t(t+1) \极\|x_{t+1} - x^*\|^2] \right)

The second term on the right is a telescoping sum:

t=1T[t(t1)xtx2t(t+1)xt+1x2]=T(T+1)xT+1x20\sum_{t=1}^T [t(t-1) \|x_t - x^*\|^2 - t(t+1) \|x_{t+1} - x^*\|^2] = - T(T+1) \|x_{T+1} - x^*\|^2 \leq 0

Because when t=1t=1, 10x1x2=01 \cdot 0 \cdot \|x_1 - x^*\|^2 = 0, and subsequent terms cancel.

So:

t=1Tt(f(xt)f(x))t=1TtB2μ(t+1)TB2μ\sum_{t=1}^T t (f(x_t) - f(x^*)) \leq \sum_{t=1}^T \frac{t B^2}{\mu(t+1)} \leq \frac{T B^2}{\mu}

By convexity, the weighted average satisfies:

f(2T(T+1)t=1Ttxt)2T(T+1)t=1Ttf(xt)f\left( \frac{2}{T(T+1)} \sum_{t=1}^T t x_t \right) \leq \frac{2}{T(T+1)} \sum_{t=1}^T t f(x_t)

So:

f(2T(T+1)t=1Ttxt)f(x)2T(T+1)t=1Tt(f(xt)f(x))2T(T+1)TB2μ=2B2μ(T+1)f\left( \frac{2}{T(T+1)} \sum_{t=1}^T t x_t \right) - f(x^*) \leq \frac{2}{T(T+1)} \sum_{t=1}^T t (f(x_t) - f(x^*)) \leq \frac{2}{T(T+1)} \cdot \frac{T B^2}{\mu} = \frac{2 B^2}{\mu (T+1)}

Q.E.D.

Iteration requirement: To achieve error ε\varepsilon, the number of iterations needed is O(B2με)\mathcal{O}\left( \frac{B^2}{\mu \varepsilon} \right).

🔍 Note: Weighted averaging helps stabilize convergence and avoid subgradient oscillation. But the subgradient method is not a descent method, so averaging is necessary.


Projected Gradient Descent

Constrained Optimization Problem

minxXf(x)\min_{x \in X} f(x)

where XRdX \subseteq \mathbb{R}^d is a closed convex set.

Screenshot 2025-09-22 18.27.11.png

Projection Operator

The projection onto XX is defined as:

ΠX(y)=argminxXxy\Pi_X(y) = \arg\min_{x \in X} \|x - y\|

截屏2025-09-22 18.29.51.png

Update Rule

yt+1=xtηtf(xt)xt+1=ΠX(yt+1)\begin{aligned} y_{t+1} &= x_t - \eta_t \nabla f(x_t) \\ x_{t+1} &= \Pi_X(y_{t+1}) \end{aligned}

Projection Properties

  1. (xΠX(y))(yΠX(y))0(x - \Pi_X(y))^\top (y - \Pi_X(y)) \leq 0 for all xXx \in X

  2. xΠX(y)2+yΠX(y)2xy2\|x - \Pi_X(y)\|^2 + \|y - \Pi_X(y)\|^2 \leq \|x - y\|^2 for all xXx \in X

Proof:
  1. Since ΠX(y)\Pi_X(y) minimizes xy2\|x - y\|^2 over XX, by optimality conditions, for any xXx \in X:

    (ΠX(y)y)(xΠX(y))0(\Pi_X(y) - y)^\top (x - \Pi_X(y)) \geq 0

    i.e., (xΠX(y))(yΠX(y))0(x - \Pi_X(y))^\top (y - \Pi_X(y)) \leq 0.

  2. From 1, we have:

    xy2=xΠX(y)+ΠX(y)y2=xΠX(y)2+ΠX(y)y2+2(xΠX(y))(ΠX(y)y)\|x - y\|^2 = \|x - \Pi_X(y) + \Pi_X(y) - y\|^2 = \|x - \Pi_X(y)\|^2 + \|\Pi_X(y) - y\|^2 + 2 (x - \Pi_X(y))^\top (\Pi_X(y) - y)

    Since (xΠX(y))(ΠX(y)y)0(x - \Pi_X(y))^\top (\Pi_X(y) - y) \geq 0, so:

    xy2xΠX(y)2+ΠX(y)y2\|x - y\|^2 \geq \|x - \Pi_X(y)\|^2 + \|\Pi_X(y) - y\|^2

    i.e., property 2.

Convergence Rate

The convergence rate of projected gradient descent is the same as the unconstrained case, depending on function properties:

  • If ff is convex and LL-Lipschitz over XX, error O(1/T)O(1/\sqrt{T}), iterations O(1/ε2)O(1/\varepsilon^2).

  • If ff is convex and LL-smooth over XX, error O(1/T)O(1/T), iterations O(1/ε)O(1/\varepsilon).

  • If ff is μ\mu-strongly convex and LL-smooth over XX, error O(ecT)O(e^{-c T}), iterations O(log(1/ε))O(\log(1/\varepsilon)).

Proof sketch: Similar to unconstrained proof, but use projection properties to bound xt+1x2\|x_{t+1} - x^*\|^2. For example, for Lipschitz case:
From update:

yt+1=xtηf(xt)y_{t+1} = x_t - \eta \nabla f(x_t)

By projection property 2:

xt+1x2yt+1x2yt+1xt+12yt+1x2\|x_{t+1} - x^*\|^2 \leq \|y_{t+1} - x^*\|^2 - \|y_{t+1} - x_{t+1}\|^2 \leq \|y_{t+1} - x^*\|^2

Then analyze similarly to unconstrained.

Summary Table

Function Properties Algorithm Convergence Rate Iterations
Convex, LL-Lipschitz Gradient Descent f(xbest(T))f(x)RLTf(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{RL}{\sqrt{T}} O(1/ε2)O(1/\varepsilon^2)
Convex, LL-Smooth Gradient Descent f(xbest(T))f(x)R2L2Tf(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{R^2 L}{2T} O(1/ε)O(1/\varepsilon)
Convex, LL-Lipschitz Subgradient Method f(xbest(T))f(x)LRTf(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{LR}{\sqrt{T}} O(1/ε2)O(1/\varepsilon^2)
μ\mu-Strongly Convex, LL-Smooth Gradient Descent f(xbest(T))f(x)RL2(1μL)Tf(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{RL}{2}(1 - \frac{\mu}{L})^T O(log(1/ε))O(\log(1/\varepsilon))
μ\mu-Strongly Convex, gB|g| \leq B Subgradient Method f(xbest(T))f(x)2B2μ(T+1)f(x_{\text{best}}^{(T)}) - f(x^*) \leq \frac{2B^2}{\mu(T+1)} O(1/ε)O(1/\varepsilon)

Where R=x0xR = \|x_0 - x^*\|.

Note: In practice, step size selection is crucial for convergence. For unknown parameters, adaptive step size strategies may be needed.


Additional Notes

  1. Conflict between Strong Convexity and Lipschitz:
    Non-smooth functions cannot be both Lipschitz and strongly convex (e.g., f(x)=xf(x) = \sqrt{x} is unbounded near x=0x=0).

  2. Relationship between Subgradient Norm and Lipschitz:

    • Lipschitz continuity \Rightarrow bounded subgradient
    • Bounded subgradient ⇏\not\Rightarrow Lipschitz continuity
  3. Optimality:
    The convergence rates of first-order methods (gradient/subgradient) are optimal in general and cannot be further improved.