#sdsc6015

English / Chinese

Review - Convex Functions and Convex Optimization

Review

Definition of Convex Functions

Review

A function f:RdRf: \mathbb{R}^d \to \mathbb{R} is convex if and only if:

  1. Its domain dom(f)\text{dom}(f) is a convex set;

  2. For all x,ydom(f)\mathbf{x}, \mathbf{y} \in \text{dom}(f) and λ[0,1]\lambda \in [0,1], it satisfies:

    f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda \mathbf{x} + (1-\lambda)\mathbf{y}) \leq \lambda f(\mathbf{x}) + (1-\lambda)f(\mathbf{y})

Geometric Interpretation: The line segment between any two points on the function’s graph lies above the graph.


First-Order Convexity Criterion

Review

If ff is differentiable, convexity is equivalent to:

f(y)f(x)+f(x)(yx),x,ydom(f)f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x}), \quad \forall \mathbf{x}, \mathbf{y} \in \text{dom}(f)

Geometric Interpretation: The function graph always lies above its tangent hyperplane.


Differentiable Functions

Review
  • Definition: ff is differentiable at x0\mathbf{x}_0 if there exists a gradient f(x0)\nabla f(\mathbf{x}_0) such that:

    f(x0+h)f(x0)+f(x0)hf(\mathbf{x}_0 + \mathbf{h}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top \mathbf{h}

    where h\mathbf{h} is an infinitesimal perturbation.

  • Globally Differentiable: If differentiable at every point in the domain, ff is called differentiable, and its graph has non-vertical tangent hyperplanes at all points.


Convex Optimization Problems

Review

Formulation:

minxRdf(x)\min_{\mathbf{x} \in \mathbb{R}^d} f(\mathbf{x})

where ff is a convex differentiable function, and Rd\mathbb{R}^d is a convex set. The global minimum point is denoted x=argminf(x)\mathbf{x}^* = \arg\min f(\mathbf{x}) (may not be unique).


Gradient Descent Method

Core Idea

Update parameters using the negative gradient direction (gradient f(x)\nabla f(\mathbf{x}) points in the direction of fastest function increase):

xk+1=xkηkf(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \eta_k \nabla f(\mathbf{x}_k)

where:

  • xk\mathbf{x}_k: Current parameter vector at iteration kk

  • ηk\eta_k: Step size (learning rate), controlling update magnitude (ηk>0\eta_k > 0)

  • f(xk)\nabla f(\mathbf{x}_k): Gradient of ff at xk\mathbf{x}_k

  • xk+1\mathbf{x}_{k+1}: Updated parameter vector

Key Notes:

  1. Gradient Direction: Negative gradient f(xk)-\nabla f(\mathbf{x}_k) points in the direction of fastest function decrease;
  2. Step Size Selection:
    • Fixed step size (e.g., ηk=0.1\eta_k = 0.1) must satisfy convergence conditions;
    • Adaptive step size (e.g., ηk=1k\eta_k = \frac{1}{\sqrt{k}}) improves convergence;
  3. Geometric Meaning: Each iteration moves linearly along the gradient direction by distance ηkf(xk)\eta_k \|\nabla f(\mathbf{x}_k)\|.

Gradient Descent Example: Quadratic Function Optimization

Consider the convex function f(x)=12x2f(x) = \frac{1}{2} x^2:

  • Minimum Point: x=0x^* = 0 (minimum value f(0)=0f(0)=0)

  • Gradient Calculation:

    f(x)=ddx(12x2)=x\nabla f(x) = \frac{d}{dx} \left( \frac{1}{2} x^2 \right) = x

Gradient Descent Update Rule

Iteration formula with fixed step size η\eta:

xt+1=xtηf(xt)=xtηxt=xt(1η)x_{t+1} = x_t - \eta \nabla f(x_t) = x_t - \eta x_t = x_t (1 - \eta)

  • xtx_t: Parameter value at iteration tt

  • η\eta: Step size (learning rate)

Closed-Form Iteration Sequence

After kk iterations:

xk=x0(1η)kx_k = x_0 (1 - \eta)^k

Derivation:
Recursive expansion:

xk=xk1(1η)=xk2(1η)2==x0(1η)kx_k = x_{k-1} (1 - \eta) = x_{k-2} (1 - \eta)^2 = \cdots = x_0 (1 - \eta)^k

Convergence Analysis

When 0<η<10 < \eta < 1:

limkxk=limkx0(1η)k=0\lim_{k \to \infty} x_k = \lim_{k \to \infty} x_0 (1 - \eta)^k = 0

Explanation:
Since 1η<1|1 - \eta| < 1, the sequence (1η)k(1 - \eta)^k decays exponentially to 0, so iterates converge to x=0x^* = 0.


Convergence Behavior with Specific Parameters

  • Step Size: η=0.1\eta = 0.1

  • Initial Point: x0=5x_0 = 5

  • Iteration Sequence:

    xk=5×(0.9)kx_k = 5 \times (0.9)^k

  • Convergence Process:

    • k=0k=0: x0=5x_0 = 5
    • k=1k=1: x1=5×0.9=4.5x_1 = 5 \times 0.9 = 4.5
    • k=10k=10: x10=5×(0.9)101.74x_{10} = 5 \times (0.9)^{10} \approx 1.74
    • k=50k=50: x50=5×(0.9)500.034x_{50} = 5 \times (0.9)^{50} \approx 0.034 (close to minimum)

Screenshot 2025-09-17 11.11.58.png

Conclusion:
This example verifies global convergence of gradient descent on convex functions. Step size η=0.1\eta=0.1 satisfies 0<η<10<\eta<1, ensuring monotonic convergence to x=0x^*=0.


Vanilla Analysis of Gradient Descent

Goal: Bound the error f(xt)f(x)f(\mathbf{x}_t) - f(\mathbf{x}^*)

Define gradient gt:=f(xt)\mathbf{g}_t := \nabla f(\mathbf{x}_t). From update rule xt+1=xtηgt\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \mathbf{g}_t:

gt=xtxt+1η\mathbf{g}_t = \frac{\mathbf{x}_t - \mathbf{x}_{t+1}}{\eta}

Key Equality Derivation

  1. Construct Inner Product Term:

    gt(xtx)=1η(xtxt+1)(xtx)\mathbf{g}_t^\top (\mathbf{x}_t - \mathbf{x}^*) = \frac{1}{\eta} (\mathbf{x}_t - \mathbf{x}_{t+1})^\top (\mathbf{x}_t - \mathbf{x}^*)

  2. Apply Vector Identity (2vw=v2+w2vw22\mathbf{v}^\top\mathbf{w} = \|\mathbf{v}\|^2 + \|\mathbf{w}\|^2 - \|\mathbf{v} - \mathbf{w}\|^2):

    gt(xtx)=12η(xtxt+12+xtx2xt+1x2)=η2gt2+12η(xtx2xt+1x2)\begin{aligned} \mathbf{g}_t^\top (\mathbf{x}_t - \mathbf{x}^*) &= \frac{1}{2\eta} \left( \|\mathbf{x}_t - \mathbf{x}_{t+1}\|^2 + \|\mathbf{x}_t - \mathbf{x}^*\|^2 - \|\mathbf{x}_{t+1} - \mathbf{x}^*\|^2 \right) \\ &= \frac{\eta}{2} \|\mathbf{g}_t\|^2 + \frac{1}{2\eta} \left( \|\mathbf{x}_t - \mathbf{x}^*\|^2 - \|\mathbf{x}_{t+1} - \mathbf{x}^*\|^2 \right) \end{aligned}

    Notes:

    • Term η2gt2\frac{\eta}{2} \|\mathbf{g}_t\|^2 controls gradient magnitude;
    • Term 12η(xtx2xt+1x2)\frac{1}{2\eta} (\|\mathbf{x}_t - \mathbf{x}^*\|^2 - \|\mathbf{x}_{t+1} - \mathbf{x}^*\|^2) represents distance change in parameter space.
  3. Sum over TT Steps:

    t=0T1gt(xtx)=η2t=0T1gt2+12η(x0x2xTx2)\sum_{t=0}^{T-1} \mathbf{g}_t^\top (\mathbf{x}_t - \mathbf{x}^*) = \frac{\eta}{2} \sum_{t=0}^{T-1} \|\mathbf{g}_t\|^2 + \frac{1}{2\eta} \left( \|\mathbf{x}_0 - \mathbf{x}^*\|^2 - \|\mathbf{x}_T - \mathbf{x}^*\|^2 \right)

Combine with First-Order Convexity Condition

From convexity: f(x)f(xt)+gt(xxt)f(\mathbf{x}^*) \geq f(\mathbf{x}_t) + \mathbf{g}_t^\top (\mathbf{x}^* - \mathbf{x}_t), yielding per-step error bound:

f(xt)f(x)gt(xtx)f(\mathbf{x}_t) - f(\mathbf{x}^*) \leq \mathbf{g}_t^\top (\mathbf{x}_t - \mathbf{x}^*)

Substitute into summation for cumulative error bound:

t=0T1[f(xt)f(x)]η2t=0T1gt2+12η(x0x2xTx2)\sum_{t=0}^{T-1} \left[ f(\mathbf{x}_t) - f(\mathbf{x}^*) \right] \leq \frac{\eta}{2} \sum_{t=0}^{T-1} \|\mathbf{g}_t\|^2 + \frac{1}{2\eta} \left( \|\mathbf{x}_0 - \mathbf{x}^*\|^2 - \|\mathbf{x}_T - \mathbf{x}^*\|^2 \right)

Key Conclusions:

  1. Cumulative error is controlled by gradient norm sum gt2\sum \|\mathbf{g}_t\|^2 and initial distance x0x2\|\mathbf{x}_0 - \mathbf{x}^*\|^2;
  2. Step Size η\eta Trade-off:
    • Too large: Gradient term η2gt2\frac{\eta}{2} \sum \|\mathbf{g}_t\|^2 dominates, may diverge;
    • Too small: Distance term 12ηx0x2\frac{1}{2\eta} \|\mathbf{x}_0 - \mathbf{x}^*\|^2 dominates, slow convergence.

Convergence Analysis for Lipschitz Convex Functions

Assume all gradients of f have bounded norm.

  • Equivalent to ff being Lipschitz (see notes)

  • Excludes many interesting functions (e.g., f(x)=x2f(x) = x^2)

Equivalence Proof

Let f:RdRf: \mathbb{R}^d \to \mathbb{R} be convex. The following are equivalent:

  1. Bounded Gradient: M>0\exists M > 0 such that f(x)M, x\|\nabla f(x)\| \leq M, \ \forall x.

  2. Lipschitz Continuity: L>0\exists L > 0 such that f(y)f(x)Lyx, x,y\|f(y) - f(x)\| \leq L \|y - x\|, \ \forall x, y.


Proof Outline

(⇒) Bounded Gradient ⇒ Lipschitz Continuity

Assume f(x)M\|\nabla f(x)\| \leq M. By fundamental theorem of calculus:

f(y)f(x)=01f(x+t(yx))(yx)dtf(y) - f(x) = \int_0^1 \nabla f(x + t(y-x))^\top (y-x) \, dt

Take norms and apply Cauchy-Schwarz:

f(y)f(x)01f(x+t(yx))yxdt\|f(y) - f(x)\| \leq \int_0^1 \|\nabla f(x + t(y-x))\| \cdot \|y - x\| \, dt

Substitute bounded gradient condition:

f(y)f(x)01Myxdt=Myx\|f(y) - f(x)\| \leq \int_0^1 M \|y - x\| \, dt = M \|y - x\|

Thus ff is MM-Lipschitz.

Explanation: Bounded gradient norm controls function value change rate.


(⇐) Lipschitz Continuity ⇒ Bounded Gradient

Assume f(y)f(x)Lyx\|f(y) - f(x)\| \leq L \|y - x\| and ff convex. Need f(x)L\|\nabla f(x)\| \leq L.

  • If f(x)=0\nabla f(x) = 0: Trivial.

  • Otherwise: Set y=x+f(x)y = x + \nabla f(x).
    By first-order convexity condition:

    f(y)f(x)+f(x)(yx)f(y) \geq f(x) + \nabla f(x)^\top (y - x)

    Substitute yx=f(x)y - x = \nabla f(x):

    f(y)f(x)f(x)f(x)=f(x)2()f(y) - f(x) \geq \nabla f(x)^\top \nabla f(x) = \|\nabla f(x)\|^2 \quad (*)

    By Lipschitz continuity:

    f(y)f(x)Lyx=Lf(x)()f(y) - f(x) \leq L \|y - x\| = L \|\nabla f(x)\| \quad (**)

    Combine ()(*) and ()(**):

    f(x)2Lf(x)\|\nabla f(x)\|^2 \leq L \|\nabla f(x)\|

    Thus f(x)L\|\nabla f(x)\| \leq L.

Explanation: Lipschitz continuity restricts linear growth of function values, forcing bounded gradient norm.


Key Formula Notes

  1. Fundamental Theorem Application:

    f(y)f(x)=01f(x+t(yx))(yx)dtf(y) - f(x) = \int_0^1 \nabla f(x + t(y-x))^\top (y-x) \, dt

    • Meaning: Expresses function difference as path integral of gradient.
  2. Cauchy-Schwarz Inequality:

    f(z)(yx)f(z)yx\left| \nabla f(z)^\top (y-x) \right| \leq \|\nabla f(z)\| \cdot \|y - x\|

    • Role: Converts inner product to norm product for integral bounding.
  3. First-Order Convexity Condition:

    f(y)f(x)+f(x)(yx)f(y) \geq f(x) + \nabla f(x)^\top (y - x)

    • Role: Establishes one-way relationship between gradient and function value change.

Appendix: P47 Example
For f(x)=xf(x) = |x|:

  • At x>0x > 0, gk=1g_k = 1, gradient descent may overshoot x=0x^* = 0.
  • At x=0x = 0, subgradient gk[1,1]g_k \in [-1, 1] causes oscillation.
    Non-smooth functions require specialized methods (e.g., subgradient method).

Theorem Setup

Let f:RdRf: \mathbb{R}^d \to \mathbb{R} be convex differentiable with:

  1. x0xR\|\mathbf{x}_0 - \mathbf{x}^*\| \leq R (bounded initial distance)

  2. f(x)B, x\|\nabla f(\mathbf{x})\| \leq B,\ \forall \mathbf{x} (bounded gradient norm, equivalent to BB-Lipschitz ff)

Convergence Proof

  1. Base Inequality (from Vanilla Analysis):

    t=0T1[f(xt)f(x)]η2t=0T1gt2+12ηx0x2\sum_{t=0}^{T-1} [f(\mathbf{x}_t) - f(\mathbf{x}^*)] \leq \frac{\eta}{2} \sum_{t=0}^{T-1} \|\mathbf{g}_t\|^2 + \frac{1}{2\eta} \|\mathbf{x}_0 - \mathbf{x}^*\|^2

  2. Substitute Bounded Conditions:

    t=0T1[f(xt)f(x)]η2B2T+R22η\sum_{t=0}^{T-1} [f(\mathbf{x}_t) - f(\mathbf{x}^*)] \leq \frac{\eta}{2} B^2 T + \frac{R^2}{2\eta}

  3. Optimize Step Size η\eta:
    Define auxiliary function:

    h(η)=ηB2T2+R22ηh(\eta) = \frac{\eta B^2 T}{2} + \frac{R^2}{2\eta}

    • Optimal step size via derivative:

      h(η)=B2T2R22η2=0  η=RBTh'(\eta) = \frac{B^2 T}{2} - \frac{R^2}{2\eta^2} = 0 \ \Rightarrow \ \eta^* = \frac{R}{B\sqrt{T}}

    • Substitute for minimal bound:

      h(η)=RBTB2T2+R22RBT=RBTh(\eta^*) = \frac{R}{B\sqrt{T}} \cdot \frac{B^2 T}{2} + \frac{R^2}{2 \cdot \frac{R}{B\sqrt{T}}} = RB\sqrt{T}

  4. Average Error Bound:

    1Tt=0T1[f(xt)f(x)]RBT\frac{1}{T} \sum_{t=0}^{T-1} [f(\mathbf{x}_t) - f(\mathbf{x}^*)] \leq \frac{RB}{\sqrt{T}}

Convergence Rate and Iteration Count

  • Iterations for ε\varepsilon-accuracy:

    TR2B2ε2RBTεT \geq \frac{R^2 B^2}{\varepsilon^2} \quad \Rightarrow \quad \frac{RB}{\sqrt{T}} \leq \varepsilon

  • Advantages:

    1. Dimension Independence: Rate independent of dimension dd
    2. Robustness: Holds for both average iterate 1Tf(xt)\frac{1}{T}\sum f(\mathbf{x}_t) and best iterate mintf(xt)\min_t f(\mathbf{x}_t)

Note: Lipschitz condition excludes functions with unbounded gradients (e.g., f(x)=x2f(x)=x^2), which require smoothness analysis.

Practical Advice (Unknown RR, BB)

  1. Fixed Small Step Size: Start with η=0.01\eta = 0.01

  2. Dynamic Adjustment:

    • Oscillation/divergence → decrease η\eta
    • Or use decaying step size ηt=η0t+1\eta_t = \frac{\eta_0}{\sqrt{t+1}}
  3. Adaptive Methods: Use adaptive optimizers like Adam


Smooth Functions

Definition

A differentiable function f:dom(f)Rf : \text{dom}(f) \to \mathbb{R} is smooth with parameter L>0L > 0 on subset Xdom(f)X \subseteq \text{dom}(f) if for all x,yXx, y \in X:

f(y)f(x)+f(x)(yx)+L2xy2f(y) \leq f(x) + \nabla f(x)^\top (y - x) + \frac{L}{2} \|x - y\|^2

Intuition:

  • The function “does not bend too much,” its growth is bounded by a quadratic function (like a paraboloid).
  • LL quantifies the maximum rate of gradient change (faster gradient changes → larger LL).

Geometric Interpretation

Smoothness implies: Near any point xx, ff is upper-bounded by a quadratic function (paraboloid-shaped graph).

Screenshot 2025-09-17 12.28.48.png

Example:
Quadratic functions (e.g., f(x)=x2f(x) = x^2) are smooth.


Operations Preserving Smoothness

The following operations preserve smoothness under specified conditions:

Lemma 1
(i) Let f1,,fmf_1, \dots, f_m be smooth with parameters L1,,LmL_1, \dots, L_m, weights λ1,,λmR+\lambda_1, \dots, \lambda_m \in \mathbb{R}^+. Then f:=i=1mλifif := \sum_{i=1}^m \lambda_i f_i is smooth with parameter i=1mλiLi\sum_{i=1}^m \lambda_i L_i.
(ii) Let ff be smooth (parameter LL), g:RmRdg: \mathbb{R}^m \to \mathbb{R}^d affine (i.e., g(x)=Ax+bg(x) = Ax + b, ARd×mA \in \mathbb{R}^{d \times m}, bRdb \in \mathbb{R}^d). Then fgf \circ g (xf(Ax+b)x \mapsto f(Ax + b)) is smooth with parameter LA2L \|A\|_2.

Here A2\|A\|_2 is the spectral norm of AA (largest singular value).


Convex vs. Smooth Functions: Properties and Optimization

Basic Concept Comparison

Property Mathematical Definition Geometric Meaning Optimization Implication
Convexity f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y) Graph below chords Guarantees global optimality
Lipschitz Continuity f(x)B|\nabla f(x)| \leq B Bounded gradient, controlled change rate Controls gradient descent stability
Smoothness f(x)f(y)Lxy|\nabla f(x) - \nabla f(y)| \leq L|x-y| Gentle gradient changes, bounded curvature Enables faster convergence rates

Key Equivalence

Lemma 2: Equivalent Characterizations of Smoothness

For convex differentiable f:RdRf: \mathbb{R}^d \to \mathbb{R}, the following are equivalent:

  1. ff is LL-smooth: f(y)f(x)+f(x)(yx)+L2xy2f(y) \leq f(x) + \nabla f(x)^\top(y-x) + \frac{L}{2}\|x-y\|^2

  2. Gradient is LL-Lipschitz: f(x)f(y)Lxy\|\nabla f(x) - \nabla f(y)\| \leq L\|x-y\|

Significance: Establishes equivalence between function smoothness and gradient Lipschitz continuity.

Gradient Descent Analysis for Smooth Functions

Lemma 3: Sufficient Decrease

For LL-smooth functions, gradient descent with η=1L\eta = \frac{1}{L} satisfies:

f(xt+1)f(xt)12Lf(xt)2f(x_{t+1}) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2

Proof:

f(xt+1)f(xt)+f(xt)(xt+1xt)+L2xt+1xt2=f(xt)1Lf(xt)2+12Lf(xt)2=f(xt)12Lf(xt)2\begin{aligned} f(x_{t+1}) &\leq f(x_t) + \nabla f(x_t)^\top(x_{t+1}-x_t) + \frac{L}{2}\|x_{t+1}-x_t\|^2 \\ &= f(x_t) - \frac{1}{L} \|\nabla f(x_t)\|^2 + \frac{1}{2L} \|\nabla f(x_t)\|^2 \\ &= f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2 \end{aligned}

Meaning: Each iteration guarantees function value decrease, controlled by gradient norm.

Theorem 2: Convergence Rate

For LL-smooth convex ff, gradient descent with η=1L\eta = \frac{1}{L} satisfies:

f(xT)f(x)Lx0x22Tf(x_T) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2T}

Proof Sketch:

  1. Use base analysis inequality

  2. Apply sufficient decrease lemma to bound gradient term

  3. Obtain final bound via telescoping sum

Practical Applications and Comparisons

Convergence Rate Comparison

Function Type Convergence Rate Iteration Complexity (ε\varepsilon-accuracy)
Lipschitz Convex O(1/T)O(1/\sqrt{T}) O(1/ε2)O(1/\varepsilon^2)
Smooth Convex O(1/T)O(1/T) O(1/ε)O(1/\varepsilon)

Advantage: Smoothness improves convergence from sublinear to linear, significantly boosting efficiency.

Practical Advice: Unknown LL

  1. Initial Guess: Set L=2εR2L = \frac{2\varepsilon}{R^2} (if correct, one iteration suffices)

  2. Verification: Check f(xt+1)f(xt)12Lf(xt)2f(x_{t+1}) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2

  3. Doubling Strategy: If condition fails, double LL and retry

  4. Total Complexity: At most O(4R2Lε)O\left(\frac{4R^2L}{\varepsilon}\right) iterations for ε\varepsilon-accuracy

Practical Impact: Adaptive adjustment enables efficient optimization even without knowing LL.


Subgradient Method

Subgradient Definition

For convex f:RdRf: \mathbb{R}^d \to \mathbb{R}, a subgradient at x\mathbf{x} is any vector gRd\mathbf{g} \in \mathbb{R}^d satisfying:

f(y)f(x)+g(yx),yf(\mathbf{y}) \geq f(\mathbf{x}) + \mathbf{g}^\top (\mathbf{y} - \mathbf{x}), \quad \forall \mathbf{y}

Key Properties:

  • Always exists at points in the domain
  • If ff differentiable at x\mathbf{x}, subgradient unique and equals f(x)\nabla f(\mathbf{x})

Subgradient Calculation Examples

Example 1: Absolute Value f(x)=xf(x) = |x|

Screenshot 2025-09-17 14.45.33.png

  • For x0x \neq 0: Unique subgradient g=sign(x)g = \text{sign}(x)

  • For x=0x = 0: Any g[1,1]g \in [-1, 1]

Example 2: L2 Norm f(x)=x2f(\mathbf{x}) = \|\mathbf{x}\|_2

Screenshot 2025-09-17 14.45.54.png

  • For x0\mathbf{x} \neq \mathbf{0}: Unique subgradient g=xx2\mathbf{g} = \frac{\mathbf{x}}{\|\mathbf{x}\|_2}

  • For x=0\mathbf{x} = \mathbf{0}: Any g\mathbf{g} in closed unit ball {z:z21}\{\mathbf{z} : \|\mathbf{z}\|_2 \leq 1\}

Example 3: L1 Norm f(x)=x1=i=1dxif(\mathbf{x}) = \|\mathbf{x}\|_1 = \sum_{i=1}^d |x_i|

Screenshot 2025-09-17 14.46.59.png

  • For xi0x_i \neq 0: Component gi=sign(xi)g_i = \text{sign}(x_i)

  • For xi=0x_i = 0: Component gi[1,1]g_i \in [-1, 1]

Example 4: Set Indicator Function

For convex set XRdX \subset \mathbb{R}^d:

1X(x)={0if xX+if xX1_X(\mathbf{x}) = \begin{cases} 0 & \text{if } \mathbf{x} \in X \\ +\infty & \text{if } \mathbf{x} \notin X \end{cases}

Normal Cone

Definition: For convex set XRdX \subseteq \mathbb{R}^d and point xX\mathbf{x} \in X, the normal cone is:

NX(x)={gRdgxgy, yX}\mathcal{N}_X(\mathbf{x}) = \{ \mathbf{g} \in \mathbb{R}^d \mid \mathbf{g}^\top \mathbf{x} \geq \mathbf{g}^\top \mathbf{y},\ \forall \mathbf{y} \in X \}

Geometric Interpretation: Contains outward-pointing vectors “supporting” XX at x\mathbf{x}. These vectors form right/obtuse angles with all feasible directions at x\mathbf{x}.

Subdifferential

Definition: The subdifferential of convex f:RdRf: \mathbb{R}^d \to \mathbb{R} at x\mathbf{x} is the set of all subgradients:

f(x)={gRdf(y)f(x)+g(yx), yRd}\partial f(\mathbf{x}) = \{ \mathbf{g} \in \mathbb{R}^d \mid f(\mathbf{y}) \geq f(\mathbf{x}) + \mathbf{g}^\top (\mathbf{y} - \mathbf{x}),\ \forall \mathbf{y} \in \mathbb{R}^d \}

Properties:

  • f(x)\partial f(\mathbf{x}) is closed and convex

  • If ff differentiable at x\mathbf{x}, f(x)={f(x)}\partial f(\mathbf{x}) = \{ \nabla f(\mathbf{x}) \}

  • If f(x)\partial f(\mathbf{x}) is singleton, ff differentiable at x\mathbf{x} with f(x)=g\nabla f(\mathbf{x}) = \mathbf{g}

Optimality Conditions

Unconstrained Optimization

For convex f:RdRf: \mathbb{R}^d \to \mathbb{R}:

x minimizes f    0f(x)\mathbf{x}^* \text{ minimizes } f \iff 0 \in \partial f(\mathbf{x}^*)

Interpretation: x\mathbf{x}^* is minimum iff zero vector is a subgradient.

Constrained Optimization

Consider:

minxf(x)s.t.xX\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad \mathbf{x} \in X

Reformulation: Introduce indicator function 1X(x)1_X(\mathbf{x}):

minx{f(x)+1X(x)}\min_{\mathbf{x}} \left\{ f(\mathbf{x}) + 1_X(\mathbf{x}) \right\}

Optimality Condition: xX\mathbf{x}^* \in X is optimal iff:

0(f(x)+1X(x))=f(x)+NX(x)0 \in \partial \left( f(\mathbf{x}^*) + 1_X(\mathbf{x}^*) \right) = \nabla f(\mathbf{x}^*) + \mathcal{N}_X(\mathbf{x}^*)

Equivalently:

f(x)NX(x)- \nabla f(\mathbf{x}^*) \in \mathcal{N}_X(\mathbf{x}^*)

By normal cone definition, this means:

f(x)(yx)0,yX\nabla f(\mathbf{x}^*)^\top (\mathbf{y} - \mathbf{x}^*) \geq 0, \quad \forall \mathbf{y} \in X

Geometric Interpretation: At optimum x\mathbf{x}^*, the gradient points opposite to feasible directions.


Subgradient Method

Basic Concepts

For a convex function f:RdRf : \mathbb{R}^d \to \mathbb{R} (not necessarily differentiable), the subgradient method mimics gradient descent but uses a subgradient instead of the gradient:

xk+1=xkηk+1gkx_{k+1} = x_k - \eta_{k+1} g_k

  • xkx_k: current point

  • gkf(xk)g_k \in \nabla f(x_k): any subgradient of ff at xkx_k

  • ηk>0\eta_k > 0: step size

  • xk+1x_{k+1}: updated next point

⚠️ Note: The subgradient method is not necessarily a descent method. For example, when f(x)=xf(x) = |x|, non-smoothness can cause oscillations.

Convergence Theorem

Theorem 3: Assume ff is convex and L-Lipschitz (i.e., f(x)f(y)Lxy|f(x) - f(y)| \leq L \|x - y\|).

  1. Fixed step size ηk=η\eta_k = \eta (for all kk):

    limkf(xbest(k))f+L2η2\lim_{k \to \infty} f(x_{\text{best}}^{(k)}) \leq f^* + \frac{L^2 \eta}{2}

  2. Diminishing step size (satisfying k=1ηk2<\sum_{k=1}^\infty \eta_k^2 < \infty and k=1ηk=\sum_{k=1}^\infty \eta_k = \infty):

    limkf(xbest(k))f\lim_{k \to \infty} f(x_{\text{best}}^{(k)}) \leq f^*

Where:

  • f(xbest(k))=mini=0,,kf(xi)f(x_{\text{best}}^{(k)}) = \min_{i=0,\dots,k} f(x_i) is the best value up to step kk
  • f=f(x)f^* = f(x^*) is the true minimum value

Key Steps of the Proof

Based on the definition of the subgradient, we can derive the basic inequality:

xkx2xk1x22ηk(f(xk1)f)+ηk2gk12\|x_k - x^*\|^2 \leq \|x_{k-1} - x^*\|^2 - 2\eta_k (f(x_{k-1}) - f^*) + \eta_k^2 \|g_{k-1}\|^2

Iterating this inequality and rearranging (let R=x0xR = \|x_0 - x^*\|), we get:

0R22i=1kηi(f(xi1)f)+L2i=1kηi20 \leq R^2 - 2 \sum_{i=1}^k \eta_i (f(x_{i-1}) - f^*) + L^2 \sum_{i=1}^k \eta_i^2

After introducing f(xbest(k))f(x_{\text{best}}^{(k)}), we obtain the basic inequality:

f(xbest(k))fR2+L2i=1kηi22i=1kηif(x_{\text{best}}^{(k)}) - f^* \leq \frac{R^2 + L^2 \sum_{i=1}^k \eta_i^2}{2 \sum_{i=1}^k \eta_i}

The convergence results for different step size strategies can be derived from this.

Convergence Rate with Fixed Step Size

If the step size is fixed as η\eta, then:

f(xbest(k))fR22kη+L2η2f(x_{\text{best}}^{(k)}) - f^* \leq \frac{R^2}{2k\eta} + \frac{L^2 \eta}{2}

To achieve accuracy f(xbest(k))fεf(x_{\text{best}}^{(k)}) - f^* \leq \varepsilon, set each term ε/2\leq \varepsilon/2, and solve:

  • Optimal step size: η=εL2\eta = \frac{\varepsilon}{L^2}

  • Required number of iterations: k=R2L2ε2k = \frac{R^2 L^2}{\varepsilon^2}

Therefore, the convergence rate of the subgradient method is O(1ε2)O\left(\frac{1}{\varepsilon^2}\right).

📉 Comparison: The convergence rate of gradient descent for smooth convex functions is O(1ε)O\left(\frac{1}{\varepsilon}\right), indicating that the subgradient method converges slower.

Summary of Algorithm Convergence

Function Properties Algorithm Convergence Bound Number of Iterations
Convex, L-Lipschitz Subgradient f(xbest(T))fLRTf(x_{\text{best}}^{(T)}) - f^* \leq \frac{LR}{\sqrt{T}} O(R2L2ε2)O\left(\frac{R^2 L^2}{\varepsilon^2}\right)
Convex, L-Smooth Gradient Descent f(xbest(T))fR2L2Tf(x_{\text{best}}^{(T)}) - f^* \leq \frac{R^2 L}{2T} O(R2Lε)O\left(\frac{R^2 L}{\varepsilon}\right)
Convex, L-Lipschitz Gradient Descent f(xbest(T))fRLTf(x_{\text{best}}^{(T)}) - f^* \leq \frac{RL}{\sqrt{T}} O(R2L2ε2)O\left(\frac{R^2 L^2}{\varepsilon^2}\right)

Where:

  • TT: time horizon/number of iterations
  • R=x0xR = \|x_0 - x^*\|: distance from initial point to optimum
  • xbest(T)=argmini=0,1,,Tf(xi)x_{\text{best}}^{(T)} = \arg\min_{i=0,1,\dots,T} f(x_i): best point among the first TT iterations

Summary

Expand

Convex Function Definition

f:RdRf: \mathbb{R}^d \to \mathbb{R} is convex iff:

  1. dom(f)\text{dom}(f) convex;

  2. x,ydom(f),λ[0,1]\forall \mathbf{x}, \mathbf{y} \in \text{dom}(f), \lambda \in [0,1]:

    f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda \mathbf{x} + (1-\lambda)\mathbf{y}) \leq \lambda f(\mathbf{x}) + (1-\lambda)f(\mathbf{y})

Geometry: Line segment between graph points lies above graph.


First-Order Convexity

If ff differentiable, convexity ≡:

f(y)f(x)+f(x)(yx)f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x})

Geometry: Graph above tangent hyperplane.


Differentiable Functions

  • Definition: ff differentiable at x0\mathbf{x}_0 if f(x0)\exists \nabla f(\mathbf{x}_0):

    f(x0+h)f(x0)+f(x0)hf(\mathbf{x}_0 + \mathbf{h}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top \mathbf{h}

  • Global Differentiability: Differentiable everywhere → non-vertical tangents everywhere.


Convex Optimization

Formulation:

minxRdf(x)\min_{\mathbf{x} \in \mathbb{R}^d} f(\mathbf{x})

ff convex differentiable, Rd\mathbb{R}^d convex. Minimum x=argminf(x)\mathbf{x}^* = \arg\min f(\mathbf{x}) (may not be unique).


Gradient Descent

Core Idea

Update with negative gradient:

xk+1=xkηkf(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \eta_k \nabla f(\mathbf{x}_k)

  • ηk\eta_k: Step size

  • xk\mathbf{x}_k: Current parameters

Convergence Analysis

Lipschitz Convex Functions (Bounded Gradient)

Assumptions: f(x)B\|\nabla f(\mathbf{x})\| \leq B, x0xR\|\mathbf{x}_0 - \mathbf{x}^*\| \leq R
Step Size: η=RBT\eta = \frac{R}{B\sqrt{T}}
Convergence Rate:

1Tt=0T1[f(xt)f(x)]RBT\frac{1}{T} \sum_{t=0}^{T-1} [f(\mathbf{x}_t) - f(\mathbf{x}^*)] \leq \frac{RB}{\sqrt{T}}

Advantage: Dimension-independent.
Practice: Unknown B,RB, R → try small η\eta (e.g., 0.01) or adaptive methods.

Smooth Convex Functions (Lipschitz Gradient)

Definition: ff is LL-smooth if:

f(y)f(x)+f(x)(yx)+L2xy2f(\mathbf{y}) \leq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y}-\mathbf{x}) + \frac{L}{2} \|\mathbf{x}-\mathbf{y}\|^2

f(x)f(y)Lxy\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L \|\mathbf{x}-\mathbf{y}\|.
Step Size: η=1L\eta = \frac{1}{L}
Convergence Rate:

f(xT)f(x)Lx0x22Tf(\mathbf{x}_T) - f(\mathbf{x}^*) \leq \frac{L \|\mathbf{x}_0 - \mathbf{x}^*\|^2}{2T}

Comparison: O(1/ε)O(1/\varepsilon) for smooth vs. O(1/ε2)O(1/\varepsilon^2) for Lipschitz.
Practice: Unknown LL → doubling strategy with verification of f(xt+1)f(xt)12Lf(xt)2f(\mathbf{x}_{t+1}) \leq f(\mathbf{x}_t) - \frac{1}{2L} \|\nabla f(\mathbf{x}_t)\|^2.


Subgradient Method

Subgradient Definition

For convex ff, gf(x)g \in \partial f(\mathbf{x}) satisfies:

f(y)f(x)+g(yx),yf(\mathbf{y}) \geq f(\mathbf{x}) + g^\top (\mathbf{y} - \mathbf{x}), \quad \forall \mathbf{y}

Key Properties:

  • f(x)={f(x)}\partial f(\mathbf{x}) = \{\nabla f(\mathbf{x})\} if differentiable;
  • Optimality: x\mathbf{x}^* minimum iff 0f(x)0 \in \partial f(\mathbf{x}^*).

Update Rule

xk+1=xkηkgk,gkf(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \eta_k g_k, \quad g_k \in \partial f(\mathbf{x}_k)

Note: Not a descent method (e.g., oscillates for f(x)=xf(x)=|x|).

Convergence

Assumptions: ff convex LL-Lipschitz

  • Fixed η\eta: Limiting error f+L2η/2\leq f^* + L^2\eta/2

  • Diminishing η\eta (ηk=\sum \eta_k = \infty, ηk2<\sum \eta_k^2 < \infty): Asymptotic convergence to ff^*
    Convergence Rate:

f(xbest(k))fR2+L2i=1kηi22i=1kηif(\mathbf{x}_{\text{best}}^{(k)}) - f^* \leq \frac{R^2 + L^2 \sum_{i=1}^k \eta_i^2}{2 \sum_{i=1}^k \eta_i}

where R=x0xR = \|\mathbf{x}_0 - \mathbf{x}^*\|, fbest(k)=mini=0,,kf(xi)f_{\text{best}}^{(k)} = \min_{i=0,\dots,k} f(\mathbf{x}_i).