#sdsc6015

English / Chinese

Course Introduction and Preliminary Stochastic Optimization


Main Problem

Given labeled training data (x1,y1),,(xn,yn)Rd×Y(x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^d \times \mathcal{Y},
find weights θ\theta to minimize:

minθf(θ)=1ni=1n(θ,(xi,yi)),n extremely large\min_{\theta} f(\theta) = \frac{1}{n} \sum_{i=1}^{n} \ell(\theta, (x_i, y_i)), \quad n \text{ extremely large}

Objective: Quantify model prediction error through the loss function \ell and optimize parameters θ\theta.

Supplementary Notes

Mathematical Notation Explanation:

  • θRm\theta \in \mathbb{R}^m: Parameter vector to optimize (e.g., weights in linear models, neural network parameters)

  • (θ,(xi,yi))\ell(\theta, (x_i, y_i)): Loss function, quantifying the discrepancy between model prediction y^i=gθ(xi)\hat{y}_i = g_\theta(x_i) and true label yiy_i

  • 1ni=1n\frac{1}{n} \sum_{i=1}^n: Empirical risk, representing the average loss over the training set

Practical Significance:

This optimization problem is called Empirical Risk Minimization (ERM):

  • Core Idea: Generalize the model to new data by minimizing average loss on the training set
  • Challenge: High computational cost for gradient calculation when nn is extremely large
  • Solution: Stochastic optimization algorithms (e.g., SGD) approximate gradients using subsets of data per iteration

Soft Margin Support Vector Machine (SVM)

截屏2025-09-13 13.27.48.png

minw,b{1ni=1nmax(0,1yi(wxib))+λw2}\min_{w,b} \left\{ \frac{1}{n} \sum_{i=1}^{n} \max\left(0, 1 - y_i (w^\top x_i - b)\right) + \lambda \|w\|^2 \right\}

  • xix_i: Feature vector

  • yi{1,1}y_i \in \{-1, 1\}: Class label

  • λ>0\lambda > 0: Regularization coefficient

  • w,bw, b: Parameters to learn

Hinge Loss: max(0,1yi(wxib))\max(0, 1 - y_i (w^\top x_i - b)) penalizes misclassification.
Regularization: λw2\lambda \|w\|^2 controls model complexity to prevent overfitting.

Supplementary Notes

Mathematical Derivation:

  1. Origin of Hinge Loss:

    • Ideal constraint yi(wxib)1y_i(w^\top x_i - b) \geq 1 (all samples correctly classified with margin ≥1)
    • Introduce slack variables ξi0\xi_i \geq 0 to allow constraint violation: yi(wxib)1ξiy_i(w^\top x_i - b) \geq 1 - \xi_i
    • Minimize violation degree ξi\sum \xi_imax(0,1yi(wxib))\max(0, 1 - y_i(w^\top x_i - b))

    Geometric Interpretation: Distance from sample to decision boundary d=wxibwd = \frac{|w^\top x_i - b|}{\|w\|}, penalized when d<1wd < \frac{1}{\|w\|}

  2. Derivation of Regularization Term:

    • Margin size M=2wM = \frac{2}{\|w\|}
    • Maximizing margin is equivalent to minimizing w2\|w\|^2 (since argmaxM=argminw\arg\max M = \arg\min \|w\|)

    Practical Meaning: λ\lambda balances classification error and model complexity

    • λ0\lambda \to 0: Allows more misclassification (large margin)
    • λ\lambda \to \infty: Strict classification (small margin)

Extended Notation Explanation:

  • wxib=0w^\top x_i - b = 0: Decision hyperplane (boundary separating classes)

  • ww\frac{w}{\|w\|}: Normal vector direction (determines classification direction)

  • bb: Intercept term (adjusts hyperplane position)

Example:
In 2D space:

  • Decision boundary is line w1x1+w2x2b=0w_1x_1 + w_2x_2 - b = 0

  • Positive class (yi=1y_i=1) on one side, negative class (yi=1y_i=-1) on the other

  • Hinge loss activates only when samples enter the “margin zone” (gray area)


Example: Image Denoising

Total Variation Denoising Model:

minX(i,j)PXijOij2+λTV(X)\min_{X} \sum_{(i,j) \in P} |X_{ij} - O_{ij}|^2 + \lambda \cdot TV(X)

  • XX: Image matrix to recover

  • OO: Noisy observed image

  • PP: Set of observed pixel indices

  • TV(X)TV(X): Total variation regularization term:

    TV(X)=i,jXi+1,jXij+Xi,j+1XijTV(X) = \sum_{i,j} |X_{i+1,j} - X_{ij}| + |X_{i,j+1} - X_{ij}|

Data Fidelity Term: XijOij2\sum |X_{ij} - O_{ij}|^2 ensures recovered image resembles observations.
Smoothness Constraint: TV(X)TV(X) promotes piecewise smoothness by penalizing adjacent pixel differences.

Supplementary Notes

Mathematical Derivation:

  1. Physical Meaning of TV(X)TV(X):

    • Xi+1,jXij|X_{i+1,j} - X_{ij}| computes vertical gradient (intensity difference with lower neighbor)
    • Xi,j+1Xij|X_{i,j+1} - X_{ij}| computes horizontal gradient (intensity difference with right neighbor)

    Core Idea: Natural images exhibit “piecewise smooth” properties (similar adjacent pixels), while noise disrupts continuity

  2. Optimization Objective Decomposition:

    • Data Fidelity Term: Minimize XOF2\|X - O\|_F^2 (squared Frobenius norm) → Maintain similarity to observed image
    • Regularization Term: Minimize image gradient X1\|\nabla X\|_1 (L1 norm of gradient vector) → Promote smoothness

    λ\lambda controls smoothness strength: λ\lambda \uparrow → Smoother but may blur details

Extended Notation Explanation:

  • Xij[0,255]X_{ij} \in [0,255]: Pixel intensity (grayscale value) at position (i,j)(i,j)

  • X=[Xi+1,jXijXi,j+1Xij]\nabla X = \begin{bmatrix} X_{i+1,j}-X_{ij} \\ X_{i,j+1}-X_{ij} \end{bmatrix}: Discrete gradient operator

  • L1 norm X1\|\nabla X\|_1: More robust to noise (compared to L2 norm)

Practical Significance:

  • Medical Imaging: Remove Gaussian noise in CT scans while preserving organ boundaries

  • Satellite Imagery: Eliminate cloud interference and restore surface textures

  • Comparison with Gaussian Filtering: TV denoising preserves edges (e.g., building contours), while Gaussian filtering blurs edges

Example:
Assume 2×2 noisy image O=[501003080]O = \begin{bmatrix} 50 & 100 \\ 30 & 80 \end{bmatrix}:

  • Compute TV(O)=5030+50100+3080+10080=20+50+50+20=140TV(O) = |50-30| + |50-100| + |30-80| + |100-80| = 20+50+50+20=140

  • Denoised image XX has significantly reduced TV(X)TV(X) (e.g., 40) while XO2|X-O|^2 remains small


Example: Neural Networks

Convolutional Neural Network (CNN):

截屏2025-09-13 15.15.28.png

minw,bf(w,b)=1ni=1n(w,b,(xi,yi))\min_{w,b} f(w,b) = \frac{1}{n} \sum_{i=1}^{n} \ell(w, b, (x_i, y_i))

where

(w,b,(xi,yi))=w3Q(w2Q(w1xib1)b2)yi2\ell(w, b, (x_i, y_i)) = \left\| w_3^\top \cdot Q(w_2^\top \cdot Q(w_1^\top x_i - b_1) - b_2) - y_i \right\|^2

  • Q(w)i=max(0,wi)Q(w)_i = \max(0, w_i): ReLU activation function

  • w1,w2,w3w_1, w_2, w_3: Weight matrices

  • b1,b2b_1, b_2: Bias vectors

  • xix_i: Input data, yiy_i: Target output

ReLU Activation: Q()Q(\cdot) outputs 0 for negative inputs, introducing nonlinearity.
Hierarchical Structure: Nested transformations wkQ()bkw_k^\top \cdot Q(\cdot) - b_k model complex feature interactions.

Supplementary Notes

Mathematical Derivation:

  1. Forward Propagation Process:

    • Input Layer: xiRd0x_i \in \mathbb{R}^{d_0} (raw input, e.g., image pixels)
    • Hidden Layer 1: z(1)=w1xib1z^{(1)} = w_1^\top x_i - b_1 (linear transformation)
      a(1)=Q(z(1))a^{(1)} = Q(z^{(1)}) (ReLU activation, max(0,z(1))\max(0, z^{(1)}))
    • Hidden Layer 2: z(2)=w2a(1)b2z^{(2)} = w_2^\top a^{(1)} - b_2
      a(2)=Q(z(2))a^{(2)} = Q(z^{(2)})
    • Output Layer: y^i=w3a(2)\hat{y}_i = w_3^\top a^{(2)} (predicted value)

    =y^iyi2\ell = \|\hat{y}_i - y_i\|^2 computes squared error between prediction and true value

  2. Mathematical Properties of ReLU:

    Q(z)={zif z>00otherwiseQ(z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}

    Advantages:

    • Solves vanishing gradient problem (gradient=1 in positive region)
    • Sparse activation (~50% neuron activation)

Extended Notation Explanation:

  • wkRdk×dk1w_k \in \mathbb{R}^{d_{k} \times d_{k-1}}: Weight matrix (dkd_k = number of neurons in layer kk)

  • bkRdkb_k \in \mathbb{R}^{d_k}: Bias vector (shifts decision boundary)

  • Q()Q(\cdot): Element-wise operation (applies ReLU independently to each element)

Practical Significance:

  • Feature Extraction:

    • Shallow layers learn edges/textures (w1w_1)
    • Middle layers learn parts/shapes (w2w_2)
    • Deep layers learn semantic concepts (w3w_3)
  • Optimization Challenges:

    • Non-convex objective (multiple local optima)
    • Gradient computation requires backpropagation (chain rule)

Example (Digit Recognition):
Input xix_i is 28×28 handwritten digit image (d0=784d_0=784):

  1. w1w_1: Convolutional kernels extract edges (output d1=32d_1=32 feature maps)

  2. QQ: Retains positive activations (enhances feature contrast)

  3. w2w_2: Combines edges into digit components (e.g., horizontal stroke of “7”)

  4. w3w_3: Combines components into digits 0-9 (output d3=10d_3=10)


Simple Numerical Example: Least Squares Regression

Problem Setup

minxf(x):=12Axy2\min_{x} f(x) := \frac{1}{2} \|Ax - y\|^2

  • ARn×dA \in \mathbb{R}^{n \times d}: Data matrix

  • yRny \in \mathbb{R}^{n}: Observation vector

  • xRdx \in \mathbb{R}^{d}: Variable to optimize

Gradient Descent (GD) vs. Stochastic Gradient Descent (SGD)

Method Update Rule Gradient Calculation Per-Step Cost
GD xt+1=xtηf(xt)x_{t+1} = x_t - \eta \nabla f(x_t) f(x)=A(Axy)\nabla f(x) = A^\top (Ax - y) O(nd)O(nd)
SGD xt+1=xtηft(xt)x_{t+1} = x_t - \eta \nabla f_t(x_t) ft(x)=ai(aixyi)\nabla f_t(x) = a_i^\top (a_i x - y_i) O(d)O(d)

Core Idea:

  • GD uses all nn data points per step → High computation cost but stable convergence.
  • SGD uses one random data point (ai,yi)(a_i, y_i) per step → Computationally efficient for large-scale data, but noisy convergence.

Simulation Parameters

  • Data: A=[122111]A = \begin{bmatrix} 1 & 2 \\ 2 & 1 \\ 1 & 1 \end{bmatrix}, y=[1.53.52.0]y = \begin{bmatrix} 1.5 \\ 3.5 \\ 2.0 \end{bmatrix} (n=3n=3, d=2d=2)

  • Step Size: η=0.1\eta = 0.1

  • Initial Value: x0=(0,0)x_0 = (0, 0)

截屏2025-09-11 20.28.26.png

截屏2025-09-11 20.29.28.png

Key Takeaways

  • GD: Smooth convergence but high per-iteration cost.

  • SGD: Low per-step cost, practically faster convergence for large-scale problems (despite noisy path).

Preliminaries of Stochastic Optimization


Cauchy-Schwarz Inequality

For any vectors u,vRdu, v \in \mathbb{R}^d:

uvuv|u^\top v| \leq \|u\| \|v\|

Notation Explanation:

  • u=(u1ud)u = \begin{pmatrix} u_1 \\ \vdots \\ u_d \end{pmatrix}, v=(v1vd)v = \begin{pmatrix} v_1 \\ \vdots \\ v_d \end{pmatrix}: dd-dimensional real column vectors

  • uv=i=1duiviu^\top v = \sum_{i=1}^d u_i v_i: Scalar product (inner product)

  • u=uu=i=1dui2\|u\| = \sqrt{u^\top u} = \sqrt{\sum_{i=1}^d u_i^2}: Euclidean norm

Geometric Interpretation:

  • For unit vectors (u=v=1\|u\|=\|v\|=1), equality holds iff u=vu=v or u=vu=-v
  • Defines vector angle α\alpha: cosα=uvuv\cos \alpha = \frac{u^\top v}{\|u\|\|v\|}, satisfying 1cosα1-1 \leq \cos \alpha \leq 1

截屏2025-09-11 20.45.47.png

Examples for unit vectors (‖u‖=‖v‖= 1)

截屏2025-09-11 20.48.05.png

Equality in Cauchy-Schwarz if and only u = v or u = −v.

Hölder’s Inequality

For p,q>1p,q > 1 with 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1, and any vectors x,yx,y:

i=1nxiyi(i=1nxip)1/p(i=1nyiq)1/q\sum_{i=1}^n |x_i y_i| \leq \left( \sum_{i=1}^n |x_i|^p \right)^{1/p} \left( \sum_{i=1}^n |y_i|^q \right)^{1/q}

Related Concepts:

  • p\ell_p-norm: xp=(i=1nxip)1/p\|x\|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p}

  • Larger pp gives higher weight to large components

  • Cauchy-Schwarz is a special case when p=q=2p=q=2

Core Idea: Quantify upper bounds on vector interactions via p\ell_p and q\ell_q norms.


Convex Sets

A set CC is convex iff the line segment between any two points lies within CC:

x,yC, λ[0,1],λx+(1λ)yC\forall x,y \in C, \ \forall \lambda \in [0,1], \quad \lambda x + (1-\lambda)y \in C

截屏2025-09-11 20.48.59.png

Left: Convex set
Middle: Non-convex set (segment not fully contained)
Right: Non-convex set (missing boundary points)

Properties of Convex Sets

  1. Closed under intersection: Intersection of convex sets is convex

    iICi convex, convex sets Ci\bigcap_{i \in I} C_i \text{ convex}, \quad \forall \text{ convex sets } C_i

截屏2025-09-11 20.50.09.png

  1. Unique projection: For non-empty closed convex set CC, the projection operator PC(x)=argminyCyxP_C(x) = \arg\min\limits_{y \in C} \|y - x\| has a unique solution

截屏2025-09-12 11.09.59.png

  1. Non-expansiveness:

    PC(x)PC(y)xy,x,y\|P_C(x) - P_C(y)\| \leq \|x - y\|, \quad \forall x,y

截屏2025-09-12 11.11.12.png


Lipschitz Continuity

A function f:RdRf: \mathbb{R}^d \to \mathbb{R} is LL-Lipschitz continuous if:

f(x)f(y)Lxy,x,yRd\|f(x) - f(y)\| \leq L\|x - y\|, \quad \forall x,y \in \mathbb{R}^d

Meaning of LL: Upper bound on function change rate
Examples: f(x)=x2+5f(x) = \sqrt{x^2 + 5}, f(x)=xf(x) = \|x\|


Differentiable Functions

A function f:RdRf: \mathbb{R}^d \to \mathbb{R} is differentiable at x0x_0 if there exists gradient f(x0)\nabla f(x_0) such that:

f(x0+h)f(x0)+f(x0)hf(x_0 + h) \approx f(x_0) + \nabla f(x_0)^\top h

截屏2025-09-12 11.12.13.png

where hh is a small perturbation. If differentiable everywhere, its graph has a tangent hyperplane at every point.

Non-Differentiable Function Example

f(x)=x(Euclidean norm)f(x) = \|x\| \quad \text{(Euclidean norm)}

截屏2025-09-12 11.12.50.png

Characteristic: Non-differentiable at origin (“ice cream cone” shape)

截屏2025-09-12 11.13.57.png


L-Smoothness

A function f:RdRf: \mathbb{R}^d \to \mathbb{R} is LL-smooth if:

  1. ff is continuously differentiable (gradient f\nabla f exists and is continuous)

  2. Gradient satisfies LL-Lipschitz condition:

    f(x)f(y)Lxy,x,y\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|, \quad \forall x,y

Gradient Definition:

f(x)=(fx1(x),,fxd(x))\nabla f(x) = \left( \frac{\partial f}{\partial x_1}(x), \dots, \frac{\partial f}{\partial x_d}(x) \right)

Example: f(x)=12x2f(x) = \frac{1}{2}\|x\|^2 is 11-smooth

Supplementary Notes

Mathematical Derivation:

  1. Computing Lipschitz Constant:

    L=supxyf(x)f(y)xyL = \sup_{x \neq y} \frac{|f(x) - f(y)|}{\|x - y\|}

    Maximum secant slope between any two points

  2. Lipschitz Condition for Differentiable Functions:
    If ff differentiable, L=supxf(x)L = \sup_{x} \|\nabla f(x)\|

    Proof: By mean value theorem f(x)f(y)=f(ξ)(xy)f(ξ)xy|f(x)-f(y)| = |\nabla f(\xi)^\top (x-y)| \leq \|\nabla f(\xi)\|\|x-y\|

Extended Notation Explanation:

  • xy\|x - y\|: Euclidean distance between points in Rd\mathbb{R}^d

  • LL: Lipschitz constant (upper bound on function change rate)

  • f(x)f(y)|f(x)-f(y)|: Absolute value of function change

Practical Significance:

  • Optimization Algorithm Stability:

    • In gradient descent, LL determines maximum step size (η<2/L\eta < 2/L)
    • Ensures convergence (prevents oscillation or divergence)
  • Neural Network Training:

    • Lipschitz constraints on weight matrices control model sensitivity
    • Enhance adversarial robustness (resist input perturbations)

Example Verification:

  1. f(x)=xf(x) = \|x\|:

    xyxy(triangle inequality)| \|x\| - \|y\| | \leq \|x - y\| \quad (\text{triangle inequality})

    L=1L=1

  2. f(x)=x2+5f(x) = \sqrt{x^2 + 5}:

    ddxx2+5=xx2+51\left| \frac{d}{dx}\sqrt{x^2+5} \right| = \frac{|x|}{\sqrt{x^2+5}} \leq 1

    L=1L=1

  3. Non-Lipschitz Function: f(x)=x2f(x) = x^2

    x2y2xy=x+yx,y\frac{|x^2 - y^2|}{|x-y|} = |x+y| \xrightarrow{|x|,|y|\to\infty} \infty

    → No global Lipschitz constant

Geometric Interpretation:

  • Graph of Lipschitz continuous function confined within a “slope cone”

  • Vertical distance between any two points ≤ L×L \times horizontal distance


Convex Functions


Definition and Geometric Interpretation

A function f:RdRf: \mathbb{R}^d \to \mathbb{R} is convex if:

  1. Domain dom(f)\text{dom}(f) is convex

  2. Satisfies convex inequality:

    x,ydom(f), λ[0,1],f(λx+(1λ)y)λf(x)+(1λ)f(y)\forall x,y \in \text{dom}(f), \ \forall \lambda \in [0,1], \quad f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)

Geometric Interpretation: The line segment (chord) between any two points on the graph lies above the graph.
截屏2025-09-12 11.22.21.png

Supplementary Notes

Mathematical Notation Explanation:

  • λ[0,1]\lambda \in [0,1]: Interpolation parameter, controlling linear combination ratio

  • λx+(1λ)y\lambda x + (1-\lambda)y: Convex combination, representing a point on the segment connecting xx and yy

  • f(λx+(1λ)y)f(\lambda x + (1-\lambda)y): Function value at convex combination point

  • λf(x)+(1λ)f(y)\lambda f(x) + (1-\lambda)f(y): Linear interpolation between f(x)f(x) and f(y)f(y)

Key Property Derivation:

Convex inequality is equivalent to:

f(y)f(x)yxf(z)f(x)zx,z[x,y]\frac{f(y) - f(x)}{\|y - x\|} \geq \frac{f(z) - f(x)}{\|z - x\|}, \quad \forall z \in [x,y]

Meaning the average growth rate is non-decreasing in any interval [x,y][x,y]

Practical Significance:

  • Optimization Advantage: Local minima of convex functions are global minima

  • Economic Decisions: Model diminishing marginal returns (e.g., production cost functions)

  • Physical Systems: Spring potential f(x)=12kx2f(x) = \frac{1}{2}kx^2 is convex (k>0k>0)

Geometric Properties:

  1. Chord Above Graph:

    • For any x,yx,y, the segment connecting (x,f(x))(x,f(x)) and (y,f(y))(y,f(y))
    • Always lies above the function graph z=f(λx+(1λ)y)z = f(\lambda x + (1-\lambda)y)
  2. No Local Dips:

    • Function curve has no “bowl-shaped” dips (e.g., f(x)=x2f(x) = -x^2)
    • For twice differentiable functions, equivalent to 2f(x)0\nabla^2 f(x) \succeq 0 (Hessian positive semi-definite)

Example Verification:

  1. Convex Function: f(x)=x2f(x) = x^2

    f(λx+(1λ)y)=[λx+(1λ)y]2f(\lambda x + (1-\lambda)y) = [\lambda x + (1-\lambda)y]^2

    λf(x)+(1λ)f(y)=λx2+(1λ)y2\lambda f(x) + (1-\lambda)f(y) = \lambda x^2 + (1-\lambda)y^2

    Difference: λ(1λ)(xy)20\lambda(1-\lambda)(x-y)^2 \geq 0 → Satisfies convex inequality

  2. Non-Convex Function: f(x)=sin(x)f(x) = \sin(x) (on [0,2π][0,2\pi])

    Take x=0,y=2π,λ=0.5x=0, y=2\pi, \lambda=0.5:
    f(0.5×0+0.5×2π)=sin(π)=0f(0.5×0 + 0.5×2\pi) = \sin(\pi) = 0
    0.5sin(0)+0.5sin(2π)=00.5\sin(0) + 0.5\sin(2\pi) = 0
    But at x=π/2x=\pi/2, f(π/2)=1>0f(\pi/2)=1 > 0 → Violates convex inequality

Metaphorical Understanding:

Imagine convex functions as “valleys”:

  • Cable car line (chord) between any two points stays above valley floor (function graph)
  • Water droplets sliding down converge to the lowest point (global optimum)

Common Convex Function Examples

  1. Linear Function: f(x)=axf(x) = a^\top x

  2. Affine Function: f(x)=ax+bf(x) = a^\top x + b

  3. Exponential Function: f(x)=eαxf(x) = e^{\alpha x}

  4. Norms: Any norm on Rd\mathbb{R}^d (e.g., Euclidean norm x\|x\|)

    Proof of Norm Convexity:
    By triangle inequality x+yx+y\|x+y\| \leq \|x\| + \|y\| and homogeneity αx=αx\|\alpha x\| = |\alpha|\|x\|:

    λx+(1λ)yλx+(1λ)y\| \lambda x + (1-\lambda)y \| \leq \lambda \|x\| + (1-\lambda)\|y\|


Relationship Between Convex Functions and Sets: Epigraph

Graph of a Function

The graph of a function is defined as:

{(x,f(x))xdom(f)}\{(x, f(x)) \mid x \in \text{dom}(f)\}

where dom(f)\text{dom}(f) denotes the domain of ff.

Epigraph

The epigraph of a function f:RdRf: \mathbb{R}^d \to \mathbb{R} is:

epi(f):={(x,α)Rd+1xdom(f),αf(x)}\text{epi}(f) := \{(x, \alpha) \in \mathbb{R}^{d+1} \mid x \in \text{dom}(f), \alpha \geq f(x)\}

Note: The epigraph is the set of all points above the function graph, visually the region “above” the function. For convex ff, the epigraph forms a convex “bowl-shaped” set.

  • Key Property:

    ff is convex     \iff epi(f)\text{epi}(f) is convex

    Note: This links function convexity to set convexity, simplifying analysis by leveraging convex set properties (e.g., segments lie within the set).

截屏2025-09-12 11.22.48.png

Proof: ff convex     \iff epi(f)\text{epi}(f) convex

⇒ (Sufficiency): Assume ff convex. Take any (x,t),(y,s)epi(f)(x,t), (y,s) \in \text{epi}(f). Prove segment λ(x,t)+(1λ)(y,s)\lambda(x,t) + (1-\lambda)(y,s) is in epi(f)\text{epi}(f), i.e.:

f(λx+(1λ)y)λt+(1λ)sf(\lambda x + (1-\lambda)y) \leq \lambda t + (1-\lambda)s

By convexity of ff (f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)) and since (x,t),(y,s)epi(f)(x,t), (y,s) \in \text{epi}(f) imply tf(x)t \geq f(x), sf(y)s \geq f(y), the inequality holds. Thus epi(f)\text{epi}(f) is convex.

⇐ (Necessity): Assume epi(f)\text{epi}(f) convex. Take points (x,f(x)),(y,f(y))epi(f)(x,f(x)), (y,f(y)) \in \text{epi}(f). Since epi(f)\text{epi}(f) convex, segment λ(x,f(x))+(1λ)(y,f(y))\lambda(x,f(x)) + (1-\lambda)(y,f(y)) is in epi(f)\text{epi}(f), i.e.:

(λx+(1λ)y,λf(x)+(1λ)f(y))epi(f)(\lambda x + (1-\lambda)y, \lambda f(x) + (1-\lambda)f(y)) \in \text{epi}(f)

By epigraph definition, λf(x)+(1λ)f(y)f(λx+(1λ)y)\lambda f(x) + (1-\lambda)f(y) \geq f(\lambda x + (1-\lambda)y), equivalent to f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y). Thus ff is convex.

Note: Proof uses the epigraph inequality αf(x)\alpha \geq f(x) to convert set convexity to function convexity. Part 1 derives set convexity from function convexity, Part 2 reverses.

Jensen’s Inequality

Lemma 1 (Jensen’s Inequality)
Let ff be convex, x1,,xmdom(f)x_1, \dots, x_m \in \text{dom}(f), λ1,,λmR+\lambda_1, \dots, \lambda_m \in \mathbb{R}_+ with i=1mλi=1\sum_{i=1}^m \lambda_i = 1. Then:

f(i=1mλixi)i=1mλif(xi)f\left( \sum_{i=1}^m \lambda_i x_i \right) \leq \sum_{i=1}^m \lambda_i f(x_i)

Note:

  • For m=2m=2, this reduces to convex function definition.
  • General case proven by induction (exercise). Core idea: decompose convex combination recursively.

Geometric Properties of Differentiable Functions

For differentiable ff, the affine function:

f(x)+f(x)(yx)f(x) + \nabla f(x)^\top (y - x)

describes the tangent hyperplane to ff at (x,f(x))(x, f(x)).

截屏2025-09-16 23.00.47.png

Geometric Interpretation:

  • f(x)\nabla f(x) is the gradient (multidimensional derivative) at xx.
  • This tangent hyperplane touches ff’s surface at xx and locally approximates ff.
  • If ff convex, the tangent hyperplane lies below the entire function graph.

Convexity Criteria

First-Order Convexity Condition

Lemma 2
Assume dom(f)\text{dom}(f) open and ff differentiable (gradient f(x):=(fx1(x),,fxd(x))\nabla f(x) := \left( \frac{\partial f}{\partial x_1}(x), \dots, \frac{\partial f}{\partial x_d}(x) \right) exists everywhere). Then ff convex iff:

  1. dom(f)\text{dom}(f) convex;

  2. For all x,ydom(f)x, y \in \text{dom}(f):

f(y)f(x)+f(x)(yx)f(y) \geq f(x) + \nabla f(x)^\top (y - x)

Geometric Interpretation:

  • Inequality means convex function graph lies above its tangent hyperplane at any point (“bowl-shaped”).
  • Gradient f(x)\nabla f(x) is the steepest ascent direction, f(x)-\nabla f(x) the steepest descent.

Second-Order Convexity Condition

Assume dom(f)\text{dom}(f) open and ff twice differentiable (Hessian 2f(x)\nabla^2 f(x) symmetric and exists everywhere). Then ff convex iff:

  1. dom(f)\text{dom}(f) convex;

  2. For all xdom(f)x \in \text{dom}(f), Hessian is positive semi-definite (2f(x)0\nabla^2 f(x) \succeq 0), i.e.:

zRd,z2f(x)z0\forall z \in \mathbb{R}^d, \quad z^\top \nabla^2 f(x) z \geq 0

Key Concepts:

  • Hessian Matrix: Symmetric matrix of second partial derivatives, describing local curvature:

2f(x)=(2fx122fx1xd2fxdx12fxd2)\nabla^2 f(x) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_d \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_d^2} \end{pmatrix}

  • Positive Semi-Definiteness (0\succeq 0): Non-negative eigenvalues, indicating “upward curvature” in all directions (e.g., Hessian of x2x^2 is 2).

Example

Hessian of f(x1,x2)=x12+x22f(x_1, x_2) = x_1^2 + x_2^2:

2f(x)=(2002)0\nabla^2 f(x) = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix} \succeq 0

Note: This matrix is positive definite (eigenvalues 2 > 0), so ff is strictly convex (rotational paraboloid).


Properties of Convex Functions

Convexity-Preserving Operations

Lemma 4
(i) Let f1,,fmf_1, \dots, f_m be convex, λ1,,λmR+\lambda_1, \dots, \lambda_m \in \mathbb{R}_+. Then f:=i=1mλifif := \sum_{i=1}^m \lambda_i f_i is convex on dom(f):=i=1mdom(fi)\text{dom}(f) := \bigcap_{i=1}^m \text{dom}(f_i).

Note: Non-negative linear combinations of convex functions remain convex (domain is intersection).

(ii) Let ff convex (dom(f)Rd\text{dom}(f) \subseteq \mathbb{R}^d), g(x)=Ax+bg(x) = Ax + b affine (ARd×m,bRdA \in \mathbb{R}^{d \times m}, b \in \mathbb{R}^d). Then fgf \circ g (i.e., xf(Ax+b)x \mapsto f(Ax + b)) is convex on dom(fg):={xRm:Ax+bdom(f)}\text{dom}(f \circ g) := \{x \in \mathbb{R}^m : Ax + b \in \text{dom}(f)\}.

Note: Composition with affine mapping preserves convexity (e.g., f(2x+1)f(2x+1) convex).


Local Minimum is Global Minimum

Definition: xx is a local minimum of ff if ε>0\exists \varepsilon > 0 such that f(x)f(y)f(x) \leq f(y) for all ydom(f)y \in \text{dom}(f) with yxε\|y - x\| \leq \varepsilon.

Lemma 5: Local minimum xx^* of convex ff is a global minimum (i.e., f(x)f(y),ydom(f)f(x^*) \leq f(y), \forall y \in \text{dom}(f)).
Proof:
Assume y\exists y with f(y)<f(x)f(y) < f(x^*). Construct y=λx+(1λ)yy' = \lambda x^* + (1-\lambda)y (λ(0,1)\lambda \in (0,1)). By convexity:

f(y)λf(x)+(1λ)f(y)<f(x).f(y') \leq \lambda f(x^*) + (1-\lambda)f(y) < f(x^*).

As λ1\lambda \to 1, yx0\|y' - x^*\| \to 0, contradicting xx^* being local minimum.


Critical Point is Global Minimum

Lemma 6: If ff differentiable and convex on open convex set dom(f)\text{dom}(f), and f(x)=0\nabla f(x) = 0 (critical point), then xx is global minimum.
Proof:
By first-order condition (Lemma 2), for any yy:

f(y)f(x)+f(x)(yx)=0=f(x).f(y) \geq f(x) + \underbrace{\nabla f(x)^\top (y - x)}_{=0} = f(x).

Geometric Interpretation: Zero gradient implies horizontal tangent hyperplane and global minimum.


Strictly Convex Functions

Definition: ff is strictly convex if:

  1. dom(f)\text{dom}(f) convex;

  2. For all xyx \neq y and λ(0,1)\lambda \in (0,1):

f(λx+(1λ)y)<λf(x)+(1λ)f(y).f(\lambda x + (1-\lambda)y) < \lambda f(x) + (1-\lambda)f(y).

Examples:

  • x2x^2 strictly convex (left);
  • Linear functions convex but not strictly convex (right).

Lemma 7: Strictly convex functions have at most one global minimum.


Constrained Minimization

Definition: Let ff convex, Xdom(f)X \subseteq \text{dom}(f) convex. xXx \in X minimizes ff on XX if f(x)f(y),yXf(x) \leq f(y), \forall y \in X.

Lemma 8: If ff differentiable on open convex dom(f)\text{dom}(f), Xdom(f)X \subseteq \text{dom}(f) convex, then xXx^* \in X is minimum iff:

f(x)(xx)0,xX.\nabla f(x^*)^\top (x - x^*) \geq 0, \quad \forall x \in X.

Geometric Interpretation: At xx^*, gradient forms angle 90\leq 90^\circ with feasible directions (no descent direction).


Existence of Minimum

α\alpha-Sublevel Set: fα:={xRd:f(x)α}f_{\leq \alpha} := \{x \in \mathbb{R}^d : f(x) \leq \alpha\}.

Note: Even if ff bounded below (e.g., f(x)=exf(x)=e^x), minimum may not exist (requires sublevel sets compact).


Convex Optimization Problems

Form:

minxDf(x),\min_{x \in D} f(x),

with ff convex, Ddom(f)D \subseteq \text{dom}(f) convex (e.g., D=RdD = \mathbb{R}^d).
Key Properties:

  • Local minimum is global minimum.

  • Algorithms (coordinate descent, gradient descent, SGD, etc.) provably converge to global optimum.

Convergence Rate Example: Gradient descent converges at O(1/t)O(1/t) rate:

f(xt)f(x)ct,f(x_t) - f(x^*) \leq \frac{c}{t},

where xx^* is optimum, cc constant (depends on initialization and function properties).