#sdsc6015

English / 中文

Mirror Descent

Click to expand Mirror Descent review content

Motivation

Consider the simplex-constrained optimization problem:

minxdf(x)\min_{x \in \triangle_d} f(x)

where the simplex d:={xRd:i=1dxi=1,xi0,i}\triangle_d := \{x \in \mathbb{R}^d : \sum_{i=1}^d x_i = 1, x_i \geq 0, \forall i\}. Assume the gradient’s infinity norm is bounded: f(x)=maxi=1,,d[f(x)]i1\|\nabla f(x)\|_\infty = \max_{i=1,\ldots,d} |[\nabla f(x)]_i| \leq 1.

  • Notation: xx is the optimization variable, dd is the dimension, d\triangle_d is the probability simplex.

  • Geometric Interpretation: The simplex is the space of probability distributions; constrained optimization aims to minimize the loss under probability constraints.

  • Convergence Rate Comparison:

    • GD: O(d/T)\mathcal{O}(\sqrt{d/T})
    • Mirror Descent: O(logd/T)\mathcal{O}(\sqrt{\log d / T}), superior in high-dimensional problems.

Preliminaries

  • Dual Norm: g=supx1gx\|g\|_* = \sup_{\|x\| \leq 1} g^\top x.

  • Function Properties:

    • LL-Lipschitz: gL\|g\|_* \leq L for gf(x)g \in \partial f(x).
    • β\beta-smooth: Gradient satisfies Lipschitz condition.
    • μ\mu-strongly convex: Function has a lower-bounding quadratic term.

Mirror Descent Algorithm

Iteration format:

yt+1=Φ(Φ(xt)ηgt),xt+1=ΠXΦ(yt+1)y_{t+1} = \nabla \Phi^*( \nabla \Phi(x_t) - \eta g_t ), \quad x_{t+1} = \Pi_X^\Phi(y_{t+1})

where gtf(xt)g_t \in \partial f(x_t), Φ\Phi is the mirror potential, ΠXΦ\Pi_X^\Phi is the projection based on Bregman divergence.

  • Notation: Φ\Phi is strictly convex and continuously differentiable, DΦ(x,y)D_\Phi(x,y) is the Bregman divergence.

  • Geometric Interpretation: Transforms the problem into the dual space via the potential function, making the update direction better align with the geometric structure.

Convergence Theorem

Theorem 1 (Mirror Descent Convergence):
Let Φ\Phi be a ρ\rho-strongly convex mirror map, R2=supxXΦ(x)Φ(x1)R^2 = \sup_{x \in X} \Phi(x) - \Phi(x_1), ff convex and LL-Lipschitz. Choose η=2RLρT\eta = \frac{2R}{L} \sqrt{\frac{\rho}{T}}, then:

mint=1,,TE[f(xt)]f(x)2RL1ρT\min_{t=1,\ldots,T} \mathbb{E}[f(x_t)] - f(x^*) \leq 2RL \sqrt{\frac{1}{\rho T}}

Proof Outline 1. Utilize the strong convexity of the Bregman divergence to bound the error. 2. Control terms via gradient Lipschitz properties. 3. Summation and step size optimization. For details, see Document 1, Page 9.

Standard Settings

  • Ball Setting: Φ(x)=12x22\Phi(x) = \frac{1}{2} \|x\|_2^2, equivalent to projected gradient descent.

  • Simplex Setting: Φ(x)=i=1dxilogxi\Phi(x) = \sum_{i=1}^d x_i \log x_i (negative entropy), Bregman divergence is KL divergence, projection achieved via renormalization.

  • Example: When optimizing probability distributions on the simplex, Mirror Descent naturally handles constraints, preventing Euclidean projection bias.


Stochastic Gradient Descent (SGD)


Problem Formulation

The main problem is defined as minimizing a function f(θ)f(\theta), where θRd\theta \in \mathbb{R}^d is the parameter vector. The specific form is:

minimizeθRdf(θ)\underset{\theta \in \mathbb{R}^d}{\text{minimize}} \, f(\theta)

Here, f(θ)f(\theta) is typically an expected loss function, expressed as:

f(θ)=(θ,Z)dP(Z)f(\theta) = \int \ell(\theta, Z) \, dP(Z)

  • ZRpZ \in \mathbb{R}^p is a random variable representing a data point (e.g., features and labels).

  • P(Z)P(Z) is the unknown distribution, reflecting that in real-world problems the data distribution is often unknown and must be estimated from samples.

  • (θ,Z)\ell(\theta, Z) is the loss function, parameterized by θ\theta, used to measure the model’s performance on data point ZZ.

Core Idea: Learn parameters θ\theta by minimizing the expected loss, but since P(Z)P(Z) is unknown, empirical risk minimization is used in practice, approximating the integral with the sample mean.

Given a set of labeled training data (x1,y1),,(xn,yn)Rd×Y(x_1, y_1), \ldots, (x_n, y_n) \in \mathbb{R}^d \times \mathcal{Y}, the goal is to find the weight parameters θ\theta to classify the data. This connects the optimization problem with supervised learning tasks like classification or regression.

In simple terms, the main problem is to optimize parameters θ\theta to minimize the model’s average loss over all possible data, but due to the unknown distribution, we rely on finite samples.


Motivation

When the dataset is huge (e.g., n1,000,000n \gg 1,000,000), computing the full gradient becomes impractical. Specific analysis:

  • Define fi(θ)=(θ,(xi,yi))f_i(\theta) = \ell(\theta, (x_i, y_i)) as the cost function for the ii-th observation, drawn from a training set of nn observations.

  • The full gradient f(θ)=i=1nfi(θ)\nabla f(\theta) = \sum_{i=1}^n \nabla f_i(\theta) requires iterating over all data points; the computational cost grows linearly with nn, creating a bottleneck when nn is very large.

  • Therefore, cheaper methods can be used: compute the gradient of a single component fi(θ)\nabla f_i(\theta) or a mini-batch gradient. This introduces Stochastic Gradient Descent (SGD) or its variants, which approximate the gradient by randomly sampling a subset of data points, significantly reducing the computational cost per iteration.


SGD Algorithm Description

The basic steps of stochastic gradient descent are as follows:

  • Initialization: Set the initial point x0Rdx_0 \in \mathbb{R}^d.

  • Iteration Process: For each time step t=0,1,t = 0, 1, \ldots:

    • Randomly and uniformly sample an index it{1,2,,n}i_t \in \{1, 2, \ldots, n\}.
    • Update parameters: xt+1=xtηtfit(xt)x_{t+1} = x_t - \eta_t \nabla f_{i_t}(x_t), where ηt\eta_t is the learning rate.

Here, gt:=fit(xt)g_t := \nabla f_{i_t}(x_t) is called the stochastic gradient; it uses the gradient of only a single function component fitf_{i_t}, not the full gradient f(xt)=1ni=1nfi(xt)\nabla f(x_t) = \frac{1}{n} \sum_{i=1}^n \nabla f_i(x_t).

Screenshot 2025-10-15 15.45.16.png

Advantage: Computational Efficiency

  • Each iteration computes the gradient of only one function component, so the iteration cost is nn times lower than Full Gradient Descent.
  • This makes SGD more scalable for large datasets.

Convergence Considerations

Although SGD is computationally efficient, using the gradient of a single component does not guarantee convergence. The document provides an example to illustrate this issue:

Example: Consider the optimization problem minxf(x)=f1(x)+f2(x)\min_x f(x) = f_1(x) + f_2(x), where f1(x)=x2f_1(x) = x^2 and f2(x)=x2f_2(x) = -x^2. If starting from xk>0x_k > 0 and randomly sampling ik=2i_k = 2, the gradient update xk+1=xkηtf2(xk)x_{k+1} = x_k - \eta_t \nabla f_2(x_k) may cause the function value to increase (because f2(x)=2x\nabla f_2(x) = -2x, the update direction might be wrong).

However, SGD can produce descent in expectation, leading to the concept of unbiasedness.

Unbiasedness

The stochastic gradient gtg_t is an unbiased estimator of the true gradient f(xt)\nabla f(x_t), i.e.:

E[gtxt=x]=f(xt)\mathbb{E}[g_t|x_t = x] = \nabla f(x_t)

where the expectation is taken over the random index iti_t.

Significance: Unbiasedness ensures that on average, the update direction of SGD is correct. Although individual steps may randomly deviate, in the long run, the algorithm converges towards the optimal solution. This allows the use of expectation inequalities to analyze convergence, rather than relying on strong assumptions like convexity.


Conditional Expectation

Detailed Explanation of Conditional Expectation:

Definition

Conditional expectation is a core concept in probability theory and statistics, used to approximate a random variable given partial information (e.g., the value of another random variable).

  • Definition: Let X and Y be random variables. The conditional expectation of X given Y, denoted E[XY]E[X|Y], is the best approximation of X using only the information from Y.

    • This differs from the unconditional expectation E[X]E[X]: E[X]E[X] gives the overall average of X, while E[XY]E[X|Y] gives the average of X given the value of Y.
    • Intuitively, conditional expectation “filters out” the influence of Y, providing a more precise local estimate.

Explanation: Conditional expectation is commonly used in machine learning to handle partially observed data, such as optimizing model parameters in stochastic gradient descent.

Properties

  • Linearity: Conditional expectation is linear, i.e., for random variables X and Y, and constants a and b:

    E[aX+bYZ]=aE[XZ]+bE[YZ]E[aX + bY | Z] = aE[X|Z] + bE[Y|Z]

    The document implies similar properties, e.g., for Z = X + Y or Z = XY, conditional expectation may satisfy addition or multiplication rules, but specific formulas are not given.

  • Partition Theorem: For discrete random variable Y, this theorem provides a simplified method for computing expectations.

    • Theorem statement: If Y is a discrete random variable, then the unconditional expectation can be computed via the expectation of the conditional expectation, i.e.:

      E[X]=E[E[XY]]E[X] = E[E[X|Y]]

      This means first compute the conditional expectation of X given Y, then take the expectation over Y.
    • A footnote in the document indicates that formal definitions and properties can be found in relevant textbooks; this theorem is a special case of the Law of Total Expectation.

Example: Suppose Y is discrete (e.g., a categorical variable), then E[X]E[X] can be obtained by computing E[XY=y]E[X|Y=y] for each possible value of Y and taking a weighted average.

Convexity of Expectation

  • For any fixed x, the linearity of conditional expectation allows simplification of calculations:

E[f(X)Y]=linear combination based on YE[f(X)|Y] = \text{linear combination based on Y}

where f is a convex function.

  • The document mentions that the event {xt=x}\{x_t = x\} occurs only for x in a finite set X (i.e., xtx_t is determined by the index selections from previous iterations). This implies that in iterative algorithms (like stochastic gradient descent), parameter updates are limited to a finite number of options.

  • Application of the Partition Theorem: By decomposing the expectation into conditional expectations, convexity analysis can be simplified. For example:

E[L(xt)]=E[E[L(xt)Y]]E[L(x_t)] = E[E[L(x_t)|Y]]

where L is the loss function, aiding in convergence proofs.

Explanation: In stochastic optimization, this convexity analysis ensures algorithm stability and convergence even with noisy data.

Summary

The document focuses on the definition and properties of conditional expectation (such as linearity and the Partition Theorem) and its application in convexity analysis. These concepts are fundamental to stochastic optimization and machine learning, helping to handle uncertainty and partial information. Conditional expectation, as a “best approximation” tool, plays a key role in model training and convergence proofs.

Note: Image labels mentioned in the document (such as ``) are not embedded in the summary as the specific content is not provided, and to avoid fabrication. The summary is strictly based on the text content.


Convergence Theorems

Theorem 2 (SGD on Convex Functions):
Let ff be convex and differentiable, xx^* be the global optimum, x0xR\|x_0 - x^*\| \leq R, and E[gt2]B2\mathbb{E}[\|g_t\|^2] \leq B^2. Choose constant step size η=RBT\eta = \frac{R}{B \sqrt{T}}, then:

1Tt=0T1E[f(xt)]f(x)RBT\frac{1}{T} \sum_{t=0}^{T-1}\mathbb{E} \left[ f(x_t) \right] - f(x^*) \leq \frac{RB}{\sqrt{T}}

Proof Details (Supplemented from Document2 P23)

Proof Outline:

  1. Basic Inequality: Start from the iteration rule:

    xt+1x2=xtx22ηgt(xtx)+η2gt2\|x_{t+1} - x^*\|^2 = \|x_t - x^*\|^2 - 2\eta g_t^\top (x_t - x^*) + \eta^2 \|g_t\|^2

  2. Take Expectation: Utilize unbiasedness E[gtxt]=f(xt)\mathbb{E}[g_t | x_t] = \nabla f(x_t) and convexity expectation:

    E[gt(xtx)]=E[f(xt)(xtx)]E[f(xt)f(x)]\mathbb{E}[g_t^\top (x_t - x^*)] = \mathbb{E}[\nabla f(x_t)^\top (x_t - x^*)] \geq \mathbb{E}[f(x_t) - f(x^*)]

  3. Summation and Optimization: Sum from t=0t=0 to T1T-1, and substitute step size η=RBT\eta = \frac{R}{B \sqrt{T}}:

    t=0T1E[f(xt)f(x)]R22η+ηB2T2=RBT\sum_{t=0}^{T-1} \mathbb{E}[f(x_t) - f(x^*)] \leq \frac{R^2}{2\eta} + \frac{\eta B^2 T}{2} = RB \sqrt{T}

  4. Jensen’s Inequality: Apply E[f(xˉT)]1Tt=0T1E[f(xt)]\mathbb{E}[f(\bar{x}_T)] \leq \frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E}[f(x_t)] to obtain the final bound.

Geometric Interpretation: SGD moves in a descent direction in expectation, but variance causes oscillations; the step size balances convergence speed and stability.

  • Notation: RR is the initial error bound, BB is the stochastic gradient norm bound, TT is the number of iterations.

  • Comparison with GD: The full gradient descent norm bound BGD2B_{\text{GD}}^2 is usually smaller than SGD’s BSGD2B_{\text{SGD}}^2 (due to Jensen’s inequality), but SGD has lower cost per round, resulting in higher overall efficiency.

Theorem 3 (SGD on Strongly Convex Functions):
Let ff be μ\mu-strongly convex, xx^* be the unique optimum, choose step size ηt=1μt\eta_t = \frac{1}{\mu t}, then:

E[xTx2]B2μ2T\mathbb{E}[\|x_T - x^*\|^2] \leq \frac{B^2}{\mu^2 T}

Proof Details (Supplemented from Document2 P26)

Proof Steps:

  1. Strong Convexity Inequality:

    f(x)f(xt)+f(xt)(xxt)+μ2xtx2f(x^*) \geq f(x_t) + \nabla f(x_t)^\top (x^* - x_t) + \frac{\mu}{2} \|x_t - x^*\|^2

  2. Iteration Analysis: Substitute the SGD update rule, take expectation:

    E[xt+1x2](1μηt)E[xtx2]+ηt2B2\mathbb{E}[\|x_{t+1} - x^*\|^2] \leq (1 - \mu \eta_t) \mathbb{E}[\|x_t - x^*\|^2] + \eta_t^2 B^2

  3. Recursive Solution: Set ηt=1/(μt)\eta_t = 1/(\mu t), use the recurrence relation and summation to obtain the convergence bound.

Geometric Interpretation: Strong convexity ensures the function has a unique minimum; SGD converges at a linear rate, but the variance term B2B^2 affects precision.

Mini-batch SGD

Use batch update: g~t=1StiStfi(xt)\tilde{g}_t = \frac{1}{|S_t|} \sum_{i \in S_t} \nabla f_i(x_t), where St{1,,n}S_t \subset \{1, \ldots, n\} is a random subset.

  • Notation: St=m|S_t| = m is the batch size, m=1m=1 is equivalent to SGD, m=nm=n is equivalent to GD.

  • Variance Calculation (Supplemented from Document2 P31):

    E[g~tf(xt)2]=1mE[f1(xt)f(xt)2]B2m\mathbb{E}[\|\tilde{g}_t - \nabla f(x_t)\|^2] = \frac{1}{m} \mathbb{E}[\|\nabla f_1(x_t) - \nabla f(x_t)\|^2] \leq \frac{B^2}{m}

    As mm \to \infty, the variance tends to zero.

  • Geometric Interpretation: Batch averaging reduces variance, making the gradient estimate more stable, but increases computational cost per round.

Stochastic Subgradient Descent and Constrained Optimization

For non-smooth problems, use subgradient gtfi(xt)g_t \in \partial f_i(x_t), update: xt+1=xtηtgtx_{t+1} = x_t - \eta_t g_t. For constrained problems, add a projection step (Projected SGD), convergence rate remains O(1/ε2)\mathcal{O}(1/\varepsilon^2).


Momentum Methods

Motivation

GD oscillates severely on long, narrow valley-shaped functions. Momentum methods introduce historical step information to accelerate convergence.

  • Geometric Interpretation: Momentum acts like an inertial ball, accelerating in flat directions and damping in steep directions, reducing oscillations.

Heavy Ball Method (Polyak Momentum)

Update rule:

xt+1=xtηtf(xt)+βt(xtxt1)x_{t+1} = x_t - \eta_t \nabla f(x_t) + \beta_t (x_t - x_{t-1})

Or equivalently:

vt+1=βtvtηtf(xt),xt+1=xt+vt+1v_{t+1} = \beta_t v_t - \eta_t \nabla f(x_t), \quad x_{t+1} = x_t + v_{t+1}

where βt[0,1]\beta_t \in [0,1] is the momentum coefficient.

  • Notation: vtv_t is the momentum term, βt\beta_t controls the weight of history.

  • Geometric Interpretation: The momentum term retains the previous direction, helping to traverse locally flat regions.

Convergence

Theorem 4 (Convergence on Quadratic Functions):
Consider the strongly convex quadratic function f(x)=12xQxbxf(x) = \frac{1}{2} x^\top Q x - b^\top x, where QQ is positive definite, condition number κ=L/μ\kappa = L/\mu. Choose ηt=4(L+μ)2\eta_t = \frac{4}{(\sqrt{L} + \sqrt{\mu})^2}, βt=(κ1κ+1)2\beta_t = \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^2, then the heavy ball method converges at a linear rate.

Proof Outline

Proof Idea:

  • Through eigenvalue analysis, the momentum term improves the convergence constant of GD.

  • Detailed proof can be found in Document 1, Page 39, and references (e.g., Wright & Recht, 2022).

Geometric Interpretation: For quadratic functions, momentum adjusts the step size to match the eigenvalue distribution of the Hessian matrix, accelerating convergence.

  • Counterexample (Document 1, Pages 41-44): For a piecewise quadratic function f(x)f(x), although it is 1-strongly convex and 25-smooth, the heavy ball method may cycle without converging (e.g., with ηt=1/9\eta_t = 1/9, βt=4/9\beta_t = 4/9, x0=3.3x_0 = 3.3, iterations may cycle among three points).

    • Geometric Interpretation: Non-smooth points cause incorrect momentum directions, leading to oscillations.
    • Schematic Diagram:
      ![Potential oscillation diagram description - specific image content not provided]

Nesterov’s Accelerated Gradient Descent

Similar to the heavy ball method but uses a lookahead point to compute the gradient, with more robust theoretical guarantees. Update rule difference:

  • Heavy Ball: Directly combines current gradient and momentum.

  • Nesterov’s Method: First moves along the momentum direction, then computes a gradient correction.

  • Geometric Interpretation: Nesterov’s method reduces oscillation through “lookahead” steps, making it more suitable for general convex functions.