#sdsc6015

English / 中文

Projected Gradient Descent

Projected Gradient Descent is an algorithm for constrained optimization problems that ensures constraint satisfaction by projecting gradient steps back onto the feasible set.


Constrained Optimization Problem Definition

The constrained optimization problem is formally defined as:

minf(x)subject toxX\begin{aligned} &\min f(x) \\ &\text{subject to}\quad x \in X \end{aligned}

where:

  • f:RdRf: \mathbb{R}^d \rightarrow \mathbb{R} is the objective function

  • XRdX \subseteq \mathbb{R}^d is a closed convex set

  • xRdx \in \mathbb{R}^d is the optimization variable

Geometric Meaning: Find the point that minimizes f(x)f(x) while satisfying the constraint xXx \in X.


Algorithm Description

Projected Gradient Descent iteration steps:

For t=0,1,2, with stepsize ηt>0:yt+1=xtηtf(xt)xt+1=ΠX(yt+1)\begin{aligned} &\text{For } t = 0, 1, 2, \ldots \text{ with stepsize } \eta_t > 0: \\ &\quad y_{t+1} = x_t - \eta_t \nabla f(x_t) \\ &\quad x_{t+1} = \Pi_X(y_{t+1}) \end{aligned}

where the projection operator is defined as:

ΠX(y):=arg minxXxy\Pi_X(y) := \operatorname*{arg\,min}_{x \in X} \|x - y\|

Mathematical Notation Meaning:

  • ηt\eta_t: Step size parameter, controls the update magnitude

  • f(xt)\nabla f(x_t): Gradient of the function at xtx_t

  • ΠX\Pi_X: Projection operator, maps a point to the closest point in XX


Convergence Rate Analysis

Convergence rate is the same as unconstrained gradient descent, but each step requires a projection operation:

  • Lipschitz convex function (parameter GG): O(1/ε2)\mathcal{O}(1/\varepsilon^2) steps

  • Smooth convex function (parameter LL): O(1/ε)\mathcal{O}(1/\varepsilon) steps

  • Smooth strongly convex function (parameters μ,L\mu, L): O(log(1/ε))\mathcal{O}(\log(1/\varepsilon)) steps

Computational efficiency depends on the difficulty of the projection operation.


Convergence on Smooth Functions

Smoothness Definition: Function ff is called smooth (parameter LL) on closed convex set XRdX \subseteq \mathbb{R}^d if:

f(x)f(y)Lxy,x,yX\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|, \quad \forall x,y \in X

Lemma 1 (Sufficient Decrease):

Let f:RdRf: \mathbb{R}^d \rightarrow \mathbb{R} be differentiable and smooth (parameter LL) on closed convex set XRdX \subseteq \mathbb{R}^d. Choose step size η=1/L\eta = 1/L, then projected gradient descent satisfies:

f(xt+1)f(xt)12Lf(xt)2+L2yt+1xt+12f(x_{t+1}) \leq f(x_t) - \frac{1}{2L}\|\nabla f(x_t)\|^2 + \frac{L}{2}\|y_{t+1} - x_{t+1}\|^2

where yt+1=xtηf(xt)y_{t+1} = x_t - \eta \nabla f(x_t).

Lemma 1 Proof:

Mathematical Notation Meaning in Projection

  • yt+1=xtηf(xt)y_{t+1} = x_t - \eta \nabla f(x_t): This is the unprojected gradient step, the point after moving one step along the gradient direction, but may not be in the feasible set XX.

  • xt+1=ΠX(yt+1)x_{t+1} = \Pi_X(y_{t+1}): This is the projected point, mapping yt+1y_{t+1} to the closest point in XX via the projection operator ΠX\Pi_X, ensuring the iteration point always satisfies constraints.

  • yt+1xt+12\|y_{t+1} - x_{t+1}\|^2: This is the projection error, measuring the squared Euclidean distance between the unprojected point and the projected point. This term acts as a “penalty” for the deviation caused by projection, as projection may shift the point away from the gradient direction.

  • ΠX(y):=arg minxXxy\Pi_X(y) := \operatorname*{arg\,min}_{x \in X} \|x - y\|: Projection operator, meaning projecting point yy onto closed convex set XX, finding the closest point to yy in XX.


Proof

The proof of Lemma 1 relies on three key elements: function smoothness, optimality conditions of projection, and geometric properties of the projection operator (Fact(ii)). The detailed proof steps are as follows:

  1. Utilize Function Smoothness:
    Since ff is LL-smooth, according to the smoothness definition, for any points xx and yy, we have:

    f(y)f(x)+f(x)(yx)+L2yx2f(y) \leq f(x) + \nabla f(x)^\top (y - x) + \frac{L}{2} \|y - x\|^2

    Apply this property with x=xtx = x_t and y=yt+1y = y_{t+1}, and substitute yt+1=xtηf(xt)y_{t+1} = x_t - \eta \nabla f(x_t) and η=1/L\eta = 1/L, obtaining:

    f(yt+1)f(xt)12Lf(xt)2f(y_{t+1}) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2

    This step gives an upper bound on the function value at the unprojected point yt+1y_{t+1}.

  2. Optimality Condition of Projection:
    Since xt+1=ΠX(yt+1)x_{t+1} = \Pi_X(y_{t+1}) is the projection point, for a closed convex set XX, the projection operator satisfies the following optimality condition: for any xXx \in X, we have:

    (xxt+1)(yt+1xt+1)0(x - x_{t+1})^\top (y_{t+1} - x_{t+1}) \leq 0

    In particular, take x=xtx = x_t (since xtXx_t \in X), we have:

    (xtxt+1)(yt+1xt+1)0(x_t - x_{t+1})^\top (y_{t+1} - x_{t+1}) \leq 0

    Substitute yt+1=xtηf(xt)y_{t+1} = x_t - \eta \nabla f(x_t), and after algebraic transformation, we get:

    f(xt)(xt+1xt)1ηxtxt+12\nabla f(x_t)^\top (x_{t+1} - x_t) \leq -\frac{1}{\eta} \|x_t - x_{t+1}\|^2

    This inequality relates the gradient inner product to the movement distance after projection.

  3. Apply Smoothness to xt+1x_{t+1}:
    Again use the smoothness definition, this time applied to x=xtx = x_t and y=xt+1y = x_{t+1}:

    f(xt+1)f(xt)+f(xt)(xt+1xt)+L2xt+1xt2f(x_{t+1}) \leq f(x_t) + \nabla f(x_t)^\top (x_{t+1} - x_t) + \frac{L}{2} \|x_{t+1} - x_t\|^2

    Substitute the inequality from step 2, obtaining:

    f(xt+1)f(xt)1ηxtxt+12+L2xt+1xt2f(x_{t+1}) \leq f(x_t) - \frac{1}{\eta} \|x_t - x_{t+1}\|^2 + \frac{L}{2} \|x_{t+1} - x_t\|^2

    Substitute η=1/L\eta = 1/L, simplifying gives:

    f(xt+1)f(xt)L2xtxt+12f(x_{t+1}) \leq f(x_t) - \frac{L}{2} \|x_t - x_{t+1}\|^2

  4. Utilize Projection Property (Fact(ii)):
    Fact(ii) mentioned in the attachment is a geometric property of the projection operator: for any xXx \in X and yRdy \in \mathbb{R}^d, we have:

    xΠX(y)2+yΠX(y)2xy2\|x - \Pi_X(y)\|^2 + \|y - \Pi_X(y)\|^2 \leq \|x - y\|^2

    Take x=xtx = x_t and y=yt+1y = y_{t+1}, and since ΠX(yt+1)=xt+1\Pi_X(y_{t+1}) = x_{t+1}, we get:

    xtxt+12+yt+1xt+12\|x_t - x_{t+1}\|^2 + \|y_{t+1} - x_{t+1}\|^2


Theorem 1:

Let f:RdRf: \mathbb{R}^d \rightarrow \mathbb{R} be a convex and differentiable function. Let XRdX \subseteq \mathbb{R}^d be a closed convex set, and assume there exists a minimizer xXx^* \in X such that f(x)=minxXf(x)f(x^*) = \min_{x \in X} f(x). Further, assume ff is smooth on XX (i.e., gradient is L-Lipschitz continuous) with parameter LL. Choose step size η=1L\eta = \frac{1}{L}, then the iteration sequence generated by the projected gradient descent algorithm satisfies the following convergence bound:

f(xT)f(x)L2Tx0x2f(x_T) - f(x^*) \leq \frac{L}{2T} \|x_0 - x^*\|^2

where xTx_T is the point after the TT-th iteration, x0x_0 is the initial point.

Theorem 1 Proof:

Notation Explanation

  • ff: Objective function, mapping from Rd\mathbb{R}^d to R\mathbb{R}, convex and differentiable.

  • XX: Feasible region, a closed convex set (e.g., region defined by constraints).

  • xx^*: Global minimizer of ff on XX, i.e., x=argminxXf(x)x^* = \arg\min_{x \in X} f(x).

  • LL: Smoothness constant (Lipschitz constant), representing the upper bound of gradient variation, i.e., f(x)f(y)Lxy\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| holds for all x,yXx, y \in X.

  • η\eta: Step size, set to η=1L\eta = \frac{1}{L} in the theorem to ensure convergence.

  • gtg_t: Gradient at point xtx_t, i.e., gt=f(xt)g_t = \nabla f(x_t).

  • xtx_t: Point after the tt-th iteration (after projection).

  • yt+1y_{t+1}: Unprojected gradient step, computed as yt+1=xtηgty_{t+1} = x_t - \eta g_t.

  • xt+1x_{t+1}: Projected point, i.e., xt+1=ΠX(yt+1)x_{t+1} = \Pi_X(y_{t+1}), where ΠX\Pi_X is the projection operator onto set XX.

The projection operator ΠX(y)\Pi_X(y) returns the point in XX closest to yy, i.e., ΠX(y)=argminxXxy\Pi_X(y) = \arg\min_{x \in X} \|x - y\|. Since XX is closed and convex, the projection exists and is unique.

Proof Sketch

The core of the proof lies in utilizing the function’s smoothness and convexity, as well as the properties of the projection operator, to derive the error bound per iteration. The key steps are:

  1. Sufficient Decrease Condition
    Since ff is L-smooth, for projected gradient descent, we have the following sufficient decrease inequality:

    f(xt+1)f(xt)12Lgt2+L2yt+1xt+12f(x_{t+1}) \leq f(x_t) - \frac{1}{2L} \|g_t\|^2 + \frac{L}{2} \|y_{t+1} - x_{t+1}\|^2

    This inequality shows that the amount of function value decrease after each iteration is affected by the gradient magnitude and the projection error. The extra term L2yt+1xt+12\frac{L}{2} \|y_{t+1} - x_{t+1}\|^2 is introduced by the projection operation.

  2. Property of the Projection Operator
    Use a key fact (Fact ii) of the projection operator: for any xXx \in X and yRdy \in \mathbb{R}^d, we have

    xΠX(y)2+yΠX(y)2xy2\|x - \Pi_X(y)\|^2 + \|y - \Pi_X(y)\|^2 \leq \|x - y\|^2

    Let x=xx = x^* (the optimal solution) and y=yt+1y = y_{t+1}, since ΠX(yt+1)=xt+1\Pi_X(y_{t+1}) = x_{t+1}, we obtain:

    xxt+12+yt+1xt+12xyt+12\|x^* - x_{t+1}\|^2 + \|y_{t+1} - x_{t+1}\|^2 \leq \|x^* - y_{t+1}\|^2

    This inequality reflects the “orthogonality” of projection, i.e., the distance relationship between the projection point and the original point.

  3. Combine with Convexity
    By the convexity of ff, we have:

    f(xt)f(x)gt(xtx)f(x_t) - f(x^*) \leq g_t^\top (x_t - x^*)

    This means the function value difference is controlled by the inner product of the gradient and the point difference.

  4. Algebraic Manipulation and Cancellation
    Combine the above elements. First, start from the analysis framework of ordinary gradient descent, but replace xt+1x_{t+1} with yt+1y_{t+1} (since yt+1y_{t+1} is the unprojected step). Then, using the projection property inequality and the sufficient decrease condition, algebraic manipulation shows that the extra term yt+1xt+12\|y_{t+1} - x_{t+1}\|^2 cancels out when summed. Specifically, when step size η=1L\eta = \frac{1}{L}, we have:

    t=0T1(f(xt)f(x))L2x0x2\sum_{t=0}^{T-1} (f(x_t) - f(x^*)) \leq \frac{L}{2} \|x_0 - x^*\|^2

    Since f(xt)f(x_t) is decreasing (guaranteed by sufficient decrease), we finally obtain f(xT)f(x)L2Tx0x2f(x_T) - f(x^*) \leq \frac{L}{2T} \|x_0 - x^*\|^2.

The key insight in the proof is: although the projection operation introduces error terms, through the properties of the projection operator, these errors cancel each other out in the overall analysis, thus maintaining the same convergence rate as unconstrained gradient descent.

Geometric Interpretation

  • Role of Projection: Projected gradient descent ensures that the iteration points always satisfy constraints by projecting the gradient step yt+1y_{t+1} back to the feasible set XX. The projection operation can be seen as “pulling back” the point to the nearest point in the feasible region.

  • Convergence Behavior: Since the function is smooth and convex, the gradient direction points to the descent direction, but projection may alter the path. The theorem shows that even with projection, the convergence rate remains O(1/T)O(1/T), the same as the unconstrained case. This is because the projection error averages out over long-term iterations and does not affect the overall trend.

  • Intuitive Understanding: Imagine a ball rolling on a smooth convex surface, moving along the gradient direction each time, but constrained within a feasible region. Projection ensures the ball does not leave the region, while smoothness ensures the movement does not oscillate excessively, eventually stabilizing at the lowest point.

Example

Although the document does not provide a specific example, a typical application is constrained linear regression: the objective function f(x)=Axb2f(x) = \|Ax - b\|^2 is smooth and convex, and the constraint set XX might be the non-negative orthant (x0x \geq 0). Projected gradient descent iteratively updates xx and projects it onto XX, ensuring the solution satisfies non-negativity while converging at a rate of O(1/T)O(1/T).

Additional Notes

  • The theorem assumes function smoothness and convexity; these conditions need to be verified in practice.

  • The step size η=1/L\eta = 1/L is theoretically optimal, but adaptive step sizes may be used in practice.

  • The convergence rate O(1/T)O(1/T) is sublinear, which may be slow for large-scale problems, but this is a standard result in convex optimization.


Convergence on Strongly Convex Smooth Functions

Strongly Convex Function Definition: Function ff is strongly convex on set XX (with parameter μ>0\mu > 0) if for all x,yXx, y \in X, it satisfies:

f(y)f(x)+f(x)(yx)+μ2yx2f(y) \geq f(x) + \nabla f(x)^\top (y - x) + \frac{\mu}{2} \|y - x\|^2

This means the function has a lower-bounding quadratic term, ensuring its curvature has a lower bound, thus accelerating optimization convergence.

Properties of Strongly Convex Functions

  • Lemma 2: A strongly convex function on XX has a unique minimizer xx^{*}.

    This is because strong convexity avoids multiple local minima, guaranteeing a unique global solution.

Lemma 2 Proof:

Mathematical Notation and Meaning

  • ff: Function defined on set XRdX \subset \mathbb{R}^d

  • μ>0\mu > 0: Strong convexity parameter

  • xx^{*}: Minimizer of function ff on XX

Theorem Statement

If function ff is μ\mu-strongly convex on set XX, then ff has a unique minimizer xx^{*} on XX.

Proof

  1. Existence: Since ff is strongly convex, it is also convex, and XX is a closed convex set, so the minimum exists.

  2. Uniqueness (by contradiction):

    • Assume there are two distinct minimizers x1x_1^* and x2x_2^*, and f(x1)=f(x2)=ff(x_1^*) = f(x_2^*) = f^*
    • Consider the midpoint xm=x1+x22x_m = \frac{x_1^* + x_2^*}{2}
    • By the definition of strong convexity:

      f(xm)12f(x1)+12f(x2)μ8x1x22f(x_m) \leq \frac{1}{2}f(x_1^*) + \frac{1}{2}f(x_2^*) - \frac{\mu}{8}\|x_1^* - x_2^*\|^2

      f(xm)fμ8x1x22<ff(x_m) \leq f^* - \frac{\mu}{8}\|x_1^* - x_2^*\|^2 < f^*

    • This contradicts the assumption that ff^* is the minimum value
    • Therefore, the minimizer must be unique

Geometric Meaning

Strongly convex functions have a “steep bowl-shaped” structure, ensuring the function has only one global minimum point, without any flat regions or multiple local minima. The larger the μ\mu value, the steeper the “bowl” shape of the function, making it easier for the optimization process to find the unique solution.

Convergence of Projected Gradient Descent (Theorem 2)

  • Theorem Statement: Let f:RdRf: \mathbb{R}^d \to \mathbb{R} be a convex differentiable function, XRdX \subset \mathbb{R}^d be a non-empty closed convex set. Assume ff is smooth on XX (parameter LL) and strongly convex (parameter μ>0\mu > 0). Choose step size η=1/L\eta = 1/L, then for any initial point x0x_0, projected gradient descent satisfies:

    • (i) Geometric decrease in squared distance to xx^{*}:

      xt+1x2(1μL)xtx2\|x_{t+1} - x^{*}\|^2 \leq \left(1 - \frac{\mu}{L}\right) \|x_t - x^{*}\|^2

      This indicates that after each iteration, the distance between the solution and the optimal solution contracts at a linear rate.

    • (ii) Absolute error decreases exponentially with iterations: After TT iterations, the error satisfies:

      f(xT)f(x)L2(1μL)Tx0x2f(x_T) - f(x^{*}) \leq \frac{L}{2} \left(1 - \frac{\mu}{L}\right)^T \|x_0 - x^{*}\|^2

      The error decreases exponentially, with the convergence rate depending on the condition number L/μL/\mu.

Theorem Proof:

Mathematical Notation and Meaning

  • LL: Smoothness parameter (Lipschitz constant)

  • η=1/L\eta = 1/L: Optimal step size selection

  • xtx_t: Solution after the tt-th iteration

  • yt+1=xtηf(xt)y_{t+1} = x_t - \eta \nabla f(x_t): Unprojected gradient step

  • xt+1=ΠX(yt+1)x_{t+1} = \Pi_X(y_{t+1}): Projected solution

Theorem Statement

For a μ\mu-strongly convex and LL-smooth function ff, projected gradient descent with step size η=1/L\eta = 1/L satisfies:

(i) Geometric decrease in squared distance:

xt+1x2(1μL)xtx2\|x_{t+1} - x^{*}\|^2 \leq \left(1 - \frac{\mu}{L}\right) \|x_t - x^{*}\|^2

(ii) Exponential convergence in function value error:

f(xT)f(x)L2(1μL)Tx0x2f(x_T) - f(x^{*}) \leq \frac{L}{2} \left(1 - \frac{\mu}{L}\right)^T \|x_0 - x^{*}\|^2

Proof Outline

Key Steps:

  1. Gradient Inequality:

    f(xt)(xtx)f(xt)f(x)+μ2xtx2\nabla f(x_t)^\top (x_t - x^{*}) \geq f(x_t) - f(x^{*}) + \frac{\mu}{2} \|x_t - x^{*}\|^2

  2. Projection Property (Attachment 2 P8):

    f(xt+1)f(xt)12Lf(xt)2+L2yt+1xt+12f(x_{t+1}) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2 + \frac{L}{2} \|y_{t+1} - x_{t+1}\|^2

  3. Distance Recurrence Relation:

    • Combine with the non-expansiveness of the projection operator: ΠX(y)ΠX(z)yz\|\Pi_X(y) - \Pi_X(z)\| \leq \|y - z\|
    • Utilize smoothness: f(y)f(x)+f(x)(yx)+L2yx2f(y) \leq f(x) + \nabla f(x)^\top (y - x) + \frac{L}{2} \|y - x\|^2
  4. Final Derivation:

    xt+1x2yt+1x2yt+1xt+12=xtηf(xt)x2yt+1xt+12(1μL)xtx2\begin{aligned} \|x_{t+1} - x^{*}\|^2 &\leq \|y_{t+1} - x^{*}\|^2 - \|y_{t+1} - x_{t+1}\|^2 \\ &= \|x_t - \eta \nabla f(x_t) - x^{*}\|^2 - \|y_{t+1} - x_{t+1}\|^2 \\ &\leq \left(1 - \frac{\mu}{L}\right) \|x_t - x^{*}\|^2 \end{aligned}

Geometric Meaning and Interpretation

Condition number κ=L/μ\kappa = L/\mu: This theorem reveals that the convergence rate depends on the ratio of the function’s smoothness (LL) to strong convexity (μ\mu). The smaller the condition number (i.e., the more strongly convex or smoother the function), the faster the convergence.

Geometric Interpretation:

  • Strong convexity ensures the function has a clear “descent direction”
  • Smoothness ensures the gradient does not change too drastically, allowing for larger step sizes
  • The projection operation ensures the solution always stays within the feasible region XX
  • Each iteration pulls the current solution exponentially closer to the unique optimal solution

Practical Significance: This theorem guarantees that under reasonable conditions, projected gradient descent can quickly converge to the optimal solution, with error decreasing exponentially with the number of iterations. The step size η=1/L\eta = 1/L is the optimal choice, balancing convergence speed and stability.


Examples of Projection Step Computations

  • The projection operation ΠX(y)=argminxXxy\Pi_X(y) = \arg\min_{x \in X} \|x - y\| is itself an optimization problem, but can be computed efficiently in certain cases:

    • Projection onto affine subspaces: Achieved by solving linear equations (similar to least squares).

      For example, if XX is a hyperplane, the projection has an analytical solution.

    • Projection onto Euclidean balls (centered at cc): Achieved by scaling the vector ycy - c.

      ΠX(y)=c+Ryc(yc)ifyc>R\Pi_X(y) = c + \frac{R}{\|y - c\|} (y - c) \quad \text{if} \quad \|y - c\| > R

      Where RR is the ball radius; otherwise, the point is already inside the ball.

    • Projection onto 1\ell_1-balls (used in Lasso problems): Can be simplified to projection onto the simplex.

      • Let B1(R)={x:x1R}B_1(R) = \{x : \|x\|_1 \leq R\}, the projection problem can be transformed into:

        minxdxz2whered={x:ixi=1,xi0}\min_{x \in \triangle_d} \|x - z\|^2 \quad \text{where} \quad \triangle_d = \{x : \sum_i x_i = 1, x_i \geq 0\}

      Projection onto the simplex can be computed in O(dlogd)\mathcal{O}(d \log d) or O(d)\mathcal{O}(d) time (DSSSC08 algorithm).


Proximal Gradient Descent

Problem Setting

Consider composite optimization problems where the objective function consists of two parts:

f(x):=g(x)+h(x)f(x) := g(x) + h(x)

Mathematical Notation Meaning:

  • g(x)g(x): “Well-behaved” function (usually convex and LL-smooth)

  • h(x)h(x): “Simple” but possibly non-smooth additional term (e.g., 1\ell_1 norm, indicator function, etc.)

  • f(x)f(x): Composite objective function to be optimized

Proximal gradient descent is suitable for solving non-smooth, constrained, or specially structured optimization problems.

Algorithm Idea and Update Formula

Core Idea: Extend the gradient descent idea to composite functions. For the smooth function g(x)g(x), the gradient descent step is equivalent to minimizing a local quadratic approximation:

xt+1=argminz{g(xt)+g(xt)(zxt)+12ηzxt2}x_{t+1} = \arg\min_z \left\{ g(x_t) + \nabla g(x_t)^\top (z - x_t) + \frac{1}{2\eta} \|z - x_t\|^2 \right\}

For f=g+hf = g + h, keep the quadratic approximation for gg and directly add hh:

xt+1=argminz{g(xt)+g(xt)(zxt)+12ηzxt2+h(z)}x_{t+1} = \arg\min_z \left\{ g(x_t) + \nabla g(x_t)^\top (z - x_t) + \frac{1}{2\eta} \|z - x_t\|^2 + h(z) \right\}

Algorithm Iteration Formula:

xt+1=proxh,η(xtηg(xt))x_{t+1} = \text{prox}_{h, \eta}(x_t - \eta \nabla g(x_t))

Proximal Mapping Definition:

proxh,η(z)=argminx{12ηxz2+h(x)}\text{prox}_{h, \eta}(z) = \arg\min_x \left\{ \frac{1}{2\eta} \|x - z\|^2 + h(x) \right\}

Equivalent Form:

xt+1=xtηGh,η(xt)x_{t+1} = x_t - \eta G_{h,\eta}(x_t)

where Gh,η(x)=1η(xproxh,η(xηg(x)))G_{h,\eta}(x) = \frac{1}{\eta}(x - \text{prox}_{h, \eta}(x - \eta \nabla g(x))) is the generalized gradient

Special Cases and Geometric Meaning

h=0h = 0: Reduces to standard gradient descent

prox0,η(z)=z\text{prox}_{0, \eta}(z) = z

h=1Xh = 1_X (indicator function): Reduces to projected gradient descent

  • Indicator function: 1X(x)={0,xX,xX1_X(x) = \begin{cases} 0, & x \in X \\ \infty, & x \notin X \end{cases}

  • Proximal mapping becomes projection: prox1X,η(z)=πX(z)=argminxXxz\text{prox}_{1_X, \eta}(z) = \pi_X(z) = \arg\min_{x \in X} \|x - z\|

Geometric Meaning: The proximal mapping finds points near zz that make h(x)h(x) small; the parameter η\eta balances the goals of “staying close to zz” and “making hh small”.

Convergence Analysis

Theorem 3 (Proximal Gradient Descent Convergence)

Let g:RdRg: \mathbb{R}^d \rightarrow \mathbb{R} be convex and LL-smooth, hh convex, and the proximal mapping be computable. Choose step size η=1/L\eta = 1/L, then:

f(xT)f(x)L2Txx02f(x_T) - f(x^*) \leq \frac{L}{2T} \|x^* - x_0\|^2

Convergence rate is O(1/T)\mathcal{O}(1/T), same as gradient descent on smooth convex functions.

Theorem 3 Proof Idea:

Mathematical Notation Meaning

  • ψ(z)=g(xt)+g(xt)(zxt)+12ηzxt2+h(z)\psi(z) = g(x_t) + \nabla g(x_t)^\top (z - x_t) + \frac{1}{2\eta} \|z - x_t\|^2 + h(z): Local approximation function

  • xt+1=argminzψ(z)x_{t+1} = \arg\min_z \psi(z): Proximal gradient step

  • yxt+12\|y - x_{t+1}\|^2: Approximation error term, measuring the distance between the current point and the optimal point

Proof Steps

  1. Define local approximation function:

    ψ(z)=g(xt)+g(xt)(zxt)+12ηzxt2+h(z)\psi(z) = g(x_t) + \nabla g(x_t)^\top (z - x_t) + \frac{1}{2\eta} \|z - x_t\|^2 + h(z)

    Due to the quadratic term, ψ(z)\psi(z) is strongly convex.

  2. Utilize strong convexity:
    According to the definition of xt+1x_{t+1} and the strong convexity of ψ\psi:

    ψ(xt+1)ψ(y)12ηyxt+12,y\psi(x_{t+1}) \leq \psi(y) - \frac{1}{2\eta} \|y - x_{t+1}\|^2, \quad \forall y

  3. Transform into decreasing property of ff:
    Using the LL-smoothness and convexity of gg, transform the inequality into:

    f(xt+1)f(y)+12ηyxt212ηyxt+12,yf(x_{t+1}) \leq f(y) + \frac{1}{2\eta} \|y - x_t\|^2 - \frac{1}{2\eta} \|y - x_{t+1}\|^2, \quad \forall y

  4. Summation and monotonicity:

    • Let y=xy = x^*, sum from t=0t = 0 to T1T-1
    • Utilize the monotonic decrease of function values: f(xt+1)f(xt)f(x_{t+1}) \leq f(x_t)
    • Obtain the final convergence bound

Geometric Intuition: The proximal method only “sees” the smooth part gg; the non-smooth part hh is handled separately by the proximal mapping and does not affect the convergence speed.

Example: Iterative Soft Thresholding Algorithm (ISTA)

Lasso Regression Problem:

minβ12yAβ22+λβ1\min_{\beta} \frac{1}{2} \|y - A\beta\|_2^2 + \lambda \|\beta\|_1

Mathematical Notation Meaning:

  • ARn×dA \in \mathbb{R}^{n \times d}: Feature matrix

  • yRny \in \mathbb{R}^n: Response variable

  • λ>0\lambda > 0: Sparsity regularization parameter

  • βRd\beta \in \mathbb{R}^d: Coefficient vector

Decomposition:

  • g(β)=12yAβ22g(\beta) = \frac{1}{2} \|y - A\beta\|_2^2, g(β)=A(yAβ)\nabla g(\beta) = -A^\top (y - A\beta)

  • h(β)=λβ1h(\beta) = \lambda \|\beta\|_1

Soft Thresholding Operator:

[proxh,η(z)]i=Sλ,η(zi)={ziηλif zi>ηλ0if ziηλzi+ηλif zi<ηλ[\text{prox}_{h,\eta}(z)]_i = S_{\lambda,\eta}(z_i) = \begin{cases} z_i - \eta\lambda & \text{if } z_i > \eta\lambda \\ 0 & \text{if } |z_i| \leq \eta\lambda \\ z_i + \eta\lambda & \text{if } z_i < -\eta\lambda \end{cases}

ISTA Iteration Formula:

βt+1=Sλη(βt+ηA(yAβt))\beta_{t+1} = S_{\lambda \eta}(\beta_t + \eta A^\top (y - A\beta_t))

Derivation Explanation: By analyzing the first-order optimality conditions of the proximal mapping, a piecewise analytical solution is obtained, corresponding to the soft thresholding operation.


Mirror Descent


Motivation

Consider the simplex-constrained problem:

minxdf(x)\min_{x \in \triangle_d} f(x)

where d:={xRd:i=1dxi=1,xi0}\triangle_d := \{x \in \mathbb{R}^d : \sum_{i=1}^d x_i = 1, x_i \geqslant 0\}. Assume the gradient satisfies f(x)1\|\nabla f(x)\|_\infty \leqslant 1, meaning each element of the gradient has bounded absolute value.

  • Convergence Rate Comparison:

    • Gradient Descent (GD) on convex L-Lipschitz functions: O(dT)\mathcal{O}\left( \sqrt{\frac{d}{T}} \right)
    • Mirror Descent can achieve O(logdT)\mathcal{O}\left( \sqrt{\frac{\log d}{T}} \right), superior in high-dimensional problems (large dd)
  • Why Superior: GD’s convergence rate is affected by dimension dd, while Mirror Descent reduces the dependency to logd\log d through mirror mapping, better handling the geometry of the simplex


Mathematical Notation Meaning
  • \|\cdot\|_\infty: Infinity norm, representing the maximum absolute value of elements in a vector, i.e., maxixi\max_i |x_i|

  • 2\|\cdot\|_2: Euclidean norm, representing the standard length of a vector, x2=ixi2\|x\|_2 = \sqrt{\sum_i x_i^2}

  • LL: Lipschitz constant, here derived from the gradient norm, f(x)2d=L\|\nabla f(x)\|_2 \leqslant \sqrt{d} = L

  • dd: Problem dimension

  • TT: Number of iterations

  • f(xbest)f(x)f(x_{\text{best}}) - f(x^*): Optimality gap, the difference in function value between the current best solution and the optimal solution

  • ρ\rho: Strong convexity coefficient, used to describe the convexity strength of the mirror potential Φ\Phi

  • η\eta: Step size, set to η=2RLρT\eta = \frac{2R}{L} \sqrt{\frac{\rho}{T}} in Mirror Descent

  • RR: Domain radius, R2=supxXDΦ(x,x1)R^2 = \sup_{x \in X} D_\Phi(x, x_1), where x1x_1 is the initial point

Preliminaries

  • Norms and Dual Norms: Fixed norm \|\cdot\|, dual norm defined as g=supx1gx\|g\|_* = \sup_{\|x\| \leq 1} g^\top x

  • Function Properties:

    • LL-Lipschitz: x,gf(x),gL\forall x, g \in \partial f(x), \|g\|_* \leqslant L
    • β\beta-smooth: Satisfies specific smoothness conditions
    • μ\mu-strongly convex: f(y)f(x)+g(yx)+μ2yx2f(y) \geq f(x) + g^\top (y-x) + \frac{\mu}{2} \|y-x\|^2

Mirror Descent Iteration Process

Mirror Descent iteration consists of two steps: dual update and projection. This process ensures that at each iteration, the point remains in the feasible region X.

  • Dual Update Step:

    yt+1=(Φ)1(Φ(xt)ηtgt)\boldsymbol{y}_{t+1} = (\nabla\Phi)^{-1}(\nabla\Phi(\boldsymbol{x}_{t}) - \eta_{t}\boldsymbol{g}_{t})

    where:

    • xt\boldsymbol{x}_{t} is the current iteration point (primal space)
    • Φ\nabla\Phi is the gradient of the Mirror Potential, used to map to the dual space
    • ηt\eta_{t} is the step size (learning rate)
    • gtf(xt)\boldsymbol{g}_{t} \in \partial f(\boldsymbol{x}_{t}) is the subgradient of objective function f at xt\boldsymbol{x}_{t} (if f is differentiable, it’s the gradient)
      This step performs gradient descent in the dual space
  • Projection Step:

    xt+1=argminxXDΦ(x,yt+1)\boldsymbol{x}_{t+1} = \underset{\boldsymbol{x} \in X}{\arg\min} D_{\Phi}(\boldsymbol{x}, \boldsymbol{y}_{t+1})

    where DΦD_{\Phi} is the Bregman Divergence associated with Φ. This step projects the intermediate point yt+1\boldsymbol{y}_{t+1} back to the feasible region X, minimizing the Bregman Divergence

Geometric Interpretation
The geometric meaning of Mirror Descent can be understood from the perspectives of dual space and primal space:

  • Dual Space: The gradient step Φ(xt)ηtgt\nabla\Phi(\boldsymbol{x}_{t}) - \eta_{t}\boldsymbol{g}_{t} is performed in the dual space, which is a linear update
  • Primal Space: The dual point is mapped back to the primal space via the inverse mapping (Φ)1(\nabla\Phi)^{-1}, then feasibility is ensured through Bregman projection
  • Schematic diagrams (e.g., from Bubeck 2015) illustrate this process: the “mirror” mapping from dual space to primal space, including gradient steps and projection, making the algorithm efficient under complex geometries

Key Elements

Mirror Potential

Mirror Potential is the core function of Mirror Descent, denoted as Φ\Phi. It defines the mapping from primal space to dual space.

  • Definition: Φ:RdR\Phi: \mathbb{R}^{d} \rightarrow \mathbb{R} is a function satisfying the following properties:

    • Strict Convexity: Ensures the optimization problem has a unique solution
    • Continuous Differentiability: Gradient Φ\nabla \Phi exists and is continuous, facilitating computation
    • Limit Condition: When x2||x||_2 → \infty, Φ(x)||\nabla \Phi(x)|| → \infty. This prevents algorithm divergence and ensures stability

Bregman Divergence

Bregman Divergence is an asymmetric distance measure used to replace Euclidean distance, based on a convex function (usually Φ\Phi or related function ff). It measures the “difference” between points and is used in the projection step to ensure feasibility.

  • Definition: For function ff, Bregman Divergence is defined as:

    Df(x,y)=f(x)f(y)f(y)(xy)D_f(x, y) = f(x) - f(y) - \nabla f(y)^\top (x - y)

    This measures the convexity difference between xx and yy

  • Properties: Non-negativity (Df(x,y)0D_f(x, y) ≥ 0, with equality when x=yx = y), convexity in xx

Projection Operator

Projection is a key part of Mirror Descent, ensuring algorithm feasibility on the constraint set X.

  • Projection Definition: The projection operation is denoted as ΠXΦ\Pi^{\Phi}_{X}, using the Bregman Divergence associated with Φ:

    ΠXΦ(y)=argminxXDΦ(x,y)\Pi^{\Phi}_{X}(\boldsymbol{y}) = \arg\min_{\boldsymbol{x} \in X} D_{\Phi}(\boldsymbol{x}, \boldsymbol{y})

  • Existence and Uniqueness: Due to the strict convexity and continuous differentiability of Φ, the projection ΠXΦ\Pi^{\Phi}_{X} exists and is unique. This means for any point, a unique projection point can be found, avoiding algorithm uncertainty

Convergence of Mirror Descent

Mirror Descent Update Rule

Mirror Descent is an optimization algorithm suitable for convex optimization problems in non-Euclidean spaces. Its update rule is as follows:

  • Initialization: Choose initial point x1x_1 such that x1arg minxXΦ(x)x_1 \in \operatorname*{arg\,min}_{x \in X} \Phi(x), where Φ\Phi is a mirror map and XX is the feasible region

  • For each time step t1t \geq 1:

    • Compute subgradient gtf(xt)g_t \in \partial f(x_t), where ff is the objective function
    • Then, define yt+1Rdy_{t+1} \in \mathbb{R}^d satisfying:

      Φ(yt+1)=Φ(xt)ηgt\nabla\Phi(y_{t+1}) = \nabla\Phi(x_t) - \eta g_t

      where η\eta is the step size parameter

Core Idea: Perform gradient descent in the mirror space (defined by Φ\Phi), then project back to the primal space via the mirror mapping. Φ\nabla\Phi is the gradient of the mapping, which maps points from the primal space to the dual space

Theorem 4: Convergence Guarantee

Assumptions:

  • Φ\Phi is a mirror map and is ρ\rho-strongly convex on XX with respect to norm \|\cdot\|

  • ff is a convex function and LL-Lipschitz continuous with respect to \|\cdot\|
    Then the Mirror Descent algorithm with step size η=2RLρT\eta = \frac{2R}{L} \sqrt{\frac{\rho}{T}} satisfies:

f(1Tt=1Txt)f(x)2RLρTf(\frac{1}{T} \sum_{t=1}^T{x_t}) - f(x^*) \leq \frac{2RL}{\sqrt{\rho T}}

where RR is the upper bound on the “distance” between the initial point and the optimal solution xx^*, and TT is the number of iterations

Notation Meaning:

  • ρ\rho: Strong convexity parameter, measuring the convexity strength of Φ\Phi
  • LL: Lipschitz constant, representing the upper bound of the rate of change of function ff
  • RR: Usually defined as R2=maxxXDΦ(x,x1)R^2 = \max_{x \in X} D_\Phi(x, x_1), reflecting the problem scale
  • η\eta: Step size, controlling the update magnitude, depending on problem parameters to ensure convergence

Standard Setup Examples

Ball Setup

  • Mirror potential: Φ(x)=12x22\Phi(x) = \frac{1}{2} \|x\|_2^2

  • Bregman divergence: DΦ(x,y)=12xy22D_\Phi(x, y) = \frac{1}{2} \|x - y\|_2^2

  • In this case, Mirror Descent is equivalent to projected subgradient descent, as projection is performed in Euclidean space

Simplex Setup

  • Mirror potential: Φ(x)=i=1dxilogxi\Phi(x) = \sum_{i=1}^d x_i \log x_i (negative entropy function), defined on the simplex d={xRd:xi0,ixi=1}\triangle_d = \{x \in \mathbb{R}^d : x_i \geq 0, \sum_i x_i = 1\}

  • Gradient update: Φ(yt+1)=Φ(xt)ηf(xt)\nabla \Phi(y_{t+1}) = \nabla \Phi(x_t) - \eta \nabla f(x_t) can be written in component form:

    yt+1,i=xt,iexp(η[f(xt)]i)y_{t+1,i} = x_{t,i} \exp(-\eta [\nabla f(x_t)]_i)

  • Bregman divergence: DΦ(x,y)=i=1dxilogxiyiD_\Phi(x, y) = \sum_{i=1}^d x_i \log \frac{x_i}{y_i} (Kullback-Leibler divergence)

  • Projection: Projection onto the simplex is equivalent to renormalization yy/y1y \rightarrow y / \|y\|_1

  • Example: When X=dX = \triangle_d, initial point x1=(1/d,,1/d)x_1 = (1/d, \ldots, 1/d), then R2=logdR^2 = \log d, suitable for high-dimensional probability optimization

Example Explanation: Simplex setup is commonly used in probability model optimization in machine learning, such as online learning or problems constrained to the probability simplex. The use of KL divergence makes the algorithm more adaptive to the geometry of probability distributions

Proof Details:
  1. Basic Inequality: Using properties of Bregman divergence, we have:

    f(xt)f(x)1η(DΦ(x,xt)DΦ(x,xt+1))+η2ρgt2f(x_t) - f(x) \leq \frac{1}{\eta} \left( D_\Phi(x, x_t) - D_\Phi(x, x_{t+1}) \right) + \frac{\eta}{2\rho} \|g_t\|_*^2

    • Source: Attachment 2 P47, derived through first-order optimality conditions and strong convexity
  2. Summation and Telescoping: Sum from t=1t=1 to TT, the DΦD_\Phi terms form a telescoping sum:

    t=1T[DΦ(x,xt)DΦ(x,xt+1)]=DΦ(x,x1)DΦ(x,xT+1)R2\sum_{t=1}^T [D_\Phi(x, x_t) - D_\Phi(x, x_{t+1})] = D_\Phi(x, x_1) - D_\Phi(x, x_{T+1}) \leq R^2

    • Because x1=argminxXΦ(x)x_1 = \arg\min_{x \in X} \Phi(x), so DΦ(x,x1)R2D_\Phi(x, x_1) \leq R^2
  3. Bounding the Gradient Term: Since ff is LL-Lipschitz, gtL\|g_t\|_* \leq L, therefore:

    t=1Tη2ρgt2ηTL22ρ\sum_{t=1}^T \frac{\eta}{2\rho} \|g_t\|_*^2 \leq \frac{\eta T L^2}{2\rho}

  4. Combine Results:

    t=1T(f(xt)f(x))R2η+ηTL22ρ\sum_{t=1}^T (f(x_t) - f(x)) \leq \frac{R^2}{\eta} + \frac{\eta T L^2}{2\rho}

    • Choose η=2RLρT\eta = \frac{2R}{L} \sqrt{\frac{\rho}{T}} to minimize the right-hand side, obtaining the convergence rate
  5. Jensen’s Inequality: Finally apply Jensen’s inequality to the average point 1Tt=1Txt\frac{1}{T} \sum_{t=1}^T x_t