#sdsc6015

English / 中文

Projected Gradient Descent

Projected Gradient Descent is an algorithm for constrained optimization problems that ensures constraint satisfaction by projecting gradient steps back onto the feasible set.

Constrained Optimization Problem Definition

The constrained optimization problem is formally defined as:

$\begin{aligned} &\min f(x) \\ &\text{subject to}\quad x \in X \end{aligned}$

where:

$f: \mathbb{R}^d \rightarrow \mathbb{R}$ is the objective function
$X \subseteq \mathbb{R}^d$ is a closed convex set
$x \in \mathbb{R}^d$ is the optimization variable

Geometric Meaning: Find the point that minimizes $f(x)$ while satisfying the constraint $x \in X$ .

Algorithm Description

Projected Gradient Descent iteration steps:

$\begin{aligned} &\text{For } t = 0, 1, 2, \ldots \text{ with stepsize } \eta_t > 0: \\ &\quad y_{t+1} = x_t - \eta_t \nabla f(x_t) \\ &\quad x_{t+1} = \Pi_X(y_{t+1}) \end{aligned}$

where the projection operator is defined as:

$\Pi_X(y) := \operatorname*{arg\,min}_{x \in X} \|x - y\|$

Mathematical Notation Meaning:

$\eta_t$ : Step size parameter, controls the update magnitude
$\nabla f(x_t)$ : Gradient of the function at $x_t$
$\Pi_X$ : Projection operator, maps a point to the closest point in $X$

Convergence Rate Analysis

Convergence rate is the same as unconstrained gradient descent, but each step requires a projection operation:

Lipschitz convex function (parameter $G$ ): $\mathcal{O}(1/\varepsilon^2)$ steps
Smooth convex function (parameter $L$ ): $\mathcal{O}(1/\varepsilon)$ steps
Smooth strongly convex function (parameters $\mu, L$ ): $\mathcal{O}(\log(1/\varepsilon))$ steps

Computational efficiency depends on the difficulty of the projection operation.

Convergence on Smooth Functions

Smoothness Definition: Function $f$ is called smooth (parameter $L$ ) on closed convex set $X \subseteq \mathbb{R}^d$ if:

$\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|, \quad \forall x,y \in X$

Lemma 1 (Sufficient Decrease):

Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ be differentiable and smooth (parameter $L$ ) on closed convex set $X \subseteq \mathbb{R}^d$ . Choose step size $\eta = 1/L$ , then projected gradient descent satisfies:

$f(x_{t+1}) \leq f(x_t) - \frac{1}{2L}\|\nabla f(x_t)\|^2 + \frac{L}{2}\|y_{t+1} - x_{t+1}\|^2$

where $y_{t+1} = x_t - \eta \nabla f(x_t)$ .

Lemma 1 Proof:

Mathematical Notation Meaning in Projection

$y_{t+1} = x_t - \eta \nabla f(x_t)$ : This is the unprojected gradient step, the point after moving one step along the gradient direction, but may not be in the feasible set $X$ .
$x_{t+1} = \Pi_X(y_{t+1})$ : This is the projected point, mapping $y_{t+1}$ to the closest point in $X$ via the projection operator $\Pi_X$ , ensuring the iteration point always satisfies constraints.
$\|y_{t+1} - x_{t+1}\|^2$ : This is the projection error, measuring the squared Euclidean distance between the unprojected point and the projected point. This term acts as a “penalty” for the deviation caused by projection, as projection may shift the point away from the gradient direction.
$\Pi_X(y) := \operatorname*{arg\,min}_{x \in X} \|x - y\|$ : Projection operator, meaning projecting point $y$ onto closed convex set $X$ , finding the closest point to $y$ in $X$ .

Proof

The proof of Lemma 1 relies on three key elements: function smoothness, optimality conditions of projection, and geometric properties of the projection operator (Fact(ii)). The detailed proof steps are as follows:

Utilize Function Smoothness:
Since $f$ is $L$ -smooth, according to the smoothness definition, for any points $x$ and $y$ , we have:

$f(y) \leq f(x) + \nabla f(x)^\top (y - x) + \frac{L}{2} \|y - x\|^2$

Apply this property with $x = x_t$ and $y = y_{t+1}$ , and substitute $y_{t+1} = x_t - \eta \nabla f(x_t)$ and $\eta = 1/L$ , obtaining:

$f(y_{t+1}) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2$

This step gives an upper bound on the function value at the unprojected point $y_{t+1}$ .
Optimality Condition of Projection:
Since $x_{t+1} = \Pi_X(y_{t+1})$ is the projection point, for a closed convex set $X$ , the projection operator satisfies the following optimality condition: for any $x \in X$ , we have:

$(x - x_{t+1})^\top (y_{t+1} - x_{t+1}) \leq 0$

In particular, take $x = x_t$ (since $x_t \in X$ ), we have:

$(x_t - x_{t+1})^\top (y_{t+1} - x_{t+1}) \leq 0$

Substitute $y_{t+1} = x_t - \eta \nabla f(x_t)$ , and after algebraic transformation, we get:

$\nabla f(x_t)^\top (x_{t+1} - x_t) \leq -\frac{1}{\eta} \|x_t - x_{t+1}\|^2$

This inequality relates the gradient inner product to the movement distance after projection.
Apply Smoothness to $x_{t+1}$ :
Again use the smoothness definition, this time applied to $x = x_t$ and $y = x_{t+1}$ :

$f(x_{t+1}) \leq f(x_t) + \nabla f(x_t)^\top (x_{t+1} - x_t) + \frac{L}{2} \|x_{t+1} - x_t\|^2$

Substitute the inequality from step 2, obtaining:

$f(x_{t+1}) \leq f(x_t) - \frac{1}{\eta} \|x_t - x_{t+1}\|^2 + \frac{L}{2} \|x_{t+1} - x_t\|^2$

Substitute $\eta = 1/L$ , simplifying gives:

$f(x_{t+1}) \leq f(x_t) - \frac{L}{2} \|x_t - x_{t+1}\|^2$
Utilize Projection Property (Fact(ii)):
Fact(ii) mentioned in the attachment is a geometric property of the projection operator: for any $x \in X$ and $y \in \mathbb{R}^d$ , we have:

$\|x - \Pi_X(y)\|^2 + \|y - \Pi_X(y)\|^2 \leq \|x - y\|^2$

Take $x = x_t$ and $y = y_{t+1}$ , and since $\Pi_X(y_{t+1}) = x_{t+1}$ , we get:

$\|x_t - x_{t+1}\|^2 + \|y_{t+1} - x_{t+1}\|^2$

Theorem 1:

Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ be a convex and differentiable function. Let $X \subseteq \mathbb{R}^d$ be a closed convex set, and assume there exists a minimizer $x^* \in X$ such that $f(x^*) = \min_{x \in X} f(x)$ . Further, assume $f$ is smooth on $X$ (i.e., gradient is L-Lipschitz continuous) with parameter $L$ . Choose step size $\eta = \frac{1}{L}$ , then the iteration sequence generated by the projected gradient descent algorithm satisfies the following convergence bound:

$f(x_T) - f(x^*) \leq \frac{L}{2T} \|x_0 - x^*\|^2$

where $x_T$ is the point after the $T$ -th iteration, $x_0$ is the initial point.

Theorem 1 Proof:

Notation Explanation

$f$ : Objective function, mapping from $\mathbb{R}^d$ to $\mathbb{R}$ , convex and differentiable.
$X$ : Feasible region, a closed convex set (e.g., region defined by constraints).
$x^*$ : Global minimizer of $f$ on $X$ , i.e., $x^* = \arg\min_{x \in X} f(x)$ .
$L$ : Smoothness constant (Lipschitz constant), representing the upper bound of gradient variation, i.e., $\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\|$ holds for all $x, y \in X$ .
$\eta$ : Step size, set to $\eta = \frac{1}{L}$ in the theorem to ensure convergence.
$g_t$ : Gradient at point $x_t$ , i.e., $g_t = \nabla f(x_t)$ .
$x_t$ : Point after the $t$ -th iteration (after projection).
$y_{t+1}$ : Unprojected gradient step, computed as $y_{t+1} = x_t - \eta g_t$ .
$x_{t+1}$ : Projected point, i.e., $x_{t+1} = \Pi_X(y_{t+1})$ , where $\Pi_X$ is the projection operator onto set $X$ .

The projection operator $\Pi_X(y)$ returns the point in $X$ closest to $y$ , i.e., $\Pi_X(y) = \arg\min_{x \in X} \|x - y\|$ . Since $X$ is closed and convex, the projection exists and is unique.

Proof Sketch

The core of the proof lies in utilizing the function’s smoothness and convexity, as well as the properties of the projection operator, to derive the error bound per iteration. The key steps are:

Sufficient Decrease Condition
Since $f$ is L-smooth, for projected gradient descent, we have the following sufficient decrease inequality:

$f(x_{t+1}) \leq f(x_t) - \frac{1}{2L} \|g_t\|^2 + \frac{L}{2} \|y_{t+1} - x_{t+1}\|^2$

This inequality shows that the amount of function value decrease after each iteration is affected by the gradient magnitude and the projection error. The extra term $\frac{L}{2} \|y_{t+1} - x_{t+1}\|^2$ is introduced by the projection operation.
Property of the Projection Operator
Use a key fact (Fact ii) of the projection operator: for any $x \in X$ and $y \in \mathbb{R}^d$ , we have

$\|x - \Pi_X(y)\|^2 + \|y - \Pi_X(y)\|^2 \leq \|x - y\|^2$

Let $x = x^*$ (the optimal solution) and $y = y_{t+1}$ , since $\Pi_X(y_{t+1}) = x_{t+1}$ , we obtain:

$\|x^* - x_{t+1}\|^2 + \|y_{t+1} - x_{t+1}\|^2 \leq \|x^* - y_{t+1}\|^2$

This inequality reflects the “orthogonality” of projection, i.e., the distance relationship between the projection point and the original point.
Combine with Convexity
By the convexity of $f$ , we have:

$f(x_t) - f(x^*) \leq g_t^\top (x_t - x^*)$

This means the function value difference is controlled by the inner product of the gradient and the point difference.
Algebraic Manipulation and Cancellation
Combine the above elements. First, start from the analysis framework of ordinary gradient descent, but replace $x_{t+1}$ with $y_{t+1}$ (since $y_{t+1}$ is the unprojected step). Then, using the projection property inequality and the sufficient decrease condition, algebraic manipulation shows that the extra term $\|y_{t+1} - x_{t+1}\|^2$ cancels out when summed. Specifically, when step size $\eta = \frac{1}{L}$ , we have:

$\sum_{t=0}^{T-1} (f(x_t) - f(x^*)) \leq \frac{L}{2} \|x_0 - x^*\|^2$

Since $f(x_t)$ is decreasing (guaranteed by sufficient decrease), we finally obtain $f(x_T) - f(x^*) \leq \frac{L}{2T} \|x_0 - x^*\|^2$ .

The key insight in the proof is: although the projection operation introduces error terms, through the properties of the projection operator, these errors cancel each other out in the overall analysis, thus maintaining the same convergence rate as unconstrained gradient descent.

Geometric Interpretation

Role of Projection: Projected gradient descent ensures that the iteration points always satisfy constraints by projecting the gradient step $y_{t+1}$ back to the feasible set $X$ . The projection operation can be seen as “pulling back” the point to the nearest point in the feasible region.
Convergence Behavior: Since the function is smooth and convex, the gradient direction points to the descent direction, but projection may alter the path. The theorem shows that even with projection, the convergence rate remains $O(1/T)$ , the same as the unconstrained case. This is because the projection error averages out over long-term iterations and does not affect the overall trend.
Intuitive Understanding: Imagine a ball rolling on a smooth convex surface, moving along the gradient direction each time, but constrained within a feasible region. Projection ensures the ball does not leave the region, while smoothness ensures the movement does not oscillate excessively, eventually stabilizing at the lowest point.

Example

Although the document does not provide a specific example, a typical application is constrained linear regression: the objective function $f(x) = \|Ax - b\|^2$ is smooth and convex, and the constraint set $X$ might be the non-negative orthant ( $x \geq 0$ ). Projected gradient descent iteratively updates $x$ and projects it onto $X$ , ensuring the solution satisfies non-negativity while converging at a rate of $O(1/T)$ .

Additional Notes

The theorem assumes function smoothness and convexity; these conditions need to be verified in practice.
The step size $\eta = 1/L$ is theoretically optimal, but adaptive step sizes may be used in practice.
The convergence rate $O(1/T)$ is sublinear, which may be slow for large-scale problems, but this is a standard result in convex optimization.

Convergence on Strongly Convex Smooth Functions

Strongly Convex Function Definition: Function $f$ is strongly convex on set $X$ (with parameter $\mu > 0$ ) if for all $x, y \in X$ , it satisfies:

$f(y) \geq f(x) + \nabla f(x)^\top (y - x) + \frac{\mu}{2} \|y - x\|^2$

This means the function has a lower-bounding quadratic term, ensuring its curvature has a lower bound, thus accelerating optimization convergence.

Properties of Strongly Convex Functions

Lemma 2: A strongly convex function on $X$ has a unique minimizer $x^{*}$ .

This is because strong convexity avoids multiple local minima, guaranteeing a unique global solution.

Lemma 2 Proof:

Mathematical Notation and Meaning

$f$ : Function defined on set $X \subset \mathbb{R}^d$
$\mu > 0$ : Strong convexity parameter
$x^{*}$ : Minimizer of function $f$ on $X$

Theorem Statement

If function $f$ is $\mu$ -strongly convex on set $X$ , then $f$ has a unique minimizer $x^{*}$ on $X$ .

Proof

Existence: Since $f$ is strongly convex, it is also convex, and $X$ is a closed convex set, so the minimum exists.
Uniqueness (by contradiction):
- Assume there are two distinct minimizers $x_1^*$ and $x_2^*$ , and $f(x_1^*) = f(x_2^*) = f^*$
- Consider the midpoint $x_m = \frac{x_1^* + x_2^*}{2}$
- By the definition of strong convexity:
  $f(x_m) \leq \frac{1}{2}f(x_1^*) + \frac{1}{2}f(x_2^*) - \frac{\mu}{8}\|x_1^* - x_2^*\|^2$
  
  $f(x_m) \leq f^* - \frac{\mu}{8}\|x_1^* - x_2^*\|^2 < f^*$
- This contradicts the assumption that $f^*$ is the minimum value
- Therefore, the minimizer must be unique

Geometric Meaning

Strongly convex functions have a “steep bowl-shaped” structure, ensuring the function has only one global minimum point, without any flat regions or multiple local minima. The larger the $\mu$ value, the steeper the “bowl” shape of the function, making it easier for the optimization process to find the unique solution.

Convergence of Projected Gradient Descent (Theorem 2)

Theorem Statement: Let $f: \mathbb{R}^d \to \mathbb{R}$ be a convex differentiable function, $X \subset \mathbb{R}^d$ be a non-empty closed convex set. Assume $f$ is smooth on $X$ (parameter $L$ ) and strongly convex (parameter $\mu > 0$ ). Choose step size $\eta = 1/L$ , then for any initial point $x_0$ , projected gradient descent satisfies:
- (i) Geometric decrease in squared distance to $x^{*}$ :
  $\|x_{t+1} - x^{*}\|^2 \leq \left(1 - \frac{\mu}{L}\right) \|x_t - x^{*}\|^2$
  
  This indicates that after each iteration, the distance between the solution and the optimal solution contracts at a linear rate.
- (ii) Absolute error decreases exponentially with iterations: After $T$ iterations, the error satisfies:
  $f(x_T) - f(x^{*}) \leq \frac{L}{2} \left(1 - \frac{\mu}{L}\right)^T \|x_0 - x^{*}\|^2$
  
  The error decreases exponentially, with the convergence rate depending on the condition number $L/\mu$ .

Theorem Proof:

Mathematical Notation and Meaning

$L$ : Smoothness parameter (Lipschitz constant)
$\eta = 1/L$ : Optimal step size selection
$x_t$ : Solution after the $t$ -th iteration
$y_{t+1} = x_t - \eta \nabla f(x_t)$ : Unprojected gradient step
$x_{t+1} = \Pi_X(y_{t+1})$ : Projected solution

Theorem Statement

For a $\mu$ -strongly convex and $L$ -smooth function $f$ , projected gradient descent with step size $\eta = 1/L$ satisfies:

(i) Geometric decrease in squared distance:

$\|x_{t+1} - x^{*}\|^2 \leq \left(1 - \frac{\mu}{L}\right) \|x_t - x^{*}\|^2$

(ii) Exponential convergence in function value error:

$f(x_T) - f(x^{*}) \leq \frac{L}{2} \left(1 - \frac{\mu}{L}\right)^T \|x_0 - x^{*}\|^2$

Proof Outline

Key Steps:

Gradient Inequality:

$\nabla f(x_t)^\top (x_t - x^{*}) \geq f(x_t) - f(x^{*}) + \frac{\mu}{2} \|x_t - x^{*}\|^2$
Projection Property (Attachment 2 P8):

$f(x_{t+1}) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2 + \frac{L}{2} \|y_{t+1} - x_{t+1}\|^2$
Distance Recurrence Relation:
- Combine with the non-expansiveness of the projection operator: $\|\Pi_X(y) - \Pi_X(z)\| \leq \|y - z\|$
- Utilize smoothness: $f(y) \leq f(x) + \nabla f(x)^\top (y - x) + \frac{L}{2} \|y - x\|^2$
Final Derivation:

$\begin{aligned} \|x_{t+1} - x^{*}\|^2 &\leq \|y_{t+1} - x^{*}\|^2 - \|y_{t+1} - x_{t+1}\|^2 \\ &= \|x_t - \eta \nabla f(x_t) - x^{*}\|^2 - \|y_{t+1} - x_{t+1}\|^2 \\ &\leq \left(1 - \frac{\mu}{L}\right) \|x_t - x^{*}\|^2 \end{aligned}$

Geometric Meaning and Interpretation

Condition number $\kappa = L/\mu$ : This theorem reveals that the convergence rate depends on the ratio of the function’s smoothness ( $L$ ) to strong convexity ( $\mu$ ). The smaller the condition number (i.e., the more strongly convex or smoother the function), the faster the convergence.

Geometric Interpretation:

Strong convexity ensures the function has a clear “descent direction”

Smoothness ensures the gradient does not change too drastically, allowing for larger step sizes

The projection operation ensures the solution always stays within the feasible region $X$

Each iteration pulls the current solution exponentially closer to the unique optimal solution

Practical Significance: This theorem guarantees that under reasonable conditions, projected gradient descent can quickly converge to the optimal solution, with error decreasing exponentially with the number of iterations. The step size $\eta = 1/L$ is the optimal choice, balancing convergence speed and stability.

Examples of Projection Step Computations

The projection operation $\Pi_X(y) = \arg\min_{x \in X} \|x - y\|$ is itself an optimization problem, but can be computed efficiently in certain cases:
- Projection onto affine subspaces: Achieved by solving linear equations (similar to least squares).
  
  For example, if $X$ is a hyperplane, the projection has an analytical solution.
- Projection onto Euclidean balls (centered at $c$ ): Achieved by scaling the vector $y - c$ .
  
  $\Pi_X(y) = c + \frac{R}{\|y - c\|} (y - c) \quad \text{if} \quad \|y - c\| > R$
  
  Where $R$ is the ball radius; otherwise, the point is already inside the ball.
- Projection onto $\ell_1$ -balls (used in Lasso problems): Can be simplified to projection onto the simplex.
  - Let $B_1(R) = \{x : \|x\|_1 \leq R\}$ , the projection problem can be transformed into:
    $\min_{x \in \triangle_d} \|x - z\|^2 \quad \text{where} \quad \triangle_d = \{x : \sum_i x_i = 1, x_i \geq 0\}$
  Projection onto the simplex can be computed in $\mathcal{O}(d \log d)$ or $\mathcal{O}(d)$ time (DSSSC08 algorithm).

Proximal Gradient Descent

Problem Setting

Consider composite optimization problems where the objective function consists of two parts:

$f(x) := g(x) + h(x)$

Mathematical Notation Meaning:

$g(x)$ : “Well-behaved” function (usually convex and $L$ -smooth)
$h(x)$ : “Simple” but possibly non-smooth additional term (e.g., $\ell_1$ norm, indicator function, etc.)
$f(x)$ : Composite objective function to be optimized

Proximal gradient descent is suitable for solving non-smooth, constrained, or specially structured optimization problems.

Algorithm Idea and Update Formula

Core Idea: Extend the gradient descent idea to composite functions. For the smooth function $g(x)$ , the gradient descent step is equivalent to minimizing a local quadratic approximation:

$x_{t+1} = \arg\min_z \left\{ g(x_t) + \nabla g(x_t)^\top (z - x_t) + \frac{1}{2\eta} \|z - x_t\|^2 \right\}$

For $f = g + h$ , keep the quadratic approximation for $g$ and directly add $h$ :

$x_{t+1} = \arg\min_z \left\{ g(x_t) + \nabla g(x_t)^\top (z - x_t) + \frac{1}{2\eta} \|z - x_t\|^2 + h(z) \right\}$

Algorithm Iteration Formula:

$x_{t+1} = \text{prox}_{h, \eta}(x_t - \eta \nabla g(x_t))$

Proximal Mapping Definition:

$\text{prox}_{h, \eta}(z) = \arg\min_x \left\{ \frac{1}{2\eta} \|x - z\|^2 + h(x) \right\}$

Equivalent Form:

$x_{t+1} = x_t - \eta G_{h,\eta}(x_t)$

where $G_{h,\eta}(x) = \frac{1}{\eta}(x - \text{prox}_{h, \eta}(x - \eta \nabla g(x)))$ is the generalized gradient

Special Cases and Geometric Meaning

$h = 0$ : Reduces to standard gradient descent

$\text{prox}_{0, \eta}(z) = z$

$h = 1_X$ (indicator function): Reduces to projected gradient descent

Indicator function: $1_X(x) = \begin{cases} 0, & x \in X \\ \infty, & x \notin X \end{cases}$
Proximal mapping becomes projection: $\text{prox}_{1_X, \eta}(z) = \pi_X(z) = \arg\min_{x \in X} \|x - z\|$

Geometric Meaning: The proximal mapping finds points near $z$ that make $h(x)$ small; the parameter $\eta$ balances the goals of “staying close to $z$ ” and “making $h$ small”.

Convergence Analysis

Theorem 3 (Proximal Gradient Descent Convergence)

Let $g: \mathbb{R}^d \rightarrow \mathbb{R}$ be convex and $L$ -smooth, $h$ convex, and the proximal mapping be computable. Choose step size $\eta = 1/L$ , then:

$f(x_T) - f(x^*) \leq \frac{L}{2T} \|x^* - x_0\|^2$

Convergence rate is $\mathcal{O}(1/T)$ , same as gradient descent on smooth convex functions.

Theorem 3 Proof Idea:

Mathematical Notation Meaning

$\psi(z) = g(x_t) + \nabla g(x_t)^\top (z - x_t) + \frac{1}{2\eta} \|z - x_t\|^2 + h(z)$ : Local approximation function
$x_{t+1} = \arg\min_z \psi(z)$ : Proximal gradient step
$\|y - x_{t+1}\|^2$ : Approximation error term, measuring the distance between the current point and the optimal point

Proof Steps

Define local approximation function:

$\psi(z) = g(x_t) + \nabla g(x_t)^\top (z - x_t) + \frac{1}{2\eta} \|z - x_t\|^2 + h(z)$

Due to the quadratic term, $\psi(z)$ is strongly convex.
Utilize strong convexity:
According to the definition of $x_{t+1}$ and the strong convexity of $\psi$ :

$\psi(x_{t+1}) \leq \psi(y) - \frac{1}{2\eta} \|y - x_{t+1}\|^2, \quad \forall y$
Transform into decreasing property of $f$ :
Using the $L$ -smoothness and convexity of $g$ , transform the inequality into:

$f(x_{t+1}) \leq f(y) + \frac{1}{2\eta} \|y - x_t\|^2 - \frac{1}{2\eta} \|y - x_{t+1}\|^2, \quad \forall y$
Summation and monotonicity:
- Let $y = x^*$ , sum from $t = 0$ to $T-1$
- Utilize the monotonic decrease of function values: $f(x_{t+1}) \leq f(x_t)$
- Obtain the final convergence bound

Geometric Intuition: The proximal method only “sees” the smooth part $g$ ; the non-smooth part $h$ is handled separately by the proximal mapping and does not affect the convergence speed.

Example: Iterative Soft Thresholding Algorithm (ISTA)

Lasso Regression Problem:

$\min_{\beta} \frac{1}{2} \|y - A\beta\|_2^2 + \lambda \|\beta\|_1$

Mathematical Notation Meaning:

$A \in \mathbb{R}^{n \times d}$ : Feature matrix
$y \in \mathbb{R}^n$ : Response variable
$\lambda > 0$ : Sparsity regularization parameter
$\beta \in \mathbb{R}^d$ : Coefficient vector

Decomposition:

$g(\beta) = \frac{1}{2} \|y - A\beta\|_2^2$ , $\nabla g(\beta) = -A^\top (y - A\beta)$
$h(\beta) = \lambda \|\beta\|_1$

Soft Thresholding Operator:

$[\text{prox}_{h,\eta}(z)]_i = S_{\lambda,\eta}(z_i) = \begin{cases} z_i - \eta\lambda & \text{if } z_i > \eta\lambda \\ 0 & \text{if } |z_i| \leq \eta\lambda \\ z_i + \eta\lambda & \text{if } z_i < -\eta\lambda \end{cases}$

ISTA Iteration Formula:

$\beta_{t+1} = S_{\lambda \eta}(\beta_t + \eta A^\top (y - A\beta_t))$

Derivation Explanation: By analyzing the first-order optimality conditions of the proximal mapping, a piecewise analytical solution is obtained, corresponding to the soft thresholding operation.

Mirror Descent

Motivation

Consider the simplex-constrained problem:

$\min_{x \in \triangle_d} f(x)$

where $\triangle_d := \{x \in \mathbb{R}^d : \sum_{i=1}^d x_i = 1, x_i \geqslant 0\}$ . Assume the gradient satisfies $\|\nabla f(x)\|_\infty \leqslant 1$ , meaning each element of the gradient has bounded absolute value.

Convergence Rate Comparison:
- Gradient Descent (GD) on convex L-Lipschitz functions: $\mathcal{O}\left( \sqrt{\frac{d}{T}} \right)$
- Mirror Descent can achieve $\mathcal{O}\left( \sqrt{\frac{\log d}{T}} \right)$ , superior in high-dimensional problems (large $d$ )
Why Superior: GD’s convergence rate is affected by dimension $d$ , while Mirror Descent reduces the dependency to $\log d$ through mirror mapping, better handling the geometry of the simplex

Mathematical Notation Meaning

$\|\cdot\|_\infty$ : Infinity norm, representing the maximum absolute value of elements in a vector, i.e., $\max_i |x_i|$
$\|\cdot\|_2$ : Euclidean norm, representing the standard length of a vector, $\|x\|_2 = \sqrt{\sum_i x_i^2}$
$L$ : Lipschitz constant, here derived from the gradient norm, $\|\nabla f(x)\|_2 \leqslant \sqrt{d} = L$
$d$ : Problem dimension
$T$ : Number of iterations
$f(x_{\text{best}}) - f(x^*)$ : Optimality gap, the difference in function value between the current best solution and the optimal solution
$\rho$ : Strong convexity coefficient, used to describe the convexity strength of the mirror potential $\Phi$
$\eta$ : Step size, set to $\eta = \frac{2R}{L} \sqrt{\frac{\rho}{T}}$ in Mirror Descent
$R$ : Domain radius, $R^2 = \sup_{x \in X} D_\Phi(x, x_1)$ , where $x_1$ is the initial point

Preliminaries

Norms and Dual Norms: Fixed norm $\|\cdot\|$ , dual norm defined as $\|g\|_* = \sup_{\|x\| \leq 1} g^\top x$
Function Properties:
- $L$ -Lipschitz: $\forall x, g \in \partial f(x), \|g\|_* \leqslant L$
- $\beta$ -smooth: Satisfies specific smoothness conditions
- $\mu$ -strongly convex: $f(y) \geq f(x) + g^\top (y-x) + \frac{\mu}{2} \|y-x\|^2$

Mirror Descent Iteration Process

Mirror Descent iteration consists of two steps: dual update and projection. This process ensures that at each iteration, the point remains in the feasible region X.

Dual Update Step:

$\boldsymbol{y}_{t+1} = (\nabla\Phi)^{-1}(\nabla\Phi(\boldsymbol{x}_{t}) - \eta_{t}\boldsymbol{g}_{t})$

where:
- $\boldsymbol{x}_{t}$ is the current iteration point (primal space)
- $\nabla\Phi$ is the gradient of the Mirror Potential, used to map to the dual space
- $\eta_{t}$ is the step size (learning rate)
- $\boldsymbol{g}_{t} \in \partial f(\boldsymbol{x}_{t})$ is the subgradient of objective function f at $\boldsymbol{x}_{t}$ (if f is differentiable, it’s the gradient)
  This step performs gradient descent in the dual space
Projection Step:

$\boldsymbol{x}_{t+1} = \underset{\boldsymbol{x} \in X}{\arg\min} D_{\Phi}(\boldsymbol{x}, \boldsymbol{y}_{t+1})$

where $D_{\Phi}$ is the Bregman Divergence associated with Φ. This step projects the intermediate point $\boldsymbol{y}_{t+1}$ back to the feasible region X, minimizing the Bregman Divergence

Geometric Interpretation
The geometric meaning of Mirror Descent can be understood from the perspectives of dual space and primal space:

Dual Space: The gradient step $\nabla\Phi(\boldsymbol{x}_{t}) - \eta_{t}\boldsymbol{g}_{t}$ is performed in the dual space, which is a linear update

Primal Space: The dual point is mapped back to the primal space via the inverse mapping $(\nabla\Phi)^{-1}$ , then feasibility is ensured through Bregman projection

Schematic diagrams (e.g., from Bubeck 2015) illustrate this process: the “mirror” mapping from dual space to primal space, including gradient steps and projection, making the algorithm efficient under complex geometries

Key Elements

Mirror Potential

Mirror Potential is the core function of Mirror Descent, denoted as $\Phi$ . It defines the mapping from primal space to dual space.

Definition: $\Phi: \mathbb{R}^{d} \rightarrow \mathbb{R}$ is a function satisfying the following properties:
- Strict Convexity: Ensures the optimization problem has a unique solution
- Continuous Differentiability: Gradient $\nabla \Phi$ exists and is continuous, facilitating computation
- Limit Condition: When $||x||_2 → \infty$ , $||\nabla \Phi(x)|| → \infty$ . This prevents algorithm divergence and ensures stability

Bregman Divergence

Bregman Divergence is an asymmetric distance measure used to replace Euclidean distance, based on a convex function (usually $\Phi$ or related function $f$ ). It measures the “difference” between points and is used in the projection step to ensure feasibility.

Definition: For function $f$ , Bregman Divergence is defined as:

$D_f(x, y) = f(x) - f(y) - \nabla f(y)^\top (x - y)$

This measures the convexity difference between $x$ and $y$
Properties: Non-negativity ( $D_f(x, y) ≥ 0$ , with equality when $x = y$ ), convexity in $x$

Projection Operator

Projection is a key part of Mirror Descent, ensuring algorithm feasibility on the constraint set X.

Projection Definition: The projection operation is denoted as $\Pi^{\Phi}_{X}$ , using the Bregman Divergence associated with Φ:

$\Pi^{\Phi}_{X}(\boldsymbol{y}) = \arg\min_{\boldsymbol{x} \in X} D_{\Phi}(\boldsymbol{x}, \boldsymbol{y})$
Existence and Uniqueness: Due to the strict convexity and continuous differentiability of Φ, the projection $\Pi^{\Phi}_{X}$ exists and is unique. This means for any point, a unique projection point can be found, avoiding algorithm uncertainty

Convergence of Mirror Descent

Mirror Descent Update Rule

Mirror Descent is an optimization algorithm suitable for convex optimization problems in non-Euclidean spaces. Its update rule is as follows:

Initialization: Choose initial point $x_1$ such that $x_1 \in \operatorname*{arg\,min}_{x \in X} \Phi(x)$ , where $\Phi$ is a mirror map and $X$ is the feasible region
For each time step $t \geq 1$ :
- Compute subgradient $g_t \in \partial f(x_t)$ , where $f$ is the objective function
- Then, define $y_{t+1} \in \mathbb{R}^d$ satisfying:
  $\nabla\Phi(y_{t+1}) = \nabla\Phi(x_t) - \eta g_t$
  where $\eta$ is the step size parameter

Core Idea: Perform gradient descent in the mirror space (defined by $\Phi$ ), then project back to the primal space via the mirror mapping. $\nabla\Phi$ is the gradient of the mapping, which maps points from the primal space to the dual space

Theorem 4: Convergence Guarantee

Assumptions:

$\Phi$ is a mirror map and is $\rho$ -strongly convex on $X$ with respect to norm $\|\cdot\|$
$f$ is a convex function and $L$ -Lipschitz continuous with respect to $\|\cdot\|$
Then the Mirror Descent algorithm with step size $\eta = \frac{2R}{L} \sqrt{\frac{\rho}{T}}$ satisfies:

$f(\frac{1}{T} \sum_{t=1}^T{x_t}) - f(x^*) \leq \frac{2RL}{\sqrt{\rho T}}$

where $R$ is the upper bound on the “distance” between the initial point and the optimal solution $x^*$ , and $T$ is the number of iterations

Notation Meaning:

$\rho$ : Strong convexity parameter, measuring the convexity strength of $\Phi$

$L$ : Lipschitz constant, representing the upper bound of the rate of change of function $f$

$R$ : Usually defined as $R^2 = \max_{x \in X} D_\Phi(x, x_1)$ , reflecting the problem scale

$\eta$ : Step size, controlling the update magnitude, depending on problem parameters to ensure convergence

Standard Setup Examples

Ball Setup

Mirror potential: $\Phi(x) = \frac{1}{2} \|x\|_2^2$
Bregman divergence: $D_\Phi(x, y) = \frac{1}{2} \|x - y\|_2^2$
In this case, Mirror Descent is equivalent to projected subgradient descent, as projection is performed in Euclidean space

Simplex Setup

Mirror potential: $\Phi(x) = \sum_{i=1}^d x_i \log x_i$ (negative entropy function), defined on the simplex $\triangle_d = \{x \in \mathbb{R}^d : x_i \geq 0, \sum_i x_i = 1\}$
Gradient update: $\nabla \Phi(y_{t+1}) = \nabla \Phi(x_t) - \eta \nabla f(x_t)$ can be written in component form:

$y_{t+1,i} = x_{t,i} \exp(-\eta [\nabla f(x_t)]_i)$
Bregman divergence: $D_\Phi(x, y) = \sum_{i=1}^d x_i \log \frac{x_i}{y_i}$ (Kullback-Leibler divergence)
Projection: Projection onto the simplex is equivalent to renormalization $y \rightarrow y / \|y\|_1$
Example: When $X = \triangle_d$ , initial point $x_1 = (1/d, \ldots, 1/d)$ , then $R^2 = \log d$ , suitable for high-dimensional probability optimization

Example Explanation: Simplex setup is commonly used in probability model optimization in machine learning, such as online learning or problems constrained to the probability simplex. The use of KL divergence makes the algorithm more adaptive to the geometry of probability distributions

Proof Details:

Basic Inequality: Using properties of Bregman divergence, we have:

$f(x_t) - f(x) \leq \frac{1}{\eta} \left( D_\Phi(x, x_t) - D_\Phi(x, x_{t+1}) \right) + \frac{\eta}{2\rho} \|g_t\|_*^2$
- Source: Attachment 2 P47, derived through first-order optimality conditions and strong convexity
Summation and Telescoping: Sum from $t=1$ to $T$ , the $D_\Phi$ terms form a telescoping sum:

$\sum_{t=1}^T [D_\Phi(x, x_t) - D_\Phi(x, x_{t+1})] = D_\Phi(x, x_1) - D_\Phi(x, x_{T+1}) \leq R^2$
- Because $x_1 = \arg\min_{x \in X} \Phi(x)$ , so $D_\Phi(x, x_1) \leq R^2$
Bounding the Gradient Term: Since $f$ is $L$ -Lipschitz, $\|g_t\|_* \leq L$ , therefore:

$\sum_{t=1}^T \frac{\eta}{2\rho} \|g_t\|_*^2 \leq \frac{\eta T L^2}{2\rho}$
Combine Results:

$\sum_{t=1}^T (f(x_t) - f(x)) \leq \frac{R^2}{\eta} + \frac{\eta T L^2}{2\rho}$
- Choose $\eta = \frac{2R}{L} \sqrt{\frac{\rho}{T}}$ to minimize the right-hand side, obtaining the convergence rate
Jensen’s Inequality: Finally apply Jensen’s inequality to the average point $\frac{1}{T} \sum_{t=1}^T x_t$