SDSC6015 Course 4-Projected Gradient Descent, Proximal Gradient Desent and Mirror Descent
#sdsc6015
English / 中文
Projected Gradient Descent
Projected Gradient Descent is an algorithm for constrained optimization problems that ensures constraint satisfaction by projecting gradient steps back onto the feasible set.
Constrained Optimization Problem Definition
The constrained optimization problem is formally defined as:
where:
-
is the objective function
-
is a closed convex set
-
is the optimization variable
Geometric Meaning: Find the point that minimizes while satisfying the constraint .
Algorithm Description
Projected Gradient Descent iteration steps:
where the projection operator is defined as:
Mathematical Notation Meaning:
-
: Step size parameter, controls the update magnitude
-
: Gradient of the function at
-
: Projection operator, maps a point to the closest point in
Convergence Rate Analysis
Convergence rate is the same as unconstrained gradient descent, but each step requires a projection operation:
-
Lipschitz convex function (parameter ): steps
-
Smooth convex function (parameter ): steps
-
Smooth strongly convex function (parameters ): steps
Computational efficiency depends on the difficulty of the projection operation.
Convergence on Smooth Functions
Smoothness Definition: Function is called smooth (parameter ) on closed convex set if:
Lemma 1 (Sufficient Decrease):
Let be differentiable and smooth (parameter ) on closed convex set . Choose step size , then projected gradient descent satisfies:
where .
Mathematical Notation Meaning in Projection
-
: This is the unprojected gradient step, the point after moving one step along the gradient direction, but may not be in the feasible set .
-
: This is the projected point, mapping to the closest point in via the projection operator , ensuring the iteration point always satisfies constraints.
-
: This is the projection error, measuring the squared Euclidean distance between the unprojected point and the projected point. This term acts as a “penalty” for the deviation caused by projection, as projection may shift the point away from the gradient direction.
-
: Projection operator, meaning projecting point onto closed convex set , finding the closest point to in .
Proof
The proof of Lemma 1 relies on three key elements: function smoothness, optimality conditions of projection, and geometric properties of the projection operator (Fact(ii)). The detailed proof steps are as follows:
-
Utilize Function Smoothness:
Since is -smooth, according to the smoothness definition, for any points and , we have:Apply this property with and , and substitute and , obtaining:
This step gives an upper bound on the function value at the unprojected point .
-
Optimality Condition of Projection:
Since is the projection point, for a closed convex set , the projection operator satisfies the following optimality condition: for any , we have:In particular, take (since ), we have:
Substitute , and after algebraic transformation, we get:
This inequality relates the gradient inner product to the movement distance after projection.
-
Apply Smoothness to :
Again use the smoothness definition, this time applied to and :Substitute the inequality from step 2, obtaining:
Substitute , simplifying gives:
-
Utilize Projection Property (Fact(ii)):
Fact(ii) mentioned in the attachment is a geometric property of the projection operator: for any and , we have:Take and , and since , we get:
Theorem 1:
Let be a convex and differentiable function. Let be a closed convex set, and assume there exists a minimizer such that . Further, assume is smooth on (i.e., gradient is L-Lipschitz continuous) with parameter . Choose step size , then the iteration sequence generated by the projected gradient descent algorithm satisfies the following convergence bound:
where is the point after the -th iteration, is the initial point.
Notation Explanation
-
: Objective function, mapping from to , convex and differentiable.
-
: Feasible region, a closed convex set (e.g., region defined by constraints).
-
: Global minimizer of on , i.e., .
-
: Smoothness constant (Lipschitz constant), representing the upper bound of gradient variation, i.e., holds for all .
-
: Step size, set to in the theorem to ensure convergence.
-
: Gradient at point , i.e., .
-
: Point after the -th iteration (after projection).
-
: Unprojected gradient step, computed as .
-
: Projected point, i.e., , where is the projection operator onto set .
The projection operator returns the point in closest to , i.e., . Since is closed and convex, the projection exists and is unique.
Proof Sketch
The core of the proof lies in utilizing the function’s smoothness and convexity, as well as the properties of the projection operator, to derive the error bound per iteration. The key steps are:
-
Sufficient Decrease Condition
Since is L-smooth, for projected gradient descent, we have the following sufficient decrease inequality:This inequality shows that the amount of function value decrease after each iteration is affected by the gradient magnitude and the projection error. The extra term is introduced by the projection operation.
-
Property of the Projection Operator
Use a key fact (Fact ii) of the projection operator: for any and , we haveLet (the optimal solution) and , since , we obtain:
This inequality reflects the “orthogonality” of projection, i.e., the distance relationship between the projection point and the original point.
-
Combine with Convexity
By the convexity of , we have:This means the function value difference is controlled by the inner product of the gradient and the point difference.
-
Algebraic Manipulation and Cancellation
Combine the above elements. First, start from the analysis framework of ordinary gradient descent, but replace with (since is the unprojected step). Then, using the projection property inequality and the sufficient decrease condition, algebraic manipulation shows that the extra term cancels out when summed. Specifically, when step size , we have:Since is decreasing (guaranteed by sufficient decrease), we finally obtain .
The key insight in the proof is: although the projection operation introduces error terms, through the properties of the projection operator, these errors cancel each other out in the overall analysis, thus maintaining the same convergence rate as unconstrained gradient descent.
Geometric Interpretation
-
Role of Projection: Projected gradient descent ensures that the iteration points always satisfy constraints by projecting the gradient step back to the feasible set . The projection operation can be seen as “pulling back” the point to the nearest point in the feasible region.
-
Convergence Behavior: Since the function is smooth and convex, the gradient direction points to the descent direction, but projection may alter the path. The theorem shows that even with projection, the convergence rate remains , the same as the unconstrained case. This is because the projection error averages out over long-term iterations and does not affect the overall trend.
-
Intuitive Understanding: Imagine a ball rolling on a smooth convex surface, moving along the gradient direction each time, but constrained within a feasible region. Projection ensures the ball does not leave the region, while smoothness ensures the movement does not oscillate excessively, eventually stabilizing at the lowest point.
Example
Although the document does not provide a specific example, a typical application is constrained linear regression: the objective function is smooth and convex, and the constraint set might be the non-negative orthant (). Projected gradient descent iteratively updates and projects it onto , ensuring the solution satisfies non-negativity while converging at a rate of .
Additional Notes
-
The theorem assumes function smoothness and convexity; these conditions need to be verified in practice.
-
The step size is theoretically optimal, but adaptive step sizes may be used in practice.
-
The convergence rate is sublinear, which may be slow for large-scale problems, but this is a standard result in convex optimization.
Convergence on Strongly Convex Smooth Functions
Strongly Convex Function Definition: Function is strongly convex on set (with parameter ) if for all , it satisfies:
This means the function has a lower-bounding quadratic term, ensuring its curvature has a lower bound, thus accelerating optimization convergence.
Properties of Strongly Convex Functions
-
Lemma 2: A strongly convex function on has a unique minimizer .
This is because strong convexity avoids multiple local minima, guaranteeing a unique global solution.
Mathematical Notation and Meaning
-
: Function defined on set
-
: Strong convexity parameter
-
: Minimizer of function on
Theorem Statement
If function is -strongly convex on set , then has a unique minimizer on .
Proof
-
Existence: Since is strongly convex, it is also convex, and is a closed convex set, so the minimum exists.
-
Uniqueness (by contradiction):
- Assume there are two distinct minimizers and , and
- Consider the midpoint
- By the definition of strong convexity:
- This contradicts the assumption that is the minimum value
- Therefore, the minimizer must be unique
Geometric Meaning
Strongly convex functions have a “steep bowl-shaped” structure, ensuring the function has only one global minimum point, without any flat regions or multiple local minima. The larger the value, the steeper the “bowl” shape of the function, making it easier for the optimization process to find the unique solution.
Convergence of Projected Gradient Descent (Theorem 2)
-
Theorem Statement: Let be a convex differentiable function, be a non-empty closed convex set. Assume is smooth on (parameter ) and strongly convex (parameter ). Choose step size , then for any initial point , projected gradient descent satisfies:
- (i) Geometric decrease in squared distance to :
This indicates that after each iteration, the distance between the solution and the optimal solution contracts at a linear rate.
- (ii) Absolute error decreases exponentially with iterations: After iterations, the error satisfies:
The error decreases exponentially, with the convergence rate depending on the condition number .
- (i) Geometric decrease in squared distance to :
Mathematical Notation and Meaning
-
: Smoothness parameter (Lipschitz constant)
-
: Optimal step size selection
-
: Solution after the -th iteration
-
: Unprojected gradient step
-
: Projected solution
Theorem Statement
For a -strongly convex and -smooth function , projected gradient descent with step size satisfies:
(i) Geometric decrease in squared distance:
(ii) Exponential convergence in function value error:
Proof Outline
Key Steps:
-
Gradient Inequality:
-
Projection Property (Attachment 2 P8):
-
Distance Recurrence Relation:
- Combine with the non-expansiveness of the projection operator:
- Utilize smoothness:
-
Final Derivation:
Geometric Meaning and Interpretation
Condition number : This theorem reveals that the convergence rate depends on the ratio of the function’s smoothness () to strong convexity (). The smaller the condition number (i.e., the more strongly convex or smoother the function), the faster the convergence.
Geometric Interpretation:
- Strong convexity ensures the function has a clear “descent direction”
- Smoothness ensures the gradient does not change too drastically, allowing for larger step sizes
- The projection operation ensures the solution always stays within the feasible region
- Each iteration pulls the current solution exponentially closer to the unique optimal solution
Practical Significance: This theorem guarantees that under reasonable conditions, projected gradient descent can quickly converge to the optimal solution, with error decreasing exponentially with the number of iterations. The step size is the optimal choice, balancing convergence speed and stability.
Examples of Projection Step Computations
-
The projection operation is itself an optimization problem, but can be computed efficiently in certain cases:
-
Projection onto affine subspaces: Achieved by solving linear equations (similar to least squares).
For example, if is a hyperplane, the projection has an analytical solution.
-
Projection onto Euclidean balls (centered at ): Achieved by scaling the vector .
Where is the ball radius; otherwise, the point is already inside the ball.
-
Projection onto -balls (used in Lasso problems): Can be simplified to projection onto the simplex.
- Let , the projection problem can be transformed into:
Projection onto the simplex can be computed in or time (DSSSC08 algorithm).
- Let , the projection problem can be transformed into:
-
Proximal Gradient Descent
Problem Setting
Consider composite optimization problems where the objective function consists of two parts:
Mathematical Notation Meaning:
-
: “Well-behaved” function (usually convex and -smooth)
-
: “Simple” but possibly non-smooth additional term (e.g., norm, indicator function, etc.)
-
: Composite objective function to be optimized
Proximal gradient descent is suitable for solving non-smooth, constrained, or specially structured optimization problems.
Algorithm Idea and Update Formula
Core Idea: Extend the gradient descent idea to composite functions. For the smooth function , the gradient descent step is equivalent to minimizing a local quadratic approximation:
For , keep the quadratic approximation for and directly add :
Algorithm Iteration Formula:
Proximal Mapping Definition:
Equivalent Form:
where is the generalized gradient
Special Cases and Geometric Meaning
: Reduces to standard gradient descent
(indicator function): Reduces to projected gradient descent
-
Indicator function:
-
Proximal mapping becomes projection:
Geometric Meaning: The proximal mapping finds points near that make small; the parameter balances the goals of “staying close to ” and “making small”.
Convergence Analysis
Theorem 3 (Proximal Gradient Descent Convergence)
Let be convex and -smooth, convex, and the proximal mapping be computable. Choose step size , then:
Convergence rate is , same as gradient descent on smooth convex functions.
Mathematical Notation Meaning
-
: Local approximation function
-
: Proximal gradient step
-
: Approximation error term, measuring the distance between the current point and the optimal point
Proof Steps
-
Define local approximation function:
Due to the quadratic term, is strongly convex.
-
Utilize strong convexity:
According to the definition of and the strong convexity of : -
Transform into decreasing property of :
Using the -smoothness and convexity of , transform the inequality into: -
Summation and monotonicity:
- Let , sum from to
- Utilize the monotonic decrease of function values:
- Obtain the final convergence bound
Geometric Intuition: The proximal method only “sees” the smooth part ; the non-smooth part is handled separately by the proximal mapping and does not affect the convergence speed.
Example: Iterative Soft Thresholding Algorithm (ISTA)
Lasso Regression Problem:
Mathematical Notation Meaning:
-
: Feature matrix
-
: Response variable
-
: Sparsity regularization parameter
-
: Coefficient vector
Decomposition:
-
,
-
Soft Thresholding Operator:
ISTA Iteration Formula:
Derivation Explanation: By analyzing the first-order optimality conditions of the proximal mapping, a piecewise analytical solution is obtained, corresponding to the soft thresholding operation.
Mirror Descent
Motivation
Consider the simplex-constrained problem:
where . Assume the gradient satisfies , meaning each element of the gradient has bounded absolute value.
-
Convergence Rate Comparison:
- Gradient Descent (GD) on convex L-Lipschitz functions:
- Mirror Descent can achieve , superior in high-dimensional problems (large )
-
Why Superior: GD’s convergence rate is affected by dimension , while Mirror Descent reduces the dependency to through mirror mapping, better handling the geometry of the simplex
-
: Infinity norm, representing the maximum absolute value of elements in a vector, i.e.,
-
: Euclidean norm, representing the standard length of a vector,
-
: Lipschitz constant, here derived from the gradient norm,
-
: Problem dimension
-
: Number of iterations
-
: Optimality gap, the difference in function value between the current best solution and the optimal solution
-
: Strong convexity coefficient, used to describe the convexity strength of the mirror potential
-
: Step size, set to in Mirror Descent
-
: Domain radius, , where is the initial point
Preliminaries
-
Norms and Dual Norms: Fixed norm , dual norm defined as
-
Function Properties:
- -Lipschitz:
- -smooth: Satisfies specific smoothness conditions
- -strongly convex:
Mirror Descent Iteration Process
Mirror Descent iteration consists of two steps: dual update and projection. This process ensures that at each iteration, the point remains in the feasible region X.
-
Dual Update Step:
where:
- is the current iteration point (primal space)
- is the gradient of the Mirror Potential, used to map to the dual space
- is the step size (learning rate)
- is the subgradient of objective function f at (if f is differentiable, it’s the gradient)
This step performs gradient descent in the dual space
-
Projection Step:
where is the Bregman Divergence associated with Φ. This step projects the intermediate point back to the feasible region X, minimizing the Bregman Divergence
Geometric Interpretation
The geometric meaning of Mirror Descent can be understood from the perspectives of dual space and primal space:
- Dual Space: The gradient step is performed in the dual space, which is a linear update
- Primal Space: The dual point is mapped back to the primal space via the inverse mapping , then feasibility is ensured through Bregman projection
- Schematic diagrams (e.g., from Bubeck 2015) illustrate this process: the “mirror” mapping from dual space to primal space, including gradient steps and projection, making the algorithm efficient under complex geometries
Key Elements
Mirror Potential
Mirror Potential is the core function of Mirror Descent, denoted as . It defines the mapping from primal space to dual space.
-
Definition: is a function satisfying the following properties:
- Strict Convexity: Ensures the optimization problem has a unique solution
- Continuous Differentiability: Gradient exists and is continuous, facilitating computation
- Limit Condition: When , . This prevents algorithm divergence and ensures stability
Bregman Divergence
Bregman Divergence is an asymmetric distance measure used to replace Euclidean distance, based on a convex function (usually or related function ). It measures the “difference” between points and is used in the projection step to ensure feasibility.
-
Definition: For function , Bregman Divergence is defined as:
This measures the convexity difference between and
-
Properties: Non-negativity (, with equality when ), convexity in
Projection Operator
Projection is a key part of Mirror Descent, ensuring algorithm feasibility on the constraint set X.
-
Projection Definition: The projection operation is denoted as , using the Bregman Divergence associated with Φ:
-
Existence and Uniqueness: Due to the strict convexity and continuous differentiability of Φ, the projection exists and is unique. This means for any point, a unique projection point can be found, avoiding algorithm uncertainty
Convergence of Mirror Descent
Mirror Descent Update Rule
Mirror Descent is an optimization algorithm suitable for convex optimization problems in non-Euclidean spaces. Its update rule is as follows:
-
Initialization: Choose initial point such that , where is a mirror map and is the feasible region
-
For each time step :
- Compute subgradient , where is the objective function
- Then, define satisfying:
where is the step size parameter
Core Idea: Perform gradient descent in the mirror space (defined by ), then project back to the primal space via the mirror mapping. is the gradient of the mapping, which maps points from the primal space to the dual space
Theorem 4: Convergence Guarantee
Assumptions:
-
is a mirror map and is -strongly convex on with respect to norm
-
is a convex function and -Lipschitz continuous with respect to
Then the Mirror Descent algorithm with step size satisfies:
where is the upper bound on the “distance” between the initial point and the optimal solution , and is the number of iterations
Notation Meaning:
- : Strong convexity parameter, measuring the convexity strength of
- : Lipschitz constant, representing the upper bound of the rate of change of function
- : Usually defined as , reflecting the problem scale
- : Step size, controlling the update magnitude, depending on problem parameters to ensure convergence
Standard Setup Examples
Ball Setup
-
Mirror potential:
-
Bregman divergence:
-
In this case, Mirror Descent is equivalent to projected subgradient descent, as projection is performed in Euclidean space
Simplex Setup
-
Mirror potential: (negative entropy function), defined on the simplex
-
Gradient update: can be written in component form:
-
Bregman divergence: (Kullback-Leibler divergence)
-
Projection: Projection onto the simplex is equivalent to renormalization
-
Example: When , initial point , then , suitable for high-dimensional probability optimization
Example Explanation: Simplex setup is commonly used in probability model optimization in machine learning, such as online learning or problems constrained to the probability simplex. The use of KL divergence makes the algorithm more adaptive to the geometry of probability distributions
-
Basic Inequality: Using properties of Bregman divergence, we have:
- Source: Attachment 2 P47, derived through first-order optimality conditions and strong convexity
-
Summation and Telescoping: Sum from to , the terms form a telescoping sum:
- Because , so
-
Bounding the Gradient Term: Since is -Lipschitz, , therefore:
-
Combine Results:
- Choose to minimize the right-hand side, obtaining the convergence rate
-
Jensen’s Inequality: Finally apply Jensen’s inequality to the average point
