Proximal Algorithms
Proximal operator
- Proximal operator of is
with parameter .
- may be nonsmooth, have embedded constraints, β¦
- evaluating involves solving a convex optimization problem
- can evaluate via standard methods like BFGS, but very often has an analytical solution or simple specialized linear-time algorithm
Generalized projection
- proximal operator of an indicator function of a convex set is projection:
many properties carry over
- if , then
- if is block separable, so , then
- example: if , then
a simple element wise operation called soft thresholding
Fixed points
- the point minimized if and only if is a fixed point:
- provides a link between proximal operators and fixed point theory
- many proximal algorithms can be viewed as methods for finding fixed points of appropriate operators
Moreau-Yosida regularization
- infimal convolution of and , denote , is defined as
- Moreau envelope or Moreau-Yosida regularization of is
- a smoothed or regularized form of :
- always has full domain
- always continuously differentiable
- has the same minimizes as
- can minimized instead of , though could be hard to evaluate
- motivation: can show that
- Moreau envelope obtains a smooth approximation via:
- taking conjugate
- regularizing to get a strongly convex function
- raking conjugate again
- example: Moreau envelope of is the Huber function
Modified gradient step
- many relationships between proximal operators and gradient steps
- proximal operator is gradient step for Moreau envelope:
- for small , converges to gradient step in :
- parameter can be interpreted as a step size, though proximal methods will generally work even for large step sizes, unlike gradient method
Resolvent of subdifferential operator
- if , then
- can rewrite as
- must be careful to interpret and expressions using it as relations
- mapping known as resolvent of operator
Moreau decompotition
- following relation always holds:
- main link between proximal operators and duality
- a generalization of orthogonal decomposition induced by subspace :
follows from Moreau decomposition and
Norms and norm balls
- in general: if and is unit ball of dual norm, then
by Moreau decomposition
- if and in the unit ball, then
sometimes called 'block soft thresholding' operator
- if and in the unit ball, then
let us derive (element wise) soft thresholding
- if and in the unit ball, simple algrithms available
Proximal minimization
- the proximal minimization or proximal point algorithm is
- the simplest proximal method, can be interpreted as
- gradient method applied to rather than
- simple iteration for finding a fixed point of
- if , reduces to iterative refinement for
- works for any fixed , or for non-summable
Operator splitting
- the most useful proximal methods use the idea of operator splitting
- these algorithms minimize only using or
- useful when and each have useful structure separately
- very simple historical example: alternating projections to find
Proximal gradient method
is smooth, is closed proper convex
- method:
- converges with rate when is Lipschitz continuous with constant and step sizes are
- special case: projected gradient method (take )
- if is not known (usually the case), can use the following line search
given , and parameter . Let . repeat 1. Let 2. break if 3.Update return
typical value of is 1/2, and
Interpretations
- is the solution to
trade of between minimizing and being close to gradient step for
- majorization-minimization method for :
- keep minimizing convex upper bound to objective tight at the previous iterate
- here, use as upper bound (when )
- if and only if
i.e., is a fixed point of forward-backward operator
Accelerated proximal gradient method
is smooth, is closed proper convex
- method:
works for and similar line search as before
- faster convergence rate, originated with Nesterov (1983)
Applications
Lasso
with and
- proximal gradient method (similar for accelerated):
- faster implementations:
- parallel matrix-vector multiplication
- if , precompute and , then solve smaller lasso problem with ; effort is then mostly computing
- compute and in parallel with one in memory at a time
- compute the entire regularization path with warm starting
- can easily generalize to group lasso, elastic net, GLMs, ...
Monotone Operators
Relations
- a relation on a set is a subset of
- dom
- overload to mean the set
- can think of as 'set-valued mapping', i.e., from dom into
- when is always empty or a singleton, we say is a function
- any function (or operator) with is a relation is then ambiguous: it can mean or { }
- Examples
- empty relation:
- full relation:
- identity:
- zero:
- subdifferential relation:
Operations on relations
- inverse (relation):
- inverse exists for any relation
- coincides with inverse function, when inverse function exists
- composition:
- scalar multiplication:
- addition:
- Example: Resolvent of operator
- for relation and , resolvent (much more on this later) is relation
- for
Generalized equations
- goal: solve generalized equation
- i.e., find with
- solution set or zero set is
- if and , then means minimizes
Monotone operators
- relation on is monotone if
- is maximal monotone if there is no monotone operator that properly contains it
- is maximal monotone iff it is a connected curve with no endpoints, with nonnegative (or infinite) slope
- suppose and are monotone
- sum: is monotone
- nonnegative scaling: if , then is monotone
- inverse: is monotone
- congruence: for is monotone (on )
- zero set: is convex if is maximal monotone
- affine function is monotone iff
Subdifferential
is monotone
- suppose and
- then
- add these and cancel to get
if is convex closed proper (CCP) then is maximal monotone
Nonexpansive and contractive operators
- has Lipschitz constant if
- for , we say is nonexpansive
- for , we say is a contraction (with contraction factor )
Properties
- if and have Lipschitz constant , so does
- composition of nonexpansive operators is nonexpansive
- composition of nonexpansive operator and contraction is a contraction
- fixed point set of nonexpansive (with )
is convex (but can be empty)
- a contraction has a single fixed point
Resolvent and Cayley operator
- for , resolvent of relation is
- when and monotone, is nonexpansive (thus a function)
- when and maximal monotone,
- Cayley operator of is
- when and monotone, is nonexpansive
- we write and to explicitly show dependence on
Proof that is nonexpansive
assume and monotone
- suppose and , i.e.,
- subtract to get
- multiply by and use monotonicity of to get
- so when , we must have (i.e., is a function)
- now let's show is nonexpansive
using inequality above
- is nonexpansive since it is the average of and :
Example: Linear operators
- linear mapping is
- monotone iff
- nonexpansive iff
- and
- nonsingular
- for matrix case, we have alternative formula for Cayley operator:
cf. bilinear function , which maps
Resolvent of subdifferential: Proximal mapping
- suppose , with convex
- this means
- rewrite as
which is the same as
- RHS called proximal mapping of , denoted
Example: Indicator function
- take , indicator function of convex set
- is the normal cone operator
- proximal operator of (i.e., resolvent of ) is
where is (Euclidean) projection onto
Resolvent of multiplier to residual map
- take to be multiplier to residual mapping for convex problem
- where
- means
- for some
- write as
- write second term as , so
- function on right side is augmented Lagrangian for the problem
- so can be found as
Fixed points of Cayley and resolvent operators
- assume is maximal monotone,
- solutions of are fixed points of :
- solutions of are fixed points of :
- key result: we can solve by finding fixed points of or
- next: how to actually find these fixed points
Contraction mapping theorem
- also known as Banach fixed point theorem
- assume is contraction, with Lipschitz constant
- the iteration
converges to the unique fixed point of
- proof (sketch):
- sequence is Cauchy:
- hence converges to a point
- must be (the) fixed point
Example: Gradient method
- assume is convex, (i.e., strongly convex, Lipschitz)
- gradient method is
(fixed points are exactly solutions of )
- is a Lipschitz with parameter
- is a contraction when
- so gradient method converges (geometrically) when
Damped iteration of a nonexpansive operator
- suppose is nonexpansive, , with fixed point set
- can have (e.g., translation)
- simple iteration of need not converge, even when (e.g, rotation)
- damped iteration:
- important special case:
- another special case: , which gives simple averaging
Convergence results
- assume is nonexpansive, , and
(which holds for special cases above)
- then we have
i.e., (some) iterates get arbitrarily close to fixed point set, and
i.e., (some) iterates yield arbitrarily good 'almost fixed points'
- is no farther from than is (by nonexpansivity)
- so is closer to than is
Proof
- start with identity
- apply to
using
- iterate inequality to get
- if for , then
- RHS goes to zero as
Damped Cayley iteration
- want to solve with maximal monotone
- damped Cayley iteration
with and
- converges (assuming ) in sense given above
- important: requires ability to evaluate resolvent map of
Proximal point algorithm
- take in damped Cayley iteration
- gives resolvent iteration or proximal point algorithm:
- if with convex, yields proximal minimization algorithm
can interpret as quadratic regularization that goes away in limit
- many classical algorithms are just proximal point methods applied to the appropriate maximal monotone operator
method of multipliers
- like dual method, but with augmented Lagrangian, specific step size
- converges under far more general conditions than dual subgradient
- need not be strictly convex, or differentiable
- can take on value
- but not amenable to decomposition
Monotone Operator Splitting Methods
Operator splitting
- want to solve with maximal monotone
- main idea: write as , with and maximal monotone
- called operator splitting
- solve using methods that require evaluation of resolvents
(or Cayley operators and )
- useful when and can be evaluated more easily than
main result
- maximal monotone, so Cayley operators nonexpansive
- hence nonexpansive
- key result:
- so solutions of can be found from fixed points of nonexpansive operator
Proof of main result
- write , and as
- subtract 2nd & 4th equations to conclude
- 4th equation is then
- now add and to get
- hence
- argument goes other way (but we don't need it)
Peaceman-Rachford and Douglas-Rachford splitting
- Peaceman-Rachford splitting is undamped iteration
doesn't converge in general case; need or to be contraction
- Douglas-Rachford splitting is damped iteration
always converges when has solution
Douglas-Rachford splitting
write D-R iteration as
last update follows from
- can consider as a residual
- is running sum of residuals
Douglas-Rachford algorithm
- many ways to rewrite/rearrange D-R algorithm
- equivalent to many other algorithms; often not obvious
- need very little: maximal monotone; solution exists
- and are handled separately (via and ); they are βuncoupledβ
Alternating direction method of multipliers
to minimize , we solve
with , D-R is
a special case of the alternating direction method of multipliers (ADMM)
Constrained optimization
- constrained convex problem:
- take and
- so and
- D-R is
Dykstra's alternation projections
- find a point in the intersection of convex sets
- D-R gives algorithm
- this is Dykstraβs alternating projections algorithm
- much faster than classical alternating projections (e.g., for polyhedral, converges in finite number of steps)
Consensus optimization
- want to minimize
- rewrite as consensus problem
- D-R consensus optimization:
Douglas-Rachford consensus
- -update splits into separate (parallel) problems:
- -update is averaging:
- -update becomes
- taking average of last equation, we get
- renaming as , D-R consensus becomes
- subsystem (local) state:
- gather 's to compute , which is then scattered
Β