Operator

Proximal Algorithms

Proximal operator

Proximal operator of is

with parameter .

may be nonsmooth, have embedded constraints, …

evaluating involves solving a convex optimization problem

can evaluate via standard methods like BFGS, but very often has an analytical solution or simple specialized linear-time algorithm

Generalized projection

proximal operator of an indicator function of a convex set is projection:

many properties carry over

if , then

if is block separable, so , then

example: if , then

a simple element wise operation called soft thresholding

Fixed points

the point minimized if and only if is a fixed point:

provides a link between proximal operators and fixed point theory

many proximal algorithms can be viewed as methods for finding fixed points of appropriate operators

Moreau-Yosida regularization

infimal convolution of and , denote , is defined as

Moreau envelope or Moreau-Yosida regularization of is

a smoothed or regularized form of :

always has full domain
always continuously differentiable
has the same minimizes as

can minimized instead of , though could be hard to evaluate

motivation: can show that

Moreau envelope obtains a smooth approximation via:

taking conjugate
regularizing to get a strongly convex function
raking conjugate again

example: Moreau envelope of is the Huber function

Modified gradient step

many relationships between proximal operators and gradient steps

proximal operator is gradient step for Moreau envelope:

for small , converges to gradient step in :

parameter can be interpreted as a step size, though proximal methods will generally work even for large step sizes, unlike gradient method

Resolvent of subdifferential operator

if , then

can rewrite as

must be careful to interpret and expressions using it as relations

mapping known as resolvent of operator

Moreau decompotition

following relation always holds:

main link between proximal operators and duality

a generalization of orthogonal decomposition induced by subspace :

follows from Moreau decomposition and

Norms and norm balls

in general: if and is unit ball of dual norm, then

by Moreau decomposition

if and in the unit ball, then

sometimes called 'block soft thresholding' operator

if and in the unit ball, then

let us derive (element wise) soft thresholding

if and in the unit ball, simple algrithms available

Proximal minimization

the proximal minimization or proximal point algorithm is

the simplest proximal method, can be interpreted as

gradient method applied to rather than
simple iteration for finding a fixed point of

if , reduces to iterative refinement for

works for any fixed , or for non-summable

Operator splitting

the most useful proximal methods use the idea of operator splitting

these algorithms minimize only using or

useful when and each have useful structure separately

very simple historical example: alternating projections to find

Proximal gradient method

is smooth, is closed proper convex

method:

converges with rate when is Lipschitz continuous with constant and step sizes are

special case: projected gradient method (take )

if is not known (usually the case), can use the following line search

given , and parameter . Let . repeat 1. Let 2. break if 3.Update return

typical value of is 1/2, and

Interpretations

is the solution to

trade of between minimizing and being close to gradient step for

majorization-minimization method for :

keep minimizing convex upper bound to objective tight at the previous iterate
here, use as upper bound (when )

if and only if

i.e., is a fixed point of forward-backward operator

Accelerated proximal gradient method

is smooth, is closed proper convex

method:

works for and similar line search as before

faster convergence rate, originated with Nesterov (1983)

Applications

Lasso

with and

proximal gradient method (similar for accelerated):

faster implementations:

parallel matrix-vector multiplication
if , precompute and , then solve smaller lasso problem with ; effort is then mostly computing
compute and in parallel with one in memory at a time
compute the entire regularization path with warm starting

can easily generalize to group lasso, elastic net, GLMs, ...

Monotone Operators

Relations

a relation on a set is a subset of

dom

overload to mean the set

can think of as 'set-valued mapping', i.e., from dom into

when is always empty or a singleton, we say is a function

any function (or operator) with is a relation is then ambiguous: it can mean or { }

Examples

empty relation:
full relation:
identity:
zero:
subdifferential relation:

Operations on relations

inverse (relation):

inverse exists for any relation
coincides with inverse function, when inverse function exists

composition:

scalar multiplication:

addition:

Example: Resolvent of operator

for relation and , resolvent (much more on this later) is relation

Generalized equations

goal: solve generalized equation

i.e., find with

solution set or zero set is

if and , then means minimizes

Monotone operators

relation on is monotone if

is maximal monotone if there is no monotone operator that properly contains it

is maximal monotone iff it is a connected curve with no endpoints, with nonnegative (or infinite) slope

suppose and are monotone

sum: is monotone
nonnegative scaling: if , then is monotone
inverse: is monotone
congruence: for is monotone (on )
zero set: is convex if is maximal monotone

affine function is monotone iff

Subdifferential

is monotone

suppose and

then

add these and cancel to get

if is convex closed proper (CCP) then is maximal monotone

Nonexpansive and contractive operators

has Lipschitz constant if

for , we say is nonexpansive

for , we say is a contraction (with contraction factor )

Properties

if and have Lipschitz constant , so does

composition of nonexpansive operators is nonexpansive

composition of nonexpansive operator and contraction is a contraction

fixed point set of nonexpansive (with )

is convex (but can be empty)

a contraction has a single fixed point

Resolvent and Cayley operator

for , resolvent of relation is

when and monotone, is nonexpansive (thus a function)

when and maximal monotone,

Cayley operator of is

when and monotone, is nonexpansive

we write and to explicitly show dependence on

Proof that is nonexpansive

assume and monotone

suppose and , i.e.,

subtract to get

multiply by and use monotonicity of to get

so when , we must have (i.e., is a function)

now let's show is nonexpansive

using inequality above

is nonexpansive since it is the average of and :

Example: Linear operators

linear mapping is

monotone iff
nonexpansive iff

nonsingular

for matrix case, we have alternative formula for Cayley operator:

cf. bilinear function , which maps

Resolvent of subdifferential: Proximal mapping

suppose , with convex

this means

rewrite as

which is the same as

RHS called proximal mapping of , denoted

Example: Indicator function

take , indicator function of convex set

is the normal cone operator

proximal operator of (i.e., resolvent of ) is

where is (Euclidean) projection onto

Resolvent of multiplier to residual map

take to be multiplier to residual mapping for convex problem

where

means

for some

write as

write second term as , so

function on right side is augmented Lagrangian for the problem

so can be found as

Fixed points of Cayley and resolvent operators

assume is maximal monotone,

solutions of are fixed points of :

solutions of are fixed points of :

key result: we can solve by finding fixed points of or

next: how to actually find these fixed points

Contraction mapping theorem

also known as Banach fixed point theorem

assume is contraction, with Lipschitz constant

the iteration

converges to the unique fixed point of

proof (sketch):

sequence is Cauchy:
hence converges to a point
must be (the) fixed point

Example: Gradient method

assume is convex, (i.e., strongly convex, Lipschitz)

gradient method is

(fixed points are exactly solutions of )

is a Lipschitz with parameter

is a contraction when

so gradient method converges (geometrically) when

Damped iteration of a nonexpansive operator

suppose is nonexpansive, , with fixed point set

can have (e.g., translation)

simple iteration of need not converge, even when (e.g, rotation)

damped iteration:

important special case:

another special case: , which gives simple averaging

Convergence results

assume is nonexpansive, , and

(which holds for special cases above)

then we have

i.e., (some) iterates get arbitrarily close to fixed point set, and

i.e., (some) iterates yield arbitrarily good 'almost fixed points'

is no farther from than is (by nonexpansivity)

so is closer to than is

Proof

start with identity

apply to

using

iterate inequality to get

if for , then

RHS goes to zero as

Damped Cayley iteration

want to solve with maximal monotone

damped Cayley iteration

with and

converges (assuming ) in sense given above

important: requires ability to evaluate resolvent map of

Proximal point algorithm

take in damped Cayley iteration

gives resolvent iteration or proximal point algorithm:

if with convex, yields proximal minimization algorithm

can interpret as quadratic regularization that goes away in limit

many classical algorithms are just proximal point methods applied to the appropriate maximal monotone operator

method of multipliers

like dual method, but with augmented Lagrangian, specific step size

converges under far more general conditions than dual subgradient

need not be strictly convex, or differentiable

can take on value

but not amenable to decomposition

Monotone Operator Splitting Methods

Operator splitting

want to solve with maximal monotone

main idea: write as , with and maximal monotone

called operator splitting

solve using methods that require evaluation of resolvents

(or Cayley operators and )

useful when and can be evaluated more easily than

main result

maximal monotone, so Cayley operators nonexpansive

hence nonexpansive

key result:

so solutions of can be found from fixed points of nonexpansive operator

Proof of main result

write , and as

subtract 2nd & 4th equations to conclude

4th equation is then

now add and to get

hence

argument goes other way (but we don't need it)

Peaceman-Rachford and Douglas-Rachford splitting

Peaceman-Rachford splitting is undamped iteration

doesn't converge in general case; need or to be contraction

Douglas-Rachford splitting is damped iteration

always converges when has solution

Douglas-Rachford splitting

write D-R iteration as

last update follows from

can consider as a residual

is running sum of residuals

Douglas-Rachford algorithm

many ways to rewrite/rearrange D-R algorithm

equivalent to many other algorithms; often not obvious

need very little: maximal monotone; solution exists

and are handled separately (via and ); they are ‘uncoupled’

Alternating direction method of multipliers

to minimize , we solve with , D-R is

a special case of the alternating direction method of multipliers (ADMM)

Constrained optimization

constrained convex problem:

take and

so and

D-R is

Dykstra's alternation projections

find a point in the intersection of convex sets

D-R gives algorithm

this is Dykstra’s alternating projections algorithm

much faster than classical alternating projections (e.g., for polyhedral, converges in finite number of steps)

Consensus optimization

want to minimize

rewrite as consensus problem

D-R consensus optimization:

Douglas-Rachford consensus

-update splits into separate (parallel) problems:

-update is averaging:

-update becomes

taking average of last equation, we get

renaming as , D-R consensus becomes

subsystem (local) state:

gather 's to compute , which is then scattered