A simple analysis of multiplicative weights
I give a clean and self-contained presentation of the multiplicative weights framework that is somewhat more direct than the most popular analysis (see the comparison below). The framework provides a simple algorithm for (approximately) solving many important problems (e.g., maximum flow, densest subgraph, linear classification, zero-sum games and even general linear programs) and is a beautiful demonstration of leveraging continuous analysis to solve discrete problems. Warning: this is a technical post.
All of the above problems can be written in the following simple form:
\[\begin{align} \text{minimize }\ & \max(Ax) \\ \text{such that }\ & x \in K \end{align}\]Symbol $A \in \mathbb{R}^{n \times m}$ represents a matrix, $x \in \mathbb{R}^m$ is a vector we are trying to optimize over, $\max : \mathbb{R}^n \to \mathbb{R}$ returns the maximum entry of a vector (e.g., $\max([1, 3, -4]^T) = 3$), and $K$ is some convex set (often called the easy constraints).
Example: Casting maximum flow in the above form. (click to expand)
Suppose we are given an uncapacitated directed graph $G = (V, E)$ and we want to compute the maxflow between some $s, t \in V$. Suppose that the optimal value of this problem is $\OPT$ (i.e., there are $\OPT$ edge-disjoint paths between $s$ and $t$).
We cast the problem in the above form. On a high-level, easy constraints $K$ will correspond to unit flows from $s$ to $t$. We first fix the representation of $K$: a flow $x \in K$ is a vector of dimension $\vert E \vert$, one scalar for each directed edge $e \in E$. Precisely, $x_e$ will represent the amount of flow that is pushed through $e$. This determines the representation. Now, define $K$ to be the convex hull of all simple paths from $s$ to $t$. On a side-note, this is equivalent to saying the flow $x \in K$ is conserving in $V \setminus \{s, t\}$, pushes 1 unit from $s$, and contains no circulations; equivalency is due to the integrality of the flow polytope. While it might be strange to call this polytope with exponential number of vertices “easy constraints”, we only interact with $K$ by optimizing a linear function over $K$. However, optimizing a linear function $f(x) = \inner{c, x} = \sum_{e \in E} c_e \cdot x_e$ over $x \in K$ is exactly the problem of finding a shortest path between $s$ and $t$ with edge costs being $c \in \mathbb{R}^{\vert E \vert}$.
We want the objective $\max(Ax)$ of our form to correspond to the maximum amount of flow going through any edge in the graph (call this the congestion). Using the above representation for members of $K$, this means we set $A$ to be the identity matrix $A = I$. Minimizing congestion might seem counterintuitive (or at least unrelated) to maxflow which wishes to maximize the flow from $s$ to $t$. However, we claim that minimizing the congestion over $K$ is equivalent to maxflow. To show one side of the equivalence, if there are $\OPT$ disjoint paths we can find a unit flow with congestion $1/\OPT$. Moreover, one can easily show the converse holds (if there are unit-flows with smaller congestion it leads to more edge-disjoint paths) and hence solving the above problem solves the maximum flow. Of course, the optimal value in the disjoint-path formulation and congestion-minimizing formulation will be reciprocals of each other, but the problems are otherwise equivalent.
Digression: High-level comparison with other analyses: (click to expand)
This post is inspirated by my personal struggles I had a few years back while trying to learn the multiplicative weights framework. Most popular analyses motivate the approach by the expert prediction algorithm [AHZ]. While the approach is intuitive by itself, my intuition completely dissapeared when using them to solve problems such as maximum flow. This is because the experts from [AHZ] essentially correspond to dual variables which are largely disconnected from the original (primal) problem. This analysis keeps the entire discussion in the primal. I have not seen this analysis written down anywhere, but I am sure researchers in the area are well-aware of it.
The analysis is essentially equivalent to the Frank-Wolfe method of optimization applied to the log-sum-exp function (denoted below as $\smax$) over $x \in K$. One can say that multiplicative weights is an instance of the Frank-Wolfe method.
On a high-level, multiplicative weights smoothen the problem by replacing (regular) max with a smooth version of max. The smoothened problem is solved iteratively: the current solution $x_{t-1}$ is updated by linearizing the smooth max around $x_{t-1}$, finding the optimizer of the linearization $h_t$ via any black-box method and updating $x_{t} \gets x_{t-1} + h_t$. The following fact introduces the smooth max and a few of its simple-to-prove properties.
Fact (smooth max): Define $\smax : \mathbb{R}^n \to \mathbb{R}$ as
\[\mathrm{smax}_{\beta}(x) = \frac{1}{\beta} \ln\left( \sum_{i=1}^n \exp(\beta \cdot x_i) \right)\]where $\beta$ is some “accuracy” parameter (increasing $\beta$ increases accuracy but decreases smoothness). The following properties hold:
- $\smax$ approximates max:
- The gradient is some probability distribution over $[n]$:
- $\smax_{\beta}$ is $\beta$-smooth with respect to $\dnorm{\cdot}_\infty$:
- \(\smax_{\beta}( \pmb{0} ) = \frac{\ln n}{\beta}\), where $\pmb{0} = [0,0,\ldots,0]$.
We introduce and compare the original problem, the smoothened version and the linearized version around $x_{t-1}$.
In the maxflow problem, the parameter $\beta$ is chosen to be approximately $\eps$. Let’s elaborate on the objective of the last problem. Intuitively, we relax our original goal of minimizing $\max(Ax)$ with minimizing $\inner{p, Ax}$ for some probability distribution $p$. Clearly, the latter problem is connected to the middle one (see below), but it is significantly easier: it is linear and relaxes the original problem as any solution to the original is also a solution of equal quality of the latter one.
Formally, let $\Phi(x) := \smax_{\beta}(Ax)$ be the objective of the smoothened (middle) problem. Then $\inner{ \nabla \smax_{\beta}(A x_{t-1}), A h_t }$ is exactly the linearization of $\Phi(x)$ around $x_{t-1}$. This is because
\[\begin{align} \Phi(x_{t-1} + h_t) - \Phi(x_{t-1}) & \approx \inner{\nabla \Phi(x_{t-1}), h_t } \\ & = \inner{ A^T \nabla \smax_{\beta}(A x_{t-1}), h_t} \\ & = \inner{ \nabla \smax_{\beta}(A x_{t-1}), A h_t}. \end{align}\]We can now present the full multiplicative weights algorithm. We repeat the linearize-solve-update loop for a total of $T$ rounds (to be specified later). Solving the linearized problem is done via some black-box method (called the oracle) that needs to be supplied to the algorithm upfront. Precisely, the oracle is supplied a probability distribution $p \in \mathbb{R}^n$ and needs to return $\arg\min_{x \in K} \inner{ p, A x }$. Luckily, the linearized problem is often much simpler than the original one and can be easily optimized.
How to choose the number of rounds $T$? It will mostly depend on two important parameters: the accuracy $\eps$ we require from the final solution, and a new parameter $\rho$ that we call the width of the oracle. The only requirement on the width is that $\rho \ge 1$ must be larger than \(\dnorm{Ax}_\infty\) for any $x$ that can be returned by the oracle (e.g., $\max_{x \in K} \dnorm{Ax}_\infty$ suffices, but sometimes we can implement the oracle that achieves a better bound).
Algorithm (Multiplicative Weights):
- Input: matrix $A \in \mathbb{R}^{n \times m}$, oracle for the linearized problem, width $\rho$, accuracy $\eps$.
- Define \(\beta := \frac{\eps}{2 \rho^2}\), $x_0 := 0$, $\Phi(x) := \smax_{\beta}(Ax)$, \(T := \frac{\ln n}{\beta \cdot \eps} = O(\eps^{-2} \rho^2 \ln n)\).
- For $t = 1 \ldots T$
- \(h_t := \arg \min_{h \in K} \inner{\nabla \smax_{\beta}(A x_{t-1}), A h }\) via the oracle
- $x_t := x_{t-1} + h_t \qquad$ (equivalent: $x_t = \sum_{i=1}^t h_i$)
- Output $(1/T) \cdot x_T$
Example: Multiplicative weights for maximum flow. (click to expand)
Suppose we want to solve maxflow between $s$ and $t$ with $\eps$ additive error. We assume for simplicity that the graph is directed and uncapacitated which allows us to set $\rho = 1$. Set $\beta$ and $T$ accordingly. Let $\OPT$ be the optimal value of the problem when cast in the aforementioned standard form (e.g., $OPT = 1/F$ if the $F$ is the number of edge-disjoint paths between $s$ and $t$; note that we need to set $\eps < 1/F$ to get a good approximation).
Initialize a “congestion” vector $x := [0, 0, \ldots, 0] \in \mathbb{R}^{\vert E \vert}$ that remembers for each edge how many times has it been used. We repeat the following for $T$ rounds: for each directed edge $e$ we compute a cost $c_e := \exp(\beta x_i) > 0$. Normalize this vector of costs $c$ by dividing all entries by \(Z := \sum_{e \in E} \exp(\beta x_i)\) that makes their sum equal to $1$ (note: this is unnecessary, but we do it to stay true to the algorithm). This vector $c$ is exactly equal to $\nabla \smax_{\beta}(Ax)$. Find the shortest path $P$ between $s$ and $t$ with respect to the edge costs $c$ (e.g., with a Dijkstra). Update $x$ by incrementing $x_e$ for each edge $e$ that was used in the shortest path $P$.
After the above loop terminates, the collection of all shortest paths found throughout the algorithm provide us with $T$ paths that incur congestion of at most $T \cdot (OPT + \eps)$. Scaling by $1/T$, we find a unit-flow that incurs congestion $OPT + \eps$ and we are done.
Digression: Low-level comparison with other analyses. (click to expand)
Many other authors take a dual approach of the one stated above, in which one maintains a set of “multiplicative weights” or “dual variables” $y \in \mathbb{R}^n$ (one for each constraint of $K$, i.e., row of $A$). The dual roughly tells us how “important” a constraint is (e.g., if $y_3 = 100$ and $y_7 = 1$, it would mean that constraint 3 gets violated much more often and more egregiously than constraint 7). Initially, $y \gets [1, 1, \ldots, 1]$ and then for each $h$ we multiplicatively update $y$ with a value proportional to the violation $Ah$ via the formula \(y_i \gets y_i \cdot \exp\left(\beta (A h)_{i} \right)\) for $i \in [n]$. This multiplicative reweighting of the constraint importance is the reason behind the method’s name. The “linearize+solve” steps now correspond to solving $\min_{h \in K} \inner{y, A h} = \min_{h \in K} \inner{A^T y, h}$. Iteratively performing this procedure yields the same dynamics as the one described in the pseudocode above.
Pattern-matching $\inner{A^T y, h}$ with $\inner{ A^T \nabla \smax_{\beta}(A x_{t-1}), h }$ one can, correctly, presume that the connection between the methods is that $y = \nabla \smax_{\beta}(A x_{t-1})$. This is indeed the case (up to a factor of proportionality that can be ignored), $\nabla \smax(A x_{t-1})$ is a probability distribution proportional to $\exp(\beta A(h_1 + \ldots + h_{t+1}))$ (see property 2 of $\smax$ Fact); $y$ can be easily deduced to have the same value.
We now prove that the above algorithm works.
Theorem: Let $\OPT$ be the value of the optimal solution and $\tilde{x} := x_T/T$ be the output of the multiplicative weights algorithm. Then $\tilde{x} \in K$ and $\max(A \tilde{x}) \le OPT + \eps$.
Proof: It is sufficient to show $\max(A x_T) \le T \cdot (OPT + \eps)$ since it is equivalent to $\max(A \tilde{x}) = \max(A x_T/T) \le OPT + \eps$, by scaling. Furthermore, note that solving the smoothened version of the problem $\Phi(x) = \smax_{\beta}(Ax)$ is “harder” than solving the original one since $\Phi(x) \ge \max(A x)$ by property 1 of Fact (smooth max). Therefore, it is sufficient to show $\Phi(x_T) \le T \cdot (OPT + \eps)$. Also, note that $x_T/T = \sum_{t=1}^T \frac{1}{T} h_t \in K$ since $K$ is convex and $h_t \in K$.
We track the evolution of $\Phi(x_t) = \Phi(h_1 + h_2 + \ldots + h_t)$ by arguing about the increase $\Phi(x_{t-1} + h_t) - \Phi(x_{t-1})$ in each round. Up to first order terms, the increase is exactly $\inner{\nabla \smax(x_{t-1}, A h_t)} = \inner{\nabla \Phi(x_{t-1}), h_t}$, the linearized problem we are optimizing. We argue that this (approximate) increase is small, $\inner{\nabla \Phi(x_{t-1}), h_t} \le \OPT$. However, this is immediate because the linearized problem is a “relaxation” of the original problem. Let $x_{\ast} \in K$, $\max(A x_{\ast}) = \OPT$ be the optimal solution. It is easy to see that \(\inner{p, A x_{\ast}} \le \OPT\) for any probability distribution $p$ (we use $p = \nabla \smax_{\beta}(A x_{t-1})$), hence the oracle will always return a value not larger than $\OPT$.
We argued that the first-order increase in $\Phi$ is $\inner{\nabla \Phi(x_{t-1}), h_t} \le \OPT$. In reality, we have to account for higher-order errors via property 3 of Fact (smooth max).
\[\Phi(x_{t-1}+h_t) - \Phi(x_{t-1}) \le \inner{\nabla \Phi(x_{t-1}), h_t} + \beta \dnorm{A h_t}^2_\infty \le \OPT + \frac{\eps}{2}\]This allows us to effectively bound $\Phi$:
\[\Phi(x_T) \le \Phi(\pmb{0}) + \sum_{t=1}^T\left[ \Phi(x_{t-1} + h_t) - \Phi(x_{t-1}) \right] \le \frac{\ln n}{\beta} + \sum_{t=1}^T \left[ OPT + \frac{\eps}{2} \right]\]Straighforward algebra with $T = \frac{\ln n}{\beta \cdot \eps}$ gives us that the right-hand side is at most $T \cdot (OPT + \eps)$ and we are done.
A few final points:
- For a full proof please see the appendix of [Zuz].
- One can often get a better dependence on $\rho$ for special problems. E.g., one can get a linear instead of quadratic dependence of $\rho$ for positive packing and positive covering problems such as maxflow. See [AHZ] for details.
- Thanks to Thatchaphol Saranurak for pointing out one can also rederive the online experts algorithm using the analysis.
- It would be interesting to rederive matrix multiplicative weights using an analogue of the above analysis.
References:
- [AHK] Arora, Hazan, Kale: The Multiplicative Weights Update Method: A Meta-Algorithm and Applications
- [Zuz] Zuzic: A Simple Boosting Framework for Transshipment