calculus_of_KL

On the calculus of Kullback-Leibler Divergence

Author: Tran Thu Le
Date: 16/02/2023

Abstract. This post investigates the calculus of Kullback-Leibler divergence in the context of convex optimization.

For $b, v, \varepsilon \in \mathbb R^m_+$ , let

$KL(b, v) \triangleq \langle b, \log(b/(v+\varepsilon))\rangle -\langle 1, b\rangle + \langle 1, v+\varepsilon \rangle.\tag{KL}$

Here we define $\log(0)\triangleq -\infty.$

Fenchel conjugate

The Fenchel conjugate of $\varphi$ is defined as
$\varphi^*(u) \triangleq \sup_{v\in \mathbb R^m} \langle u, v\rangle -\varphi(u).$

For $\textcolor{blue}{u\in (-\infty, 1]^m}$ and $f(\cdot) = KL(b, \cdot)$ for some given $b\in \mathbb R^m_+$ , we have
$\textcolor{blue}{f^*(u) = - \langle b, \log(1-u)\rangle - \langle \varepsilon, u\rangle.}\tag{KL*}$

Indeed, we have
$\begin{aligned} \phi(u, v) \triangleq &\langle u, v\rangle - \langle b, \log(b/(v+\varepsilon))\rangle +\langle 1, b\rangle - \langle 1, v+\varepsilon \rangle\\ = & \langle u, v+\varepsilon\rangle + \langle b, \log(v+\varepsilon)\rangle - \langle 1, v+\varepsilon \rangle - \langle b, \log(b)\rangle+\langle 1, b\rangle - \langle \varepsilon, u\rangle\\ = & \left[\langle u-1, v+\varepsilon\rangle + \langle b, \log(v+\varepsilon)\rangle\right] + \left[ - \langle b, \log(b)\rangle+\langle 1, b\rangle - \langle \varepsilon, u\rangle\right] \end{aligned}$

The supremum in $v$ attains at $v$ so that
$v+ \varepsilon = \frac{b}{1-u}.$

In this case, notice that $\langle u-1, v+\varepsilon\rangle = -\langle 1, b\rangle$ , then
$\begin{aligned} f^*(u) =& \sup_{v}\phi(u, v)\\ = & \left[-\langle 1, b\rangle + \langle b, \log(b/(1-u))\rangle\right] + \left[ - \langle b, \log(b)\rangle+\langle 1, b\rangle - \langle \varepsilon, u\rangle\right]\\ = & -\langle b, \log(1-u)\rangle - \langle \varepsilon, u\rangle. \end{aligned}$
The proof is done. $\square$

Note. Here notice that $f^*$ is a proper convex function. Indeed, since $\log$ is concave, $\log(1-\cdot)$ is also concave and thus $-\log(1-\cdot)$ is convex. Although, $f^*(u) = +\infty$ when $\max_i u_i = 1$ , there exists $u$ so that the value if finite, e.g. consider $u=0_m$ , we have $f^*(0_m) = 0<+\infty$ . Thus $f^*$ is a proper function.

Locally bounded of the conjugate

In this section define
$h(u) \triangleq f^*(-u)=- \langle b, \log(1+u)\rangle + \langle \varepsilon, u\rangle.$
with the modified constraint set
$u \in [-1, +\infty)^m.$

Here we have
$\nabla^2 h(u) = diag \left( \frac{b}{(1+u)^2}\right)$

For fixed $u$ such that $u_i>1$ for all $i=1...,m$ , the Hessian of $h$ is positive definite. Which means $h$ is strictly convex. However, when $\min_i u_i\rightarrow +\infty$ , the min of entries in the diagonal of $\nabla^2h$ converges to $0$ , which shows that $h$ is not globally strongly convex. However, if we assume that $u$ is bounded from above by some $\textcolor{blue}{M>-1}$ , then $h$ is locally strongly convex on such domain. The following theorem clarifies this.

Theorem. Assume that $\min_i b_i>0$ . $h$ is strongly convex on the domain $[-1, M]^m$ .

Proof. For $u\in [-1, M]^m$ we have
$0\leq 1+u \leq 1+M$
Thus
$\nabla^2 h(u) = diag \left( \frac{b}{(1+u)^2}\right) \geq \frac{\min_ib_i}{(1+M)^2} I_m$

Here notice that $1/(1+M)$ is well-defined since we assumes that $M>-1$ . $\square$

On the boundedess of polyhedral set $A^Tu\leq \lambda$

Lemma: Let $U = \{u\in \mathbb R^m: A^Tu\leq \lambda\}$ for some $A\in \mathbb R^{m\times n}$ . Assume that $A$ has right-inverse denoted by $A^\dag$ , i.e.
$AA^\dag =I_m.$
Then $U\subset (-\infty, M].$
for $M$ defined by $M = \max\left( ||A||_1, \lambda \right)\cdot \max_i ||a_i^\dag||$ where is the $||A||_1$ is the maximum absolute column sum of the matrix A and $a_i^\dag$ is the $i$ -column of $A^\dag$ .

Proof. The detailed calculations (and refinements) can be found in [1, Proof of Theorem 2].

Dual problem of sparse regression with KL divergence

Consider parametric convex optimization problem

$\min_{x\in \mathbb R^m} KL(b, Ax) + \lambda ||x||_1.$

Its Fenchel Rockafellar dual problem is

$\max_{u\in \mathbb R^m} -h(u)$
with the parameterized constraint
$u\in \{u\in \mathbb R^m: u\geq -1, A^Tu\leq \lambda\}\subset [-1, M]^m$
where $M$ is defined in the previous section. So we see that

Theorem. Over above feasible region of $u$ , $h$ is strongly convex.

References

[1]: Dantas, Cassio F., Emmanuel Soubies, and Cédric Févotte. “Safe screening for sparse regression with the Kullback-Leibler divergence.” ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

Tran Thu Le

Pages

Sunday, February 19, 2023