Sunday, February 19, 2023

On the calculus of Kullback-Leibler Divergence

calculus_of_KL

On the calculus of Kullback-Leibler Divergence

Author: Tran Thu Le
Date: 16/02/2023

Abstract. This post investigates the calculus of Kullback-Leibler divergence in the context of convex optimization.

For b,v,εR+mb, v, \varepsilon \in \mathbb R^m_+, let

KL(b,v)b,log(b/(v+ε))1,b+1,v+ε.(KL)KL(b, v) \triangleq \langle b, \log(b/(v+\varepsilon))\rangle -\langle 1, b\rangle + \langle 1, v+\varepsilon \rangle.\tag{KL}

Here we define log(0).\log(0)\triangleq -\infty.

Fenchel conjugate

The Fenchel conjugate of φ\varphi is defined as
φ(u)supvRmu,vφ(u).\varphi^*(u) \triangleq \sup_{v\in \mathbb R^m} \langle u, v\rangle -\varphi(u).

For u(,1]m\textcolor{blue}{u\in (-\infty, 1]^m} and f()=KL(b,)f(\cdot) = KL(b, \cdot) for some given bR+mb\in \mathbb R^m_+, we have
f(u)=b,log(1u)ε,u.(KL*)\textcolor{blue}{f^*(u) = - \langle b, \log(1-u)\rangle - \langle \varepsilon, u\rangle.}\tag{KL*}

Indeed, we have
ϕ(u,v)u,vb,log(b/(v+ε))+1,b1,v+ε=u,v+ε+b,log(v+ε)1,v+εb,log(b)+1,bε,u=[u1,v+ε+b,log(v+ε)]+[b,log(b)+1,bε,u]\begin{aligned} \phi(u, v) \triangleq &\langle u, v\rangle - \langle b, \log(b/(v+\varepsilon))\rangle +\langle 1, b\rangle - \langle 1, v+\varepsilon \rangle\\ = & \langle u, v+\varepsilon\rangle + \langle b, \log(v+\varepsilon)\rangle - \langle 1, v+\varepsilon \rangle - \langle b, \log(b)\rangle+\langle 1, b\rangle - \langle \varepsilon, u\rangle\\ = & \left[\langle u-1, v+\varepsilon\rangle + \langle b, \log(v+\varepsilon)\rangle\right] + \left[ - \langle b, \log(b)\rangle+\langle 1, b\rangle - \langle \varepsilon, u\rangle\right] \end{aligned}

The supremum in vv attains at vv so that
v+ε=b1u.v+ \varepsilon = \frac{b}{1-u}.

In this case, notice that u1,v+ε=1,b\langle u-1, v+\varepsilon\rangle = -\langle 1, b\rangle, then
f(u)=supvϕ(u,v)=[1,b+b,log(b/(1u))]+[b,log(b)+1,bε,u]=b,log(1u)ε,u.\begin{aligned} f^*(u) =& \sup_{v}\phi(u, v)\\ = & \left[-\langle 1, b\rangle + \langle b, \log(b/(1-u))\rangle\right] + \left[ - \langle b, \log(b)\rangle+\langle 1, b\rangle - \langle \varepsilon, u\rangle\right]\\ = & -\langle b, \log(1-u)\rangle - \langle \varepsilon, u\rangle. \end{aligned}
The proof is done. \square

Note. Here notice that ff^* is a proper convex function. Indeed, since log\log is concave, log(1)\log(1-\cdot) is also concave and thus log(1)-\log(1-\cdot) is convex. Although, f(u)=+f^*(u) = +\infty when maxiui=1\max_i u_i = 1, there exists uu so that the value if finite, e.g. consider u=0mu=0_m, we have f(0m)=0<+f^*(0_m) = 0<+\infty. Thus ff^* is a proper function.

Locally bounded of the conjugate

In this section define
h(u)f(u)=b,log(1+u)+ε,u.h(u) \triangleq f^*(-u)=- \langle b, \log(1+u)\rangle + \langle \varepsilon, u\rangle.
with the modified constraint set
u[1,+)m.u \in [-1, +\infty)^m.

Here we have
2h(u)=diag(b(1+u)2)\nabla^2 h(u) = diag \left( \frac{b}{(1+u)^2}\right)

For fixed uu such that ui>1u_i>1 for all i=1...,mi=1...,m, the Hessian of hh is positive definite. Which means hh is strictly convex. However, when miniui+\min_i u_i\rightarrow +\infty, the min of entries in the diagonal of 2h\nabla^2h converges to 00, which shows that hh is not globally strongly convex. However, if we assume that uu is bounded from above by some M>1\textcolor{blue}{M>-1}, then hh is locally strongly convex on such domain. The following theorem clarifies this.

Theorem. Assume that minibi>0\min_i b_i>0. hh is strongly convex on the domain [1,M]m[-1, M]^m.

Proof. For u[1,M]mu\in [-1, M]^m we have
01+u1+M0\leq 1+u \leq 1+M
Thus
2h(u)=diag(b(1+u)2)minibi(1+M)2Im\nabla^2 h(u) = diag \left( \frac{b}{(1+u)^2}\right) \geq \frac{\min_ib_i}{(1+M)^2} I_m

Here notice that 1/(1+M)1/(1+M) is well-defined since we assumes that M>1M>-1.\square

On the boundedess of polyhedral set ATuλA^Tu\leq \lambda

Lemma: Let U={uRm:ATuλ}U = \{u\in \mathbb R^m: A^Tu\leq \lambda\} for some ARm×nA\in \mathbb R^{m\times n}. Assume that AA has right-inverse denoted by AA^\dag, i.e.
AA=Im.AA^\dag =I_m.
Then U(,M].U\subset (-\infty, M].
for MM defined by M=max(A1,λ)maxiaiM = \max\left( ||A||_1, \lambda \right)\cdot \max_i ||a_i^\dag|| where is the A1||A||_1 is the maximum absolute column sum of the matrix A and aia_i^\dag is the ii-column of AA^\dag.

Proof. The detailed calculations (and refinements) can be found in [1, Proof of Theorem 2].

Dual problem of sparse regression with KL divergence

Consider parametric convex optimization problem

minxRmKL(b,Ax)+λx1.\min_{x\in \mathbb R^m} KL(b, Ax) + \lambda ||x||_1.

Its Fenchel Rockafellar dual problem is

maxuRmh(u)\max_{u\in \mathbb R^m} -h(u)
with the parameterized constraint
u{uRm:u1,ATuλ}[1,M]mu\in \{u\in \mathbb R^m: u\geq -1, A^Tu\leq \lambda\}\subset [-1, M]^m
where MM is defined in the previous section. So we see that

Theorem. Over above feasible region of uu, hh is strongly convex.

References

[1]: Dantas, Cassio F., Emmanuel Soubies, and Cédric Févotte. “Safe screening for sparse regression with the Kullback-Leibler divergence.” ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

Popular Posts