On the calculus of Kullback-Leibler Divergence
Author: Tran Thu Le
Date: 16/02/2023
Abstract. This post investigates the calculus of Kullback-Leibler divergence in the context of convex optimization.
For b,v,ε∈R+m, let
KL(b,v)≜⟨b,log(b/(v+ε))⟩−⟨1,b⟩+⟨1,v+ε⟩.(KL)
Here we define log(0)≜−∞.
Fenchel conjugate
The Fenchel conjugate of φ is defined as
φ∗(u)≜v∈Rmsup⟨u,v⟩−φ(u).
For u∈(−∞,1]m and f(⋅)=KL(b,⋅) for some given b∈R+m, we have
f∗(u)=−⟨b,log(1−u)⟩−⟨ε,u⟩.(KL*)
Indeed, we have
ϕ(u,v)≜==⟨u,v⟩−⟨b,log(b/(v+ε))⟩+⟨1,b⟩−⟨1,v+ε⟩⟨u,v+ε⟩+⟨b,log(v+ε)⟩−⟨1,v+ε⟩−⟨b,log(b)⟩+⟨1,b⟩−⟨ε,u⟩[⟨u−1,v+ε⟩+⟨b,log(v+ε)⟩]+[−⟨b,log(b)⟩+⟨1,b⟩−⟨ε,u⟩]
The supremum in v attains at v so that
v+ε=1−ub.
In this case, notice that ⟨u−1,v+ε⟩=−⟨1,b⟩, then
f∗(u)===vsupϕ(u,v)[−⟨1,b⟩+⟨b,log(b/(1−u))⟩]+[−⟨b,log(b)⟩+⟨1,b⟩−⟨ε,u⟩]−⟨b,log(1−u)⟩−⟨ε,u⟩.
The proof is done. □
Note. Here notice that f∗ is a proper convex function. Indeed, since log is concave, log(1−⋅) is also concave and thus −log(1−⋅) is convex. Although, f∗(u)=+∞ when maxiui=1, there exists u so that the value if finite, e.g. consider u=0m, we have f∗(0m)=0<+∞. Thus f∗ is a proper function.
Locally bounded of the conjugate
In this section define
h(u)≜f∗(−u)=−⟨b,log(1+u)⟩+⟨ε,u⟩.
with the modified constraint set
u∈[−1,+∞)m.
Here we have
∇2h(u)=diag((1+u)2b)
For fixed u such that ui>1 for all i=1...,m, the Hessian of h is positive definite. Which means h is strictly convex. However, when miniui→+∞, the min of entries in the diagonal of ∇2h converges to 0, which shows that h is not globally strongly convex. However, if we assume that u is bounded from above by some M>−1, then h is locally strongly convex on such domain. The following theorem clarifies this.
Theorem. Assume that minibi>0. h is strongly convex on the domain [−1,M]m.
Proof. For u∈[−1,M]m we have
0≤1+u≤1+M
Thus
∇2h(u)=diag((1+u)2b)≥(1+M)2minibiIm
Here notice that 1/(1+M) is well-defined since we assumes that M>−1.□
On the boundedess of polyhedral set ATu≤λ
Lemma: Let U={u∈Rm:ATu≤λ} for some A∈Rm×n. Assume that A has right-inverse denoted by A†, i.e.
AA†=Im.
Then U⊂(−∞,M].
for M defined by M=max(∣∣A∣∣1,λ)⋅maxi∣∣ai†∣∣ where is the ∣∣A∣∣1 is the maximum absolute column sum of the matrix A and ai† is the i-column of A†.
Proof. The detailed calculations (and refinements) can be found in [1, Proof of Theorem 2].
Dual problem of sparse regression with KL divergence
Consider parametric convex optimization problem
x∈RmminKL(b,Ax)+λ∣∣x∣∣1.
Its Fenchel Rockafellar dual problem is
u∈Rmmax−h(u)
with the parameterized constraint
u∈{u∈Rm:u≥−1,ATu≤λ}⊂[−1,M]m
where M is defined in the previous section. So we see that
Theorem. Over above feasible region of u, h is strongly convex.
References
[1]: Dantas, Cassio F., Emmanuel Soubies, and Cédric Févotte. “Safe screening for sparse regression with the Kullback-Leibler divergence.” ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.