Tuesday, March 7, 2023

Solving L1-Penalization of Kullback-Leibler Divergence using SMART Algorithm

Sparse_KL

Solving L1-Penalization of Kullback-Leibler Divergence using SMART Algorithm

Author: Tran Thu Le
Date: 7/3/2023
Abstract: In this note, we show how to modify the SMART-algorithm to solve the Sparse regression of Kullback-Leibler divergence using L1-pennalization.

Preliminaries

For vectors x,yx, y in a finite dimensional vector space, we denote by xyxy, x/yx/y, logx\log x if x0x\geq 0 and expx\exp x the vectors with operators applied elementwise. Here we set log0=\log 0=-\infty.

For u,vRmu, v\in \mathbb R^m, the negative entropy is a convex function defined by

H(u)=u,logu=ui>0uilog(ui).H(u) = \langle u, \log u\rangle = \sum_{u_i>0} u_i \log(u_i).

where u0u\geq 0. Notice that H(0)=0H(0)=0.

The Bregman divergence of a convex function hh is

Bregh(v,p)=h(v)h(p)h(p),vp.Breg_h(v, p) = h(v) - h(p) - \langle \nabla h(p), v-p\rangle.

The Kullback-Leibler divergence (a.k.a. relative entropy) is the Bregman divergence of negative entropy

KL(v,p)=BregH(v,p)=v,logvp1,v+1,p.KL(v, p) = Breg_H(v, p) = \langle v, \log \frac{v}{p}\rangle - \langle 1, v\rangle + \langle 1, p\rangle.

where vv and pp have the same signs, i.e. v/p0v/p\geq 0 and pi0p_i\neq 0 for all i=1,...,mi=1,..., m.

One should have 2 remarks. First, KL div. has gradient of the form

vKL(v,p)=H(v)H(p)=logvp.\nabla_v KL(v, p) = \nabla H(v) - \nabla H(p) = \log \frac{v}{p}.

This holds for any same sign vectors v,pv, p.

Second, the Bregman div. satisfies the following Cosine law

Bregh(v,b)=Bregh(v,v)+Breg(v,b)vv,h(b)h(v).Breg_h(v', b) = Breg_h(v', v) + Breg(v, b) - \langle v'-v, \nabla h(b) - \nabla h(v)\rangle.

If we let f(v):=Bregh(v,b)f(v):=Breg_h(v, b) for some given bR+mb\in \mathbb R^m_+, we have

Bregf(v,v)=f(v)f(v)f(v),vv=Bregh(v,v)Breg_f(v', v) = f(v') -f(v) - \langle \nabla f(v), v'-v\rangle=Breg_h(v', v)

We call this the Bidual Bregman div. theorem.

Sparse-SMART algorithm

Simultaneous Multiplicative Algebraic Reconstruction Technique (SMART) is an iterative method to solve Kullback-Leibler divergence (KL div.) problem, see [1]. We now show how to use a modification of SMART, say Sparse-SMART, to solve the following sparse convex problem

P(x)=KL(Ax,b)+I(Ax0)+λx1+I(x0).P(x) = KL(Ax, b) +I(Ax\geq 0) + \lambda ||x||_1 + I(x\geq 0).

where A=[a1,...,an]Rm×nA=[a_1,...,a_n]\in \mathbb R^{m\times n} with ai0a_i\geq 0, b0b\geq 0, λ>0\lambda >0. Note that x0Ax0x\geq 0 \Rightarrow Ax\geq 0, without the condition x0x\geq 0, the constraint Ax0Ax\geq 0 should be explicitly declared.

This problem aims to solve the linear inverse problem Ax=bAx=b using KL div as a distance between AxAx and bb and the optimal solution tends to be sparse due to the use of L1-norm. Note that by the non-negativity constraint x0x\geq 0, we have x1=1,x||x||_1=\langle 1, x\rangle. Here, w.l.o.g we assume that

ai1=1,i=1,...,n.||a_i||_1 =1, \forall i=1,..., n.

Under this assumption we have, see [Prop. 2, [1]]

KL(Ax,Ax)KL(x,x).KL(Ax', Ax)\leq KL(x', x).

Let f(v)=KL(v,b)f(v) = KL(v, b) we have

Bregf(Ax,Ax)=BregH(Ax,Ax)=KL(Ax,Ax)KL(x,x)Breg_f(Ax', Ax) = Breg_H(Ax', Ax) = KL(Ax', Ax) \leq KL(x', x)

So for x,x0x', x \geq 0, we have

P(x)f(Ax)+ATf(Ax),xx+λ1,x+KL(x,x)=φx(x)P(x') \leq f(Ax) + \langle A^T \nabla f(Ax), x'-x\rangle + \lambda \langle 1, x'\rangle + KL(x', x)= \varphi_x(x')

Then φx\varphi_x achieves its min value at xx' being the solution of the following equation

ATf(Ax)+λ+logxx=0A^T \nabla f(Ax) + \lambda + \log \frac{x'}{x}=0

Then

x=xexp(ATf(Ax)λ)x' = x \exp(-A^T\nabla f(Ax) - \lambda)

Let rx=f(Ax)r_x=-\nabla f(Ax) the residual of xx, the formula of xx' can be rewritten

x=xexp(ATrxλ).x' = x \exp(A^Tr_x - \lambda).

Explicitly, the Sparse-SMART update rule for our problem is as follows

x(k+1)=x(k)exp(ATlogbAx(k)λ)(Sparse-SMART)x^{(k+1)} = x^{(k)} \exp(A^T\log \frac{b}{Ax^{(k)}} - \lambda)\tag{Sparse-SMART}

Note. 1) If λ>λmax:=maxATb\lambda >\lambda_{\max}:= \max A^T b then the optimal solution is x=0nx=0_n. So, in pratice we only consider λ<λmax\lambda <\lambda_{\max}. 2) In (Sparse-SMART)(\text{Sparse-SMART}) the formula remains valid if there are some indices ii with xi<0x_i<0. In such cases, we should take into account the condition Ax0Ax\geq 0. 3) The classical SMART is
x(k+1)=x(k)exp(ATlogbAx(k)).x^{(k+1)} = x^{(k)} \exp(A^T\log \frac{b}{Ax^{(k)}}).

Numerical Experiment

KL div with L1-pen.
import numpy as np 
import matplotlib.pyplot as plt 

def dic(x, t, w):
  # atoms with L1-norm normalization
  diff = x.reshape(-1, 1) - t.reshape(1, -1)
  gausses = 2**(-diff**2/w**2)
  norms = np.linalg.norm(gausses, axis=0, ord=1)
  assert gausses.shape[1] == len(norms)
  return gausses/norms


def sparse_SMART(b, A, lbd, x_init, iter_stop=50):
  x = x_init 
  x_list = [x]
  for i in range(iter_stop): 
      Ax = A.dot(x)        
      r = - np.log(Ax/b)
      x = x * np.exp( (A.T).dot( r ) - lbd)
      x_list += [x]
  return x_list 


def run():
  m, n = 60, 100
  k=3
  locs = np.linspace(-0.1, 1.1, m)
  t = np.linspace(0., 1., n)
  w = 0.05
  A = dic(locs, t, w) 
  np.random.seed(1234)
  rand_indices = np.random.choice(range(n), k)
  x_true = np.zeros_like(t)
  x_true[rand_indices] = np.random.rand(k)
  b = A.dot(x_true) 

  lbd_max = np.max( (A.T).dot(b))
  lbd = 0.5*lbd_max 
  x_init = np.random.rand(n)/20.
  x_list = sparse_SMART(b, A, lbd, x_init, iter_stop=1000)
  
  fig, axs = plt.subplots(1, 2, figsize=(16, 2.5))
  ax=axs[0]
  ax.plot(locs, b, color="blue", label="b")
  ax.plot(locs, A.dot(x_list[-1]), color="red", label="Ax_est")
  ax.legend()
  
  ax=axs[1]
  ax.plot(t, x_true, color="blue", label="x_true")
  ax.plot(t, x_list[-1], color="red", label="x_est")
  ax.legend()    
  
  fig.suptitle("Sparse-SMART alg. for L1-pennalization of Kullback-Leibler divergence")
  plt.plot()
  plt.show()
  

run()

Note. In this code, it is important to notice that we should normalize the dictionary using 1\ell_1 norm instead of usual 2\ell_2 norm. Second, the initialized vector should be small, but not zero, otherwise, the whole sequence will be zeros.

References

[1]: Petra, Stefania, et al. “B-SMART: Bregman-based first-order algorithms for non-negative compressed sensing problems.” Scale Space and Variational Methods in Computer Vision: 4th International Conference, SSVM 2013, Schloss Seggau, Leibnitz, Austria, June 2-6, 2013. Proceedings 4. Springer Berlin Heidelberg, 2013.

Popular Posts