Friday, November 17, 2023

A deep understanding of PCA: From statistics to optimization explanation

A deep understanding of PCA: From statistics to optimization explanation

A deep understanding of PCA: From statistics to optimization explanation

Abstract. Principal Component Analysis (PCA) stands as a widely utilized technique for condensing high-dimensional data into a low-dimensional space while retaining the “maximum amount of information”. This post delves into the mathematical modeling of PCA through the lens of optimization rather than the other popular explanations based on statistics.

PCA from 3D to 2D

Girl in a jacket

1. Introduction

Principal Component Analysis (PCA) stands out as a widely embraced technique for transforming very high-dimensional datasets into a very low dimension while retaining the “maximum amount of information”, see the figures above, depicting the transformation from 2D to 1D [1] and 3D to 2D [2].

The post is structured as follows:

  • The next section provides a brief explanation of PCA from a statistical perspective.
  • Following that, we delve into the optimization approach to explain PCA, emphasizing its central theme in this post.
  • The post concludes by offering insights through comments that draw connections between the two approaches.

2. Statistics approach

In statistical terms, PCA method involves transforming the original variables into a new set of uncorrelated variables known as principal components. In order to maximize the information in the low-dimensional space, these components represent linear combinations of the original variables and are arranged in a specific order: the first principal component (to the eigenvector associated with the largest eigenvalue of the covariance matrix) explains the maximum variance in the data, the second component (the eigenvector associated with the second largest eigenvalue) explains the maximum remaining variance orthogonal to the first, and so forth.

While this statistical explanation for PCA is simple, it masks the intricate relationships among the concepts of eigenvectors, eigenvalues, the covariance matrix, and the covariance matrix’s role in maximum information preservation.

To unravel these complexities, in the following section, we will employ an optimization approach to describe how the PCA method works and thus provide a profound understanding of PCA.

3. Optimization approach

Given nn data points xix_i in a high dimensional space Rd\mathbb R^d (large dd). The goal of PCA is to represent these points using other nn points ziz_i in a very low dimensional space Rk\mathbb R^k (kdk\ll d) spanned by {a1,....,ak}Rd\{a_1,...., a_k\}\subset \mathbb R^d, i.e.
xiμ+Azi,i=1,...,n,x_i \approx \mu + Az_i, \, \forall i=1,..., n,
where μRd\mu\in \mathbb R^d and A=[a1,...,ak]Rd×kA= [a_1,..., a_k] \in \mathbb R^{d\times k}. Here we assume that {a1,...,an}\{a_1,..., a_n\} forms an orthogonal setting, i.e. ATA=Ik.A^TA = I_k.

Here, the approximation \approx needs to be defined in a manner that ensures the preservation of the “maximum amount of information.” Intuitively, this preservation can be achieved through a “regression” model, wherein we seek the hyperplane μ+Az\mu+Az that best fits the data points xix_i. In this context, ziz_i represents the projection of xix_i onto the hyperplane (refer to the figures above).

Let’s delve into the mathematical details. PCA is essentially equivalent to solving the following optimization regression problem [3]:

minμ,A,zii=1nxiμAzi22(1)\min_{\mu, A, z_i}\quad \sum_{i=1}^n ||x_i-\mu - Az_i||_2^2 \tag{1}

The hard part of this problem is finding AA since it is relatively easy to find μ\mu and ziz_i according to the following formulas (see more details in [3]):
μ=1ni=1nxi\mu = \frac{1}{n} \sum_{i=1}^nx_i
and zi=AT(xiμ),i=1,...,n.z_i= A^T(x_i-\mu), \forall i=1,..., n.

Note that w.l.o.g. we may assume that μ=0\mu=0 (otherwise, we let xi:=xiμx_i:=x_i-\mu). By the formula of ziz_i, the problem (1) now can be rewritten as follows:
minAi=1nxiAATxi22(2)\min_{A}\quad \sum_{i=1}^n ||x_i- AA^Tx_i||_2^2 \tag{2}

Expanding (2), the focus now is to maximize

i=1nxiTAATxi=i=1nATxi22=i=1ntrace(ATxixiTA)=trace(ATSA)\sum_{i=1}^n x_i^TAA^Tx_i = \sum_{i=1}^n ||A^Tx_i||_2^2 = \sum_{i=1}^ntrace(A^Tx_ix_i^TA) = trace(A^TSA)
where v22=trace(vvT)||v||_2^2 = trace(vv^T) and S=i=1nxixiTS = \sum_{i=1}^n x_ix_i^T is a (scaled) covariance matrix.

The problem of finding AA is now equivalent to
maxAtrace(ATSA) s.t. ATA=Ik.(3)\max_{A}\quad trace(A^TSA) \quad \text{ s.t. } \quad A^TA=I_k.\tag{3}

Let us now focus a particular case of (3) when k=1k=1 i.e. AaRdA \equiv a\in \mathbb R^d:
maxaaTSa s.t. a2=1.\max_{a}\quad a^TSa \quad \text{ s.t. } \quad ||a||_2=1.

The minimum value of this problem is the so-called “norm of operator” of SS, which is exactly the maximum eigenvalue of SS, see e.g. [4], while the minimizer is the corresponding eigenvector.

In general, let S=UΛUTS = U\Lambda U^T, where U=[u1,..,ud]U=[u_1,.., u_d] a unit orthogonal matrix of eigenvectors, i.e. UTU=IdU^TU=I_d, and Λ=diag(λ1,...,λd)\Lambda=diag(\lambda_1,..., \lambda_d) be the diagonal matrix of eigenvalues of SS such that λ1...λd0,\lambda_1\geq ... \geq \lambda_d\geq 0, then optimal solution AA of (3) is

A=[u1,...,uk]Tdiag(λ1,...,λk)[u1,...,uk].A = [u_1,..., u_k]^T diag(\lambda_1,..., \lambda_k) [u_1,..., u_k].

4. Conclusion

Through the lens of the optimization approach, it becomes evident that PCA can be conceptualized as a regression model. In this model, the high-dimensional data points xiRdx_i \in \mathbb{R}^d, are aptly represented by a lower-dimensional points ziRkz_i \in \mathbb{R}^k, through a linear relation xiμ+Azix_i \approx \mu + Az_i, where knk \ll n.

In other words, the aim of preserving the “maximum amount of information” of data points can be achieved by solving a regression model.

In this model, the optimal matrix AA is found to be constructed using the kk largest eigenvalues and their corresponding eigenvectors from the (scaled) covariance matrix of xix_i. This connection unveils itself as a generalization of the well-known fact: the “norm of a (positive definite) matrix is the largest eigenvalue of that matrix.”

5. References

[1] step-step-explanation-principal-component-analysis
[2] explanation-of-principal-component-analysis
[3] PCA and LASSO
[4] operator-norm-is-equal-to-max-eigenvalue

Popular Posts