Saturday, September 30, 2023

Mathematical Definition of Image Classification Problem

MNIST_with_simple_NN

Mathematical Definition of Image Classification Problem

Image classification is a task aimed at categorizing images into their respective labels.

Image and their labels

This objective is typically addressed using neural networks. In this blog post, we have a dual purpose. Firstly, we will delve into the mathematical modeling of this problem from an optimization perspective. Secondly, we will demonstrate how to implement a custom one-hidden-layer linear neural network using PyTorch. We’ll also provide visualizations of the classification results and showcase the evolution of weights before and after training.

1. Image Classification Problem

Original Problem

In kk-class image classification problem (ICP), we are given a training dataset comprising pairs of images and labels, denoted as (xi,i)Rd×{1,,k}(x_i, \ell_i) \in \mathbb R^{d} \times \{1, \ldots, k\}, where i=1,,ni=1, \ldots, n. Here, each image xix_i is represented as a 2D array of numbers with a size of d=d1×d2d = d_1 \times d_2, and label i\ell_i is the class of image xix_i. Notice that by flattening xix_i, one can consider xix_i as a usual column vector in Rd\mathbb R^d. The main objective is to determine a label function FF that associates each image xix_i with its corresponding label i\ell_i, i.e.

Find function F:F(xi)i,i=1,,n.(ICP1)\text{Find function } F: \quad F(x_i) \approx \ell_i, \quad \forall i=1, \ldots, n. \tag{ICP1}

In the following sections, we will explore how to model the function FF and how to deal with e the notion of approximation in equation (ICP1).

Reformulation

In the real world, we may not always know how to definitively classify an image xx into a specific class. Instead, we might express that xx belongs to each class with different probabilities. We will now demonstrate how to reformulate (ICP1) using this probabilistic approach.

Let E={e1,...,ek}E= \{e_1,..., e_k\} represent the unit basis vectors in Rk\mathbb R^k. Define S={uRk:j=1kuj=1,uj0,j=1,...,k}S=\{u\in \mathbb R^k: \sum_{j=1}^k u_j=1, u_j\geq 0, \forall j=1,..., k\} as the simplex in Rk\mathbb R^k. In this context, SS represents the space of probability distributions over the labels j\ell_j, where j=1,...,kj=1,..., k.

For a given image xx, let P(x)SP(x) \in S denote the probabilities associated with each label, where the entry P(x)P(x)_\ell stands for the probability that image xx belongs to label \ell. By doing this, if P(x)=eP(x) = e_\ell if and only if \ell is the label of xx with a probability of 11.

Consequently, we may encode label i\ell_i by vector eiEe_{\ell_i}\in E:
yi:=eiy_i := e_{\ell_i}
Thus, in the following, we will consider the dataset (xi,yi)(x_i, y_i) instead of (xi,i)(x_i, \ell_i).

On the other hand, we notice that any function f:RdRkf: \mathbb R^d \rightarrow \mathbb R^k can be normalized to obtain a probability function PP as follows:
P(x)=σSM(f(x)),P(x) = \sigma_{SM}(f(x)),
where the softmax function σSM:RkS\sigma_{SM}: \mathbb R^k \rightarrow S is defined by:
σSM(u)=1j=1keuj(eu1,...,euk)S.\sigma_{SM}(u) = \frac{1}{\sum_{j=1}^k e^{u_j} }(e^{u_1}, ..., e^{u_k}) \in S.

Naturally, we can define a label function F(x)F(x) as the index with the highest probability in P(x)P(x), i.e.:
F(x)=arg maxj=1,...,kσSM(f(x))j.F(x) = \argmax_{j=1,..., k} \sigma_{SM}(f(x))_j.

The problem (ICP1) can then be reformulated as follows: Given the dataset (xi,yi)Rd×E(x_i, y_i)\in \mathbb R^d\times E, for i=1,...,ni=1,..., n, our objective is to:
Find the function f:σSM(f(xi))yi,i=1,...,n.(ICP2)\text{Find the function } f: \quad \sigma_{SM}(f(x_i)) \approx y_i, \quad \forall i=1,..., n.\tag{ICP2}

Optimization Approach

As finding a function ff across the space of general functions appears to be infeasible, we constrain ff to a set of parametric functions, denoted as fΘf_\Theta, where Θ\Theta represents a set of parameters. Achieving a function that satisfies the approximation in (ICP2) involves minimizing a loss function. Essentially, the objective here is to discover a Neural Network (NN), which is a specific parametric function denoted as fΘf_\Theta, such that Θ\Theta minimizes the following optimization problem:

minΘ1ni=1nLCE(σSM(fΘ(xi)),yi)(ICP3)\min_{\Theta} \quad \frac{1}{n}\sum_{i=1}^n L_{CE}(\, \sigma_{SM}(f_\Theta(x_i)) \, , \,y_i \,)\tag{ICP3}

Recall that (xi,yi)Rd×E(x_i, y_i) \in \mathbb R^{d} \times E, for i=1,....,ni=1,...., n. A commonly used loss function in this classification approach is the Cross Entropy Loss, defined as:

LCE(u,v)=1ki=1kvilogui.L_{CE}(u, v) = - \frac{1}{k}\sum_{i=1}^k v_i \log u_i.

Note that (ICP3) is a highly non-convex optimization problem and solving it using iterative methods such as gradient descent is known as the process of training a NN.

2. One-hidden-layer Linear Neural Network

In this example, we will illustrate how to solve this problem using PyTorch by implementing the simplest form of a neural network: a one-hidden-layer linear neural network (without activation functions).

The function fΘf_\Theta is defined as:
fΘ(x):=Θ2Θ1x.f_\Theta (x) := \Theta_2 \Theta_1 x.

Here, xRdx \in \mathbb R^d, Θ1Rm×d\Theta_1 \in \mathbb R^{m \times d}, Θ2Rk×m\Theta_2 \in \mathbb R^{k \times m}, and Θ=(Θ1,Θ2)\Theta = (\Theta_1, \Theta_2).

We will use MNIST (Modified National Institute of Standards and Technology Database), a widely-used collection of handwritten digit data. MNIST comprises 60,000 training images and 10,000 testing images, each being a grayscale image with dimensions of 28x28 pixels. The main classification task involves assigning each image to one of k=10k=10 classes, representing the 10 digits with labels {1,...,10}={0,...,9}\{\ell_1, ..., \ell_{10}\} = \{0, ..., 9\}.

In this case, we have d=784d=784, k=10k=10, and mm can be arbitrarily chosen, referred to as the size of the hidden layer.

The custom implementation of NN is provided below. The full code can be found in my Github.

class LinearLayer(nn.Module):
    def __init__(self, input_size, output_size):
        super(LinearLayer, self).__init__()
        torch.manual_seed(102)
		rand_weights = 2*(torch.rand(output_size, input_size)-0.5)
		self.weights = nn.Parameter(rand_weights)

    def forward(self, x):
        # x.shape = (m, d)
        # W.shape = (n, d)
        W = self.weights
        result = x @ W.T
        return result

# Fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.input_size = input_size
        self.l1 = LinearLayer(input_size, hidden_size)
        self.l2 = LinearLayer(hidden_size, num_classes)

    def forward(self, x):
        out = self.l1(x)
        out = self.l2(out)
        # no activation and no softmax at the end
        return out

Some optimal weights before training
Some optimal weights before training

Some optimal weights after training
Some optimal weights after training

Some predictions after training

Some predictions after training

Popular Posts