Yuge Shi

An incomplete and slightly outdated literature review on augmentation based self-supervise learning

2021-12-14T00:00:00-08:00

What’s with this title?

This is the equivalent to the “I invented this dish because I love my family” part in recipes — feel free to skip.

It was four months ago when I first drafted this blog post, and I felt like I was at the cutting edge of science. I mean, self-supervise learning methods without negative examples? That is WILD. Just knowing about these works made me feel like I am an excellent researcher, a studious PhD student, standing on the shoulder of the most recent giants.

Now that I am coming back to polish it four months later, it feels like a century has passed by, and models like masked autoencoders / BEiT has taken over as the new crowd favourites. I look at my blog post and realised that this is no longer a complete literature review of recent advances on self-supervise learning, but more of a weirdly speicifc “period piece” that covers some of the more famous works between 2020 to early 2021.

I have convinced myself that this is still useful to put this out there, since a lot of the work done during this time period shares very similar intuitions. Looking at them as a whole also provides useful insights on how our view on what boosts performance or prevents latent collapse in SSL changes through out the years (months?). I am currently working on another blog posts on masked image models such as MAE, BEiT and iBOT – so stay tuned!

Notations

Consistent notations:

$x$: Original image;
$t^A$, $t^B$: Augmentations applied to images;
$x^A$, $x^B$: Two augmented views of the same image $x$;
$h^A$, $h^B$: representations extracted from $x^A$ and $x^B$, used for downstream tasks;

(I tried my best but) less consistent notations:

$z^A$, $z^B$: representations extracted from $x^A$ and $x^B$, used for objective evaluations (apart from one special case in SCAN);
$x^{(1)}$, …, $x^{(n)}$: $n$ augmented views of the same image $x$ / $n$ different images (apologies for the abuse of notation, but it should be clear from the context which one it is)

Before we start, an overview

Almost all self-supervised learning (SSL) models share a similar goal — learning useful representations without labels. All the methods we are about to cover translate this goal as the following requirement:

Images that are semantically similar should have representations that are close to each other in feature space.

In practice, “semantically similar” images are generated by image augmentations. Let’s say we have a set of augmentations available $T$. Then an image $x$ can be augmented into two differnt views, $x_A$ and $x_B$, through the following procedure:

\[\begin{align*} & t_A \sim T, t_B \sim T \qquad &\text{Sample augmentations}\\ & x_A := t_A(x), x_B := t_B(x) &\text{Apply to image} \end{align*}\]

Let’s say we want to learn an encoder $f_\theta$, which extracts features from images. We can acquire the representations for the two image views by

\[\begin{align*} & z_A := f_\theta(x_A), z_B := f_\theta(x_B) \end{align*}\]

The goal is then to minimise the distance between the two features, i.e.

\[\begin{align*} \min_\theta \mathbb{E}[dist(z_A, z_B)] \end{align*}\]

where $dist(\cdot)$ denotes the distance between vectors.

What’s the catch?

This idea seems easy enough, so how come there are so many papers covering this same topic? (insert bad meme about the overly competitive nature of DL community)

Turns out, if you naively minimise the distance between representations, the model will simply map all the representations to a constant. This trivially minimise the distance between any pair of representations, but does not give us useful representations at all. We refer to this phenomenon as latent collapse.

For a lot of the work you are about to see, how latent collapse is avoided is the most interesting part of the paper (but a lot of them have a lot more interesting contributions too, so don’t stop there!). For those of you who likes tables, here is a quick summary (you can also jump to different sections of this blog post following the link):

Method	How latent collapse is prevented
SimCLR	Contrasting against negative examples, from minibatch
MoCo	Contrasting against negative example, from dictionary
SwAV	Contrasting against negative examples, from minibatch
BYOL	~~magic~~ Iterative online update + assymmetry of two encoders
W-MSE	Whitening
Barlow twins	Matching cross correlation matrices
SimSiam	Stop gradient operation to encoder of one view
VICReg	Regularise the standard deviation of representation

That’s it! For the rest of the blog posts I will be introducing these methods in rough chronological order, going through the model, objective as well as key findings and insights, which will hopefully shed some lights on how these models evolve through time. Enjoy!

SimCLR (2020 Feb)

TL;DR: Building on prior contrastive learning approaches, authors propose a simple contrastive framework and study the different empirical aspects that makes it “work”.

Model

Two data augmentations are applied to the same example $x$, producing $x^A$ and $x^B$. This is considered as a positive pair. (engineering detail: the combination of random crop and color distortion is crucial for good performance)
Base encoder $f(\cdot)$ extracts the representation $h$ for each views, which is used for downstream tasks at test time;
Projection head $g(\cdot)$ takes $h$ and map to $z$.

Objective

The model $\theta$ is learned through minimising the following infoNCE objective:

\[\begin{align*} \min_{\theta} \left( -\log \frac{\exp \text{sim}(z_i^A, z_i^B)}{\sum_{i\neq j}\exp \text{sim}(z_i^A, z_j^B)} \right) \end{align*}\]

Where sim denotes the cosine similarity between two vectors, i.e. sim$(u, v) = \frac{u^Tv}{||u||\cdot||v||}$.

Note that the objectve is computed using $z$, output of the projection head $g(\cdot)$ only. By minimising the above objective, we maximise the similarity between the representation of two views of the same image $z^A_i$ and $z^B_i$ (also called positive pair), and minimise those for different images, i.e. $z^A_i$ and $z^B_j$. For SimCLR, the negative examples are all but the current example in the same minibatch. (We will be looking at some other methods such as MoCo which decouples the number of negative examples from batch size)

Key findings

Composition of data augmentations is important – random crop + color distortion is crucial for good performance;
Proposes to add a projection layer between representations for downstream task ($h$) and the representation used to compute contrastive loss ($z$), which seems to improve performance;
Larger batch size is better.

SCAN (2020 May)

TL;DR: Use the power of nearest neighbours to improve the representations learned from some pretext task (for instance a model trained using SimCLR).

Model

First, the representations $z$ are learned through some pretext task – in the original paper for most of the experiments they used SimCLR;
Then, a clustering function $g_\phi$ takes $z$ as input and predicts the probability of the datapoint belonging to each cluster $\mathcal{C}={1,\cdots, C}$. We denote this probability as $h$, where $h\in[0,1]^C$;
The datapoint is then assigned to the cluster with the highest probability in $h$, which we denote as $c$.
We then select the $K$ nearest neighbours to the representation $z$ of the original image, denoting them as ${z^{(1)}, z^{(2)},\cdots,z^{(K)}}$. We perform the above clustering forward pass on all $K$ neighbours and acquire the probabilities each neighbour belongs to different clusters, $\mathcal{H} = {h^{(1)},h^{(2)},\cdots,h^{(K)}}$.

Objective

We can then learn the clustering function $g_\phi$ by minimising the following objective:

\[\begin{align*} \mathcal{L}= \underbrace{-\sum_{h^{(k)} \in \mathcal{H}} \log \langle h, h^{(k)} \rangle}_{(1)} + \lambda \ \underbrace{\sum_{c\in\mathcal{C}} h_c\log(h_c)}_{(2)} \end{align*}\]

Let’s dissect the objectives a bit:

Term (1): Consistent, confident neighbours; The first term of the objective imposes consistent predictions for $z$ and its neighbouring samples. The term will be maximised when the predictions are one-hot (confident) and assigned to the same cluster (consistent).

Term (2): diversity in clusters; The second term computes the entropy of the cluster assignment probabilities $h$. It is introduced to prevent $g_\phi$ from assigning all samples to a single cluster – it maximises the entropy to spread predictions uniformly across the clusters $\mathcal{C}$.

An extra “trick”: the model minimises mis-labelling during cluster assignment by picking out samples with highly confident predictions ($h_\max \approx 1$), assigning the sample to its predicted cluster, and updating $g_\phi$ based on the pseudo labels obtained. See original paper for more details.

Moco (2019) / MoCo V2+ (2020 March)

TL;DR: MoCo decouples the number of negative examples from the batch size. It proposes to store representations from previous $K$ minibatches in a “keys” dictionary, which can be used for computing the contrastive loss of the current “query” minibatch.

Model

MoCo uses a similar infoNCE objective as SimCLR, and the key difference between the two approaches is how they acquire negative examples.

SimCLR: all the other datapoints in the minibatch are used as negative examples to the current datapoint, and therefore the number of negative examples is limited by the size of the minibatch;
MoCo: the representation of each minibatch is stored in a fixed-sized dictionary. The negative examples used for any datapoint are drawn from this dictionary. By doing so, the number of negative examples is no longer determined by the size of the minibatch.

With this in mind, MoCo’s pipeline consists of the following two parts:

Generating positive examples: Similar to SimCLR, a forward pass is performed on two views of the same image $x^A$ and $x^B$ with a base encoder $f_\theta(\cdot)$ and projection head $g_\theta(\cdot)$. We denote the representations acquired from this step $x^A_\theta$ and $x^B_\theta$.
Generating negative examples: We mentioned that the representations for each minibatch gets stored in a dictionary and are reused for preceding batches as negative examples, however naively storing the representations $x^A_\theta$ and $x^B_\theta$ can lead to poor result due to the rapidly changing $\theta$. Therefore, authors propose to store the representations generated through the momentum encoder $\phi$, where

\[\begin{align*} \phi \leftarrow m \phi + (1-m) \theta, \end{align*}\]

with $m\in[0,1)$ being the momentum coefficient. Updating $\phi$ using the above assignment rule ensures that $\phi$ evolves more smoothly than $\theta$. Therefore for each minibatch, we simply add the representation acquired from $g_\phi(f_\phi(x))$, denote as $x^A_\phi$ and $x^B_\phi$, to the dictionary, which is then made available for future minibatches as negative examples.

Objective

The objective is very similar to SimCLR:

\[\begin{align*} \min_{\theta} \left( -\log \frac{\exp \text{sim}(z_\theta^A, z_\theta^B)}{\sum_{z_\phi \sim \texttt{dict}}\exp \text{sim}(z_\theta^A, z_\phi)} \right) \end{align*}\]

Similarly, $\text{sim}$ denotes the cosine similarity between two vectors, i.e. sim$(u, v) = \frac{u^Tv}{||u||\cdot||v||}$, and $\texttt{dict}$ denotes the dictionary.

From MoCo to MoCo V2+

The model that we described here is actually the MoCo V2+. The original MoCo was actually proposed before SimCLR. Following the key findings from SimCLR, authors updated their model to MoCo V2+ adopting the following designs and achieved better results:

use an MLP projection head $g(\cdot)$ and
use more data augmentations.

SwAV (2020 June)

TLDR: instead of matching the representations of two views (augmentations of the same image) directly, use one representation to predict the other.

Model

The model itself looks similar to simCLR, but the way it works is quite different. Here $g_\theta$ is not parametrised by a learnable neural network, but instead a set of $K$ trainable prototype vectors $G={g_1, g_2,…g_K}$ that maps $h$ into a code $z$ (this code can be discrete, however during training they find that leaving it continuous results in better performance). When computing the loss, instead of directly enforcing $z^A$ and $z^B$ to be similar, the model tries to associate the code of a view $x^A$ with the representation of another view $x^B$.

Objective

SwAV minimises the following objective

\[\begin{align*} \min_{\theta} \left(\mathcal{l}(h^B, z^A) + \mathcal{l}(h^A, z^B)\right), \end{align*}\]

where

\[\begin{align*} \mathcal{l}(h^B, z^A) = -\sum_k z^{A(k)} \log \frac{\exp(\langle h^B, g_k\rangle )}{\sum_{k'}\exp((h^B)^Tg_{k'})}. \end{align*}\]

Despite the conceptual differences, this is loosely still the inner product of the projected representation $z^A$ and $z^B$.

BYOL (2020 June)

TL;DR: BYOL avoid having to use negative examples for contrastive loss by performing an iterative online update — this paper was groundbreaking at the time, as negative examples are very computationally costly.

Model

Let’s unpack. Similar to MoCo, the model uses two sets of network parameters $\theta$ and $\phi$.

The optimisation goal of $\theta$ is to learn a projection $y_\theta$ that closely matches the representation learned from $\phi$, i.e. $z_\phi$. Implementation wise, this is done by adding yet another projection head $q_\theta(\cdot)$ that predicts $y_\theta$ from $z_\theta$. We then optimise $\theta$ using the following loss that minimises the mean squared error between $y_\theta$ and $z_\phi$:

\[\begin{align*} \mathcal{L}_\theta = \|y_\theta^A - z^B_\phi\|^2_2 = 2 - 2\cdot \frac{\langle y^A_\theta, z^B_\phi \rangle}{\|y^A_\theta \|\cdot \|z^B_\phi\|} \end{align*}\]

In the paper they also normalise $y_\theta^A$ and $z_\phi^B$ before computing this loss. Further, they symmetrise the loss by swapping $x^A_\theta$ and $x^B_\phi$ in $L_\theta$ — resulting in $x^A_\phi$ and $x^B_\theta$. (I’m not sure how important that is since the transforms are stochastically generated anyways, but it seems to improve empirical results)

$\phi$ on the other hand is not optimised via gradient descent. Similar to the momentum encoder of MoCo, it follows the following update rule at every forward pass:

\[\begin{align*} \phi \leftarrow \tau \phi + (1-\tau) \theta \end{align*}\]

where $\tau\in[0,1)$ is the coefficient that controls the smoothness of the update.

Why the hell does it work?

From the above, it is not hard to notice that BYOL is very similar to both SimCLR and MoCo. However, removing the negative examples of either models directly will lead to latent collapse.

So what makes BYOL effective without negative examples? The paper intuit this by deriving the gradient of the $\theta$ update, showing that it is the same as the gradient of the expected conditional variance, i.e.

\[\begin{align*} \nabla_\theta \mathbb{E}\left[\|\|y_\theta-z_\phi\|\|_2^2 \right] = \nabla_\theta \mathbb{E}\left[\sum_i \text{Var}(z_\phi\|z_\theta) \right] \end{align*}\]

This finding is important for explaining why BYOL doesn’t collapse, as it provides the following three insights:

It is always worth it for the model utilise stochaticities in training dynamics: Since for any random variables $X$, $Y$ and $Z$ we have $\text{Var}(X|Y,Z)\leq \text{Var}(X|Y)$, let us consider the following:
- $X$: the target projection $z_\phi$
- $Y$: the online projection $y_\theta$
- $Z$: any additional changes introduced by stochaticities in training dynamics. We see that the model cannot reduce variance by discarding $Z$.
Latent collapse avoided: following similar intuition to the above, BYOL avoids constant features in $z$, since for any constant $c$ and random variables $z_\phi$ and $z_\theta$, Var$(z_\phi|z_\theta)\leq$Var$(z_\phi|c)$.
Why we can’t optimise $\phi$ with the same objective as $\theta$: if we were to minimise the variance Var$(z_\phi|z_\theta)$ directly by optimising $\phi$, $z_\phi$ can simply reduce to a constant. Therefore instead BYOL makes $\phi$ gradually closer to $\theta$.

Note: It’s probably better to say that the above explains why BYOL does not fail completely, than to say that it explains why it works. In fact, the reason why latent collapse does not happen in BYOL (or any SSL algorithm for that matter) remains an open problem. See the resources listed below for further discussions on this topic:

This blog on BYOL attributes avoiding degenerative solutions to the batch-norm layers in the projection heads;
This paper then rebuts the above and shows that BYOL works even without batch statistics;
Multiview contrastive coding shows that using multiple, not just two views contribute to non-collapsing solutions;
Works such as SimSiam and W-MSE also offer interesting perspectives on the topic of avoiding latent collapse.

W-MSE (2020 July)

TL;DR: The paper has similar motivation to BYOL – it aims to develop an SSL method that requires no negative examples. Instead, it uses “whitening” to prevent latent collapse.

Prevent latent collapse by whitening

Before we dive in, it’s helpful to first look at how authors characterise the learning problem in this paper. Speicifically, authors propose to formulate the problem of SSL as follows:

\[\begin{align*} &\min_\theta \mathbb{E}[dist(z_i, z_j)], &(1) \\ s.t.\ & cov(z_i, z_i) = cov(z_j, z_j) = I &(2) \end{align*}\]

Let’s unpack. In the above euqations, (1) specify that representations from positive image pairs that share similar semantics $(z_i, z_j)$ should be clustered close together, and (2) that the image representations must form a non-degenerate distribution, i.e. the latents do not collapse to a single point.

More specifically, in (2), $I$ is the identify matrix. The constraint specifies that different components (dimensions) of the representation $z$ should be linearly independent, and by doing so, encourage different axis of $z$ to represent different sementic content. Importantly, by optimising this condition, the model does not need any negative examples to prevent latent collapse!

Now that we know the optimisation goal of the model, the pipeline and objective of this model should make much more sense.

Model

One of the most notable difference of this model is that it is not constrained to using only 2 positive examples – in the above schematic, $d$ views are generated for each image.
The paper again uses similar pipeline to SimCLR and extracts representation $v$ using first the base encoder $f(\cdot)$ and then the projection head $g(\cdot)$. This leads us to feature $v$, which is then passed to the whitening layer.
The whitening procedure is done using the following:

\[\begin{align*} z = W_V(v-\mu_V), \end{align*}\]

where $\mu_V$ is the mean of the elements in $V$:

\[\begin{align*} \mu_V = \frac{1}{K} \sum_k v_k, \end{align*}\]

while the matrix $W_V$ is such that $W_V^TW_V = \Sigma_V^{-1}$, and $\Sigma_V^{-1}$ being the covariance matrix of $V$:

\[\begin{align*} \Sigma_V = \frac{1}{K-1} \sum_k (v_k-\mu_V)(v_k-\mu_V)^T. \end{align*}\]

Objective

The loss is then computed for pairwise $z$s in ${z^{(1)}, \cdots, z^{(d)}}$ as follows:

\[\begin{align*} \mathcal{L} = \frac{2}{Nd(d-1)} \sum dist(z^{(i)}, z^{(j)}), \end{align*}\]

where $N$ denotes the batch size and $d$ the number of augmentations for each image.

Some extra notes

The whitening “layer” maps all the representations to a unit sphere to avoid latent collapse, therefore avoiding the need of negative examples. Note that this whitening transform was first proposed by Siarohin et al., 2019 (also seen in Huang et al., 2018) which uses the efficient and stable Cholesky decomposition.

In parallel to whitening, the authors also apply batch slicing to the representation $v$, where they further divide each batch into multiple sub-batches to compute the whitening matrix $W_V$. This is to provide more stability during training. Please refer to Page 5 and Figure 3 of the original paper for more details.

Barlow Twins (2021 March)

TL;DR: Avoid latent collapse by matching the cross-correlation matrix between the representations of images of two different views to an identity matrix. Does not need negative examples as a result.

Model:

Again, model uses similar pipeline to simclr. After the representations of the two views $z^A$ and $z^B$ are generated, we compute the cross correlation matrix $\mathcal{C}$, where each element $\mathcal{C}_{ij}$ is computed as follows:

\[\begin{align*} \mathcal{C}_{ij} = \frac{\sum_b z^A_{b,i}z^B_{b,j}}{\sqrt{\sum_b (z^A_{b,i})^2}\sqrt{\sum_b (z^B_{b,j})^2}}, \end{align*}\]

where $b$ indexes batch samples and $i,j$ index the vector dimension of $z$. The value of $\mathcal{C}_{i,j}$ is between $-1$ (perfect anti-correlation) and $1$ (perfect correlation).

The training objective is based on this cross correlation matrix, which consists of 2 terms:

\[\begin{align*} \mathcal{L} = \underbrace{\sum_i (1-\mathcal{C}_{ii})^2}_{\text{invariance term}} + \lambda \underbrace{\sum_i \sum_{i\neq j}C_{ij}^2}_{\text{redundancy reduction term}} \end{align*}\]

Invariance term: tries to equate the diagonal elements of the cross-correlation matrix to $1$, makes the representations invariant to the augmentations applied to the original image;
Redundancy reduction term: tries to decorrelate the different vector components of the embedding by equating the off-diagonal elements of $\mathcal{C}$ to 0.

The paper also mentions that Barlow Twin’s objective function can be understood as an instanciation of the information bottleneck (IB) objective, which specifies that representation should conserve as much information about the sample as possible while being the least opssible informative about the specific distortions applied to the sample.

SimSiam (2020 Nov)

TL;DR: Proposes that simple siamese networks can learn mearningful representation without negative samples (most contrastive methods)/large batches (simclr) /momentum encoders (BYOL). It turns out, “stop gradient is all you need”.

Model

The proposed model is quite simple. As authors aptly put, SimSiam can be thought of as “BYOL without the momentum encoder”, “SimCLR without negative pairs” and “SwAV without online clustering”. It is the simplest augmentation-based SSL method that I have read in this literature review.

As we can see from the architecture, SimSiam shares weights between two networks. The projection head $g_\theta$ on the augmentation B stream is removed, and gradients from the loss is not back propagated through this stream. The loss is computed without negative pairs as follows:

\[\begin{align*} \mathcal{L}(z^A, h^B) = -\frac{z^A \cdot h^B}{\|z^A\| \ \|h^B\|} \end{align*}\]

Note that this loss is the same as the numerator part of the SimCLR loss. Following practices of BYOL, they also symmetrise the loss by swapping $t_A$ and $t_B$ to compute $\mathcal{L}(z^B, h^A)$ and take the average between the two losses.

Empirical findings: what prevents collapse in SimSiam?

Apart from proposing this amazingly simple method, authors also performed some helpful empirical evaluations on different elements of SimSiam:

Stop gradient $\leftarrow$ prevent collapse Without the stop gradient operation, the model collapses and reaches minimum possible loss. The authors quantify model collapse by computing the standard deviation of the l2 normalised output $z/|z|_2$ – the std should be 0 when model collapses (all images get encode into a constant value), and is $1/\sqrt{d}$ if $z$ has a zero-mean isotropic Gaussian distribution, where $d$ is the dimension of $z$. Authors are able to show that with stop gradient the std is indeed $1/\sqrt{d}$, and without it the std is 0.

While this empirical evaluation is interesting and does show the importance of stop gradient in their architecture, quite unsatisfyingly (but also understandably), no guarantees were made about whether applying stop gradient will guarantee a non-collapsing solution. As carefully put by the authors,

Our experiments show that their exist collapsing solutions…, their existence implies that it is insufficient for our method to prevent collapsing solely by architecture designs (e.g. predictor, BN, l2-norm). In our comparison, all these architecture designs are kept unchanged, but they do not prevent collapsing if stop gradient is removed.

(Note that this finding that no stop gradient $\rightarrow$ latent collapse is limited to the architecture used in SimSiam and is not a general statement for all models.)
Predictor $g_\theta$ $\leftarrow$ prevent collapse Config 1. Removing predictor when using symmetrised loss: model collapses! When the predictor $g_\theta$ is removed, the symmetrised loss is $\frac{1}{2}\mathcal{L}(z^B, \texttt{stopgrad}(z^A)) +$ $\frac{1}{2}\mathcal{L}(z^A, \texttt{stopgrad}(z^B))$, which has the same gradient direction as $\mathcal{L}(z^A, z^B)$ – so it is as if the stop gradient operation has been removed! Collapse is observed.

Config 2. Removing predictor when using assymetrised loss: model collapses! There’s not as much explaination for this one – collapsing is observed in experiments when using this configuration.

Config 3. Fix predictor at random initialisation: training does not converge If $g_\theta$ if fixed at random initialisation, the training does not converge as the loss remains high (which is not the same as collapse where the loss is minimised)
Large batch size $\leftarrow$ not important Compared to SimCLR and SWaV which requires large batch size (4096) to work optimally, the optimal batch size of SimSiam is 256. Further increasing the batch size does not improve its performance. In addition, using smaller batch size such as 64 and 128 only observes a small accuracy drop (2.0 and 0.8 respectively).

In addition, the following three factors are helpful for training, but does not prevent collapse:

Batch normalisation: Similar to supervised learning scenarios, batch normalisation is helpful for optimisation when used appropriately, but it does not help preventing collapse.
Similarity function: Swapping the cosine similarity cross entropy similarity, where $\mathcal{L}(z^A, z^B) = -\texttt{softmax}(z_2) \cdot \log \texttt{softmax}(z_2)$. This results in a $5\%$ performance drop on ImageNet, but the model does not collapse.
Symmetrisation: Assymetrical loss achieves accuracy that is $4\%$ lower than symmetrical loss, but does not result in collapse.

Note: I find the empirical evaluation of different elements of the model to be very helpful, as it really pin-points what exactly prevents model collapse in SimSiam. Regrettably, these empirical findings do not necessarily extend beyond SimSiam’s experimental protocol, and what exactly prevents model collapse in this kind of siamese network is still unclear.

VICReg (2021 May)

TL;DR: The model uses similar pipeline to SimCLR. Instead of using negative examples to avoid latent collapse, they explicitly regularise the standard deviation of the embedding (making sure it is not zero).

Objective

The architecture and model pipeline is identical to SimCLR, with the objective being the only difference, which is what we will focus on. Here the loss enforces constraint on three aspects of the representation, namely variance, invariance and covariance (hence the name VIC-Reg):

Invariance term $s$: this term is similar to all the SSL objectives above, which minimise the distance of representations from two views of the same image. Instead of using cosine similarity, authors use the L2 distance:

\[\begin{align*} s(z^A, z^B) = \|z^A - z^B\|^2_2 \end{align*}\]

Variance term $v$: this term makes sure that the standard deviation of the projections in each dimension of $z$ is non-zero and approaching a pre-defined target value $\gamma$. We denote the dimension $j$ of representation $z$ as $z_j^A$, where $j\in[1,d]$. The variance term can then be written as a hinge loss

\[\begin{align*} v(z) = \frac{1}{2}\sum^d_{j=1} \max (0, \gamma - \sqrt{\text{Var}(z_j)+\epsilon}), \end{align*}\]

where $\gamma$ is the target value of standard deviation and $\text{Var}(z_j)$ the unbiased variance estimator.

Note: Some might notice that it is slightly weird that this is the “variance” term, when in fact standard deviation is used – turns out if we directly use the variance in the hinge loss, the gradient becomes very close to 0 when the input vector is close to its mean vector, which prevents the loss from being very effective when we need it the most. Using standard deviation alleviate this.

Covariance term c: This term is similar to the one used on Barlow Twins, which decorrelate different dimensions of $z$ by forcing the off-diagonal coefficients of the covariance matrix $C(z)$ to be 0:

\[\begin{align*} c(z) = \frac{1}{d}\sum_{i\neq j} C(z)^2_{i,j} \end{align*}\]

The final objective looks like this:

\[\begin{align*} \mathcal{L} = \lambda s(z^A, z^B) + \mu\left(v(z^A)+v(z^B)\right) + v\left(c(z^A) + c(z^B)\right), \end{align*}\]

where $\lambda, \mu$ and $v$ scales the importance of each term.

Some Afterthoughts

There are a bunch of other excellent methods for self-supervised learning that I regrettably cannot cover here due to time constraint, including but not limitted to:

PAWS (July, 2021): a bit different since it is for semi-supervised learning, however the model is able to provide theoretical guarantee to have non-degenerative solution by performing sharpening on features;
ReSSL (July, 2021): instead of using cosine similarity between different augmentations, they propose to use a relation metric to capture the similarities among different instances;
NNCLR (April, 2021): instead of using only the augmented view of the same image as positive instance, employ nearest neighbors from the dataset as well.

When looking at all these work together, the common theme of augmentation-based SSL methods is clear: draw representations extracted from semantically similar images closer in feature space, while doing “something” to prevent degenerative solutions. This blog post summarises the different “something” used by different approaches, and attempt to discuss what kind of guarantee on avoiding latent collapse they provide. Sadly, a lot of these discussions, with few exceptions, are limited to empirical findings. With the popularisation of augmentation-based SSL approaches, it would be really interesting to see more works examining different collapse modes or sharing insights on why any particular strategy (batch norm, stop gradient, momentum encoder) avoids them.

If you are interested in doing a bit more hands on stuff with the SSL methods introduced above, I would highly recommend checking out the solo-learn library, which implements a large variety of SSL approaches in Pytorch with benchmarked results on different datasets.

Thanks!

If you liked my blog post, please share it on social media (or with your employer, I am looking for a job :p). Thanks for reading!

How I learned to stop worrying and write ELBO (and its gradients) in a billion ways

2020-06-19T00:00:00-07:00

Latex equations not rendering? Try using a different browser or this link here.

Overview

I had a really hard time learning about VAE at the beginning of my PhD. I felt very betrayed spending time deriving and memorising ELBO (the evidence lower bound objective), then seeing yet another paper that writes it in a different way. However, as I mature, my attitude towards this changed — now I have learned to embrace the power of the seemingly infinitely many forms of ELBO.

Thinking back, this transformation really took place when I was introduced by my supervisor Sid to this great series of literature that covers the evolution of ELBO over the last 5, 6 years. Organising all of them and describing them in non-jibberish took some time, but I hope that this will serve as a frustration-free note-to-self for future revisiting to the topic, and also that it can be helpful to people out there who are feeling equally bamboozled as I was a year ago.

I will discuss the following papers (click on links for PDF), one in each section — and trust me they each serve a purpose and tell a whole story:

ELBO surgery (warm up) $\Rightarrow$ A more intuitive (visualisable) way to write ELBO
IWAE $\Rightarrow$ “K steps away” from basic VAE ELBO
Sticking the landing $\Rightarrow$ What’s wrong with ELBO and IWAE?
Tighter isn’t better $\Rightarrow$ What is wrong with IWAE, in particular?
DReG $\Rightarrow$ How to fix IWAE?

0. Standard ELBO

Before we dive in, let’s look at the most basic form of ELBO first, here it is in all of its glory:

\begin{align*} \mathcal{L}(\theta, \phi) = \mathbb{E}_{z\sim q_\phi(z\mid x)} \displaystyle \left[\log \frac{p_\theta(x,z)}{q_\phi(z\mid x)} \right] \leq \log p(x),\notag \end{align*}

where $\theta,\phi$ denotes the generative and inference model respectively, $x$ the observation and $z$ sample from latent space. The objective serves as a lower bound to the marginal likelihood of observation $\log p(x)$, and the VAE is trained by maximising the likelihood of reconstruction through maximising ELBO.

If you have this memorised or tattooed on your arm, we are ready to go!

1. A more intuitive (visualisable) guide to ELBO

Paper discussed: ELBO surgery: yet another way to carve up the variational evidence lower bound, work by Matthew Hoffman and Matthew Johnson.

This work provides a very intuitive perspective of the VAE objective by decomposing and rewriting ELBO. For a batch of N observations $X=\{x_n\}_{n=1}^N$ and their corresponding latent codes $Z=\{z_n\}_{n=1}^N$, ELBO can be rewritten as:

\begin{align*} \mathcal{L}(\theta, \phi) &= \underbrace{\left[ \frac{1}{N} \sum^N_{n=1} \mathbb{E}_{q(z_n\mid x_n)} [\log p(x_n \mid z_n)] \right]}_{\color{#4059AD}{\text{(1) Average reconstruction}}} - \underbrace{(\log N - \mathbb{E}_{q(z)}[\mathbb{H}[q(x_n\mid z)]])}_{\color{#EE6C4D}{\text{(2) Index-code mutual info}}} \notag \\ & + \underbrace{\text{KL}(q(z)\mid p(z))}_{\color{#86CD82}{\text{(3) KL between q and p}}} \notag \\ \end{align*}

Where $q(z)$ is the marginal, i.e. $q(z)=\sum^{N}_{n=1}q(z,x_n)$, and for large N can be approximated by the average aggregated posterior $q^{\text{avg}}(z)=\frac{1}{N}\sum^N_{n=1}q(z \mid x_n)$.

So what is the point of all this? Well, what’s interesting with this decomposition is that (1) average reconstruction and (2) index-code mutual information have opposing effects on the latent space:

Term (1) encourages accurate reconstruction of observations, which typically forces separated encoding for each $x_n$;
Term (2) maximises the entropy of $q(x_n\mid z)$, and thereby promoting overlapping encoding $q(z\mid x_n)$ for disdinct observations.

We visualise these effects in the graph below for two observations $x_1,x_2$ and their corresponding latent $z_1,z_2$. Plain and simple, (1) encourages separate encodings by “squeeshing” each latent code, and (2) “stretches” them, resulting in more overlap between $z_1$ and $z_2$.

Fig. Visualisation of effect of term (1) and (2). Dotted lines represent inference model $\phi$ and solid lines generative model $\theta$.

This now leaves us with term (3), which is the only term that involves prior. This term regularises the aggregated posterior by prior through minimising the KL distance between $q^{\text{avg}}(z)$ and $p(z)$. Theoretically speaking, $q^{\text{avg}}(z)$ can be arbitrarily close to $p(z)$ without losing expressivity of posterior; however in practice, when (3) is too large, it always indicate unwanted regularisation effect from prior.

Paper Disentangling disentanglement in Variational Autoencoders also did a great job analysing and utilising the effect of these three terms for disentanglement in VAEs, and I strongly recommend that you go and have a look.

2. “K steps away” from basic ELBO: IWAE

Paper discussed: Importance Weighted Autoencoders, work by Yuri Burda, Roger Grosse & Ruslan Salakhutdinov

Hopefully the previous section served as a good warm-up for this blog, and now you have a better intuition on how ELBO affects the graphical model. Now, we will move just a tat away from the original ELBO, to a more advanced K-sampled lower bound estimator: IWAE.

Importance Weighted Autoencoders (IWAE), is probably my favourite machine learning trick (and I know about 4). It is a simple and yet powerful way to improve the performance of VAEs, and you’re really missing out if you went through the trouble to implement ELBO but stopped there. Here, I will talk about the formultaion of IWAE and its 3 benefits: tighter lower bound estimate, importance-weighted gradients and complex implicit distribution.

Formulation

IWAE proposes a tighter estimate to $) \log p(x)$. As a reference, here’s the original ELBO again:

\begin{align*} \mathcal{L}(\theta, \phi) = \mathbb{E}_{z\sim q(z\mid x)} \left[\log \frac{p(x,z)}{q(z\mid x)}\right] \leq \log p(x) \notag \end{align*}

A common practice to acquire a better estimate to $\log p(x)$ with ELBO is to use its multisample variations, by taking $K$ samples from $q(z\mid x)$:

\begin{align*} \mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{z_1, z_2 \cdots z_K \sim q(z\mid x)} \left[\frac{1}{K} \sum_{k=1}^K \log \frac{p(x,z_k)}{q(z_k\mid x)}\right]\leq \log p(x) \notag \end{align*}

IWAE simply switch the position between the sum over $K$ and the $\log$ of the above, giving us:

\begin{align*} \mathcal{L}_{\text{IWAE}}(\theta, \phi) = \mathbb{E}_{z_1, z_2 \cdots z_K \sim q(z\mid x)} \left[\log \frac{1}{K}\sum_{k=1}^K \frac{p(x,z_k)}{q(z_k\mid x)}\right]\leq \log p(x) \notag \end{align*}

Benefit 1: Tighter lower bound estimate

It is easy to see that by Jensen’s inequality, $\mathcal{L}_{\text{VAE}}(\theta, \phi)\leq\mathcal{L}_{\text{IWAE}}(\theta, \phi)$. This means that IWAE is a tighter lower bound to the marginal log likelihood.

Benefit 2: Importance-weighted gradients

Things become even more interesting if we look at the gradient of IWAE compared to the original ELBO:

\begin{align*} \nabla_\Theta \mathcal{L}_{\text{VAE}}(\theta,\phi)&=\mathbb{E}_{z_1, z_2 \cdots z_K \sim q(z\mid x)} \left[\sum_{k=1}^K \frac{1}{K} \nabla_\Theta \log \frac{p(x,z_k)}{q(z_k\mid x)}\right]\\ \nabla_\Theta \mathcal{L}_{\text{IWAE}}(\theta,\phi)&=\mathbb{E}_{z_1, z_2 \cdots z_K \sim q(z\mid x)} \left[\sum_{k=1}^K w_k \nabla_\Theta \log \frac{p(x,z_k)}{q(z_k\mid x)}\right], \end{align*}

where

\begin{align*} w_k = \frac{\frac{p(x,z_k)}{q(z_k\mid x)}}{\sum^K_{i=1}\frac{p(x,z_i)}{q(z_i\mid x)}} \end{align*}

So we can see that in the $\mathcal{L}_{\text{VAE}}$ the gradients of each samples are equally weighted by $1/K$, but in $\mathcal{L}_{\text{VAE}}$ gradient weights them by their relative importance $w_k$.

Benefit 3: Complex implicit distribution

However, this is not all of it — authors in the original paper also showed that IWAE can be interpreted as standard ELBO, but with a more complex (implicit) posterior distribution $q_{IW}$, thanks to importance sampling. This is probably the most important take-away of IWAE, and I always like go back to this plot from reinterpreting IWAE as an intuitive demonstration of its power: Here, K is the number of importance-weighted samples taken, and the left-most plot is the true distribution that we are trying to approximate with the 3 different $q_{IW}$. We can see that when $K=1$, the IWAE objective reduces to original VAE ELBO, and the approximation to true distribution is poor; as K grows, the approximation becomes more and more accurate.

Side note: Paper Reinterpreting IWAE helped me a lot to understanding the IWAE objective, highly recommended. In addition, this blog post by Adam Kosiorek is also a very comprehensive interpretation on the topic.

3. Big! Gradient! Estimator! Variance!

Paper discussed: Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference by Geoffrey Roeder, Yuhuai Wu & David Duvenaud.

So far we discussed two variational lower bounds in details, ELBO and IWAE. Now is high time to take them off their pedestals and talk about what’s wrong with them — and as you can guess from the title of this section, this has something to do with gradient variance.

Recap: 2 types of gradient estimators

Despite my best effort to sound very excited about all this, I had definitely struggled to care about things like “gradient variance” in the past, largely because there seems to be so many different Monte Carlo gradient estimators out there. But not too long ago, I realised that there are only two very common ones that you need to care about: REINFORCE estimator and reparametrisation trick. I’m leaving some details about each of them here as a note-to-self, but here’s the key thing you need to remember if you want to skip this part and get to the good stuff:

REINFORCE estimator (score function): very general purpose, large variance;
Reparametrisation (path derivative): less general purpose, much smaller variance.

Portal to next section.

REINFORCE estimator (score function)

This is commonly used in Reinforcement Learning. It is named score function because it utilises this “cool little logarithm trick”:

\begin{align*} \nabla_{\theta} \log p(x, \theta) = \frac{\nabla_\theta p(x;\theta)}{p(x;\theta)} \end{align*}

So now, when we try to estimate the gradient of some function $f(x)$ under the expectation of distribution $p(x;\theta)$, we can do the following:

\begin{align*} \nabla_{\theta}\mathbb{E}_{x\sim p(x,\theta)}[f(x)] = \mathbb{E}_{x\sim p(x,\theta)}[f(x)\nabla_{\theta} \log p(x,\theta)] \end{align*}

and now we can easily estimate the gradient by performing MC sampling — taking $N$ samples of $\hat{x}\sim p(x,\theta)$:

\begin{align*} \nabla_{\theta}\mathbb{E}_{x\sim p(x,\theta)}[f(x)] \approx \frac{1}{N}\sum^N_{n=1}f(\hat{x}^{(n)})\nabla_\theta \log p(\hat{x}^{(n)};\theta); \end{align*}

Keep in mind that this score function estimator estimator, despite being unbiased, has very large variance from multiple sources (see here in section 4.3.1 for details). It is however very flexible and places no requirement on $p(x;\theta)$ or $f(x)$ — hence its popularity.

Reparametrisation trick (path derivative)

I assume you are faimiliar with the reparametrisation trick if you got all the way here, but I am a completionist, so here’s a quick recap:

The reparametrisation trick utilises the property that for continuous distribution $p(x;\theta)$, the following sampling processes are equivalent:

\begin{align*} \hat{x} \sim p(x;\theta) \quad \equiv \quad \hat{x}=g(\hat{\epsilon},\theta) , \hat{\epsilon} \sim p(\epsilon) \end{align*}

The most common usage of this is seen in VAE, where instead of directly sampling from the posterior, we typically take random sample from a standard Normal distribution $\hat{\epsilon} \sim \mathcal{N}(0,1)$ and multiply it by the mean and variance of the posterior computed from our inference model. Here’s that familiar illustration again as a reminder (image from Kingma’s NeurIPS2015 workshop slides):

This method is much less general-purpose compared to the score function estimator since it requires $p(x)$ to be a continuous distribution, and also access to its underlying sampling path. However, by trading-off the generality, we get an estimator with much lower variance.

Side note: For readers who’re not afraid of gradients, here is a great survey paper on MC gradient estimators.

The lurking score function in reparametrisation trick

At this point we should all be familiar with reparametrisation trick used in VAEs for gradient estimation, but here we need to formalise it a bit more for the derivation in this section:

Reparametrisation trick express sample $z$ from parametric distribution $q_\phi(z)$ as a deterministic function of a random variable $\hat{\epsilon}$ with some fixed distribution and the parameters $\phi$, i.e. $z=t(\hat{\epsilon}, \phi)$. For example, if $q_\phi$ is a diagonal Gaussian, then for $\epsilon \sim \mathcal{N}(0, 1),\ z=\mu+\sigma\hat{\epsilon}$.

We already know that reparametrisation trick (path derivative) has the benefit of lower variance for gradient estimation compared to score function. The kicker here is — the gradient of ELBO actually contains a score function term, causing the estimator to have large variance!

To see this, we can first rewrite ELBO as the following:

\begin{align*} \mathcal{L}(\theta,\phi) = \mathbb{E}_{z\sim q_\phi(z\mid x)}\left[ \log p_\theta(x\mid z) + \log p(z) - \log q_\phi(z\mid x) \right] \end{align*}

We can then take the total derivative of the term within expectation w.r.t. $\phi$:

\begin{align*} \nabla_{\phi} (\hat{\epsilon},\phi) &= \nabla_{\phi} \left[ \log p_\theta(x\mid z) + \log p(z) - \log q_\phi(z\mid x) \right]\\ &= \nabla_{\phi} \left[ \log p_\theta(z\mid x) + \log p(x) - \log q_\phi(z\mid x) \right]\\ &= \underbrace{\nabla_{z} \left[ \log p_\theta(z\mid x) - \log q_\phi(z\mid x) \right] \nabla_{\phi}t(\hat{\epsilon},\phi)}_{\color{#55828B}{\text{path derivative}}} - \underbrace{\nabla_{\phi}\log q_\phi(z\mid x)}_{\color{#87BBA2}{\text{score function}}} \end{align*}

So we see that $\nabla_{\phi} (\hat{\epsilon},\phi)$ decomposes into 2 terms, one path derivative component that measures the dependence on $\phi$ only through sample $z$; the score function the dependence on $\log q_\phi$ directly, without considering how sample $z$ changes as a function of $\phi$.

So, it is not surprising to learn that the large variance of the score function term here causes problems: the authors discovered that even when the variational posterior $q_\phi(z\mid x)$ completely matches the true posterior $p(z\mid x)$, while the path derivative component in $\nabla_{\phi} (\hat{\epsilon},\phi)$ reduces to zero, score function will have non-zero variance.

So what do we do here? Well, authors propose to simply drop the score function component to get an unbiased gradient estimator:

\begin{align*} \hat{\nabla}_{\phi} (\hat{\epsilon},\phi) &= \underbrace{\nabla_{z} \left[ \log p_\theta(z\mid x) - \log q_\phi(z\mid x) \right] \nabla_{\phi}t(\hat{\epsilon},\phi)}_{\color{#55828B}{\text{path derivative}}} \end{align*}

It sounds a bit wacky at first, but this approach works miracle, as authors show in this plot:

As we see clearly here that by using the path derivative only gradient, the variance of gradient estimation is much lower and $\phi$ converges to the true variational parameters much faster.

Note that this large gradient variance problem applies for any ELBO, including both standard VAE and IWAE. However, we will show in the next section that IWAE has its unique problem caused by the K multiple sample. More colloquially —-

4.Tighter lower bounds aren’t necessarily better

Paper discussed: Tight Variational Bounds are Not Necessarily Better, work by Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood & Yee Whye Teh

This builds on the previous Sticking the Landing paper, and discovers that the gradient variance caused by score function becomes a even bigger problem when using a multi-sample estimator like IWAE.

In here it’s not just a variance problem: estimators with small expected values need proportionally smaller variance to be estimated accurately. In other words, what we really care about here is the expectation-to-variance, or signal-to-noise (SNR) ratio:

Here $\nabla_{M,K}(\phi)$ refers to the gradient with respect to $\phi$, and here the two quantities we care about are:

$M$: the number of samples used for Monte Carlo estimate of ELBO’s gradient;
$K$: the number of samples used for IWAE to estimate a tighter lower bound to $\log p(x)$.

Ideally we want a large SNR for the gradient estimator fo both $\theta$ and $\phi$, since smaller SNR indicates that the estimate is completely random. The main contribution of this paper is discovering the following, very surprising relationships:

\begin{align*} \text{SNR}(\theta) &= \mathcal{O}(\sqrt{MK})\\ \text{SNR}(\phi) &= \mathcal{O}(\sqrt{M/K}) \end{align*}

This tells us that while increasing the number of IWAE samples $K$ get us a tighter lower bound, it actually worsen SNR ($\phi$) — meaning that a large K hurts the performance of the gradient estimator for $\phi$! Also note that the same effect is not observed for the generative model $\theta$, but the damage on inference model learning cannot simply be mitigated by increasing $M$.

The authors gave a very comprehensive proof to their finding, so I’m going to leave the mathy heavy lifting to the original paper :) We shall march on to the last section of this blog: an elegant solution to solve the large variance in ELBO gradient estimators — DReG.

5. How to fix IWAE?

Paper discussed: Doubly Reparametrised Gradient Estimators for Monte Carlo Objectives, work by George Tucker, Dieterich Lawson, Shixiang Gu & Chris J. Maddison.

In section 3 we talked about the large gradient variance caused by the score function lurking in the gradient estimation, and section 4 about how this is exacerbated for IWAE. I’ll put the total derivative we have seen in section 3 here as a reference, but to make it more relevant, this time we rewrite it for IWAE that uses $K$ importance samples:

\begin{align*} \nabla_{\phi} (\hat{\epsilon},\phi) = \mathbb{E}_{\hat{\epsilon}_{1:K}} \underbrace{\left[\sum_{k=1}^K w_k \nabla_{z} \left[ \log p_\theta(z\mid x) - \log q_\phi(z\mid x) \right] \nabla_{\phi}t(\hat{\epsilon},\phi)\right.}_{\color{#55828B}{\text{path derivative}}} - \underbrace{\left.\sum_{k=1}^K w_k \nabla_{\phi}\log q_\phi(z\mid x)\right]}_{\color{#87BBA2}{\text{score function}}} , \end{align*}

where

\begin{align*} w_k =\frac{\tilde{w}_k}{\sum^K_{i=1} \tilde{w}_i}= \frac{\frac{p(x,z_k)}{q(z_k\mid x)}}{\sum^K_{i=1}\frac{p(x,z_i)}{q(z_i\mid x)}}. \end{align*}

This is not much of a change from the total derivative of original ELBO, as we have mentioned in section 2 that IWAE simply weights the gradients of VAE ElBO by the relative importance of each sample $w_k$.

We have learned that one way to deal with it is to completely remove the score function term. However, is there a better way than completely discarding a term in gradient estimation?

Well obviously I wouldn’t be asking this question here if the answer weren’t yes — authors in this paper proposed to reduce the variance by doing another reparametrisation on the score function term! Here’s how:

Taking the score function term in the total derivative of IWAE, we can first take the $\sum_k$ term out of the expectation:

\begin{align*} \mathbb{E}_{\hat{\epsilon}_{1:K}} \underbrace{\left[\sum_{k=1}^K w_k \nabla_{\phi}\log q_\phi(z\mid x)\right]}_{\color{#87BBA2}{\text{score function}}} = \sum_{k=1}^K \mathbb{E}_{\hat{\epsilon}_{1:K}} \left[ w_k \nabla_{\phi}\log q_\phi(z\mid x)\right] \end{align*}

Now we can just ignore the sum and focus on what’s in the expectation $\mathbb{E}_{\hat{\epsilon}_{1:K}}$. Since the derivative is taken with respect to $\phi$, we can treat $\epsilon$, the pseudo sample we take for reparametrisation trick, as a constant. Therefore, it is possible to substitute $\epsilon$ by the actual sample from our approximated posterior $z$ — also a constant as far as $\nabla_\phi$ is concerned. This way we have:

\begin{align*} \mathbb{E}_{\hat{\epsilon}_{1:K}} \left[ w_k \nabla_{\phi}\log q_\phi(z\mid x)\right] &= \mathbb{E}_{z_{1:K}} \left[ w_k \nabla_{\phi}\log q_\phi(z\mid x)\right]\\\\ &= \mathbb{E}_{z_{-k}} \underbrace{\mathbb{E}_{z_k} \left[ w_k \nabla_{\phi}\log q_\phi(z\mid x)\right]}_{\text{A }\color{#EE6C4D}{\text{REINFORCE}}\text{ term appears!}} \end{align*}

By doing this substitution, a REINFORCE term appears! I’ll just let that sink in for a bit.

I should clarify that that previously we just had the score function term, but since the expectation is over $\hat{\epsilon}$ instead of actual samples from $q_\phi(z\mid x)$, it is not actually REINFORCE.

This is important because REINFORCE and reparametrisation trick are interchangable, as we see below:

\begin{align*} \underbrace{\mathbb{E}_{q_\phi (z\mid x)}\left[ f(z)\frac{\partial}{\partial \phi}\log q_\phi(z\mid x) \right]}_{\color{#EE6C4D}{\text{REINFORCE}}} = \underbrace{\mathbb{E}_{\hat{\epsilon}} \left[ \frac{\partial f(z)}{\partial z} \frac{\partial z(\hat{\epsilon}, \phi)}{\partial \phi} \right]}_{\text{reparametrisation trick}} \end{align*}

If we substitute the above back into the original total derivative of IWAE, after some math montage, we can simplifying it as the following:

\begin{align*} \nabla_{\phi} (\hat{\epsilon},\phi) = \mathbb{E}_{\hat{\epsilon}_{1:K}} \left[ \sum^K_{k=1} (w_k)^2 \frac{\partial\log \tilde{w}_k}{\partial z_i} \frac{\partial z_i}{\partial \phi} \right] \end{align*}

This is actually very easy to implement: cheeky little plug, we used this objective in our paper on multimodal VAE learning, you can find the code here that comes with a handy implementation of DReG in pytorch.

We are done!

A heartfelt congratulation if you got all the way here, well done! Leave a comment if you have any question, if you find this helpful please share on twitter/facebook :)

Special thanks to my supervisor Dr. Siddharth Narayanaswamy for guiding me through these literature with great insights and extreme patience.

Gaussian Process, not quite for dummies

2019-09-05T00:00:00-07:00

Before diving in

For a long time, I recall having this vague impression about Gaussian Processes (GPs) being able to magically define probability distributions over sets of functions, yet I procrastinated reading up about them for many many moons. However, as always, I’d like to think that this is not just due to my procrastination superpowers. Whenever I look up “Gaussian Process” on Google, I find these well-written tutorials with vivid plots that explain everything up until non-linear regression in detail, but shy away at the very first glimpse of any sort of information theory. The key takeaway is always,

A Gaussian process is a probability distribution over possible functions that fit a set of points.

While memorising this sentence does help if some random stranger comes up to you on the street and ask for a definition of Gaussian Process – which I’m sure happens all the time – it doesn’t get you much further beyond that. In what range does the algorithm search for “possible functions”? What gives it the capacity to model things on a continuous, infinite space?

Confused, I turned to the “the Book” in this area, Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams. I have friends working in more statistical areas who swear by this book, but after spending half an hour just to read 2 pages about linear regression I went straight into an existential crisis. I’m sure it’s a great book, but the math is quite out of my league.

So what more is there? Thankfully I found the above lecture by Dr. Richard Turner on YouTube, which was a great introduction to GPs, and some of its state-of-the-art approaches. After watching this video, reading the Gaussian Processes for Machine Learning book became a lot easier. So I decided to compile some notes for the lecture, which can now hopefully help other people who are eager to more than just scratch the surface of GPs by reading some “machine learning for dummies” tutorial, but don’t quite have the claws to take on a textbook.

Acknowledgement: the figures in this blog are from Dr. Richard Turner’s talk “Gaussian Processes: From the Basic to the State-of-the-Art”, which I highly recommend! Have a lookie here: Portal to slides.

Motivation: non-linear regression

Of course, like almost everything in machine learning, we have to start from regression. Let’s revisit the problem: somebody comes to you with some data points (red points in image below), and we would like to make some prediction of the value of $y$ with a specific $x$.

In non-linear regression, we fit some nonlinear curves to observations. The higher degrees of polynomials you choose, the better it will fit the observations.

This sort of traditional non-linear regression, however, typically gives you one function that it considers to fit these observations the best. But what about the other ones that are also pretty good? What if we observed one more points, and one of those ones end up being a much better fit than the “best” solution?

To solve this problem, we turn to the good Ol’ Gaussians.

The world of Gaussians

Recap

Here we cover the basics of multivariate Gaussian distribution. If you’re already familiar with this, skip to the next section 2D Gaussian Examples.

The Multivariate Gaussian distribution is also known as the joint normal distribution, and is the generalisation of the univariate Gaussian distribution to high dimensional spaces. Formally, the definition is:

A random variable is said to be k-variate normally distributed if every linear combination of its k components have a univariate normal distribution.

Mathematically, $X = (X_1, …X_k)^T$ has a multivariate Gaussian distribution if $Y=a_1X_1 + a_2X_2 … + a_kX_k$ is normally distributed for any constant vector ${a} \in \mathcal{R}^k$.

Note: if all k components are independent Gaussian random variables, then $X$ must be multivariate Gaussian (because the sum of independent Gaussian random variables is always Gaussian).

_Another note: sum of random variables is different from sum of distribution – the sum of two Gaussian distributions gives you a Gaussian mixture, which is not Gaussian except in special cases. _

2D Gaussian Examples

Covariance matrix

Here is an example of a 2D Gaussian distribution with mean 0, with the oval contours denoting points of constant probability.

The covariance matrix, denoted as $\Sigma$, tells us (1) the variance of each individual random variable (on diagonal entries) and (2) the covariance between the random variables (off diagonal entries). The covariance matrix in above image indicates that $y_1$ and $y_2$ are positively correlated (with $0.7$ covariance), therefore the somewhat “stretchy” shape of the contour. If we keep reducing the covariance while keeping the variance unchanged, the following transition can be observed:

Note that when $y_1$ is independent from $y_2$ (rightmost plot above), the contours are spherical.

Conditioning

With multivariate Gaussian, another fun thing we can do is conditioning. In 2D, we can demonstrate this graphically:

We fix the value of $y_1$ to compute the density of $y_2$ along the red line – thereby condition on $y_1$. Note that in here since $y_2 \in \mathcal{N}(\mu, \sigma)$ , by conditioning we get a Gaussian back.

We can also visualise how this conditioned gaussian changes as the correlation drop – when correlation is $0$, $y_1$ tells you nothing about $y_2$, so for $y_2$ the mean drop to $0$ and the variance becomes high.

High dimensional gaussian: a new interpretation

2D Gaussian

The oval contour graph of Gaussian, while providing information on the mean and covariance of our multivariate Gaussian distribution, does not really give us much intuition on how the random variables correlate with each other during the sampling process.

Therefore, consider this new interpretation that can be plotted as such:

Take the oval contour graph of the 2D Gaussian (left-top in below image) and choose a random point on the graph. Then, plot the value of $y_1$ and $y_2$ of that point on a new graph, at index = $1$ and $2$, respectively.

Under this setting, we can now visualise the sampling operation in a new way by taking multiple “random points” and plot $y_1$ and $y_2$ at index $1$ and $2$ multiple times. Because $y_1$ and $y_2$ are correlated ($0.9$ correlation), as we take multiple samples, the bar on the index graph only “wiggles” ever so slightly as the two endpoints move up and down together.

For conditioning, we can simply fix one of the endpoint on the index graph (in below plots, fix $y_1$ to 1) and sample from $y_2$.

Higher dimensional Gaussian

5D Gaussian

Now we can consider a higher dimension Gaussian, starting from 5D — so the covariance matrix is now 5x5.

Take a second to have a good look at the covariance matrix, and notice:

All variances (diagonal) are equal to 1;
The further away the indices of two points are, the less correlated they are. For instance, correlation between $y_1$ and $y_2$ is quite high, $y_1$ and $y_3$ lower, $y_1$ and $y_4$ the lowest)

We can again condition on $y_1$ and take samples for all the other points. Notice that $y_2$ is moving less compared to $y_3$ - $y_5$ because it is more correlated to $y_1$.

20D Gaussian

To make things more intuitive, for 20D Gaussian we replace the numerical covariance matrix by a colour map, with warmer colors indicating higher correlation:

This gives us samples that look like this:

Now look at what happens to the 20D Gaussian conditioned on $y_1$ and $y_2$:

Hopefully you may now be thinking: “Ah, this is looking exactly like the nonlinear regression problem we started with!” And yes, indeed, this is exactly like a nonlinear regression problem where $y_1$ and $y_2$ are given as observations. Using this index plot with 20D Gaussian, we can now generate a family of curves that fits these observations. Even better, if we generate a number of them, we can compute the mean and variance of the fitting using these randomly generated curves. We visualise this in the plot below.

We can see from the above image that because of how covariance matrix is structured (i.e. closer points have higher correlation), the points closer to the observations has very low uncertainty with non-zero mean, whereas the ones further from them have high uncertainty and zero mean. (Note that in reality, we don’t have to actually take many many many samples to estimate the mean and standard deviation, they are completely analytical.)

Here we also offer a slightly more exciting example where we condition on 4 points of the 20D Gaussian (and you wonder why everybody hates statisticians):

Getting “real”

The problem with this approach for nonlinear regression seems obvious – it feels like all the points on the x-axis have to be integers because they are indices, while in reality, we want to model observations with real values. One immediately obvious solution for this is, we can keep increasing the dimensionality of the Gaussian and calculate many many points close to the observation, but that is a bit clumsy.

The solution lies in how the covariance matrix is generated. Conventionally, $\Sigma$ is calculated using the following 2-step process:

$\Sigma (x_1, x_2) = K(x_1, x_2) + I \sigma_y^2$ $K(x_1, x_2) = \sigma^2 e^{-\frac{1}{2l^2}(x_1 - x_2)^2}$

The covariance matrices in all the above examples are computed using the Radial Basis Function (RBF) kernel $K(x_1, x_2)$ – all by taking integer values for $x_1$, $x_2$. This RBF kernel ensures the “smoothness” of the covariance matrix, by generating a large output values for $x_1$, $x_2$ inputs that are closer to each other and smaller values for inputs that are further away . Note that if $x_1=x_2$, $K(x_1, x_2)=\sigma^2$. We then take K and add $I\sigma_y^2$ for the final covariance matrix to factor in noise – more on this later.

This means in principle, we can calculate this covariance matrix for any real-valued $x_1$ and $x_2$ by simply plugging them in. The real-valued $x$s effectively result in an infinite-dimensional Gaussian defined by the covariance matrix.

Now that, is a Gaussian process (mic drop).

Gaussian Process

Textbook definition

From the above derivation, you can view Gaussian process as a generalisation of multivariate Gaussian distribution to infinitely many variables. Here we also provide the textbook definition of GP, in case you had to testify under oath:

A Gaussian process is a collection of random variables, any finite number of which have consistent Gaussian distributions.

Just like a Gaussian distribution is specified by its mean and variance, a Gaussian process is completely defined by (1) a mean function $m(x)$ telling you the mean at any point of the input space and (2) a covariance function $K(x, x’)$ that sets the covariance between points. The mean can be any value and the covariance matrix should be positive definite.

\[f(x) \sim \mathcal{G}\mathcal{P}(m(x), K(x, x'))\]

Parametric vs. non-parametric

Note that our Gaussian processes are non-paramatric, as opposed to nonlinear regression models which are parametric. And here’s a secret:

non-parametric model == model with infinite number of parameters

In a parametric model, we define the function explicitly with some parameters:

\[y(x) = f(x) + \epsilon \sigma_y\] \[p(\epsilon) = \mathcal{N}(0,1)\]

Where $\sigma_y$ is Gaussian noise describing how noisy the fit is to the actual observation (graphically it’ll represent how often the data lies directly on the fitted curve). We can place a Gaussian process prior over the nonlinear function – meaning, we assume that the parametric function above is drawn from the Gaussian process defined as follow:

\[p(f(x)\mid \theta) = \mathcal{G}\mathcal{P}(0, K(x, x'))\] \[K(x, x') = \sigma^2 \text{exp}(-\frac{1}{2l^2}(x-x')^2)\]

This GP will now generate lots of smooth/wiggly functions, and if you think your parametric function falls into this family of functions that GP generates, this is now a sensible way to perform non-linear regression.

We can also add Gaussian noise $\sigma_y$ directly to the model, since the sum of Gaussian variables is also a Gaussian:

\[p(f(x)\mid \theta) = \mathcal{G}\mathcal{P}(0, K(x, x') + I\sigma_y^2)\]

In summary, GP regression is exactly the same as regression with parametric models, except you put a prior on the set of functions you’d like to consider for this dataset. The characteristic of this “set of functions” you consider is defined by the kernel of choice ($K(x, x’)$). Note that conventionally the prior has mean 0.

Hyperparameters

There are 2 hyperparameters here:

Vertical scale $\sigma$: describes how much span the function has vertically;
Horizontal scale $l$: describes how quickly the correlation between two points drops as the distance between them increases – a high $l$ gives you a smooth function, while lower $l$ results in a wiggly function.

Luckily, because $p(y \mid \theta)$ is Gaussian, we can compute its likelihood in close form. That means we can just maximise the likelihood of $p(y\mid \theta)$ under these hyperparameters using a gradient optimiser:

Details for implementation

Before we start: here we are going to stay quite high level – no code will be shown, but you can easily find many implementations of GP on GitHub (personally I like this repo, it’s a Jupyter Notebook walk through with step-by-step explanation). However, this part is important to understanding how GP actually works, so try not to skip it.

Computation

Hopefully at this point you are wondering: this smooth function with infinite-dimensional covariance matrix thing all sounds well and good, but how do we actually do computation with an infinite by infinite matrix?

Marginalisation baby! Imagine you have a multivariate Gaussian over two vector variables $y_1$ and $y_2$, where:

Here, we partition the mean into the mean of $y_1$, $a$ and the mean of $y_2$, $b$; similarly, for covariance matrix, we have $A$ as the covariance of $y_1$, $B$ that of $y_1$ and $y_2$, $B^T$ that of $y_2$ and $y_1$ and $C$ of $y_2$. So now, we can easily compute the probability of $y_1$ using the marginalisation property:

This formation is extremely powerful — it allows us to calculate the likelihood of $y_1$ under the joint distribution of $p(y_1, y_2)$, while completely ignoring $y_2$! We can now generalise from two variables to infinitely many, by altering our definition of $y_1$ and $y_2$ to:

$y_1$: contains a finite number of variables we are interested in;
$y_2$: contains all the variables we don’t care about, which is infinitely many; Then similar to the 2-variable case, we can compute the mean and covariance for $y_1$ partition only, without having to worry about the infinite stuff in $y_2$. This nice little property allows us to think about finite dimensional projection of the underlying infinite object on our computer. We can forget about the infinite stuff happening under the hood.

Predictions

Taking the above $y_1$, $y_2$ example, but this time imagine all the observations are in partition $y_2$ and all the points we want to make predictions about are in $y_1$ (again, the infinite points are still in the background, let’s imagine we’ve shoved them into some $y_3$ that is omitted here).

To make predictions about $y_1$ given observations of $y_2$, we can then use bayes rules to calculate $p(y_1\mid y_2)$:

Because $p(y_1)$, $p(y_2)$ and $p(y_1,y_2)$ are all Gaussians, $p(y_1\mid y_2)$ is also Gaussian. We can therefore compute $p(y_1\mid y_2)$ analytically:

Note: here we catch a glimpse of the bottleneck of GP: we can see that this analytical solution involves computing the inverse of the covariance matrix of our observation $C^{-1}$, which, given $n$ observations, is an $O(n^3)$ operation. This is why we use Cholesky decomposition – more on this later.

To gain some more intuition on the method, we can write out the predictive mean and predictive covariance as such:

So the mean of $p(y_1 \mid y_2)$ is linearly related to $y_2$, and the predictive covariance is the prior uncertainty subtracted by the reduction in uncertainty after seeing the observations. Therefore, the more data we see, the more certain we are.

Higher dimensional data

You can also do this for higher-dimensional data (though of course at greater computational costs). Here we extend the covariance function to incorporate RBF kernels in 2D data:

Covariance matrix selection

As one last detail, let’s talk about the different covariance matrices used for GP. I don’t have any authoritative advice on selecting kernels for GP in general, and I believe in practice, most people try a few popular kernels and pick the one that fits their data/problem the best. So here we will only introduce the form of some of the most frequently seen kernels, get a feel for them with some plots and not go into too much detail. (I highly recommend implementing some of them and play around with it though! It’s good coding practice and best way to gain intuitions about these kernels.)

Laplacian Function

This function is continuous but non-differentiable. It looks like this:

If you average over all samples, you get straight lines joining your datapoints, which are called Browninan bridges.

Rational quadratic

Average over all samples looks like this:

Periodic functions

Average over all samples looks like this:

Summary

There are books that you can look up for appropriate kernels for covariance functions for your particular problem, and rules you can follow to produce more complicated covariance functions (such as, the product of two covariance functions is a valid covariance function). They can give you very different results:

It is tricky to find the appropriate covariance function, but there are also methods in place for model selection. One of those methods is Bayesian model comparison, defined as follows:

However, it does involve a very difficult integral (or sum in discrete case, as showed above) over the hyperparameters of your GP, which makes it impractical, and is also very sensitive to the prior you put over your hyperparameters. In practice, it is more common to use deep Gaussian Processes for automatic kernel design, which optimises the choice of covariance function that is appropriate for your data through training.

The end

Hopefully this has been a helpful guide to Gaussian process for you. I want to keep things relatively short and simple here, so I did not delve into the complications of using GPs in practice – in reality GPs suffers from not being able to scale to large datasets, and the choice of kernels can be very tricky. There are some state-of-the-art approaches that tackle with these issues (see deep GP and sparse GP), but since I am by no means an expert in this area I will leave you to exploring them.

Thank you for reading! Remember to take your canvas bag to the supermarket, baby whales are dying.

Acknowledgements

I would like to thank Andrey Kurenkov and Hugh Zhang from The Gradient for helping me with the edits of this article.