Jekyll2021-12-17T04:01:02-08:00https://yugeten.github.io/feed.xmlYuge ShiDPhil student, University of OxfordYuge Shiyshi@robots.ox.ac.ukAn incomplete and slightly outdated literature review on augmentation based self-supervise learning2021-12-14T00:00:00-08:002021-12-14T00:00:00-08:00https://yugeten.github.io/posts/2021/12/ssl<script src="mj.config" type="text/javascript"></script>
<h1 id="whats-with-this-title">What’s with this title?</h1>
<blockquote>
<p>This is the equivalent to the “I invented this dish because I love my family” part in recipes — feel free to skip.</p>
</blockquote>
<p>It was four months ago when I first drafted this blog post, and I felt like I was at <strong>the cutting edge of science</strong>. I mean, self-supervise learning methods without negative examples? That is WILD. Just knowing about these works made me feel like I am an excellent researcher, a studious PhD student, standing on the shoulder of the most recent giants.</p>
<p>Now that I am coming back to polish it four months later, it feels like a century has passed by, and models like masked autoencoders / BEiT has taken over as the new crowd favourites. I look at my blog post and realised that this is no longer a complete literature review of recent advances on self-supervise learning, but more of a weirdly speicifc “period piece” that covers some of the more famous works between 2020 to early 2021.</p>
<p>I have convinced myself that this is still useful to put this out there, since a lot of the work done during this time period shares very similar intuitions. Looking at them as a whole also provides useful insights on how our view on what boosts performance or prevents latent collapse in SSL changes through out the years (months?). I am currently working on another blog posts on masked image models such as MAE, BEiT and iBOT – so stay tuned!</p>
<h1 id="notations">Notations</h1>
<p><strong>Consistent notations</strong>:</p>
<ul>
<li>$x$: Original image;</li>
<li>$t^A$, $t^B$: Augmentations applied to images;</li>
<li>$x^A$, $x^B$: Two augmented views of the same image $x$;</li>
<li>$h^A$, $h^B$: representations extracted from $x^A$ and $x^B$, used for <strong>downstream tasks</strong>;</li>
</ul>
<p>(I tried my best but) <strong>less consistent notations</strong>:</p>
<ul>
<li>$z^A$, $z^B$: representations extracted from $x^A$ and $x^B$, used for <strong>objective evaluations</strong> (apart from one special case in SCAN);</li>
<li>$x^{(1)}$, …, $x^{(n)}$: $n$ augmented views of the same image $x$ / $n$ different images (apologies for the abuse of notation, but it should be clear from the context which one it is)</li>
</ul>
<h1 id="before-we-start-an-overview">Before we start, an overview</h1>
<p>Almost all self-supervised learning (SSL) models share a similar goal — learning useful representations without labels. All the methods we are about to cover translate this goal as the following requirement:</p>
<blockquote>
<p>Images that are semantically similar should have representations that are close to each other in feature space.</p>
</blockquote>
<p align="center"><img src="https://i.imgur.com/zx7SKz7.png" alt="self supervised learning" width="550" /></p>
<p>In practice, “semantically similar” images are generated by <strong>image augmentations</strong>. Let’s say we have a set of augmentations available $T$. Then an image $x$ can be augmented into two differnt views, $x_A$ and $x_B$, through the following procedure:</p>
\[\begin{align*}
& t_A \sim T, t_B \sim T \qquad &\text{Sample augmentations}\\
& x_A := t_A(x), x_B := t_B(x) &\text{Apply to image}
\end{align*}\]
<p>Let’s say we want to learn an encoder $f_\theta$, which extracts features from images. We can acquire the representations for the two image views by</p>
\[\begin{align*}
& z_A := f_\theta(x_A), z_B := f_\theta(x_B)
\end{align*}\]
<p>The goal is then to minimise the distance between the two features, i.e.</p>
\[\begin{align*}
\min_\theta \mathbb{E}[dist(z_A, z_B)]
\end{align*}\]
<p>where $dist(\cdot)$ denotes the distance between vectors.</p>
<h2 id="whats-the-catch">What’s the catch?</h2>
<p>This idea seems easy enough, so how come there are so many papers covering this same topic? (insert bad meme about the overly competitive nature of DL community)</p>
<p>Turns out, if you naively minimise the distance between representations, the model will simply map all the representations to a constant. This trivially minimise the distance between any pair of representations, but does not give us useful representations at all. We refer to this phenomenon as <strong>latent collapse</strong>.</p>
<p>For a lot of the work you are about to see, how latent collapse is avoided is the most interesting part of the paper (but a lot of them have a lot more interesting contributions too, so don’t stop there!). For those of you who likes tables, here is a quick summary (you can also jump to different sections of this blog post following the link):</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>How latent collapse is prevented</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="#simclr-2020-feb">SimCLR</a></td>
<td>Contrasting against negative examples, from minibatch</td>
</tr>
<tr>
<td><a href="#moco-2019--moco-v2-2020-march">MoCo</a></td>
<td>Contrasting against negative example, from dictionary</td>
</tr>
<tr>
<td><a href="#swav-2020-june">SwAV</a></td>
<td>Contrasting against negative examples, from minibatch</td>
</tr>
<tr>
<td><a href="#byol-2020-june">BYOL</a></td>
<td><del>magic</del> Iterative online update + assymmetry of two encoders</td>
</tr>
<tr>
<td><a href="#w-mse-2020-july">W-MSE</a></td>
<td>Whitening</td>
</tr>
<tr>
<td><a href="#barlow-twins-2021-march">Barlow twins</a></td>
<td>Matching cross correlation matrices</td>
</tr>
<tr>
<td><a href="#simsiam-2020-nov">SimSiam</a></td>
<td>Stop gradient operation to encoder of one view</td>
</tr>
<tr>
<td><a href="#vicreg-2021-may">VICReg</a></td>
<td>Regularise the standard deviation of representation</td>
</tr>
</tbody>
</table>
<p>That’s it! For the rest of the blog posts I will be introducing these methods in rough chronological order, going through the model, objective as well as key findings and insights, which will hopefully shed some lights on how these models evolve through time. Enjoy!</p>
<h1 id="simclr-2020-feb"><a href="https://arxiv.org/abs/2002.05709">SimCLR</a> (2020 Feb)</h1>
<p><strong>TL;DR:</strong> Building on prior contrastive learning approaches, authors propose a simple contrastive framework and study the different empirical aspects that makes it “work”.</p>
<h3 id="model">Model</h3>
<p align="center"><img src="https://i.imgur.com/TPZlfyK.png" alt="simclr" width="400" /></p>
<ol>
<li>Two data augmentations are applied to the same example $x$, producing $x^A$ and $x^B$. This is considered as a positive pair. (<em>engineering detail: the combination of random crop and color distortion is crucial for good performance</em>)</li>
<li>Base encoder $f(\cdot)$ extracts the representation $h$ for each views, which is used for downstream tasks at test time;</li>
<li>Projection head $g(\cdot)$ takes $h$ and map to $z$.</li>
</ol>
<h3 id="objective">Objective</h3>
<p>The model $\theta$ is learned through minimising the following <a href="https://arxiv.org/pdf/1807.03748.pdf%5D">infoNCE</a> objective:</p>
\[\begin{align*}
\min_{\theta} \left( -\log \frac{\exp \text{sim}(z_i^A, z_i^B)}{\sum_{i\neq j}\exp \text{sim}(z_i^A, z_j^B)} \right)
\end{align*}\]
<p>Where sim denotes the cosine similarity between two vectors, i.e. sim$(u, v) = \frac{u^Tv}{||u||\cdot||v||}$.</p>
<p>Note that the objectve is computed using $z$, output of the projection head $g(\cdot)$ only. By minimising the above objective, we maximise the similarity between the representation of two views of the same image $z^A_i$ and $z^B_i$ (also called positive pair), and minimise those for different images, i.e. $z^A_i$ and $z^B_j$. For SimCLR, the negative examples are <strong>all</strong> but the current example in the same minibatch. (We will be looking at some other methods such as MoCo which decouples the number of negative examples from batch size)</p>
<h3 id="key-findings">Key findings</h3>
<ul>
<li>Composition of data augmentations is important – random crop + color distortion is crucial for good performance;</li>
<li>Proposes to add a projection layer between representations for downstream task ($h$) and the representation used to compute contrastive loss ($z$), which seems to improve performance;</li>
<li>Larger batch size is better.</li>
</ul>
<h1 id="scan-2020-may"><a href="https://arxiv.org/pdf/2005.12320.pdf">SCAN</a> (2020 May)</h1>
<p><strong>TL;DR:</strong> Use the power of nearest neighbours to improve the representations learned from some pretext task (for instance a model trained using SimCLR).</p>
<h3 id="model-1">Model</h3>
<p align="center"><img src="https://i.imgur.com/X0xHE62.png" alt="scan" width="400" /></p>
<ol>
<li>First, the representations $z$ are learned through some pretext task – in the original paper for most of the experiments they used SimCLR;</li>
<li>Then, a clustering function $g_\phi$ takes $z$ as input and predicts the probability of the datapoint belonging to each cluster $\mathcal{C}={1,\cdots, C}$. We denote this probability as $h$, where $h\in[0,1]^C$;</li>
<li>The datapoint is then assigned to the cluster with the highest probability in $h$, which we denote as $c$.</li>
<li>We then select the $K$ nearest neighbours to the representation $z$ of the original image, denoting them as ${z^{(1)}, z^{(2)},\cdots,z^{(K)}}$. We perform the above clustering forward pass on all $K$ neighbours and acquire the probabilities each neighbour belongs to different clusters, $\mathcal{H} = {h^{(1)},h^{(2)},\cdots,h^{(K)}}$.</li>
</ol>
<h3 id="objective-1">Objective</h3>
<p>We can then learn the clustering function $g_\phi$ by minimising the following objective:</p>
\[\begin{align*}
\mathcal{L}= \underbrace{-\sum_{h^{(k)} \in \mathcal{H}} \log \langle h, h^{(k)} \rangle}_{(1)} + \lambda \ \underbrace{\sum_{c\in\mathcal{C}} h_c\log(h_c)}_{(2)}
\end{align*}\]
<p>Let’s dissect the objectives a bit:</p>
<p><strong>Term (1): Consistent, confident neighbours;</strong>
The first term of the objective imposes consistent predictions for $z$ and its neighbouring samples. The term will be maximised when the predictions are one-hot (confident) and assigned to the same cluster (consistent).</p>
<p><strong>Term (2): diversity in clusters;</strong>
The second term computes the entropy of the cluster assignment probabilities $h$. It is introduced to prevent $g_\phi$ from assigning all samples to a single cluster – it maximises the entropy to spread predictions uniformly across the clusters $\mathcal{C}$.</p>
<p><strong>An extra “trick”:</strong> the model minimises mis-labelling during cluster assignment by picking out samples with highly confident predictions ($h_\max \approx 1$), assigning the sample to its predicted cluster, and updating $g_\phi$ based on the pseudo labels obtained. See original paper for more details.</p>
<h1 id="moco-2019--moco-v2-2020-march"><a href="https://arxiv.org/pdf/1911.05722.pdf">Moco</a> (2019) / <a href="https://arxiv.org/pdf/2003.04297.pdf">MoCo V2+</a> (2020 March)</h1>
<p><strong>TL;DR:</strong> MoCo decouples the number of negative examples from the batch size. It proposes to store representations from previous $K$ minibatches in a “keys” dictionary, which can be used for computing the contrastive loss of the current “query” minibatch.</p>
<h3 id="model-2">Model</h3>
<p align="center"><img src="https://i.imgur.com/msGuC1C.png" alt="moco" width="550" /></p>
<p>MoCo uses a similar infoNCE objective as SimCLR, and the key difference between the two approaches is how they acquire negative examples.</p>
<ul>
<li><strong>SimCLR</strong>: all the other datapoints in the minibatch are used as negative examples to the current datapoint, and therefore the number of negative examples is limited by the size of the minibatch;</li>
<li><strong>MoCo</strong>: the representation of each minibatch is stored in a fixed-sized <strong>dictionary</strong>. The negative examples used for any datapoint are drawn from this dictionary. By doing so, the number of negative examples is no longer determined by the size of the minibatch.</li>
</ul>
<p>With this in mind, MoCo’s pipeline consists of the following two parts:</p>
<ul>
<li><strong>Generating positive examples</strong>: Similar to SimCLR, a forward pass is performed on two views of the same image $x^A$ and $x^B$ with a base encoder $f_\theta(\cdot)$ and projection head $g_\theta(\cdot)$. We denote the representations acquired from this step $x^A_\theta$ and $x^B_\theta$.</li>
<li><strong>Generating negative examples</strong>: We mentioned that the representations for each minibatch gets stored in a dictionary and are reused for preceding batches as negative examples, however naively storing the representations $x^A_\theta$ and $x^B_\theta$ can lead to poor result due to the rapidly changing $\theta$. Therefore, authors propose to store the representations generated through the momentum encoder $\phi$, where</li>
</ul>
\[\begin{align*}
\phi \leftarrow m \phi + (1-m) \theta,
\end{align*}\]
<p>with $m\in[0,1)$ being the momentum coefficient.
Updating $\phi$ using the above assignment rule ensures that $\phi$ evolves more smoothly than $\theta$. Therefore for each minibatch, we simply add the representation acquired from $g_\phi(f_\phi(x))$, denote as $x^A_\phi$ and $x^B_\phi$, to the dictionary, which is then made available for future minibatches as negative examples.</p>
<h3 id="objective-2">Objective</h3>
<p>The objective is very similar to SimCLR:</p>
\[\begin{align*}
\min_{\theta} \left( -\log \frac{\exp \text{sim}(z_\theta^A, z_\theta^B)}{\sum_{z_\phi \sim \texttt{dict}}\exp \text{sim}(z_\theta^A, z_\phi)} \right)
\end{align*}\]
<p>Similarly, $\text{sim}$ denotes the cosine similarity between two vectors, i.e. sim$(u, v) = \frac{u^Tv}{||u||\cdot||v||}$, and $\texttt{dict}$ denotes the dictionary.</p>
<h3 id="from-moco-to-moco-v2">From MoCo to MoCo V2+</h3>
<p>The model that we described here is actually the MoCo V2+. The original MoCo was actually proposed before SimCLR. Following the key findings from SimCLR, authors updated their model to MoCo V2+ adopting the following designs and achieved better results:</p>
<ol>
<li>use an MLP projection head $g(\cdot)$ and</li>
<li>use more data augmentations.</li>
</ol>
<h1 id="swav-2020-june"><a href="https://arxiv.org/pdf/2006.09882.pdf">SwAV</a> (2020 June)</h1>
<p><strong>TLDR:</strong> instead of matching the representations of two views (augmentations of the same image) directly, use one representation to predict the other.</p>
<h3 id="model-3">Model</h3>
<p align="center"><img src="https://i.imgur.com/xu1EYRa.png" alt="swav" width="400" /></p>
<p>The model itself looks similar to simCLR, but the way it works is quite different. Here $g_\theta$ is not parametrised by a learnable neural network, but instead a set of $K$ trainable prototype vectors $G={g_1, g_2,…g_K}$ that maps $h$ into a code $z$ (this code can be discrete, however during training they find that leaving it continuous results in better performance). When computing the loss, instead of directly enforcing $z^A$ and $z^B$ to be similar, the model tries to associate the code of a view $x^A$ with the representation of another view $x^B$.</p>
<h3 id="objective-3">Objective</h3>
<p>SwAV minimises the following objective</p>
\[\begin{align*}
\min_{\theta} \left(\mathcal{l}(h^B, z^A) + \mathcal{l}(h^A, z^B)\right),
\end{align*}\]
<p>where</p>
\[\begin{align*}
\mathcal{l}(h^B, z^A) = -\sum_k z^{A(k)} \log \frac{\exp(\langle h^B, g_k\rangle )}{\sum_{k'}\exp((h^B)^Tg_{k'})}.
\end{align*}\]
<p>Despite the conceptual differences, this is loosely still the inner product of the projected representation $z^A$ and $z^B$.</p>
<h1 id="byol-2020-june"><a href="https://arxiv.org/abs/2006.07733">BYOL</a> (2020 June)</h1>
<p><strong>TL;DR:</strong> BYOL avoid having to use negative examples for contrastive loss by performing an iterative online update — this paper was groundbreaking at the time, as negative examples are very computationally costly.</p>
<h3 id="model-4">Model</h3>
<p align="center"><img src="https://i.imgur.com/i6niwtF.png" alt="byol" width="450" /></p>
<p>Let’s unpack. Similar to MoCo, the model uses two sets of network parameters $\theta$ and $\phi$.</p>
<p>The optimisation goal of $\theta$ is to learn a projection $y_\theta$ that closely matches the representation learned from $\phi$, i.e. $z_\phi$. Implementation wise, this is done by adding yet another projection head $q_\theta(\cdot)$ that predicts $y_\theta$ from $z_\theta$. We then optimise $\theta$ using the following loss that minimises the mean squared error between $y_\theta$ and $z_\phi$:</p>
\[\begin{align*}
\mathcal{L}_\theta = \|y_\theta^A - z^B_\phi\|^2_2 = 2 - 2\cdot \frac{\langle y^A_\theta, z^B_\phi \rangle}{\|y^A_\theta \|\cdot \|z^B_\phi\|}
\end{align*}\]
<p>In the paper they also normalise $y_\theta^A$ and $z_\phi^B$ before computing this loss. Further, they symmetrise the loss by swapping $x^A_\theta$ and $x^B_\phi$ in $L_\theta$ — resulting in $x^A_\phi$ and $x^B_\theta$. (I’m not sure how important that is since the transforms are stochastically generated anyways, but it seems to improve empirical results)</p>
<p>$\phi$ on the other hand is not optimised via gradient descent. Similar to the momentum encoder of MoCo, it follows the following update rule at every forward pass:</p>
\[\begin{align*}
\phi \leftarrow \tau \phi + (1-\tau) \theta
\end{align*}\]
<p>where $\tau\in[0,1)$ is the coefficient that controls the smoothness of the update.</p>
<h3 id="why-the-hell-does-it-work">Why the hell does it work?</h3>
<p>From the above, it is not hard to notice that BYOL is very similar to both SimCLR and MoCo. However, removing the negative examples of either models directly will lead to latent collapse.</p>
<p>So what makes BYOL effective without negative examples? The paper intuit this by deriving the gradient of the $\theta$ update, showing that it is the same as the gradient of the expected conditional variance, i.e.</p>
\[\begin{align*}
\nabla_\theta \mathbb{E}\left[\|\|y_\theta-z_\phi\|\|_2^2 \right] = \nabla_\theta \mathbb{E}\left[\sum_i \text{Var}(z_\phi\|z_\theta) \right]
\end{align*}\]
<p>This finding is important for explaining why BYOL doesn’t collapse, as it provides the following three insights:</p>
<ol>
<li><strong>It is always worth it for the model utilise stochaticities in training dynamics:</strong> Since for any random variables $X$, $Y$ and $Z$ we have $\text{Var}(X|Y,Z)\leq \text{Var}(X|Y)$, let us consider the following:
<ul>
<li>$X$: the target projection $z_\phi$</li>
<li>$Y$: the online projection $y_\theta$</li>
<li>$Z$: any additional changes introduced by stochaticities in training dynamics.
We see that the model cannot reduce variance by discarding $Z$.</li>
</ul>
</li>
<li><strong>Latent collapse avoided:</strong> following similar intuition to the above, BYOL avoids constant features in $z$, since for any constant $c$ and random variables $z_\phi$ and $z_\theta$, Var$(z_\phi|z_\theta)\leq$Var$(z_\phi|c)$.</li>
<li><strong>Why we can’t optimise $\phi$ with the same objective as $\theta$:</strong> if we were to minimise the variance Var$(z_\phi|z_\theta)$ directly by optimising $\phi$, $z_\phi$ can simply reduce to a constant. Therefore instead BYOL makes $\phi$ gradually closer to $\theta$.</li>
</ol>
<p><strong>Note:</strong> It’s probably better to say that the above explains why BYOL does not fail completely, than to say that it explains why it works. In fact, the reason why latent collapse does not happen in BYOL (or any SSL algorithm for that matter) remains an open problem. See the resources listed below for further discussions on this topic:</p>
<ol>
<li><a href="https://generallyintelligent.ai/understanding-self-supervised-contrastive-learning.html">This blog on BYOL</a> attributes avoiding degenerative solutions to the batch-norm layers in the projection heads;</li>
<li><a href="https://arxiv.org/abs/2010.10241#:~:text=Bootstrap%20Your%20Own%20Latent%20(BYOL,view%20of%20the%20same%20image.)">This paper</a> then rebuts the above and shows that BYOL works even without batch statistics;</li>
<li><a href="https://arxiv.org/abs/1906.05849">Multiview contrastive coding</a> shows that using multiple, not just two views contribute to non-collapsing solutions;</li>
<li>Works such as <a href="https://arxiv.org/abs/2011.10566">SimSiam</a> and <a href="https://arxiv.org/pdf/2007.06346.pdf">W-MSE</a> also offer interesting perspectives on the topic of avoiding latent collapse.</li>
</ol>
<h1 id="w-mse-2020-july"><a href="https://arxiv.org/pdf/2007.06346.pdf">W-MSE</a> (2020 July)</h1>
<p><strong>TL;DR:</strong> The paper has similar motivation to BYOL – it aims to develop an SSL method that requires no negative examples. Instead, it uses “whitening” to prevent latent collapse.</p>
<h3 id="prevent-latent-collapse-by-whitening">Prevent latent collapse by whitening</h3>
<p>Before we dive in, it’s helpful to first look at how authors characterise the learning problem in this paper. Speicifically, authors propose to formulate the problem of SSL as follows:</p>
\[\begin{align*}
&\min_\theta \mathbb{E}[dist(z_i, z_j)], &(1) \\
s.t.\ & cov(z_i, z_i) = cov(z_j, z_j) = I &(2)
\end{align*}\]
<p>Let’s unpack. In the above euqations, (1) specify that representations from positive image pairs that share similar semantics $(z_i, z_j)$ should be clustered close together, and (2) that the image representations must form a non-degenerate distribution, i.e. the latents do not collapse to a single point.</p>
<p>More specifically, in (2), $I$ is the identify matrix. The constraint specifies that different components (dimensions) of the representation $z$ should be linearly independent, and by doing so, encourage different axis of $z$ to represent different sementic content. Importantly, by optimising this condition, the model does not need any negative examples to prevent latent collapse!</p>
<p>Now that we know the optimisation goal of the model, the pipeline and objective of this model should make much more sense.</p>
<h3 id="model-5">Model</h3>
<p><img src="https://i.imgur.com/xZs6X6v.png" alt="w-mse" /></p>
<ol>
<li>One of the most notable difference of this model is that it is not constrained to using only 2 positive examples – in the above schematic, $d$ views are generated for each image.</li>
<li>The paper again uses similar pipeline to SimCLR and extracts representation $v$ using first the base encoder $f(\cdot)$ and then the projection head $g(\cdot)$. This leads us to feature $v$, which is then passed to the whitening layer.</li>
<li>The whitening procedure is done using the following:</li>
</ol>
\[\begin{align*}
z = W_V(v-\mu_V),
\end{align*}\]
<p>where $\mu_V$ is the mean of the elements in $V$:</p>
\[\begin{align*}
\mu_V = \frac{1}{K} \sum_k v_k,
\end{align*}\]
<p>while the matrix $W_V$ is such that $W_V^TW_V = \Sigma_V^{-1}$, and $\Sigma_V^{-1}$ being the covariance matrix of $V$:</p>
\[\begin{align*}
\Sigma_V = \frac{1}{K-1} \sum_k (v_k-\mu_V)(v_k-\mu_V)^T.
\end{align*}\]
<h3 id="objective-4">Objective</h3>
<p>The loss is then computed for pairwise $z$s in ${z^{(1)}, \cdots, z^{(d)}}$ as follows:</p>
\[\begin{align*}
\mathcal{L} = \frac{2}{Nd(d-1)} \sum dist(z^{(i)}, z^{(j)}),
\end{align*}\]
<p>where $N$ denotes the batch size and $d$ the number of augmentations for each image.</p>
<h3 id="some-extra-notes">Some extra notes</h3>
<p>The whitening “layer” maps all the representations to a unit sphere to avoid latent collapse, therefore avoiding the need of negative examples. Note that this whitening transform was first proposed by <a href="https://arxiv.org/abs/1806.00420">Siarohin et al., 2019</a> (also seen in <a href="https://arxiv.org/abs/1804.08450">Huang et al., 2018</a>) which uses the efficient and stable Cholesky decomposition.</p>
<p>In parallel to whitening, the authors also apply batch slicing to the representation $v$, where they further divide each batch into multiple sub-batches to compute the whitening matrix $W_V$. This is to provide more stability during training. Please refer to Page 5 and Figure 3 of the original paper for more details.</p>
<h1 id="barlow-twins-2021-march"><a href="https://arxiv.org/pdf/2103.03230.pdf">Barlow Twins</a> (2021 March)</h1>
<p><strong>TL;DR:</strong> Avoid latent collapse by matching the cross-correlation matrix between the representations of images of two different views to an identity matrix. Does not need negative examples as a result.</p>
<p><strong>Model:</strong></p>
<p><img src="https://i.imgur.com/7EcL2SD.png" alt="barlow_twins" /></p>
<p>Again, model uses similar pipeline to simclr. After the representations of the two views $z^A$ and $z^B$ are generated, we compute the cross correlation matrix $\mathcal{C}$, where each element $\mathcal{C}_{ij}$ is computed as follows:</p>
\[\begin{align*}
\mathcal{C}_{ij} = \frac{\sum_b z^A_{b,i}z^B_{b,j}}{\sqrt{\sum_b (z^A_{b,i})^2}\sqrt{\sum_b (z^B_{b,j})^2}},
\end{align*}\]
<p>where $b$ indexes batch samples and $i,j$ index the vector dimension of $z$. The value of $\mathcal{C}_{i,j}$ is between $-1$ (perfect anti-correlation) and $1$ (perfect correlation).</p>
<p>The training objective is based on this cross correlation matrix, which consists of 2 terms:</p>
\[\begin{align*}
\mathcal{L} = \underbrace{\sum_i (1-\mathcal{C}_{ii})^2}_{\text{invariance term}} + \lambda \underbrace{\sum_i \sum_{i\neq j}C_{ij}^2}_{\text{redundancy reduction term}}
\end{align*}\]
<ul>
<li><strong>Invariance term:</strong> tries to equate the diagonal elements of the cross-correlation matrix to $1$, makes the representations invariant to the augmentations applied to the original image;</li>
<li><strong>Redundancy reduction term:</strong> tries to decorrelate the different vector components of the embedding by equating the off-diagonal elements of $\mathcal{C}$ to 0.</li>
</ul>
<p>The paper also mentions that Barlow Twin’s objective function can be understood as an instanciation of the information bottleneck (IB) objective, which specifies that representation should conserve as much information about the sample as possible while being the least opssible informative about the specific distortions applied to the sample.</p>
<h1 id="simsiam-2020-nov"><a href="https://arxiv.org/pdf/2011.10566.pdf">SimSiam</a> (2020 Nov)</h1>
<p><strong>TL;DR:</strong> Proposes that simple siamese networks can learn mearningful representation without negative samples (most contrastive methods)/large batches (simclr) /momentum encoders (BYOL). It turns out, “stop gradient is all you need”.</p>
<h3 id="model-6">Model</h3>
<p align="center"><img src="https://i.imgur.com/XDg3Pw0.png" alt="simsiam" width="400" /></p>
<p>The proposed model is quite simple. As authors aptly put, SimSiam can be thought of as “BYOL without the momentum encoder”, “SimCLR without negative pairs” and “SwAV without online clustering”. It is the simplest augmentation-based SSL method that I have read in this literature review.</p>
<p>As we can see from the architecture, SimSiam shares weights between two networks. The projection head $g_\theta$ on the augmentation B stream is removed, and gradients from the loss is not back propagated through this stream. The loss is computed without negative pairs as follows:</p>
\[\begin{align*}
\mathcal{L}(z^A, h^B) = -\frac{z^A \cdot h^B}{\|z^A\| \ \|h^B\|}
\end{align*}\]
<p>Note that this loss is the same as the numerator part of the SimCLR loss. Following practices of BYOL, they also symmetrise the loss by swapping $t_A$ and $t_B$ to compute $\mathcal{L}(z^B, h^A)$ and take the average between the two losses.</p>
<h3 id="empirical-findings-what-prevents-collapse-in-simsiam">Empirical findings: what prevents collapse in SimSiam?</h3>
<p>Apart from proposing this amazingly simple method, authors also performed some helpful empirical evaluations on different elements of SimSiam:</p>
<ol>
<li>
<p><strong>Stop gradient</strong> $\leftarrow$ prevent collapse
Without the stop gradient operation, the model collapses and reaches minimum possible loss. The authors quantify model collapse by computing the standard deviation of the l2 normalised output $z/|z|_2$ – the std should be 0 when model collapses (all images get encode into a constant value), and is $1/\sqrt{d}$ if $z$ has a zero-mean isotropic Gaussian distribution, where $d$ is the dimension of $z$. Authors are able to show that with stop gradient the std is indeed $1/\sqrt{d}$, and without it the std is 0.</p>
<p>While this empirical evaluation is interesting and does show the importance of stop gradient in their architecture, quite unsatisfyingly (but also understandably), no guarantees were made about whether applying stop gradient will guarantee a non-collapsing solution. As carefully put by the authors,</p>
<blockquote>
<p>Our experiments show that their exist collapsing solutions…, their existence implies that it is insufficient for our method to prevent collapsing solely by architecture designs (e.g. predictor, BN, l2-norm). In our comparison, all these architecture designs are kept unchanged, but they do not prevent collapsing if stop gradient is removed.</p>
</blockquote>
<p>(Note that this finding that no stop gradient $\rightarrow$ latent collapse is limited to the architecture used in SimSiam and is not a general statement for all models.)</p>
</li>
<li>
<p><strong>Predictor $g_\theta$</strong> $\leftarrow$ prevent collapse
<em><strong>Config 1</strong></em>. <em>Removing predictor when using symmetrised loss: model collapses!</em>
When the predictor $g_\theta$ is removed, the symmetrised loss is $\frac{1}{2}\mathcal{L}(z^B, \texttt{stopgrad}(z^A)) +$ $\frac{1}{2}\mathcal{L}(z^A, \texttt{stopgrad}(z^B))$, which has the same gradient direction as $\mathcal{L}(z^A, z^B)$ – so it is as if the stop gradient operation has been removed! Collapse is observed.</p>
<p><em><strong>Config 2</strong></em>. <em>Removing predictor when using assymetrised loss: model collapses!</em>
There’s not as much explaination for this one – collapsing is observed in experiments when using this configuration.</p>
<p><em><strong>Config 3</strong></em>. <em>Fix predictor at random initialisation: training does not converge</em>
If $g_\theta$ if fixed at random initialisation, the training does not converge as the loss remains high (which is not the same as collapse where the loss is minimised)</p>
</li>
<li>
<p><strong>Large batch size</strong> $\leftarrow$ not important
Compared to SimCLR and SWaV which requires large batch size (4096) to work optimally, the optimal batch size of SimSiam is 256. Further increasing the batch size does not improve its performance. In addition, using smaller batch size such as 64 and 128 only observes a small accuracy drop (2.0 and 0.8 respectively).</p>
</li>
</ol>
<p>In addition, the following three factors are helpful for training, but does not prevent collapse:</p>
<ol>
<li><strong>Batch normalisation:</strong> Similar to supervised learning scenarios, batch normalisation is helpful for optimisation when used appropriately, but it does not help preventing collapse.</li>
<li><strong>Similarity function:</strong> Swapping the cosine similarity cross entropy similarity, where $\mathcal{L}(z^A, z^B) = -\texttt{softmax}(z_2) \cdot \log \texttt{softmax}(z_2)$. This results in a $5\%$ performance drop on ImageNet, but the model does not collapse.</li>
<li><strong>Symmetrisation:</strong> Assymetrical loss achieves accuracy that is $4\%$ lower than symmetrical loss, but does not result in collapse.</li>
</ol>
<p><strong>Note:</strong> I find the empirical evaluation of different elements of the model to be very helpful, as it really pin-points what exactly prevents model collapse in SimSiam. Regrettably, these empirical findings do not necessarily extend beyond SimSiam’s experimental protocol, and what exactly prevents model collapse in this kind of siamese network is still unclear.</p>
<h1 id="vicreg-2021-may"><a href="https://arxiv.org/pdf/2105.04906.pdf">VICReg</a> (2021 May)</h1>
<p><strong>TL;DR:</strong> The model uses similar pipeline to SimCLR. Instead of using negative examples to avoid latent collapse, they explicitly regularise the standard deviation of the embedding (making sure it is not zero).</p>
<h3 id="objective-5">Objective</h3>
<p align="center"><img src="https://i.imgur.com/TPZlfyK.png" alt="simsiam" width="400" /></p>
<p>The architecture and model pipeline is identical to SimCLR, with the objective being the only difference, which is what we will focus on. Here the loss enforces constraint on three aspects of the representation, namely variance, invariance and covariance (hence the name VIC-Reg):</p>
<p><strong>Invariance term $s$:</strong> this term is similar to all the SSL objectives above, which minimise the distance of representations from two views of the same image. Instead of using cosine similarity, authors use the L2 distance:</p>
\[\begin{align*}
s(z^A, z^B) = \|z^A - z^B\|^2_2
\end{align*}\]
<p><strong>Variance term $v$:</strong> this term makes sure that the standard deviation of the projections in each dimension of $z$ is non-zero and approaching a pre-defined target value $\gamma$.
We denote the dimension $j$ of representation $z$ as $z_j^A$, where $j\in[1,d]$. The variance term can then be written as a hinge loss</p>
\[\begin{align*}
v(z) = \frac{1}{2}\sum^d_{j=1} \max (0, \gamma - \sqrt{\text{Var}(z_j)+\epsilon}),
\end{align*}\]
<p>where $\gamma$ is the target value of standard deviation and $\text{Var}(z_j)$ the unbiased variance estimator.</p>
<p><em>Note:</em> Some might notice that it is slightly weird that this is the “variance” term, when in fact standard deviation is used – turns out if we directly use the variance in the hinge loss, the gradient becomes very close to 0 when the input vector is close to its mean vector, which prevents the loss from being very effective when we need it the most. Using standard deviation alleviate this.</p>
<p><strong>Covariance term c:</strong> This term is similar to the one used on Barlow Twins, which decorrelate different dimensions of $z$ by forcing the off-diagonal coefficients of the covariance matrix $C(z)$ to be 0:</p>
\[\begin{align*}
c(z) = \frac{1}{d}\sum_{i\neq j} C(z)^2_{i,j}
\end{align*}\]
<p>The final objective looks like this:</p>
\[\begin{align*}
\mathcal{L} = \lambda s(z^A, z^B) + \mu\left(v(z^A)+v(z^B)\right) + v\left(c(z^A) + c(z^B)\right),
\end{align*}\]
<p>where $\lambda, \mu$ and $v$ scales the importance of each term.</p>
<h1 id="some-afterthoughts">Some Afterthoughts</h1>
<p>There are a bunch of other excellent methods for self-supervised learning that I regrettably cannot cover here due to time constraint, including but not limitted to:</p>
<ul>
<li><a href="https://arxiv.org/pdf/2104.13963.pdf">PAWS</a> (July, 2021): a bit different since it is for semi-supervised learning, however the model is able to provide theoretical guarantee to have non-degenerative solution by performing sharpening on features;</li>
<li><a href="https://arxiv.org/pdf/2107.09282.pdf">ReSSL</a> (July, 2021): instead of using cosine similarity between different augmentations, they propose to use a relation metric to capture the similarities among different instances;</li>
<li><a href="https://arxiv.org/pdf/2104.14548.pdf">NNCLR</a> (April, 2021): instead of using only the augmented view of the same image as positive instance, employ nearest neighbors from the dataset as well.</li>
</ul>
<p>When looking at all these work together, the common theme of augmentation-based SSL methods is clear: draw representations extracted from semantically similar images closer in feature space, while doing “something” to prevent degenerative solutions. This blog post summarises the different “something” used by different approaches, and attempt to discuss what kind of guarantee on avoiding latent collapse they provide. Sadly, a lot of these discussions, with few exceptions, are limited to empirical findings. With the popularisation of augmentation-based SSL approaches, it would be really interesting to see more works examining different collapse modes or sharing insights on why any particular strategy (batch norm, stop gradient, momentum encoder) avoids them.</p>
<p>If you are interested in doing a bit more hands on stuff with the SSL methods introduced above, I would highly recommend checking out the <a href="https://github.com/vturrisi/solo-learn">solo-learn library</a>, which implements a large variety of SSL approaches in Pytorch with benchmarked results on different datasets.</p>
<h1 id="thanks">Thanks!</h1>
<p>If you liked my blog post, please share it on social media (or with your employer, I am looking for a job :p). Thanks for reading!</p>Yuge Shiyshi@robots.ox.ac.ukHow I learned to stop worrying and write ELBO (and its gradients) in a billion ways2020-06-19T00:00:00-07:002020-06-19T00:00:00-07:00https://yugeten.github.io/posts/2020/06/elbo-billion<blockquote>
<p>Latex equations not rendering? Try using a different browser or this link <a href="https://hackmd.io/@5pwCvlLhSMm2E1skjPTOTQ/elbo">here</a>.</p>
</blockquote>
<script src="mj.config" type="text/javascript"></script>
<h2 id="overview">Overview</h2>
<!-- I had a really hard time learning about VAE at the beginning of my PhD, mainly because it seems like every paper is writing the ELBO objective a slightly different way. However, as I mature, my attitude towards this have changed --- now I have learned to embrace the power of the infinitely many transforms of ELBO.
this is gonna be a really boring blog for you if you are not interested in VAE, but if you are -->
<p>I had a really hard time learning about VAE at the beginning of my PhD. I felt very betrayed spending time deriving and memorising ELBO (the evidence lower bound objective), then seeing yet another paper that writes it in a different way. However, as I mature, my attitude towards this changed — now I have learned to embrace the power of the seemingly infinitely many forms of ELBO.</p>
<p>Thinking back, this transformation really took place when I was introduced by my supervisor <a href="https://www.robots.ox.ac.uk/~nsid/">Sid</a> to this great series of literature that covers the evolution of ELBO over the last 5, 6 years. Organising all of them and describing them in non-jibberish took some time, but I hope that this will serve as a frustration-free note-to-self for future revisiting to the topic, and also that it can be helpful to people out there who are feeling equally bamboozled as I was a year ago.</p>
<p>I will discuss the following papers (click on links for PDF), one in each section — and trust me they each serve a purpose and tell a whole story:</p>
<ol>
<li><a href="http://approximateinference.org/accepted/HoffmanJohnson2016.pdf">ELBO surgery</a> (warm up) \(\Rightarrow\) A more intuitive (visualisable) way to write ELBO</li>
<li><a href="https://arxiv.org/pdf/1509.00519.pdf">IWAE</a> \(\Rightarrow\) “K steps away” from basic VAE ELBO</li>
<li><a href="https://arxiv.org/pdf/1703.09194.pdf">Sticking the landing</a> \(\Rightarrow\) What’s wrong with ELBO and IWAE?</li>
<li><a href="https://arxiv.org/pdf/1802.04537.pdf">Tighter isn’t better</a> \(\Rightarrow\) What is wrong with IWAE, in particular?</li>
<li><a href="https://arxiv.org/pdf/1810.04152.pdf">DReG</a> \(\Rightarrow\) How to fix IWAE?</li>
</ol>
<h2 id="0-standard-elbo">0. Standard ELBO</h2>
<p>Before we dive in, let’s look at the most basic form of ELBO first, here it is in all of its glory:</p>
<div>
\begin{align*}
\mathcal{L}(\theta, \phi) = \mathbb{E}_{z\sim q_\phi(z\mid x)} \displaystyle \left[\log \frac{p_\theta(x,z)}{q_\phi(z\mid x)} \right] \leq \log p(x),\notag
\end{align*}
</div>
<p>where \(\theta,\phi\) denotes the generative and inference model respectively, \(x\) the observation and \(z\) sample from latent space. The objective serves as a lower bound to the marginal likelihood of observation \(\log p(x)\), and the VAE is trained by maximising the likelihood of reconstruction through maximising ELBO.</p>
<p>If you have this memorised or tattooed on your arm, we are ready to go!</p>
<h2 id="1-a-more-intuitive-visualisable-guide-to-elbo">1. A more intuitive (visualisable) guide to ELBO</h2>
<blockquote>
<p>Paper discussed: <a href="http://approximateinference.org/accepted/HoffmanJohnson2016.pdf">ELBO surgery: yet another way to carve up the variational evidence lower bound</a>, work by Matthew Hoffman and Matthew Johnson.</p>
</blockquote>
<p>This work provides a very intuitive perspective of the VAE objective by decomposing and rewriting ELBO. For a batch of N observations \(X=\{x_n\}_{n=1}^N\) and their corresponding latent codes \(Z=\{z_n\}_{n=1}^N\), ELBO can be rewritten as:</p>
<div>
\begin{align*}
\mathcal{L}(\theta, \phi) &= \underbrace{\left[ \frac{1}{N} \sum^N_{n=1} \mathbb{E}_{q(z_n\mid x_n)} [\log p(x_n \mid z_n)] \right]}_{\color{#4059AD}{\text{(1) Average reconstruction}}} - \underbrace{(\log N - \mathbb{E}_{q(z)}[\mathbb{H}[q(x_n\mid z)]])}_{\color{#EE6C4D}{\text{(2) Index-code mutual info}}} \notag \\
& + \underbrace{\text{KL}(q(z)\mid p(z))}_{\color{#86CD82}{\text{(3) KL between q and p}}} \notag \\
\end{align*}
</div>
<p>Where \(q(z)\) is the marginal, i.e. \(q(z)=\sum^{N}_{n=1}q(z,x_n)\), and for large N can be approximated by the average aggregated posterior \(q^{\text{avg}}(z)=\frac{1}{N}\sum^N_{n=1}q(z \mid x_n)\).</p>
<p>So what is the point of all this? Well, what’s interesting with this decomposition is that <span style="color:#4059AD">(1) average reconstruction </span> and <span style="color:#EE6C4D">(2) index-code mutual information</span> have opposing effects on the latent space:</p>
<ul>
<li><span style="color:#4059AD">Term (1)</span> encourages accurate reconstruction of observations, which typically forces separated encoding for each \(x_n\);</li>
<li><span style="color:#EE6C4D">Term (2)</span> maximises the entropy of \(q(x_n\mid z)\), and thereby promoting overlapping encoding \(q(z\mid x_n)\) for disdinct observations.</li>
</ul>
<p>We visualise these effects in the graph below for two observations \(x_1,x_2\) and their corresponding latent \(z_1,z_2\). Plain and simple, <span style="color:#4059AD">(1)</span> encourages separate encodings by “squeeshing” each latent code, and <span style="color:#EE6C4D">(2)</span> “stretches” them, resulting in more overlap between \(z_1\) and \(z_2\).</p>
<p align="center"><img src="https://i.imgur.com/CbpEkBH.png" alt="drawing" width="450" /></p>
<p>Fig. Visualisation of effect of term <span style="color:#4059AD">(1)</span> and <span style="color:#EE6C4D">(2)</span>. Dotted lines represent inference model \(\phi\) and solid lines generative model \(\theta\).</p>
<p>This now leaves us with <span style="color:#86CD82">term (3)</span>, which is the <strong>only term that involves prior</strong>. This term regularises the aggregated posterior by prior through minimising the KL distance between \(q^{\text{avg}}(z)\) and \(p(z)\). Theoretically speaking, \(q^{\text{avg}}(z)\) can be arbitrarily close to \(p(z)\) without losing expressivity of posterior; however in practice, when <span style="color:#86CD82">(3)</span> is too large, it always indicate unwanted regularisation effect from prior.</p>
<p>Paper <a href="https://arxiv.org/pdf/1812.02833.pdf"><em>Disentangling disentanglement in Variational Autoencoders</em></a> also did a great job analysing and utilising the effect of these three terms for disentanglement in VAEs, and I strongly recommend that you go and have a look.</p>
<h1 id="2-k-steps-away-from-basic-elbo-iwae">2. “K steps away” from basic ELBO: IWAE</h1>
<blockquote>
<p>Paper discussed: <a href="https://arxiv.org/pdf/1509.00519.pdf">Importance Weighted Autoencoders</a>, work by Yuri Burda, Roger Grosse & Ruslan Salakhutdinov</p>
</blockquote>
<p>Hopefully the previous section served as a good warm-up for this blog, and now you have a better intuition on how ELBO affects the graphical model. Now, we will move just a tat away from the original ELBO, to a more advanced K-sampled lower bound estimator: IWAE.</p>
<p>Importance Weighted Autoencoders (IWAE), is probably my favourite machine learning trick (and I know about 4). It is a simple and yet powerful way to improve the performance of VAEs, and you’re really missing out if you went through the trouble to implement ELBO but stopped there. Here, I will talk about the formultaion of IWAE and its 3 benefits: <strong><em>tighter lower bound estimate</em></strong>, <strong><em>importance-weighted gradients</em></strong> and <strong><em>complex implicit distribution</em></strong>.</p>
<h2 id="formulation">Formulation</h2>
<p>IWAE proposes a tighter estimate to \() \log p(x)\). As a reference, here’s the original ELBO again:</p>
<div>
\begin{align*}
\mathcal{L}(\theta, \phi) = \mathbb{E}_{z\sim q(z\mid x)} \left[\log \frac{p(x,z)}{q(z\mid x)}\right] \leq \log p(x) \notag
\end{align*}
</div>
<p>A common practice to acquire a better estimate to \(\log p(x)\) with ELBO is to use its multisample variations, by taking \(K\) samples from \(q(z\mid x)\):</p>
<div>
\begin{align*}
\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{z_1, z_2 \cdots z_K \sim q(z\mid x)} \left[\frac{1}{K} \sum_{k=1}^K \log \frac{p(x,z_k)}{q(z_k\mid x)}\right]\leq \log p(x) \notag
\end{align*}
</div>
<p>IWAE simply switch the position between the sum over \(K\) and the \(\log\) of the above, giving us:</p>
<div>
\begin{align*}
\mathcal{L}_{\text{IWAE}}(\theta, \phi) = \mathbb{E}_{z_1, z_2 \cdots z_K \sim q(z\mid x)} \left[\log \frac{1}{K}\sum_{k=1}^K \frac{p(x,z_k)}{q(z_k\mid x)}\right]\leq \log p(x) \notag
\end{align*}
</div>
<h2 id="benefit-1-tighter-lower-bound-estimate">Benefit 1: Tighter lower bound estimate</h2>
<p>It is easy to see that by <a href="https://www.probabilitycourse.com/chapter6/6_2_5_jensen's_inequality.php">Jensen’s inequality</a>, \(\mathcal{L}_{\text{VAE}}(\theta, \phi)\leq\mathcal{L}_{\text{IWAE}}(\theta, \phi)\). This means that IWAE is a tighter lower bound to the marginal log likelihood.</p>
<h2 id="benefit-2-importance-weighted-gradients">Benefit 2: Importance-weighted gradients</h2>
<p>Things become even more interesting if we look at the gradient of IWAE compared to the original ELBO:</p>
<div>
\begin{align*}
\nabla_\Theta \mathcal{L}_{\text{VAE}}(\theta,\phi)&=\mathbb{E}_{z_1, z_2 \cdots z_K \sim q(z\mid x)} \left[\sum_{k=1}^K \frac{1}{K} \nabla_\Theta \log \frac{p(x,z_k)}{q(z_k\mid x)}\right]\\
\nabla_\Theta \mathcal{L}_{\text{IWAE}}(\theta,\phi)&=\mathbb{E}_{z_1, z_2 \cdots z_K \sim q(z\mid x)} \left[\sum_{k=1}^K w_k \nabla_\Theta \log \frac{p(x,z_k)}{q(z_k\mid x)}\right],
\end{align*}
</div>
<p>where</p>
<div>
\begin{align*}
w_k = \frac{\frac{p(x,z_k)}{q(z_k\mid x)}}{\sum^K_{i=1}\frac{p(x,z_i)}{q(z_i\mid x)}}
\end{align*}
</div>
<p>So we can see that in the \(\mathcal{L}_{\text{VAE}}\) the gradients of each samples are <strong>equally weighted</strong> by \(1/K\), but in \(\mathcal{L}_{\text{VAE}}\) gradient weights them by their <strong>relative importance</strong> \(w_k\).</p>
<h2 id="benefit-3-complex-implicit-distribution">Benefit 3: Complex implicit distribution</h2>
<p>However, this is not all of it — authors in the <a href="https://arxiv.org/pdf/1509.00519.pdf">original paper</a> also showed that IWAE can be interpreted as standard ELBO, but with a more complex (implicit) posterior distribution \(q_{IW}\), thanks to importance sampling. This is probably the most important take-away of IWAE, and I always like go back to this plot from <a href="https://arxiv.org/pdf/1704.02916v2.pdf">reinterpreting IWAE</a> as an intuitive demonstration of its power:
<img src="https://i.imgur.com/pfDciJ7.png" alt="" />
Here, K is the number of importance-weighted samples taken, and the left-most plot is the true distribution that we are trying to approximate with the 3 different \(q_{IW}\). We can see that when \(K=1\), the IWAE objective reduces to original VAE ELBO, and the approximation to true distribution is poor; as K grows, the approximation becomes more and more accurate.</p>
<blockquote>
<p>Side note: Paper <a href="https://arxiv.org/pdf/1704.02916v2.pdf">Reinterpreting IWAE</a> helped me a lot to understanding the IWAE objective, highly recommended. In addition, <a href="http://akosiorek.github.io/ml/2018/03/14/what_is_wrong_with_vaes.html">this blog post</a> by Adam Kosiorek is also a very comprehensive interpretation on the topic.</p>
</blockquote>
<h1 id="3-big-gradient-estimator-variance">3. Big! Gradient! Estimator! Variance!</h1>
<blockquote>
<p>Paper discussed: <a href="https://arxiv.org/pdf/1703.09194.pdf">Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference</a> by Geoffrey Roeder, Yuhuai Wu & David Duvenaud.</p>
</blockquote>
<p>So far we discussed two variational lower bounds in details, ELBO and IWAE. Now is high time to take them off their pedestals and talk about what’s wrong with them — and as you can guess from the title of this section, this has something to do with gradient variance.</p>
<h2 id="recap-2-types-of-gradient-estimators">Recap: 2 types of gradient estimators</h2>
<p>Despite my best effort to sound very excited about all this, I had definitely struggled to care about things like “gradient variance” in the past, largely because there seems to be so many different Monte Carlo gradient estimators out there. But not too long ago, I realised that there are only two very common ones that you need to care about: REINFORCE estimator and reparametrisation trick. I’m leaving some details about each of them here as a note-to-self, but here’s the key thing you need to remember if you want to skip this part and get to the good stuff:</p>
<ul>
<li><strong>REINFORCE estimator (<span style="color:#87BBA2;font-weight:bold">score function</span>)</strong>: very general purpose, large variance;</li>
<li><strong>Reparametrisation (<span style="color:#55828B;font-weight:bold">path derivative</span>)</strong>: less general purpose, much smaller variance.</li>
</ul>
<p>Portal to <a href="(#4.Tighter-lower-bounds-aren't-necessarily-better)">next section</a>.</p>
<h3 id="reinforce-estimator-score-function">REINFORCE estimator (<span style="color:#87BBA2;font-weight:bold">score function</span>)</h3>
<p>This is commonly used in Reinforcement Learning. It is named score function because it utilises this “cool little logarithm trick”:</p>
<div>
\begin{align*}
\nabla_{\theta} \log p(x, \theta) = \frac{\nabla_\theta p(x;\theta)}{p(x;\theta)}
\end{align*}
</div>
<p>So now, when we try to estimate the gradient of some function \(f(x)\) under the expectation of distribution \(p(x;\theta)\), we can do the following:</p>
<div>
\begin{align*}
\nabla_{\theta}\mathbb{E}_{x\sim p(x,\theta)}[f(x)] = \mathbb{E}_{x\sim p(x,\theta)}[f(x)\nabla_{\theta} \log p(x,\theta)]
\end{align*}
</div>
<p>and now we can easily estimate the gradient by performing MC sampling — taking \(N\) samples of \(\hat{x}\sim p(x,\theta)\):</p>
<div>
\begin{align*}
\nabla_{\theta}\mathbb{E}_{x\sim p(x,\theta)}[f(x)] \approx \frac{1}{N}\sum^N_{n=1}f(\hat{x}^{(n)})\nabla_\theta \log p(\hat{x}^{(n)};\theta);
\end{align*}
</div>
<p>Keep in mind that this score function estimator estimator, despite being unbiased, has <strong>very large variance</strong> from multiple sources (see <a href="https://arxiv.org/pdf/1906.10652.pdf">here</a> in section 4.3.1 for details). It is however very flexible and places no requirement on \(p(x;\theta)\) or \(f(x)\) — hence its popularity.</p>
<h3 id="reparametrisation-trick-path-derivative">Reparametrisation trick (<span style="color:#55828B;font-weight:bold">path derivative</span>)</h3>
<p>I assume you are faimiliar with the reparametrisation trick if you got all the way here, but I am a completionist, so here’s a quick recap:</p>
<p>The reparametrisation trick utilises the property that for continuous distribution \(p(x;\theta)\), the following sampling processes are equivalent:</p>
<div>
\begin{align*}
\hat{x} \sim p(x;\theta) \quad \equiv \quad \hat{x}=g(\hat{\epsilon},\theta)
, \hat{\epsilon} \sim p(\epsilon)
\end{align*}
</div>
<p>The most common usage of this is seen in VAE, where instead of directly sampling from the posterior, we typically take random sample from a standard Normal distribution \(\hat{\epsilon} \sim \mathcal{N}(0,1)\) and multiply it by the mean and variance of the posterior computed from our inference model. Here’s that familiar illustration again as a reminder (image from <a href="http://dpkingma.com/wordpress/wp-content/uploads/2015/12/talk_nips_workshop_2015.pdf">Kingma’s NeurIPS2015 workshop slides</a>):</p>
<p align="center"><img src="https://i.imgur.com/wh2MJO6.png" alt="drawing" width="450" /></p>
<p>This method is much less general-purpose compared to the score function estimator since it requires \(p(x)\) to be a continuous distribution, and also access to its underlying sampling path. However, by trading-off the generality, we get an estimator with much <strong>lower variance</strong>.</p>
<blockquote>
<p>Side note: For readers who’re not afraid of gradients, <a href="https://arxiv.org/pdf/1906.10652.pdf">here</a> is a great survey paper on MC gradient estimators.</p>
</blockquote>
<h2 id="the-lurking-score-function-in-reparametrisation-trick">The lurking score function in reparametrisation trick</h2>
<p>At this point we should all be familiar with reparametrisation trick used in VAEs for gradient estimation, but here we need to formalise it a bit more for the derivation in this section:</p>
<blockquote>
<p>Reparametrisation trick express sample \(z\) from parametric distribution \(q_\phi(z)\) as a deterministic function of a random variable \(\hat{\epsilon}\) with some fixed distribution and the parameters \(\phi\), i.e. \(z=t(\hat{\epsilon}, \phi)\). For example, if \(q_\phi\) is a diagonal Gaussian, then for \(\epsilon \sim \mathcal{N}(0, 1),\ z=\mu+\sigma\hat{\epsilon}\).</p>
</blockquote>
<p>We already know that reparametrisation trick (<span style="color:#55828B">path derivative</span>) has the benefit of lower variance for gradient estimation compared to <span style="color:#87BBA2">score function</span>. The kicker here is — the gradient of ELBO actually contains a <span style="color:#87BBA2">score function</span> term, causing the estimator to have large variance!</p>
<p>To see this, we can first rewrite ELBO as the following:</p>
<div>
\begin{align*}
\mathcal{L}(\theta,\phi) = \mathbb{E}_{z\sim q_\phi(z\mid x)}\left[ \log p_\theta(x\mid z) + \log p(z) - \log q_\phi(z\mid x) \right]
\end{align*}
</div>
<p>We can then take the total derivative of the term within expectation w.r.t. \(\phi\):</p>
<div>
\begin{align*}
\nabla_{\phi} (\hat{\epsilon},\phi) &= \nabla_{\phi} \left[ \log p_\theta(x\mid z) + \log p(z) - \log q_\phi(z\mid x) \right]\\
&= \nabla_{\phi} \left[ \log p_\theta(z\mid x) + \log p(x) - \log q_\phi(z\mid x) \right]\\
&= \underbrace{\nabla_{z} \left[ \log p_\theta(z\mid x) - \log q_\phi(z\mid x) \right] \nabla_{\phi}t(\hat{\epsilon},\phi)}_{\color{#55828B}{\text{path derivative}}} - \underbrace{\nabla_{\phi}\log q_\phi(z\mid x)}_{\color{#87BBA2}{\text{score function}}}
\end{align*}
</div>
<p>So we see that \(\nabla_{\phi} (\hat{\epsilon},\phi)\) decomposes into 2 terms, one <span style="color:#55828B">path derivative</span> component that measures the dependence on \(\phi\) only through sample \(z\); the <span style="color:#87BBA2">score function</span> the dependence on \(\log q_\phi\) directly, without considering how sample \(z\) changes as a function of \(\phi\).</p>
<p>So, it is not surprising to learn that the large variance of the <span style="color:#87BBA2">score function</span> term here causes problems: the authors discovered that even when the variational posterior \(q_\phi(z\mid x)\) completely matches the true posterior \(p(z\mid x)\), while the <span style="color:#55828B">path derivative</span> component in \(\nabla_{\phi} (\hat{\epsilon},\phi)\) reduces to zero, <span style="color:#87BBA2">score function</span> will have non-zero variance.</p>
<p>So what do we do here? Well, authors propose to simply drop the score function component to get an unbiased gradient estimator:</p>
<div>
\begin{align*}
\hat{\nabla}_{\phi} (\hat{\epsilon},\phi)
&= \underbrace{\nabla_{z} \left[ \log p_\theta(z\mid x) - \log q_\phi(z\mid x) \right] \nabla_{\phi}t(\hat{\epsilon},\phi)}_{\color{#55828B}{\text{path derivative}}}
\end{align*}
</div>
<p>It sounds a bit wacky at first, but this approach works miracle, as authors show in this plot:</p>
<p align="center"><img src="https://i.imgur.com/oSuYLLA.png" alt="drawing" width="400" /></p>
<p>As we see clearly here that by using the path derivative only gradient, the variance of gradient estimation is much lower and \(\phi\) converges to the true variational parameters much faster.</p>
<p>Note that this large gradient variance problem applies for any ELBO, including both standard VAE and IWAE. However, we will show in the next section that IWAE has its unique problem caused by the K multiple sample. More colloquially —-</p>
<h1 id="4tighter-lower-bounds-arent-necessarily-better">4.Tighter lower bounds aren’t necessarily better</h1>
<blockquote>
<p>Paper discussed: <a href="https://arxiv.org/pdf/1802.04537.pdf">Tight Variational Bounds are Not Necessarily Better</a>, work by Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood &
Yee Whye Teh</p>
</blockquote>
<p>This builds on the previous <a href="https://arxiv.org/pdf/1703.09194.pdf">Sticking the Landing</a> paper, and discovers that the gradient variance caused by score function becomes a even bigger problem when using a multi-sample estimator like IWAE.</p>
<p>In here it’s not just a variance problem: estimators with <strong>small expected values</strong> need <strong>proportionally smaller variance</strong> to be estimated accurately. In other words, what we really care about here is the expectation-to-variance, or signal-to-noise (SNR) ratio:</p>
<p align="center"><img src="https://i.imgur.com/X8ukWX7.png" alt="drawing" width="400" /></p>
<p>Here \(\nabla_{M,K}(\phi)\) refers to the gradient with respect to \(\phi\), and here the two quantities we care about are:</p>
<ul>
<li>\(M\): the number of samples used for Monte Carlo estimate of ELBO’s <strong>gradient</strong>;</li>
<li>\(K\): the number of samples used for <strong>IWAE</strong> to estimate a tighter lower bound to \(\log p(x)\).</li>
</ul>
<p>Ideally we want a large SNR for the gradient estimator fo both \(\theta\) and \(\phi\), since smaller SNR indicates that the estimate is completely random. The main contribution of this paper is discovering the following, very surprising relationships:</p>
<div>
\begin{align*}
\text{SNR}(\theta) &= \mathcal{O}(\sqrt{MK})\\
\text{SNR}(\phi) &= \mathcal{O}(\sqrt{M/K})
\end{align*}
</div>
<p>This tells us that while increasing the number of IWAE samples \(K\) get us a tighter lower bound, it actually worsen SNR (\(\phi\)) — meaning that <strong>a large K hurts the performance of the gradient estimator for</strong> \(\phi\)! Also note that the same effect is not observed for the generative model \(\theta\), but the damage on inference model learning cannot simply be mitigated by increasing \(M\).</p>
<p>The authors gave a very comprehensive proof to their finding, so I’m going to leave the mathy heavy lifting to the original paper :) We shall march on to the last section of this blog: an elegant solution to solve the large variance in ELBO gradient estimators — DReG.</p>
<h1 id="5-how-to-fix-iwae">5. How to fix IWAE?</h1>
<blockquote>
<p>Paper discussed: <a href="https://arxiv.org/pdf/1810.04152.pdf">Doubly Reparametrised Gradient Estimators for Monte Carlo Objectives</a>, work by George Tucker, Dieterich Lawson, Shixiang Gu & Chris J. Maddison.</p>
</blockquote>
<p>In <a href="#3.-Big!-Gradient!-Estimator!-Variance!">section 3</a> we talked about the large gradient variance caused by the score function lurking in the gradient estimation, and <a href="#4.Tighter-lower-bounds-aren't-necessarily-better">section 4</a> about how this is exacerbated for IWAE. I’ll put the total derivative we have seen in <a href="#3.-Big!-Gradient!-Estimator!-Variance!">section 3</a> here as a reference, but to make it more relevant, this time we rewrite it for IWAE that uses \(K\) importance samples:</p>
<div>
\begin{align*}
\nabla_{\phi} (\hat{\epsilon},\phi) = \mathbb{E}_{\hat{\epsilon}_{1:K}} \underbrace{\left[\sum_{k=1}^K w_k \nabla_{z} \left[ \log p_\theta(z\mid x) - \log q_\phi(z\mid x) \right] \nabla_{\phi}t(\hat{\epsilon},\phi)\right.}_{\color{#55828B}{\text{path derivative}}} - \underbrace{\left.\sum_{k=1}^K w_k \nabla_{\phi}\log q_\phi(z\mid x)\right]}_{\color{#87BBA2}{\text{score function}}} ,
\end{align*}
</div>
<p>where</p>
<div>
\begin{align*}
w_k =\frac{\tilde{w}_k}{\sum^K_{i=1} \tilde{w}_i}= \frac{\frac{p(x,z_k)}{q(z_k\mid x)}}{\sum^K_{i=1}\frac{p(x,z_i)}{q(z_i\mid x)}}.
\end{align*}
</div>
<blockquote>
<p>This is not much of a change from the total derivative of original ELBO, as we have mentioned in <a href="#2.-"K-steps-away"-from-basic-ELBO:-IWAE">section 2</a> that IWAE simply weights the gradients of VAE ElBO by the relative importance of each sample \(w_k\).</p>
</blockquote>
<p>We have learned that one way to deal with it is to completely remove the score function term. However, is there a better way than completely discarding a term in gradient estimation?</p>
<p>Well obviously I wouldn’t be asking this question here if the answer weren’t yes — authors in this paper proposed to reduce the variance by <strong>doing another reparametrisation on the score function term</strong>! Here’s how:</p>
<p>Taking the score function term in the total derivative of IWAE, we can first take the \(\sum_k\) term out of the expectation:</p>
<div>
\begin{align*}
\mathbb{E}_{\hat{\epsilon}_{1:K}} \underbrace{\left[\sum_{k=1}^K w_k \nabla_{\phi}\log q_\phi(z\mid x)\right]}_{\color{#87BBA2}{\text{score function}}} = \sum_{k=1}^K \mathbb{E}_{\hat{\epsilon}_{1:K}} \left[ w_k \nabla_{\phi}\log q_\phi(z\mid x)\right]
\end{align*}
</div>
<p>Now we can just ignore the sum and focus on what’s in the expectation \(\mathbb{E}_{\hat{\epsilon}_{1:K}}\). Since the derivative is taken with respect to \(\phi\), we can treat \(\epsilon\), the pseudo sample we take for reparametrisation trick, as a constant. Therefore, it is possible to substitute \(\epsilon\) by the actual sample from our approximated posterior \(z\) — also a constant as far as \(\nabla_\phi\) is concerned. This way we have:</p>
<div>
\begin{align*}
\mathbb{E}_{\hat{\epsilon}_{1:K}} \left[ w_k \nabla_{\phi}\log q_\phi(z\mid x)\right] &= \mathbb{E}_{z_{1:K}} \left[ w_k \nabla_{\phi}\log q_\phi(z\mid x)\right]\\\\
&= \mathbb{E}_{z_{-k}} \underbrace{\mathbb{E}_{z_k} \left[ w_k \nabla_{\phi}\log q_\phi(z\mid x)\right]}_{\text{A }\color{#EE6C4D}{\text{REINFORCE}}\text{ term appears!}}
\end{align*}
</div>
<p>By doing this substitution, a <span style="color:#EE6C4D">REINFORCE</span> term appears! I’ll just let that sink in for a bit.</p>
<blockquote>
<p>I should clarify that that previously we just had the score function term, but since the expectation is over \(\hat{\epsilon}\) instead of actual samples from \(q_\phi(z\mid x)\), it is not actually REINFORCE.</p>
</blockquote>
<p>This is important because REINFORCE and reparametrisation trick are interchangable, as we see below:</p>
<div>
\begin{align*}
\underbrace{\mathbb{E}_{q_\phi (z\mid x)}\left[ f(z)\frac{\partial}{\partial \phi}\log q_\phi(z\mid x) \right]}_{\color{#EE6C4D}{\text{REINFORCE}}} = \underbrace{\mathbb{E}_{\hat{\epsilon}} \left[ \frac{\partial f(z)}{\partial z} \frac{\partial z(\hat{\epsilon}, \phi)}{\partial \phi} \right]}_{\text{reparametrisation trick}}
\end{align*}
</div>
<p>If we substitute the above back into the original total derivative of IWAE, after some math montage, we can simplifying it as the following:</p>
<div>
\begin{align*}
\nabla_{\phi} (\hat{\epsilon},\phi) = \mathbb{E}_{\hat{\epsilon}_{1:K}} \left[ \sum^K_{k=1} (w_k)^2 \frac{\partial\log \tilde{w}_k}{\partial z_i} \frac{\partial z_i}{\partial \phi} \right]
\end{align*}
</div>
<p>This is actually very easy to implement: cheeky little plug, we used this objective in our paper on multimodal VAE learning, you can find the code <a href="https://github.com/iffsid/mmvae">here</a> that comes with a handy implementation of DReG in pytorch.</p>
<h1 id="we-are-done">We are done!</h1>
<p>A heartfelt congratulation if you got all the way here, well done! Leave a comment if you have any question, if you find this helpful please share on twitter/facebook :)</p>
<p>Special thanks to my supervisor Dr. Siddharth Narayanaswamy for guiding me through these literature with great insights and extreme patience.</p>Yuge Shiyshi@robots.ox.ac.ukLatex equations not rendering? Try using a different browser or this link here.Gaussian Process, not quite for dummies2019-09-05T00:00:00-07:002019-09-05T00:00:00-07:00https://yugeten.github.io/posts/2019/09/gp<h1 id="before-diving-in">Before diving in</h1>
<p>For a long time, I recall having this vague impression about Gaussian Processes (GPs) being able to magically define probability distributions over sets of functions, yet I procrastinated reading up about them for many many moons. However, as always, I’d like to think that this is not just due to my procrastination superpowers. Whenever I look up “Gaussian Process” on Google, I find these well-written tutorials with vivid plots that explain everything up until non-linear regression in detail, but shy away at the very first glimpse of any sort of information theory. The key takeaway is always,</p>
<blockquote>
<p>A Gaussian process is a probability distribution over possible functions that fit a set of points.</p>
</blockquote>
<p>While memorising this sentence does help if some random stranger comes up to you on the street and ask for a definition of Gaussian Process – which I’m sure happens all the time – it doesn’t get you much further beyond that. In what range does the algorithm search for “possible functions”? What gives it the capacity to model things on a continuous, infinite space?</p>
<p>Confused, I turned to the “the Book” in this area, <a href="http://www.gaussianprocess.org/gpml/chapters/RW.pdf"><em>Gaussian Processes for Machine Learning</em></a> by Carl Edward Rasmussen and Christopher K. I. Williams. I have friends working in more statistical areas who swear by this book, but after spending half an hour just to read 2 pages about linear regression I went straight into an existential crisis. I’m sure it’s a great book, but the math is quite out of my league.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/92-98SYOdlY" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p>So what more is there? Thankfully I found the above lecture by Dr. Richard Turner on YouTube, which was a great introduction to GPs, and some of its state-of-the-art approaches. After watching this video, reading the <em>Gaussian Processes for Machine Learning</em> book became a lot easier. So I decided to compile some notes for the lecture, which can now hopefully help other people who are eager to more than just scratch the surface of GPs by reading some “machine learning for dummies” tutorial, but don’t quite have the claws to take on a textbook.</p>
<p><em><strong>Acknowledgement:</strong> the figures in this blog are from Dr. Richard Turner’s talk “Gaussian Processes: From the Basic to the State-of-the-Art”, which I highly recommend! Have a lookie here: <a href="http://cbl.eng.cam.ac.uk/pub/Public/Turner/News/imperial-gp-tutorial.pdf">Portal to slides</a>.</em></p>
<h1 id="motivation-non-linear-regression">Motivation: non-linear regression</h1>
<p>Of course, like almost everything in machine learning, we have to start from regression. Let’s revisit the problem: somebody comes to you with some data points (red points in image below), and we would like to make some prediction of the value of $y$ with a specific $x$.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60500034-8bcb1f00-9cb1-11e9-9028-c2982528a5f2.png" alt="drawing" width="300" /></p>
<p>In non-linear regression, we fit some nonlinear curves to observations. The higher degrees of polynomials you choose, the better it will fit the observations.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60500079-9be2fe80-9cb1-11e9-96ff-8f15bb902bd6.png" alt="drawing" width="300" /></p>
<p>This sort of traditional non-linear regression, however, typically gives you <strong>one</strong> function that it considers to fit these observations the best. But what about the other ones that are also pretty good? What if we observed one more points, and one of those ones end up being a much better fit than the “best” solution?</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60500111-aac9b100-9cb1-11e9-8e18-181598397e29.png" alt="drawing" width="300" /></p>
<p>To solve this problem, we turn to the good Ol’ Gaussians.</p>
<h1 id="the-world-of-gaussians">The world of Gaussians</h1>
<h2 id="recap">Recap</h2>
<p>Here we cover the basics of multivariate Gaussian distribution. If you’re already familiar with this, skip to the next section <em><strong>2D Gaussian Examples</strong></em>.</p>
<p>The Multivariate Gaussian distribution is also known as the joint normal distribution, and is the generalisation of the univariate Gaussian distribution to high dimensional spaces. Formally, the definition is:</p>
<blockquote>
<p>A random variable is said to be k-variate normally distributed if every linear combination of its k components have a univariate normal distribution.</p>
</blockquote>
<p>Mathematically, $X = (X_1, …X_k)^T$ has a multivariate Gaussian distribution if $Y=a_1X_1 + a_2X_2 … + a_kX_k$ is normally distributed for any constant vector ${a} \in \mathcal{R}^k$.</p>
<p><em><strong>Note</strong>: if all k components are independent Gaussian random variables, then $X$ must be multivariate Gaussian (because the sum of independent Gaussian random variables is always Gaussian).</em></p>
<p>_<strong>Another note</strong>: sum of random variables is different from sum of distribution – the sum of two Gaussian distributions gives you a Gaussian mixture, which is not Gaussian except in special cases. _</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60735059-d36bd800-9f49-11e9-870a-9c6958afc964.png" alt="drawing" width="500" /></p>
<h2 id="2d-gaussian-examples">2D Gaussian Examples</h2>
<h3 id="covariance-matrix">Covariance matrix</h3>
<p>Here is an example of a 2D Gaussian distribution with mean 0, with the oval contours denoting points of constant probability.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60500191-ccc33380-9cb1-11e9-8f80-5b695d896000.png" alt="drawing" width="350" /></p>
<p>The covariance matrix, denoted as $\Sigma$, tells us (1) the <strong>variance</strong> of each individual random variable (on diagonal entries) and (2) the <strong>covariance</strong> between the random variables (off diagonal entries). The covariance matrix in above image indicates that $y_1$ and $y_2$ are positively correlated (with $0.7$ covariance), therefore the somewhat “stretchy” shape of the contour. If we keep reducing the covariance while keeping the variance unchanged, the following transition can be observed:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60500266-f0867980-9cb1-11e9-8da6-f7f4fc5ab858.png" alt="drawing" width="220" /><img src="https://user-images.githubusercontent.com/18204038/60500289-faa87800-9cb1-11e9-84d0-50685c54a535.png" alt="drawing" width="220" /><img src="https://user-images.githubusercontent.com/18204038/60500307-03994980-9cb2-11e9-9efe-22cabecc0a27.png" alt="drawing" width="220" /></p>
<p>Note that when $y_1$ is independent from $y_2$ (rightmost plot above), the contours are spherical.</p>
<h3 id="conditioning">Conditioning</h3>
<p>With multivariate Gaussian, another fun thing we can do is conditioning. In 2D, we can demonstrate this graphically:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60500452-478c4e80-9cb2-11e9-8df8-763938a0eca2.png" alt="drawing" width="350" /></p>
<p>We fix the value of $y_1$ to compute the density of $y_2$ along the red line – thereby condition on $y_1$. Note that in here since $y_2 \in \mathcal{N}(\mu, \sigma)$ , by conditioning we get a Gaussian back.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60500477-4f4bf300-9cb2-11e9-9a90-7fc316f64468.png" alt="drawing" width="350" /></p>
<p>We can also visualise how this conditioned gaussian changes as the correlation drop – when correlation is $0$, $y_1$ tells you nothing about $y_2$, so for $y_2$ the mean drop to $0$ and the variance becomes high.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/68397329-eb0ed380-016a-11ea-9720-ae7ce342d3d4.gif" alt="drawing" width="350" /></p>
<p><</p>
<h1 id="high-dimensional-gaussian-a-new-interpretation">High dimensional gaussian: a new interpretation</h1>
<h2 id="2d-gaussian">2D Gaussian</h2>
<p>The oval contour graph of Gaussian, while providing information on the mean and covariance of our multivariate Gaussian distribution, does not really give us much intuition on how the random variables correlate with each other during the sampling process.</p>
<p>Therefore, consider this new interpretation that can be plotted as such:</p>
<p>Take the oval contour graph of the 2D Gaussian (left-top in below image) and choose a <strong>random point</strong> on the graph. Then, plot the value of $y_1$ and $y_2$ of that point on a new graph, at index = $1$ and $2$, respectively.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60428229-4e528d00-9bf0-11e9-8813-9931dd159fb8.png" alt="drawing" width="300" /></p>
<p>Under this setting, we can now visualise the sampling operation in a new way by taking multiple “<strong>random points</strong>” and plot $y_1$ and $y_2$ at index $1$ and $2$ multiple times. Because $y_1$ and $y_2$ are correlated ($0.9$ correlation), as we take multiple samples, the bar on the index graph only “wiggles” ever so slightly as the two endpoints move up and down together.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/68397330-eba76a00-016a-11ea-9950-d1dec3ee1285.gif" alt="drawing" width="300" /></p>
<p>For conditioning, we can simply fix one of the endpoint on the index graph (in below plots, fix $y_1$ to 1) and sample from $y_2$.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/68397339-ee09c400-016a-11ea-94c9-a3121f725deb.gif" alt="drawing" width="300" /></p>
<h2 id="higher-dimensional-gaussian">Higher dimensional Gaussian</h2>
<h3 id="5d-gaussian">5D Gaussian</h3>
<p>Now we can consider a higher dimension Gaussian, starting from 5D — so the covariance matrix is now 5x5.</p>
<p>Take a second to have a good look at the covariance matrix, and notice:</p>
<ol>
<li>All variances (diagonal) are equal to 1;</li>
<li>The further away the indices of two points are, the less correlated they are. For instance, correlation between $y_1$ and $y_2$ is quite high, $y_1$ and $y_3$ lower, $y_1$ and $y_4$ the lowest)</li>
</ol>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/68397343-f06c1e00-016a-11ea-8ad3-8203f940f495.gif" alt="drawing" width="300" /></p>
<p>We can again condition on $y_1$ and take samples for all the other points. Notice that $y_2$ is moving less compared to $y_3$ - $y_5$ because it is more correlated to $y_1$.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/68397344-f104b480-016a-11ea-84f3-32e7485b965e.gif" alt="drawing" width="300" /></p>
<h3 id="20d-gaussian">20D Gaussian</h3>
<p>To make things more intuitive, for 20D Gaussian we replace the numerical covariance matrix by a colour map, with warmer colors indicating higher correlation:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61445672-ba015d80-a945-11e9-95a8-e026a26f2856.png" alt="drawing" width="200" /></p>
<p>This gives us samples that look like this:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/68397345-f235e180-016a-11ea-9ebb-a81bb34d2033.gif" alt="drawing" width="300" /></p>
<p>Now look at what happens to the 20D Gaussian conditioned on $y_1$ and $y_2$:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/68397346-f2ce7800-016a-11ea-96d2-cce9ee1c8116.gif" alt="drawing" width="300" /></p>
<p>Hopefully you may now be thinking: “Ah, this is looking exactly like the nonlinear regression problem we started with!” And yes, indeed, this is exactly like a nonlinear regression problem where $y_1$ and $y_2$ are given as observations. Using this index plot with 20D Gaussian, we can now generate <strong>a family of curves</strong> that fits these observations. Even better, if we generate a number of them, we can compute the mean and variance of the fitting using these randomly generated curves. We visualise this in the plot below.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61446482-1f098300-a947-11e9-8658-18c0b6e4d50d.png" alt="drawing" width="400" /></p>
<p>We can see from the above image that because of how covariance matrix is structured (i.e. closer points have higher correlation), the points closer to the observations has very low uncertainty with non-zero mean, whereas the ones further from them have high uncertainty and zero mean. <em>(Note that in reality, we don’t have to actually take many many many samples to estimate the mean and standard deviation, they are completely analytical.)</em></p>
<p>Here we also offer a slightly more <em>exciting</em> example where we condition on 4 points of the 20D Gaussian (and you wonder why everybody hates statisticians):</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61447393-c2a76300-a948-11e9-98da-5843098f7523.png" alt="drawing" width="200" /></p>
<h2 id="getting-real">Getting “real”</h2>
<p>The problem with this approach for nonlinear regression seems obvious – it feels like all the points on the x-axis have to be integers because they are indices, while in reality, we want to model observations with real values. One immediately obvious solution for this is, we can keep increasing the dimensionality of the Gaussian and calculate many many points close to the observation, but that is a bit clumsy.</p>
<p>The solution lies in how the covariance matrix is generated. Conventionally, $\Sigma$ is calculated using the following 2-step process:</p>
<p>\(\Sigma (x_1, x_2) = K(x_1, x_2) + I \sigma_y^2\)
\(K(x_1, x_2) = \sigma^2 e^{-\frac{1}{2l^2}(x_1 - x_2)^2}\)</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61448254-75c48c00-a94a-11e9-9150-79b03d54c694.png" alt="drawing" width="200" /></p>
<p>The covariance matrices in all the above examples are computed using the Radial Basis Function (RBF) kernel $K(x_1, x_2)$ – all by taking integer values for $x_1$, $x_2$. This RBF kernel ensures the <strong>“smoothness”</strong> of the covariance matrix, by generating a large output values for $x_1$, $x_2$ inputs that are closer to each other and smaller values for inputs that are further away . Note that if $x_1=x_2$, $K(x_1, x_2)=\sigma^2$. We then take K and add $I\sigma_y^2$ for the final covariance matrix to factor in noise – more on this later.</p>
<p>This means in principle, <strong>we can calculate this covariance matrix for any real-valued $x_1$ and $x_2$ by simply plugging them in</strong>. The real-valued $x$s effectively result in an infinite-dimensional Gaussian defined by the covariance matrix.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60430108-61fff280-9bf4-11e9-8ef4-a989734b859f.png" alt="drawing" width="400" /></p>
<p>Now that, is a <strong>Gaussian process</strong> (mic drop).</p>
<h1 id="gaussian-process">Gaussian Process</h1>
<h2 id="textbook-definition">Textbook definition</h2>
<p>From the above derivation, you can view Gaussian process as a generalisation of multivariate Gaussian distribution to infinitely many variables. Here we also provide the textbook definition of GP, in case you had to testify under oath:</p>
<blockquote>
<p>A Gaussian process is a collection of random variables, any finite number of which have consistent Gaussian distributions.</p>
</blockquote>
<p>Just like a Gaussian distribution is specified by its mean and variance, a Gaussian process is completely defined by (1) a mean function $m(x)$ telling you the mean at any point of the input space and (2) a covariance function $K(x, x’)$ that sets the covariance between points. The mean can be any value and the covariance matrix should be positive definite.</p>
\[f(x) \sim \mathcal{G}\mathcal{P}(m(x), K(x, x'))\]
<h2 id="parametric-vs-non-parametric">Parametric vs. non-parametric</h2>
<p>Note that our Gaussian processes are non-paramatric, as opposed to nonlinear regression models which are parametric. And here’s a secret:</p>
<h3 align="center"> non-parametric model == model with infinite number of parameters </h3>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61577225-40e83e80-aadc-11e9-8b52-d001d25d7181.png" alt="drawing" width="200" /></p>
<p>In a parametric model, we define the function explicitly with some parameters:</p>
\[y(x) = f(x) + \epsilon \sigma_y\]
\[p(\epsilon) = \mathcal{N}(0,1)\]
<p>Where $\sigma_y$ is Gaussian noise describing how noisy the fit is to the actual observation (graphically it’ll represent how often the data lies directly on the fitted curve).
We can place a Gaussian process prior over the nonlinear function – meaning, we assume that the parametric function above is drawn from the Gaussian process defined as follow:</p>
\[p(f(x)\mid \theta) = \mathcal{G}\mathcal{P}(0, K(x, x'))\]
\[K(x, x') = \sigma^2 \text{exp}(-\frac{1}{2l^2}(x-x')^2)\]
<p>This GP will now generate lots of smooth/wiggly functions, and if you think your parametric function falls into this family of functions that GP generates, this is now a sensible way to perform non-linear regression.</p>
<p>We can also add Gaussian noise $\sigma_y$ directly to the model, since the sum of Gaussian variables is also a Gaussian:</p>
\[p(f(x)\mid \theta) = \mathcal{G}\mathcal{P}(0, K(x, x') + I\sigma_y^2)\]
<p>In summary, GP regression is exactly the same as regression with parametric models, except you put a prior on the set of functions you’d like to consider for this dataset. The characteristic of this “set of functions” you consider is defined by the kernel of choice ($K(x, x’)$). Note that conventionally the prior has mean 0.</p>
<h2 id="hyperparameters">Hyperparameters</h2>
<p>There are 2 hyperparameters here:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60430623-a2ac3b80-9bf5-11e9-952d-7181bbec066a.png" alt="drawing" width="400" /></p>
<ul>
<li><strong>Vertical scale</strong> $\sigma$: describes how much span the function has vertically;</li>
<li><strong>Horizontal scale</strong> $l$: describes how quickly the correlation between two points drops as the distance between them increases – a high $l$ gives you a <em>smooth</em> function, while lower $l$ results in a <em>wiggly</em> function.</li>
</ul>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61697612-82245c80-ad2f-11e9-8540-cef3b0715617.png" alt="drawing" width="400" /></p>
<p>Luckily, because $p(y \mid \theta)$ is Gaussian, we can compute its likelihood in close form. That means we can just maximise the likelihood of $p(y\mid \theta)$ under these hyperparameters using a gradient optimiser:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61696910-1b527380-ad2e-11e9-8392-5a6bfe0bdfc0.png" alt="drawing" width="150" /></p>
<h1 id="details-for-implementation">Details for implementation</h1>
<p><em><strong>Before we start:</strong> here we are going to stay quite high level – no code will be shown, but you can easily find many implementations of GP on GitHub (personally I like <a href="https://github.com/dfm/gp/blob/master/worksheet.ipynb">this repo</a>, it’s a Jupyter Notebook walk through with step-by-step explanation). However, this part is important to understanding how GP actually works, so try not to skip it.</em></p>
<h2 id="computation">Computation</h2>
<p>Hopefully at this point you are wondering: this smooth function with infinite-dimensional covariance matrix thing all sounds well and good, but how do we actually do computation with an infinite by infinite matrix?</p>
<p><strong>Marginalisation baby!</strong> Imagine you have a multivariate Gaussian over two vector variables $y_1$ and $y_2$, where:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61701083-7b98e380-ad35-11e9-8012-5b86e1299cf5.png" alt="drawing" width="250" /></p>
<p>Here, we partition the mean into the mean of $y_1$, $a$ and the mean of $y_2$, $b$; similarly, for covariance matrix, we have $A$ as the covariance of $y_1$, $B$ that of $y_1$ and $y_2$, $B^T$ that of $y_2$ and $y_1$ and $C$ of $y_2$.
So now, we can easily compute the probability of $y_1$ using the marginalisation property:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61701492-2b6e5100-ad36-11e9-8ad8-7c2478b29d4d.png" alt="drawing" width="500" /></p>
<p>This formation is extremely powerful — it allows us to calculate the likelihood of $y_1$ under the joint distribution of $p(y_1, y_2)$, while completely ignoring $y_2$! We can now generalise from two variables to <strong>infinitely many</strong>, by altering our definition of $y_1$ and $y_2$ to:</p>
<ul>
<li>$y_1$: contains a finite number of variables we are interested in;</li>
<li>$y_2$: contains all the variables we don’t care about, which is infinitely many;
Then similar to the 2-variable case, we can compute the mean and covariance for $y_1$ partition only, without having to worry about the infinite stuff in $y_2$.
This nice little property allows us to think about finite dimensional projection of the underlying infinite object on our computer. We can forget about the infinite stuff happening under the hood.</li>
</ul>
<h2 id="predictions">Predictions</h2>
<p>Taking the above $y_1$, $y_2$ example, but this time imagine all the observations are in partition $y_2$ and all the points we want to make predictions about are in $y_1$ (again, the infinite points are still in the background, let’s imagine we’ve shoved them into some $y_3$ that is omitted here).</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61702038-09290300-ad37-11e9-911a-346d4e5a9727.png" alt="drawing" width="250" /></p>
<p>To make predictions about $y_1$ given observations of $y_2$, we can then use bayes rules to calculate $p(y_1\mid y_2)$:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61702480-c156ab80-ad37-11e9-94e7-8c7989fefd0a.png" alt="drawing" width="150" /></p>
<p>Because $p(y_1)$, $p(y_2)$ and $p(y_1,y_2)$ are all Gaussians, $p(y_1\mid y_2)$ is also Gaussian. We can therefore compute $p(y_1\mid y_2)$ analytically:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61703635-ea783b80-ad39-11e9-9606-54b5995df95c.png" alt="drawing" width="350" /></p>
<blockquote>
<p>Note: here we catch a glimpse of the bottleneck of GP: we can see that this analytical solution involves computing the inverse of the covariance matrix of our observation $C^{-1}$, which, given $n$ observations, is an $O(n^3)$ operation. This is why we use Cholesky decomposition – more on this later.</p>
</blockquote>
<p>To gain some more intuition on the method, we can write out the predictive mean and predictive covariance as such:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61703783-3e832000-ad3a-11e9-9139-3a478550e1be.png" alt="drawing" width="500" /></p>
<p>So the mean of $p(y_1 \mid y_2)$ is linearly related to $y_2$, and the predictive covariance is the prior uncertainty subtracted by the reduction in uncertainty after seeing the observations. Therefore, the more data we see, the more certain we are.</p>
<h2 id="higher-dimensional-data">Higher dimensional data</h2>
<p>You can also do this for higher-dimensional data (though of course at greater computational costs). Here we extend the covariance function to incorporate RBF kernels in 2D data:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60501013-32fc8600-9cb3-11e9-84db-39fd11a947f5.png" alt="drawing" width="500" /></p>
<h2 id="covariance-matrix-selection">Covariance matrix selection</h2>
<p>As one last detail, let’s talk about the different covariance matrices used for GP. I don’t have any authoritative advice on selecting kernels for GP in general, and I believe in practice, most people try a few popular kernels and pick the one that fits their data/problem the best. So here we will only introduce the form of some of the most frequently seen kernels, get a feel for them with some plots and not go into too much detail. (I highly recommend implementing some of them and play around with it though! It’s good coding practice and best way to gain intuitions about these kernels.)</p>
<h3 id="laplacian-function">Laplacian Function</h3>
<p>This function is continuous but non-differentiable. It looks like this:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61705455-adae4380-ad3d-11e9-848b-0f9c49f85b99.png" alt="drawing" width="400" /></p>
<p>If you average over all samples, you get straight lines joining your datapoints, which are called Browninan bridges.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60494726-3c7ff100-9ca7-11e9-864d-e41619f2bd5c.png" alt="drawing" width="150" /><img src="https://user-images.githubusercontent.com/18204038/60494726-3c7ff100-9ca7-11e9-864d-e41619f2bd5c.png" alt="drawing" width="150" /></p>
<h3 id="rational-quadratic">Rational quadratic</h3>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61705556-e0583c00-ad3d-11e9-90d7-177c1cdb5f66.png" alt="drawing" width="400" /></p>
<p>Average over all samples looks like this:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60499570-a18c1480-9cb0-11e9-99a4-87e02f195c9e.png" alt="drawing" width="150" /><img src="https://user-images.githubusercontent.com/18204038/60499556-9b963380-9cb0-11e9-9f1c-825484a88a74.png" alt="drawing" width="150" /></p>
<h3 id="periodic-functions">Periodic functions</h3>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61706968-55794080-ad41-11e9-81ee-21043a296a08.png" alt="drawing" width="400" /></p>
<p>Average over all samples looks like this:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/60494736-430e6880-9ca7-11e9-812b-b7944305d16f.png" alt="drawing" width="150" /><img src="https://user-images.githubusercontent.com/18204038/61706395-e9e2a380-ad3f-11e9-86dc-713f14db3bae.png" alt="drawing" width="150" /></p>
<h3 id="summary">Summary</h3>
<p>There are books that you can look up for appropriate kernels for covariance functions for your particular problem, and rules you can follow to produce more complicated covariance functions (such as, the product of two covariance functions is a valid covariance function). They can give you very different results:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61706573-565da280-ad40-11e9-848a-10f34fa1113b.png" alt="drawing" width="500" /></p>
<p>It is tricky to find the appropriate covariance function, but there are also methods in place for model selection. One of those methods is Bayesian model comparison, defined as follows:</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/61706770-d5eb7180-ad40-11e9-9235-b802afec85c4.png" alt="drawing" width="500" /></p>
<p>However, it does involve a very difficult integral (or sum in discrete case, as showed above) over the hyperparameters of your GP, which makes it impractical, and is also very sensitive to the prior you put over your hyperparameters. In practice, it is more common to use deep Gaussian Processes for automatic kernel design, which optimises the choice of covariance function that is appropriate for your data through training.</p>
<h1 id="the-end">The end</h1>
<p>Hopefully this has been a helpful guide to Gaussian process for you. I want to keep things relatively short and simple here, so I did not delve into the complications of using GPs in practice – in reality GPs suffers from not being able to scale to large datasets, and the choice of kernels can be very tricky. There are some state-of-the-art approaches that tackle with these issues (see <a href="https://arxiv.org/abs/1602.04133">deep GP</a> and <a href="https://arxiv.org/abs/1605.07066">sparse GP</a>), but since I am by no means an expert in this area I will leave you to exploring them.</p>
<p>Thank you for reading! Remember to take your canvas bag to the supermarket, baby whales are dying.</p>
<p align="center"><img src="https://user-images.githubusercontent.com/18204038/64785652-c03a4180-d564-11e9-9662-980fb7ecd522.png" alt="drawing" width="250" /></p>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>I would like to thank Andrey Kurenkov and Hugh Zhang from <a href="https://thegradient.pub/">The Gradient</a> for helping me with the edits of this article.</p>Yuge Shiyshi@robots.ox.ac.ukBefore diving in For a long time, I recall having this vague impression about Gaussian Processes (GPs) being able to magically define probability distributions over sets of functions, yet I procrastinated reading up about them for many many moons. However, as always, I’d like to think that this is not just due to my procrastination superpowers. Whenever I look up “Gaussian Process” on Google, I find these well-written tutorials with vivid plots that explain everything up until non-linear regression in detail, but shy away at the very first glimpse of any sort of information theory. The key takeaway is always, A Gaussian process is a probability distribution over possible functions that fit a set of points.