BYOL - Learning Representations Without Looking at Negatives

- 8 mins

Self-supervised learning has become one of the most exciting areas in deep learning — the idea that you can learn powerful representations from raw, unlabeled data. Most state-of-the-art approaches in this space rely on contrastive learning, which works by pulling together representations of similar items and pushing apart representations of different items. The “negative pairs” — pairs of different items that you push apart — are central to how these methods work.

BYOL (Bootstrap Your Own Latent), introduced by Grill et al. from DeepMind in 2020, asks a bold question: do we actually need negative pairs at all?

The answer, surprisingly, is no.

The Problem with Contrastive Methods

Before understanding BYOL, it helps to understand what contrastive methods like SimCLR and MoCo are doing, and where they struggle.

The contrastive objective, InfoNCE, looks roughly like this:

\[\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)}\]

Where $z_i$ and $z_j$ are representations of two augmented views of the same image (positive pair), and the denominator sums over all other representations in the batch (negative pairs).

The objective pushes $z_i$ and $z_j$ closer while pushing $z_i$ away from all other $z_k$. This discrimination task forces the network to learn meaningful representations — it can’t collapse, because different images need to be separated.

But this comes with practical baggage:

Why Collapsing is the Core Challenge

If you remove negative pairs entirely and just train the network to predict one view’s representation from another view’s representation, you hit a fundamental problem: representation collapse.

The network finds the trivially perfect solution:

\[f_\theta(x) = \mathbf{c} \quad \forall x\]

where $\mathbf{c}$ is some constant vector. If every image maps to the same vector, then prediction error between any two views is zero. Loss is minimized. But the representation is completely useless — it carries zero information.

A straightforward fix is to use a fixed, randomly initialized network as the prediction target. This prevents collapse because the target outputs are frozen and vary across images. The online network is forced to learn to mimic those varying outputs.

This works for preventing collapse — but a random network is a terrible teacher. It has no semantic understanding of images.

Here’s the interesting empirical finding from the paper though: even predicting a random fixed network gives you 18.8% top-1 accuracy on ImageNet linear evaluation, while the random network itself only scores 1.4%. The act of prediction itself, even against noise, forces the online network to build some structure in its embedding space.

This is the core insight that motivates BYOL: if predicting a fixed random target already gives such a jump, what if we made the target progressively better?

BYOL’s Solution

BYOL uses two networks:

The key mechanism is the exponential moving average (EMA) update for the target network:

\[\xi \leftarrow \tau \xi + (1 - \tau)\theta\]

where $\tau \in [0, 1]$ is a decay rate (typically starting at 0.996 and increasing to 1 over training). The target network is never directly optimized — it’s a slowly lagging, smoothed version of the online network.

Forward Pass

Given an image $x$, two augmented views are produced: $v = t(x)$ and $v’ = t’(x)$ using augmentation distributions $\mathcal{T}$ and $\mathcal{T}’$.

Online network processes $v$:

\[y_\theta = f_\theta(v), \quad z_\theta = g_\theta(y_\theta), \quad \hat{q}_\theta = q_\theta(z_\theta)\]

Target network processes $v’$:

\[y'_\xi = f_\xi(v'), \quad z'_\xi = g_\xi(y'_\xi)\]

The online network’s prediction $q_\theta(z_\theta)$ is trained to match the target network’s projection $z’_\xi$.

Loss Function

Both outputs are $\ell_2$-normalized, and the loss is the mean squared error between them:

\[\mathcal{L}_{\theta, \xi} = \left\| \bar{q}_\theta(z_\theta) - \bar{z}'_\xi \right\|_2^2 = 2 - 2 \cdot \frac{\langle q_\theta(z_\theta),\ z'_\xi \rangle}{\|q_\theta(z_\theta)\|_2 \cdot \|z'_\xi\|_2}\]

where $\bar{q}$ and $\bar{z}’$ denote $\ell_2$-normalized versions.

The loss is symmetrized by also feeding $v’$ to the online network and $v$ to the target network, giving $\tilde{\mathcal{L}}_{\theta,\xi}$. The final loss is:

\[\mathcal{L}^{BYOL}_{\theta,\xi} = \mathcal{L}_{\theta,\xi} + \tilde{\mathcal{L}}_{\theta,\xi}\]

Crucially, gradients flow only through the online network — the target network is updated via EMA, not backprop (stop-gradient on $z’_\xi$).

Training Dynamics

\[\theta \leftarrow \text{optimizer}(\theta,\ \nabla_\theta \mathcal{L}^{BYOL}_{\theta,\xi},\ \eta)\] \[\xi \leftarrow \tau\xi + (1-\tau)\theta\]

At the end of training, the target network and projector are discarded. Only the encoder $f_\theta$ is kept for downstream tasks.

Why Doesn’t BYOL Collapse?

This is the most interesting theoretical question in the paper — and the authors give an elegant intuition.

Under the assumption that the predictor $q_\theta$ is near-optimal (i.e., it minimizes prediction error for the current $\theta$ and $\xi$), the update on $\theta$ follows the gradient of the expected conditional variance:

\[\nabla_\theta \mathcal{L} = \nabla_\theta \mathbb{E}\left[\sum_i \text{Var}(z'_{\xi,i} \mid z_\theta)\right]\]

Minimizing this variance means the online projection $z_\theta$ should be as informative as possible about the target projection $z’\xi$. A constant feature in $z\theta$ can only increase this conditional variance — so constant (collapsed) representations are unstable equilibria under this objective.

The EMA target plays a dual role here: it keeps the target stable enough that the predictor stays near-optimal throughout training, and it slowly propagates the variability captured by the online network into the target network. If you set $\tau = 0$ (copy online weights directly into target each step), training destabilizes and collapses immediately (0.3% top-1 accuracy in ablations). If you set $\tau = 1$ (never update target), you’re back to the fixed random network case (18.8%).

The sweet spot is $\tau \in [0.9, 0.999]$, which gives the best results.

Architecture Details

The encoder uses a standard ResNet-50 backbone. On top of it:

One important asymmetry — the output of the projector in the online network is not batch normalized, unlike SimCLR.

Optimization uses LARS with cosine decay over 1000 epochs, batch size 4096, across 512 TPU v3 cores.

Robustness Advantages

Two ablations from the paper stand out.

Batch size sensitivity:

Batch Size BYOL drop SimCLR drop
4096 → 256 -0.7 pts -3.6 pts
4096 → 128 -2.9 pts -4.3 pts

BYOL is remarkably stable across batch sizes because it doesn’t depend on in-batch negatives. SimCLR degrades sharply as batch size reduces — fewer negatives means easier discrimination, which means weaker signal.

Augmentation sensitivity:

When color distortion is removed entirely, SimCLR drops 22.2 accuracy points. BYOL drops only 9.1 points. When reduced to random crops only, SimCLR loses more than a third of its performance. BYOL holds at 59.4%.

The reason: contrastive methods can exploit color histogram shortcuts when augmentations are weak (crops of the same image share histograms, so color alone distinguishes positives from negatives). BYOL doesn’t have this problem — it’s incentivized to retain all information captured by the target, not just the discriminative shortcut.

Results

Under linear evaluation on ImageNet with ResNet-50:

Method Top-1 Top-5
SimCLR 69.3 89.0
MoCo v2 71.1 -
InfoMin Aug. 73.0 91.1
BYOL 74.3 91.6

With a larger ResNet-200 (2×), BYOL reaches 79.6% top-1, which was at that time competitive with strong supervised baselines.


Conclusion

BYOL is a conceptually elegant result — the claim that you can learn strong representations without ever comparing two different items feels almost paradoxical. But the mechanism is coherent: the asymmetry between online and target networks, combined with EMA updates and the predictor head, creates a self-improvement loop that avoids collapse and progressively refines representations.

For product retrieval systems, BYOL opens up a practical path to:

The representation quality gains from BYOL pre-training compound when combined with supervised fine-tuning — you get a product encoder that understands both feature-level semantics (from BYOL) and behavioral relevance (from interaction-supervised training).

The negative pairs, it turns out, were never strictly necessary.


References

  1. Grill, J.B., Strub, F., Altché, F., et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020. arXiv:2006.07733

  2. Chen, T., Kornblith, S., Norouzi, M., Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. ICML 2020.

  3. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. CVPR 2020.

  4. van den Oord, A., Li, Y., Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018.