Search

AlphaFold3 Explained

Introduction

If you need some background information about protein folding, refer to Introduction to Protein Folding

Timeline of AlphaFold

2016
AlphaGo beat Lee Sedol
DeepMind hired a handful of biologists
2018
CASP13
AlphaFold1
2020
CASP14
2021
AlphaFold-Multimer (AlphaFold-Multimer Explained)
2023
AlphaFold latest (blog post)

AlphaFold3

An almighty AF model for nearly all bio-molecular types

Major differences with AF2

Data type & processing
AF2
AF3
Data type
Protein (polypeptide) only
General molecules (protein, nucleic acids, small molecules, ions, …)
Tokenization
Amino acid residue frame & χ\chi angles (for side chain atoms)
Atoms & tokens
In AF3, atoms are grouped into tokens.
Standard amino acid & nucleic acid: token = entire nucleotides or residues
Others: token = a single heavy atom
Architecture
Evoformer → MSAModule & Pairformer
AF2 Evoformer
AF3 MSAModule
# blocks
48
4
Sequence-level feature
MSA rep (row-wise & column-wise attention)
MSA rep (simple pair-weighted average)
Structure-level feature
Pair rep (triangular multiplicative & self-attention updates)
Pair rep (triangular multiplicative & self-attention updates)
Communication
Outer product mean, pair bias
Outer product mean, pair bias
AF2 Evoformer
AF3 Pairformer
# blocks
48
48
Sequence-level feature
MSA rep (row-wise & column-wise attention)
Single rep (only row-wise attention)
Structure-level feature
Pair rep (triangular multiplicative & self-attention updates)
Pair rep (triangular multiplicative & self-attention updates)
Communication
Outer product mean, pair bias
Pair bias
In AF3, MSA representation is actually de-emphasized. It is first processed at the smaller and simpler MSAModule and then only the first row (single representation) is passed to the core Pairformer.
StructureModule → DiffusionModule
AF2 StructureModule
AF3 DiffusionModule
Model type
Predictive
Generative
Resolution
Residue level
All-atom level
Operates on
Residue frames and χ\chi angles
Raw atom coordinates
Equivariance
Forced to satisfy
Not required (!)
Attention type
IPA (invariant point attention)
1. Sequence local attention (atom-level) 2. Global attention (token-level) 3. Sequence local attention (atom-level)
Major loss
FAPE loss
MSE loss (!)
Initial coordinate
Black hole initialization
Generated conformer
Confidence metrics
AF2 Confidence module
AF3 Confidence module
heads
- pAE head - pLDDT head - experimentally resolved head
- pAE head - pLDDT head - experimentally resolved head - pDE head
pAE head input
Pair rep (Evoformer output)
Pair rep (mini-Pairformer output - with input feature, single feature, distance between rolled-out coordinate information)
pLDDT head input
Single rep (Structure module output)
Single rep (mini-Pairformer output - with input feature, pair feature, distance between rolled-out coordinate information)
Input gradient
Not detached
Detached
Activation function
ReLU → SwiGLU
Recycling scheme
AF2
AF3
NcycleN_{\text{cycle}}
4
4
Recycled representation
- Single rep (Evoformer output) - Pair rep (Evoformer output) - Coordinate (StructureModule output)
- Single rep (Pairformer output) - Pair rep (Pairformer output) (coordinate is not recycled, but denoised at DiffusionModule)

Architecture overview

c.f. AF2 architecture
Fig. 1d AF3 architecture for inference. yellow, input data; blue, abstract network activations; green, output data
(training scheme is a little bit different with inference scheme. see details at Training setup)
Overall structure of AF3 resembles that of AF2.
Large network trunk (Evoformer in AF2, Pairformer in AF3) evolves a pairwise representation.
StructureModule (in AF2) or DiffusionModule (in AF3) uses the pairwise representation to generate explicit atomic positions.
Major modifications were made to..
1.
Accomodate a wide range of chemical entities without excessive special casing
2.
Increase the accuracy in protein structure predictions
AF3 is a conditional diffusion model, where most of the computation happens in the conditioning!
In AF3, coordinates are not recycled with Single rep or Pair rep. But they are iteratively denoised at the DiffusionModule.

Input processing

Tokenization
amino acid, nucleotide structure
Type
Tokenization
Token center atom
Standard amino acid residue
Single token
CA
Standard nucleotide residue
Single token
C1’
Modified amino acid / nucleotide
Tokenized per-atom
First (and only) atom
All ligands
Tokenized per-atom
First (and only) atom
Each token is designated with a token center atom.
Features
Type
AF2
AF3
Description
MSA features
O
O
features from genetics search (e.g. MSA itself, deletion matrix, profile)
Template features
O
O
features from template search (e.g. template atom positions)
Token features
X
O
e.g. position indexes (token_index), chain identifiers (asym_id), masks (is_protein)
Reference features
X
O
features from reference conformer (generated with RDKit ETKDGv3)
Bond features
X
O
features about bond information

TemplateModule (TemplateEmbedder)

Baby Pair representation builder
Input: Template search info & (previous) Pair representation
Output: (updated) Pair representation
Goal: Build and update pair representation with the template information from the database
Template stack is similar to AF2.

MSAModule

Mini-Evoformer  
Input: (previous) MSA representation & (previous) Pair representation
Output: (updated) MSA representation & (updated) Pair representation
Goal: Update MSA and pair representation while exchanging information
Suppl Fig. 2
MSAModule is practically a mini-Evoformer!
Authors divided Evoformer (of AF2) into MSAModule and Pairformer.
AF2 Evoformer
AF3 MSAModule
# blocks
48
4
Sequence-level feature
MSA rep (row-wise & column-wise attention)
MSA rep (simple pair-weighted average)
Structure-level feature
Pair rep (triangular multiplicative & self-attention updates)
Pair rep (triangular multiplicative & self-attention updates)
Communication
Outer product mean, pair bias
Outer product mean, pair bias
Motivation for this change:
1.
Enrich more information on Pair rep as it forms the backbone for the rest of the network!
2.
De-emphasize the MSA rep as there were some discussions that AF2 relies heavily on MSA.
In AF3, there is no key-query based attention in MSA stack and the architecture is similar to the ExtraMSAStack in AF2
→ Reduced computation & memory
MSA is used for protein and RNA sequences.

Pairformer

The legitimate successor the the Evoformer
Input: (previous) Single representation & (previous) Pair representation
Output: (updated) Single representation & (updated) Pair representation
Goal: Update single & pair representation while exchanging information
Fig. 2a Pairformer module. n: number of tokens, c: number of channels. Each 48 blocks does not share weights.
Pairformer is a light-weight Evoformer, focusing more on the Pair rep
AF2 Evoformer
AF3 Pairformer
# blocks
48
48
Sequence-level feature
MSA rep (row-wise & column-wise attention)
Single rep (only row-wise attention)
Structure-level feature
Pair rep (triangular multiplicative & self-attention updates)
Pair rep (triangular multiplicative & self-attention updates)
Communication
Outer product mean, pair bias
Pair bias
Single rep is the privileged first row of MSA rep.
Only row-wise attention for the Single rep!
In Pairformer, only row-wise attention is performed for the Single rep, since there is only one row (instead of NseqN_{\text{seq}} rows in MSA rep)
No outer product mean in Pairformer!
Unlike in AF2, the Single rep does not (directly) influence the Pair rep within a Pairformer block with outer product mean.
(the outer product mean was already used in the MSA module)

DiffusionModule

The rightful heir of the Structure module, without leash and with modern generative technology
Input: Input feature, Single representation, Pair representation & (previous) Physical atom coordinates
Output: (updated) Physical atom coordinates
Goal: Update physical atom coordinates from original input feature and refined single & pair representation
Fig. 2b Diffusion module. coarse: token, fine: atom. green: input, blue: pair, red: single.
DiffusionModule is an all-atom, generative, non-equivariant version of the StructureModule!
AF2 StructureModule
AF3 DiffusionModule
Model type
Predictive
Generative
Resolution
Residue level
All-atom level
Operates on
Residue frames and χ\chi angles
Raw atom coordinates
Equivariance
Forced to satisfy
Not required (!)
Attention type
IPA (invariant point attention)
1. Sequence local attention (atom-level) 2. Global attention (token-level) 3. Sequence local attention (atom-level)
Major loss
FAPE loss
MSE loss (!)
Initial coordinate
Black hole initialization
Generated conformer
The diffusion module of AF3 is a conditional, non-equivariant generative model, that works on the point cloud of atoms.
Condition: Input feature, Single rep, Pair rep
Non-equivariance: no geometric inductive bias involved
In AF2, the Structure module had a invariant operation called IPA to fulfill equivariance constraints.
But in AF3, the authors found that no invariance or equivariance are required after all, and therefore omitted them to simplify the architecture.
Two-level architecture:
 It seems that authors wanted all atoms to attend globally each other, but because of the memory and computation cost, they came up with the two-level architecture.
It works as follows: 1. on atom-level → 2. on token-level → 3. on atom-level
1. Sequence local attention on atom-level
Allow nearby atoms to talk directly to each other!
Here, ‘nearby’ atoms are defined by sequence-level neighbors. (But why not structure-level neighbors?)
Each subset of 32 atoms attends to the subset of the nearby 128 atoms in sequence space.
This restriction is sub-optimal , but was necessary to keep the memory and compute costs within reasonable bounds.
2. Global attention on token-level
It’s time for the representatives (token) to gather local information and shake hands with other global representatives!
1.
Aggregate fine-grained atom-level features into coarse-grained token-level features.
2.
Full self-attention on token-level.
3.
Broadcast token activations to atom features
3. Sequence local attention on atom-level AGAIN!
With that global information update, let’s focus on our local atom neighbors again.
DiffusionModule algorithm

Training setup

Fig. 2c Training setup (distogram head is omitted for simplicity) starting from the end of the network trunk. blue line: abstract activation, yellow line: ground truth, green line: predicted data
DiffusionModule Training
Efficient training of DiffusionModule (a.k.a. the MULTI-VERSE of DiffusionModule!)
DiffusionModule is much cheaper than the network trunk (previous modules).
→ To improve efficiency,
1.
Network trunk is run only once
2.
 48 versions of the input structure are created
(by random rotation, translation and independent noise addition)
3.
DiffusionModule is trained in parallel
Loss
Ldiffusion=(t^2+σdata2)/(t^+σdata)2(LMSE+αbondLbond)+Lsmooth_lddt\mathcal{L}_{\text{diffusion}} = (\hat{t}^2 + \sigma^2_{\text{data}})/(\hat{t} + \sigma_{\text{data}})^2 \cdot (\mathcal{L}_{\text{MSE}} + \alpha_{\text{bond}} \cdot \mathcal{L}_{\text{bond}}) + \mathcal{L}_{\text{smooth\_lddt}}
Weighted aligned MSE loss (upweighting of nucleotide and ligand atoms)
The famous FAPE loss (from AF2) now disappeared, and authors just used relatively simple MSE loss
smooth LDDT loss (to emphasize local geometry)
bond distance loss (during fine-tuning only)
Confidence Module training with Diffusion “roll-out”
We need to rollout the DiffusionModule to train AF3 confidence module. Let’s see why.
AF2 Confidence module
AF3 Confidence module
heads
- pAE head - pLDDT head - experimentally resolved head
- pAE head - pLDDT head - experimentally resolved head - pDE head
pAE head input
Pair rep (Evoformer output)
Pair rep (mini-Pairformer output - with input feature, single feature, distance between rolled-out coordinate information)
pLDDT head input
Single rep (Structure module output)
Single rep (mini-Pairformer output - with input feature, pair feature, distance between rolled-out coordinate information)
Input gradient
Not detached
Detached
In AF2, confidence module was trained with relatively simple Pair rep and Single rep.
In AF3, confidence module evolved, processing much more information!
pDE head (new in AF3): predicts the error in absolute distances between atoms
(this head is also trained with binned classification task)
Other details (such as converting regression and binned classification, pTM & ipTM calculation, …) mostly follows that of AF2.
Distogram prediction
Distogram prediction head is identical to AF2.

Loss

αconfidence=104,αdiffusion=4,αdistogram=3102\alpha_{\text{confidence}} = 10^{-4}, \alpha_{\text{diffusion}} = 4, \alpha_{\text{distogram}} = 3 \cdot 10^{-2} and αpae\alpha_{\text{pae}} is 1 for final fine-tuning, 0 for others.

Results

Example structures predicted using AF3
Performance
Accuracy ~ Confidence
Accuracy ~ MSA size
Accuracy ~ Number of seeds

Limitations

Hallucination
Conformation coverage is limited
Antibody-containing samples require many seeds to obtain high accuracy.
Model outputs do not always respect chirality
Overlapping atoms

Engineering

Dataset composition (TBU)
Training protocols

Model selection

TBU

Reference