AlphaFold3 Explained

Created

2024/05/17 17:02

Language

🇺🇸

Latest checked date

2024/11/12

Status

Updating

Type

Article

1 more property

Introduction

Timeline of AlphaFold

AlphaFold3

Major differences with AF2

Architecture overview

Input processing

TemplateModule (TemplateEmbedder)

Introduction

If you need some background information about protein folding, refer to Introduction to Protein Folding 

Timeline of AlphaFold

•

2016

◦

AlphaGo beat Lee Sedol

◦

DeepMind hired a handful of biologists

•

2018

◦

CASP13

◦

AlphaFold1

•

2020

◦

CASP14

◦

AlphaFold2 (AlphaFold2 Explained)

•

2021

◦

AlphaFold-Multimer (AlphaFold-Multimer Explained)

•

2023

◦

AlphaFold latest (blog post)

•

2024

◦

AlphaFold3 (June) → This post’s topic!

◦

Authors of AlphaFold won Nobel Chemistry

◦

AlphaFold3 code reveal (November) alphafold3

AlphaFold3

An almighty AF model for nearly all bio-molecular types

Major differences with AF2

Data type & processing

	AF2	AF3
Data type	Protein (polypeptide) only	General molecules (protein, nucleic acids, small molecules, ions, …)
Tokenization	Amino acid residue frame & $\chi$ angles (for side chain atoms)	Atoms & tokens

In AF3, atoms are grouped into tokens.

•

Standard amino acid & nucleic acid: token = entire nucleotides or residues

•

Others: token = a single heavy atom

Architecture

•

Evoformer → MSAModule & Pairformer

	AF2 Evoformer	AF3 MSAModule
# blocks	48	4
Sequence-level feature	MSA rep (row-wise & column-wise attention)	MSA rep (simple pair-weighted average)
Structure-level feature	Pair rep (triangular multiplicative & self-attention updates)	Pair rep (triangular multiplicative & self-attention updates)
Communication	Outer product mean, pair bias	Outer product mean, pair bias

	AF2 Evoformer	AF3 Pairformer
# blocks	48	48
Sequence-level feature	MSA rep (row-wise & column-wise attention)	Single rep (only row-wise attention)
Structure-level feature	Pair rep (triangular multiplicative & self-attention updates)	Pair rep (triangular multiplicative & self-attention updates)
Communication	Outer product mean, pair bias	Pair bias

In AF3, MSA representation is actually de-emphasized. It is first processed at the smaller and simpler MSAModule and then only the first row (single representation) is passed to the core Pairformer.

•

StructureModule → DiffusionModule

	AF2 StructureModule	AF3 DiffusionModule
Model type	Predictive	Generative
Resolution	Residue level	All-atom level
Operates on	Residue frames and $\chi$ angles	Raw atom coordinates
Equivariance	Forced to satisfy	Not required (!)
Attention type	IPA (invariant point attention)	1. Sequence local attention (atom-level) 2. Global attention (token-level) 3. Sequence local attention (atom-level)
Major loss	FAPE loss	MSE loss (!)
Initial coordinate	Black hole initialization	Generated conformer

Confidence metrics

	AF2 Confidence module	AF3 Confidence module
heads	- pAE head - pLDDT head - experimentally resolved head	- pAE head - pLDDT head - experimentally resolved head - pDE head
pAE head input	Pair rep (Evoformer output)	Pair rep (mini-Pairformer output - with input feature, single feature, distance between rolled-out coordinate information)
pLDDT head input	Single rep (Structure module output)	Single rep (mini-Pairformer output - with input feature, pair feature, distance between rolled-out coordinate information)
Input gradient	Not detached	Detached

See details at Confidence Module training with Diffusion “roll-out” 🔥

Activation function

•

ReLU → SwiGLU

Recycling scheme

	AF2	AF3
$N_{\text{cycle}}$	4	4
Recycled representation	- Single rep (Evoformer output) - Pair rep (Evoformer output) - Coordinate (StructureModule output)	- Single rep (Pairformer output) - Pair rep (Pairformer output) (coordinate is not recycled, but denoised at DiffusionModule)

Architecture overview

c.f. AF2 architecture

Fig. 1d AF3 architecture for inference. yellow, input data; blue, abstract network activations; green, output data

(training scheme is a little bit different with inference scheme. see details at Training setup)

•

Overall structure of AF3 resembles that of AF2.

◦

Large network trunk (Evoformer in AF2, Pairformer in AF3) evolves a pairwise representation.

◦

StructureModule (in AF2) or DiffusionModule (in AF3) uses the pairwise representation to generate explicit atomic positions.

•

Major modifications were made to..

Accomodate a wide range of chemical entities without excessive special casing

Increase the accuracy in protein structure predictions

•

AF3 is a conditional diffusion model, where most of the computation happens in the conditioning!

•

In AF3, coordinates are not recycled with Single rep or Pair rep. But they are iteratively denoised at the DiffusionModule.

Input processing

Tokenization

amino acid, nucleotide structure

Type	Tokenization	Token center atom
Standard amino acid residue	Single token	CA
Standard nucleotide residue	Single token	C1’
Modified amino acid / nucleotide	Tokenized per-atom	First (and only) atom
All ligands	Tokenized per-atom	First (and only) atom

Each token is designated with a token center atom.

Features

Type	AF2	AF3	Description
MSA features	O	O	features from genetics search (e.g. MSA itself, deletion matrix, profile)
Template features	O	O	features from template search (e.g. template atom positions)
Token features	X	O	e.g. position indexes (token_index), chain identifiers (asym_id), masks (is_protein)
Reference features	X	O	features from reference conformer (generated with RDKit ETKDGv3)
Bond features	X	O	features about bond information

TemplateModule (TemplateEmbedder)

Baby Pair representation builder

Input: Template search info & (previous) Pair representation

Output: (updated) Pair representation

Goal: Build and update pair representation with the template information from the database

•

Template stack is similar to AF2.

MSAModule

Mini-Evoformer

Input: (previous) MSA representation & (previous) Pair representation

Output: (updated) MSA representation & (updated) Pair representation

Goal: Update MSA and pair representation while exchanging information

Suppl Fig. 2

•

MSAModule is practically a mini-Evoformer!

Authors divided Evoformer (of AF2) into MSAModule and Pairformer.

	AF2 Evoformer	AF3 MSAModule
# blocks	48	4
Sequence-level feature	MSA rep (row-wise & column-wise attention)	MSA rep (simple pair-weighted average)
Structure-level feature	Pair rep (triangular multiplicative & self-attention updates)	Pair rep (triangular multiplicative & self-attention updates)
Communication	Outer product mean, pair bias	Outer product mean, pair bias

•

Motivation for this change:

Enrich more information on Pair rep as it forms the backbone for the rest of the network!

De-emphasize the MSA rep as there were some discussions that AF2 relies heavily on MSA.

•

In AF3, there is no key-query based attention in MSA stack and the architecture is similar to the ExtraMSAStack in AF2

→ Reduced computation & memory

•

MSA is used for protein and RNA sequences.

Pairformer

The legitimate successor the the Evoformer

Input: (previous) Single representation & (previous) Pair representation

Output: (updated) Single representation & (updated) Pair representation

Goal: Update single & pair representation while exchanging information

Fig. 2a Pairformer module. n: number of tokens, c: number of channels. Each 48 blocks does not share weights.

•

Pairformer is a light-weight Evoformer, focusing more on the Pair rep

	AF2 Evoformer	AF3 Pairformer
# blocks	48	48
Sequence-level feature	MSA rep (row-wise & column-wise attention)	Single rep (only row-wise attention)
Structure-level feature	Pair rep (triangular multiplicative & self-attention updates)	Pair rep (triangular multiplicative & self-attention updates)
Communication	Outer product mean, pair bias	Pair bias

•

Single rep is the privileged first row of MSA rep.

•

Only row-wise attention for the Single rep!

In Pairformer, only row-wise attention is performed for the Single rep, since there is only one row (instead of

N_{\text{seq}}

rows in MSA rep)

•

No outer product mean in Pairformer!

Unlike in AF2, the Single rep does not (directly) influence the Pair rep within a Pairformer block with outer product mean.

(the outer product mean was already used in the MSA module)

DiffusionModule

The rightful heir of the Structure module, without leash ‍ and with modern generative technology

Input: Input feature, Single representation, Pair representation & (previous) Physical atom coordinates

Output: (updated) Physical atom coordinates

Goal: Update physical atom coordinates from original input feature and refined single & pair representation

Fig. 2b Diffusion module. coarse: token, fine: atom. green: input, blue: pair, red: single.

•

DiffusionModule is an all-atom, generative, non-equivariant version of the StructureModule!

	AF2 StructureModule	AF3 DiffusionModule
Model type	Predictive	Generative
Resolution	Residue level	All-atom level
Operates on	Residue frames and $\chi$ angles	Raw atom coordinates
Equivariance	Forced to satisfy	Not required (!)
Attention type	IPA (invariant point attention)	1. Sequence local attention (atom-level) 2. Global attention (token-level) 3. Sequence local attention (atom-level)
Major loss	FAPE loss	MSE loss (!)
Initial coordinate	Black hole initialization	Generated conformer

•

The diffusion module of AF3 is a conditional, non-equivariant generative model, that works on the point cloud of atoms.

◦

Condition: Input feature, Single rep, Pair rep

◦

Non-equivariance: no geometric inductive bias involved

In AF2, the Structure module had a invariant operation called IPA to fulfill equivariance constraints.

But in AF3, the authors found that no invariance or equivariance are required after all, and therefore omitted them to simplify the architecture.

•

Two-level architecture:

It seems that authors wanted all atoms to attend globally each other, but because of the memory and computation cost, they came up with the two-level architecture.

It works as follows: 1. on atom-level → 2. on token-level → 3. on atom-level

1. Sequence local attention on atom-level

Allow nearby atoms to talk directly to each other!

Here, ‘nearby’ atoms are defined by sequence-level neighbors. (But why not structure-level neighbors?)

Each subset of 32 atoms attends to the subset of the nearby 128 atoms in sequence space.

This restriction is sub-optimal , but was necessary to keep the memory and compute costs within reasonable bounds.

2. Global attention on token-level

It’s time for the representatives (token) to gather local information and shake hands with other global representatives!

Aggregate fine-grained atom-level features into coarse-grained token-level features.

Full self-attention on token-level.

Broadcast token activations to atom features

3. Sequence local attention on atom-level AGAIN!

With that global information update, let’s focus on our local atom neighbors again.

DiffusionModule algorithm

Training setup

Fig. 2c Training setup (distogram head is omitted for simplicity) starting from the end of the network trunk. blue line: abstract activation, yellow line: ground truth, green line: predicted data

DiffusionModule Training

•

Efficient training of DiffusionModule (a.k.a. the MULTI-VERSE of DiffusionModule!) 

DiffusionModule is much cheaper than the network trunk (previous modules).

→ To improve efficiency,

Network trunk is run only once

 48 versions  of the input structure are created

(by random rotation, translation and independent noise addition)

DiffusionModule is trained in parallel

•

Loss

\mathcal{L}_{\text{diffusion}} = (\hat{t}^2 + \sigma^2_{\text{data}})/(\hat{t} + \sigma_{\text{data}})^2 \cdot (\mathcal{L}_{\text{MSE}} + \alpha_{\text{bond}} \cdot \mathcal{L}_{\text{bond}}) + \mathcal{L}_{\text{smooth\_lddt}}

◦

Weighted aligned MSE loss (upweighting of nucleotide and ligand atoms)

The famous FAPE loss (from AF2) now disappeared, and authors just used relatively simple MSE loss

◦

smooth LDDT loss (to emphasize local geometry)

◦

bond distance loss (during fine-tuning only)

Confidence Module training with Diffusion “roll-out”

We need to rollout the DiffusionModule to train AF3 confidence module. Let’s see why.

	AF2 Confidence module	AF3 Confidence module
heads	- pAE head - pLDDT head - experimentally resolved head	- pAE head - pLDDT head - experimentally resolved head - pDE head
pAE head input	Pair rep (Evoformer output)	Pair rep (mini-Pairformer output - with input feature, single feature, distance between rolled-out coordinate information)
pLDDT head input	Single rep (Structure module output)	Single rep (mini-Pairformer output - with input feature, pair feature, distance between rolled-out coordinate information)
Input gradient	Not detached	Detached

In AF2, confidence module was trained with relatively simple Pair rep and Single rep.

In AF3, confidence module evolved, processing much more information!

•

pDE head (new in AF3): predicts the error in absolute distances between atoms

(this head is also trained with binned classification task)

Other details (such as converting regression and binned classification, pTM & ipTM calculation, …) mostly follows that of AF2.

Distogram prediction

Distogram prediction head is identical to AF2.