Pre-training ESM-2

Pre-trained checkpoints for ESM-2 are available at the 8M, 650M, and 3B model sizes. These models were trained by the BioNeMo Framework team to reproduce the original training results from Lin et al., Science (2023), with more recent UniProt data and leveraging the BioNeMo training infrastructure. The full pre-training data and train/test splits are available.

Model Convergence

Validation perplexity evaluated on the NVIDIA validation set.

ESM-2 Pre-training Convergence

Model Size	Perplexity at 500K Updates
8M	10.26
650M	7.14
3B	6.42

Pre-training Recipes

8M650M3B

esm2_8m_ckpt_path = load("esm2/8m:2.0")

Training Script

Training Parameters	Value
# of GPUs	32
GPU Type	A100
Batch Size (per device)	64

train_esm2 \
  --create-tensorboard-logger \
  --resume-if-exists \
  --wandb-project=<wandb-project-name> \
  --save-top-k=10 \
  --train-cluster-path=/data/train_clusters.parquet \  # (1)!
  --train-database-path=/data/train.db \
  --valid-cluster-path=/data/valid_clusters.parquet \
  --valid-database-path=/data/validation.db \
  --num-steps=500_000 \
  --metric-to-monitor-for-checkpoints=val_loss \
  --micro-batch-size=64 \
  --num-nodes=4 \
  --num-gpus=8 \
  --val-check-interval=10000 \
  --limit-val-batches=1.0 \
  --result-dir=/results/esm2_pretrain_8m \
  --experiment-name=esm2_pretrain_8m \
  --num-layers=6 \
  --hidden-size=320 \
  --num-attention-heads=20 \
  --ffn-hidden-size=1280;

Paths here must be mounted into the bionemo-framework docker image.

esm2_650m_ckpt_path = load("esm2/nv_650m:2.1")

Training Script

Training Parameters	Value
# of GPUs	64
GPU Type	H100
Batch Size (per device)	32

train_esm2 \
  --create-tensorboard-logger \
  --resume-if-exists \
  --wandb-project=<wandb-project-name> \
  --save-top-k=10 \
  --train-cluster-path=/data/train_clusters.parquet \  # (1)!
  --train-database-path=/data/train.db \
  --valid-cluster-path=/data/valid_clusters.parquet \
  --valid-database-path=/data/validation.db \
  --num-steps=500_000 \
  --metric-to-monitor-for-checkpoints=val_loss \
  --micro-batch-size=32 \
  --num-nodes=8 \
  --num-gpus=8 \
  --val-check-interval=10000 \
  --limit-val-batches=1.0 \
  --result-dir=/results/esm2_pretrain_650m \
  --experiment-name=esm2_pretrain_650m \
  --min-seq-length=1024 \
  --max-seq-length=1024 \
  --num-layers=33 \
  --hidden-size=1280 \
  --num-attention-heads=20 \
  --ffn-hidden-size=5120;

Paths here must be mounted into the bionemo-framework docker image.

esm2_3b_ckpt_path = load("esm2/nv_3b:2.1")

Training Script

Training Parameters	Value
# of GPUs	128
GPU Type	H100
Batch Size (per device)	16
Warmup Steps	20,000

train_esm2 \
  --create-tensorboard-logger \
  --resume-if-exists \
  --wandb-project=<wandb-project-name> \
  --save-top-k=10 \
  --train-cluster-path=/data/train_clusters.parquet \  # (2)!
  --train-database-path=/data/train.db \
  --valid-cluster-path=/data/valid_clusters.parquet \
  --valid-database-path=/data/validation.db \
  --num-steps=500_000 \
  --warmup-steps=20_000 \  # (1)!
  --metric-to-monitor-for-checkpoints=val_loss \
  --micro-batch-size=16 \
  --num-nodes=16 \
  --num-gpus=8 \
  --val-check-interval=2500 \
  --limit-val-batches=1.0 \
  --result-dir=/results/esm2_pretrain_3b \
  --experiment-name=esm2_pretrain_3b \
  --min-seq-length=1024 \
  --max-seq-length=1024 \
  --num-layers=36 \
  --hidden-size=2560 \
  --num-attention-heads=40 \
  --ffn-hidden-size=10240;

We had to increase the number of warmup steps 10x over the published training recipe for ESM-2 3B, which was likely trained with fp16 precision. This gave us an overall similar initial curve, but avoided convergence issues at around 2,000 steps.
Paths here must be mounted into the bionemo-framework docker image.