Pre-training ESM-2
Pre-trained checkpoints for ESM-2 are available at the 8M, 650M, and 3B model sizes. These models were trained by the BioNeMo Framework team to reproduce the original training results from Lin et al., Science (2023), with more recent UniProt data and leveraging the BioNeMo training infrastructure. The full pre-training data and train/test splits are available.
Model Convergence
Validation perplexity evaluated on the NVIDIA validation set.
Model Size | Perplexity at 500K Updates |
---|---|
8M | 10.26 |
650M | 7.14 |
3B | 6.42 |
Pre-training Recipes
esm2_8m_ckpt_path = load("esm2/8m:2.0")
Training Script
Training Parameters | Value |
---|---|
# of GPUs | 32 |
GPU Type | A100 |
Batch Size (per device) | 64 |
train_esm2 \
--create-tensorboard-logger \
--resume-if-exists \
--wandb-project=<wandb-project-name> \
--save-top-k=10 \
--train-cluster-path=/data/train_clusters.parquet \ # (1)!
--train-database-path=/data/train.db \
--valid-cluster-path=/data/valid_clusters.parquet \
--valid-database-path=/data/validation.db \
--num-steps=500_000 \
--metric-to-monitor-for-checkpoints=val_loss \
--micro-batch-size=64 \
--num-nodes=4 \
--num-gpus=8 \
--val-check-interval=10000 \
--limit-val-batches=1.0 \
--result-dir=/results/esm2_pretrain_8m \
--experiment-name=esm2_pretrain_8m \
--num-layers=6 \
--hidden-size=320 \
--num-attention-heads=20 \
--ffn-hidden-size=1280;
- Paths here must be mounted into the
bionemo-framework
docker image.
esm2_650m_ckpt_path = load("esm2/nv_650m:2.1")
Training Script
Training Parameters | Value |
---|---|
# of GPUs | 64 |
GPU Type | H100 |
Batch Size (per device) | 32 |
train_esm2 \
--create-tensorboard-logger \
--resume-if-exists \
--wandb-project=<wandb-project-name> \
--save-top-k=10 \
--train-cluster-path=/data/train_clusters.parquet \ # (1)!
--train-database-path=/data/train.db \
--valid-cluster-path=/data/valid_clusters.parquet \
--valid-database-path=/data/validation.db \
--num-steps=500_000 \
--metric-to-monitor-for-checkpoints=val_loss \
--micro-batch-size=32 \
--num-nodes=8 \
--num-gpus=8 \
--val-check-interval=10000 \
--limit-val-batches=1.0 \
--result-dir=/results/esm2_pretrain_650m \
--experiment-name=esm2_pretrain_650m \
--min-seq-length=1024 \
--max-seq-length=1024 \
--num-layers=33 \
--hidden-size=1280 \
--num-attention-heads=20 \
--ffn-hidden-size=5120;
- Paths here must be mounted into the
bionemo-framework
docker image.
esm2_3b_ckpt_path = load("esm2/nv_3b:2.1")
Training Script
Training Parameters | Value |
---|---|
# of GPUs | 128 |
GPU Type | H100 |
Batch Size (per device) | 16 |
Warmup Steps | 20,000 |
train_esm2 \
--create-tensorboard-logger \
--resume-if-exists \
--wandb-project=<wandb-project-name> \
--save-top-k=10 \
--train-cluster-path=/data/train_clusters.parquet \ # (2)!
--train-database-path=/data/train.db \
--valid-cluster-path=/data/valid_clusters.parquet \
--valid-database-path=/data/validation.db \
--num-steps=500_000 \
--warmup-steps=20_000 \ # (1)!
--metric-to-monitor-for-checkpoints=val_loss \
--micro-batch-size=16 \
--num-nodes=16 \
--num-gpus=8 \
--val-check-interval=2500 \
--limit-val-batches=1.0 \
--result-dir=/results/esm2_pretrain_3b \
--experiment-name=esm2_pretrain_3b \
--min-seq-length=1024 \
--max-seq-length=1024 \
--num-layers=36 \
--hidden-size=2560 \
--num-attention-heads=40 \
--ffn-hidden-size=10240;
-
We had to increase the number of warmup steps 10x over the published training recipe for ESM-2 3B, which was likely trained with fp16 precision. This gave us an overall similar initial curve, but avoided convergence issues at around 2,000 steps.
-
Paths here must be mounted into the
bionemo-framework
docker image.