Example notebook
import os
import tempfile
import pooch
from torch.utils.data import DataLoader
from bionemo.core import BIONEMO_CACHE_DIR
from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset
from bionemo.scdl.util.torch_dataloader_utils import collate_sparse_matrix_batch
First, copy the input data. This can be done by copying https://datasets.cellxgene.cziscience.com/97e96fb1-8caf-4f08-9174-27308eabd4ea.h5ad to a directory named hdf5s
.
input_data = pooch.retrieve(
'https://datasets.cellxgene.cziscience.com/97e96fb1-8caf-4f08-9174-27308eabd4ea.h5ad',
path=BIONEMO_CACHE_DIR / "hdf5s",
known_hash='a0728e13a421bbcd6b2718e1d32f88d0d5c7cb92289331e3f14a59b7c513b3bc')
#Create a SingleCellMemMapDataset
dataset_temp_dir = tempfile.TemporaryDirectory()
dataset_dir = os.path.join(dataset_temp_dir.name, "97e_scmm")
data = SingleCellMemMapDataset(dataset_dir, input_data)
#Save the dataset to the disk.
data.save()
True
#Reload the data
reloaded_data = SingleCellMemMapDataset(dataset_dir)
There are various numbers of columns per observation. However, for a batch size of 1 the data does not need to be collated. It will then be outputted in a torch tensor of shape (1, 2, num_obs) The first row of lengh num_obs contains the column pointers, and the second row contains the corresponding values.
model = lambda x : x
dataloader = DataLoader(data, batch_size=1, shuffle=True, collate_fn=collate_sparse_matrix_batch)
n_epochs = 1
for e in range(n_epochs):
for batch in dataloader:
model(batch)
/usr/local/lib/python3.10/dist-packages/bionemo/scdl/util/torch_dataloader_utils.py:39: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/SparseCsrTensorImpl.cpp:53.) batch_sparse_tensor = torch.sparse_csr_tensor(batch_rows, batch_cols, batch_values, size=(len(batch), max_pointer))
The data can be collated with a batch size of 1 and must be collated with larger batch sizes. This will collate several sparse matrices into the CSR (Compressed Sparse Row) torch tensor format.
model = lambda x : x
dataloader = DataLoader(data, batch_size=8, shuffle=True, collate_fn=collate_sparse_matrix_batch)
n_epochs = 1
for e in range(n_epochs):
for batch in dataloader:
model(batch)
Alternatively, if there are multiple AnnData files, they can be converted into a single SingleCellMemMapDataset. If the hdf5 directory has one or more AnnData files, the SingleCellCollection class crawls the filesystem to recursively find AnnData files (with the h5ad extension). The code below is in scripts/convert_h5ad_to_scdl.py. It will create a new dataset at example_dataset. This can also be called with the convert_h5ad_to_scdl command.
# path to dir holding hdf5s data
hdf5s = BIONEMO_CACHE_DIR / "hdf5s"
# path to output dir where SCDataset will be stored
output_temp_dir = tempfile.TemporaryDirectory()
output_dir = os.path.join(output_temp_dir.name, 'scdataset_output')
from bionemo.scdl.io.single_cell_collection import SingleCellCollection
with tempfile.TemporaryDirectory() as temp_dir:
coll = SingleCellCollection(temp_dir)
coll.load_h5ad_multi(hdf5s, max_workers=4, use_processes=True)
coll.flatten(output_dir, destroy_on_copy=True)
dataset_temp_dir.cleanup()
output_temp_dir.cleanup()