CellSpliceNet: Interpretable Multimodal Modeling of Alternative Splicing Across Neurons in C. elegans
CellSpliceNet is an interpretable transformer-based multimodal deep learning framework that predicts splicing outcomes across neurons in C. elegans by integrating four complementary data modalities.
Authors: Arman Afrasiyabi, Jake Kovalic, Chen Liu, Egbert Castro, Alexis Weinreb, Erdem Varol, David M. Miller III, Marc Hammarlund, Smita Krishnaswamy
Quick links:
📄 Preprint (bioRxiv) · 🧪 Dataset · 💻 Repo
We introduce CellSpliceNet, an interpretable transformer-based multimodal deep learning framework designed to predict splicing outcomes across the neurons of C. elegans. By integrating four complementary data modalities—(1) long-range genomic sequence, (2) local regions of interest (ROIs) in the RNA sequence, (3) secondary structure, and (4) gene expression—CellSpliceNet captures the complex interplay of factors that influence splicing decisions within the cellular context. CellSpliceNet employs modality-specific transformer embeddings, incorporating structural representations guided by mutual information and scattering graph embeddings. A carefully designed multimodal multi-head attention mechanism preserves the integrity of each modality while enabling selective cross-modal interactions (e.g., allowing gene expression to inform sequence/structure signals). Attention-based pooling within each modality highlights biologically critical elements, such as canonical intron–exon splice boundaries and accessible single-stranded RNA loop structures within exons.
- Multimodal fusion: sequence (global + ROI), secondary structure, and gene expression.
- Interpretable attention: modality-specific pooling surfaces biologically relevant signals (e.g., splice boundaries, loop accessibility).
- Selective cross-modal attention: preserves modality integrity while enabling targeted information flow.
- Repository Structure
- Requirements
- Installation
- Data: Download & Configure
- Quickstart: Train & Validate
- Pretrained Weights
- Troubleshooting
- Contributing
- License
- Citation
CellSpliceNet/
src/
data/ # datasets + dataloaders
models/ # model definitions (transformers, heads, etc.)
nn/ # neural modules and layers
utils/ # logging, seeding, config helpers, misc
viz/ # visualization utilities for results/attention maps
train.py # train/eval loops
pp/ # (optional) pre/post-processing assets; preprocessed data provided
requirements.txt
LICENSE
README.md
- OS: Enterprise Linux 8.10 (other modern Linux distros likely fine)
- Python: 3.9.18
- CUDA: 11.3.1 (for GPU training)
- PyTorch: 1.10.2
- Dependencies: see
requirements.txt
git clone https://github.com/KrishnaswamyLab/CellSpliceNet
cd CellSpliceNet
Conda (recommended)
# If your HPC requires modules, load them first (otherwise skip):
# module load CUDA/11.3.1 CUDAcore/11.3.1 cuDNN/8.2.1.32-CUDA-11.3.1
# Option A: from environment.yml (if present)
conda env create -f environment.yml -n CellSpliceNet
# Option B: from requirements.txt
conda create -n CellSpliceNet python=3.9
conda activate CellSpliceNet
pip install -r requirements.txt
# Install PyTorch matching your CUDA (example for CUDA 11.3):
# (Adjust to your platform if needed)
pip install torch==1.10.2 torchvision==0.11.3 torchaudio==0.10.2
Virtualenv
python3.9 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Tip: If you see a CUDA version mismatch at runtime, reinstall PyTorch with the correct CUDA build.
- Download the dataset: CellSpliceNet-dataset
- Set the dataset root in
src/args.py
:(If the code supports CLI/environment overrides in your fork, you can use those instead; otherwise editdataset_root = "/path/to/your/dataset"
args.py
.)
Run the default training loop (includes validation as configured):
python src/train.py
- Logs, checkpoints, and metrics will be saved as defined in
src/utils
(and/or your config). - For experiment control (epochs, batch size, etc.), update
src/args.py
(or your config system if present).
A pretrained model is available here: CellSpliceNet.pth.
Download the weights and point your configuration/checkpoint loader to the file path per your setup.
- CUDA mismatch / “CUDA driver version is insufficient”:
Ensure your installed PyTorch build matches your system CUDA (or use the CPU build). - Out of GPU memory:
Reducebatch_size
and/or sequence length; consider gradient accumulation or mixed precision (AMP). - Dataset path errors:
Double-checkdataset_root
insrc/args.py
and that the expected subfolders/files exist. - Image not rendering in README:
Confirm the filename is exactlyCellSplceNet.png
in the repository root (case-sensitive on Linux).
All experiments are conducted on a single A100 GPU. Data loading and preprocessing pipelines are implemented with standard libraries. Reproducibility is ensured via fixed random seeds and environment specification. Preprocessing scripts, end-to-end training and inference scripts, and pretrained model checkpoints are available in the public repository.
We partitioned the data with a row-level IID random split into training (65%), validation (15%), and test (20%) by drawing a uniform random assignment for each observation. To assess robustness, we additionally performed k-fold cross-validation and repeated the entire training/testing procedure ten independent times with different random seeds. All preprocessing and partitioning scripts are available in the repository under the preprocessing (pp/) folder. To prevent leakage, all normalizers/tokenizers were fit on train only; genomic windows/ROIs were generated once and constrained to not cross splits; augmentation was train-only; and early stopping/hyperparameters were selected on validation with the test set revealed once at the end.
Contributions are welcome! Please open an issue to discuss major changes. For pull requests:
- Fork the repo and create a feature branch.
- Add or update tests if applicable.
- Ensure style/formatting is consistent.
- Open a PR with a clear description and motivation.
This project is distributed under the terms specified in the LICENSE file.
If you use this repository, models, or ideas in your research, please cite:
@article{Afrasiyabi2025CellSpliceNet,
title = {CellSpliceNet: Interpretable Multimodal Modeling of Alternative Splicing Across Neurons in C. elegans},
author = {Afrasiyabi, Arman and Kovalic, Jake and Liu, Chen and Castro, Egbert and Weinreb, Alexis and Varol, Erdem and Miller, David M., III and Hammarlund, Marc and Krishnaswamy, Smita},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.06.22.660966},
url = {https://www.biorxiv.org/content/10.1101/2025.06.22.660966v1}
}