Skip to content

FunctionLab/otari

Repository files navigation

Welcome to the Otari framework repository! Otari is a comprehensive and interpretable graph-based framework of transcript isoform regulation, powering the characterization of transcriptomic diversity and isoform-level variant effects at scale.

This repository can be used to run the Otari model and get the Otari regulatory profiles, isoform abundance predictions, and variant effect predictions for input sequences or variants.

We also provide information and instructions for how to train the Otari graph neural network model.

Requirements

To set up an Otari environment, first pull the repository:

git clone https://github.com/FunctionLab/otari.git

Navigate a terminal into the root of the repository. Next, create an Otari conda environment using python3:

conda env create -n otari -f requirements.yml

Activate the environment:

conda activate otari

Setup

Please download and extract the resources subdirectory into the root directory of Otari. This subdirectory contains the Otari model weights, ConvSplice, Sei, and Seqweaver model weights, hg38 FASTA files, GENCODE annotations, pickle files, node sequence attributes, transcript datasets and more before proceeding:

sh ./download_data.sh

Variant effect prediction

  1. The following scripts can be used to obtain Otari variant effects at the isoform level (must run on a GPU node): (1) variant_effect_prediction.py (and corresponding bash script, variant_effect_prediction.sh): Accepts a .tsv or .vcf variant file as input and makes variant effect predictions.

Example usage:

sh variant_effect_prediction.sh <input-file> <output-dir> --annotate --visualize

Arguments:

  • <input-file>: .tsv or .vcf input file with variants. tsv format must be chr \t pos \t ref \t alt.
  • <output-dir>: Path to output directory (will be created if does not exist)
  • --annotate: boolean True or False (default is True). Annotate should only be set to false if variants are already annotated to genes and strands (make sure the genes column is called genes).
  • --visualize: boolean True or False (default is False). Visualize tissue-specific variant effects, transcript splice structures, and most affected nodes. .png files saved to <output-dir>/figures.

Expected outputs:

  • variant_effects_comprehensive.tsv: variant effect prediction for every isoform and tissue. Includes max_effect and mean_effect (absolute effects) across tissues.
  • interpretability_analysis.tsv: interpretability metrics including most impacted node and features.
  • variant_to_most_affected_node_embedding.pkl: node sequence attributes for the most impacted node for each variant and transcript.
  • figures/ containing variant effects and transcript structures.

Example variant effect prediction run

We provide test.tsv and test.vcf (hg38 coordinates) as examples so you can try running Otari VEP once you have installed all the requirements.

Example command run on a GPU node:

sh variant_effect_prediction.sh test.vcf ./test_outputs --annotate --visualize

Alternatively, submit as a job to slurm (make sure CUDA is available):

sbatch variant_effect_prediction.sh test.vcf ./test_outputs --annotate --visualize

log will be available in the logfiles subdirectory.

Training

The configuration files and scripts for training Otari are under the train directory. To run Otari model training, you will need GPU computing capability (we ran training on 1x Nvidia A100 GPU for ~8 hours).

The training data is available here and is downloaded and extracted as part of the resources directory.

cd ./train
sh ./download_data.sh  # in the train directory

The Otari training configuration YAML file is provided as the train/configs.yml file. Please update the dataset location (same as <output_dir> below) in train/configs.yml, as well as any other hyperparameters that you would like to modify for training.

We provide an example SLURM script train/train.sh for submitting a training job to a cluster. To preprocess the data and train the model from scratch, run the following scripts in order:

sbatch preprocess/preprocess_data.sh resources/ESPRESSO_isoform_data.tsv.gz 'espresso' <output_dir>
sh train/train.sh

You can use the same conda environment to train Otari.

Help

Please post in the Github issues or e-mail Aviya Litman ([email protected]) with any questions.

About

code to run Otari and obtain Otari predictions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published