GitHub - FunctionLab/otari: code to run Otari and obtain Otari predictions

Welcome to the Otari framework repository! Otari is a comprehensive and interpretable graph-based framework of transcript isoform regulation, powering the characterization of transcriptomic diversity and isoform-level variant effects at scale.

This repository can be used to run the Otari model and get the Otari regulatory profiles, isoform abundance predictions, and variant effect predictions for input sequences or variants.

We also provide information and instructions for how to train the Otari graph neural network model.

Requirements

To set up an Otari environment, first pull the repository:

git clone https://github.com/FunctionLab/otari.git

Navigate a terminal into the root of the repository. Next, create an Otari conda environment using python3:

conda env create -n otari -f requirements.yml

Activate the environment:

conda activate otari

Setup

Please download and extract the resources subdirectory into the root directory of Otari. This subdirectory contains the Otari model weights, ConvSplice, Sei, and Seqweaver model weights, hg38 FASTA files, GENCODE annotations, pickle files, node sequence attributes, transcript datasets and more before proceeding:

sh ./download_data.sh

Variant effect prediction

The following scripts can be used to obtain Otari variant effects at the isoform level (must run on a GPU node): (1) variant_effect_prediction.py (and corresponding bash script, variant_effect_prediction.sh): Accepts a .tsv or .vcf variant file as input and makes variant effect predictions.

Example usage:

sh variant_effect_prediction.sh <input-file> <output-dir> --annotate --visualize

Arguments:

<input-file>: .tsv or .vcf input file with variants. tsv format must be chr \t pos \t ref \t alt.
<output-dir>: Path to output directory (will be created if does not exist)
--annotate: boolean True or False (default is True). Annotate should only be set to false if variants are already annotated to genes and strands (make sure the genes column is called genes).
--visualize: boolean True or False (default is False). Visualize tissue-specific variant effects, transcript splice structures, and most affected nodes. .png files saved to <output-dir>/figures.

Expected outputs:

variant_effects_comprehensive.tsv: variant effect prediction for every isoform and tissue. Includes max_effect and mean_effect (absolute effects) across tissues.
interpretability_analysis.tsv: interpretability metrics including most impacted node and features.
variant_to_most_affected_node_embedding.pkl: node sequence attributes for the most impacted node for each variant and transcript.
figures/ containing variant effects and transcript structures.

Example variant effect prediction run

We provide test.tsv and test.vcf (hg38 coordinates) as examples so you can try running Otari VEP once you have installed all the requirements.

Example command run on a GPU node:

sh variant_effect_prediction.sh test.vcf ./test_outputs --annotate --visualize

Alternatively, submit as a job to slurm (make sure CUDA is available):

sbatch variant_effect_prediction.sh test.vcf ./test_outputs --annotate --visualize

log will be available in the logfiles subdirectory.

Training

The configuration files and scripts for training Otari are under the train directory. To run Otari model training, you will need GPU computing capability (we ran training on 1x Nvidia A100 GPU for ~8 hours).

The training data is available here and is downloaded and extracted as part of the resources directory.

cd ./train
sh ./download_data.sh  # in the train directory

The Otari training configuration YAML file is provided as the train/configs.yml file. Please update the dataset location (same as <output_dir> below) in train/configs.yml, as well as any other hyperparameters that you would like to modify for training.

We provide an example SLURM script train/train.sh for submitting a training job to a cluster. To preprocess the data and train the model from scratch, run the following scripts in order:

sbatch preprocess/preprocess_data.sh resources/ESPRESSO_isoform_data.tsv.gz 'espresso' <output_dir>
sh train/train.sh

You can use the same conda environment to train Otari.

Help

Please post in the Github issues or e-mail Aviya Litman ([email protected]) with any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
images		images
model		model
predictors		predictors
preprocess		preprocess
train		train
utils		utils
README.md		README.md
get_variant_node_embeddings.py		get_variant_node_embeddings.py
requirements.yml		requirements.yml
structure_visualization.py		structure_visualization.py
test.tsv		test.tsv
test.vcf		test.vcf
variant_effect_prediction.py		variant_effect_prediction.py
variant_effect_prediction.sh		variant_effect_prediction.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Requirements

Setup

Variant effect prediction

Example variant effect prediction run

Training

Help

About

Uh oh!

Releases

Packages

Languages

FunctionLab/otari

Folders and files

Latest commit

History

Repository files navigation

Requirements

Setup

Variant effect prediction

Example variant effect prediction run

Training

Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages