Skip to content

TorkamaniLab/CAD_meta_prediction

Repository files navigation

Meta-Prediction of Coronary Artery Disease Risk

DOI Python version MIT license Open Source GitHub latest commit

Description

This repository provides the core machine learning codebase used in our Nature Medicine publication:
“Meta-prediction of coronary artery disease risk”
🔗 https://www.nature.com/articles/s41591-025-03648-0

Our study introduces a novel meta-prediction framework that integrates genetic and non-genetic factors - unmodifiable and modifiable risk profiles - into a unified prediction system for 10-year incident coronary artery disease (CAD). This repository shares key components of the ML pipeline.

📌 Cite us

Chen SF, et al. Meta-prediction of coronary artery disease risk. Nature Medicine. 2025. DOI: 10.1038/s41591-025-03648-0.

📎 BibTeX
@article{chen2025metapred,
  title={Meta-prediction of coronary artery disease risk},
  author={Chen, Shang-Fu and Lee, Sang Eun and Sadaei, Hossein Javedani and Park, Jun-Bean and Khattab, Ahmed and Chen, Jei-Fu and Henegar, Corneliu and Wineinger, Nathan E. and Muse, Evan D. and Torkamani, Ali},
  journal={Nature Medicine},
  year={2025},
  month={Apr},
  publisher={Nature Portfolio},
  doi={10.1038/s41591-025-03648-0},
  url={https://www.nature.com/articles/s41591-025-03648-0}
}

🧬 Study overview

CAD_meta_prediction

👨‍💻 Insider trivia The silhouette featured in Figure 1b isn’t just any figure — it’s based on Shaun (Chen SF), the first author and lead developer of this codebase. Fitting, since the meta-prediction includes both features about the individual… and of the individual. 😉
  • We developed a meta-prediction framework that combines unmodifiable and modifiable factors to predict 10-year risk of CAD.
  • The UK Biobank dataset was partitioned into two cohorts, each serving a distinct role in the pipeline:
    • A prevalent CAD cohort, used to train baseline models for predicting biomarker levels and diagnostic categories.
    • An incident CAD cohort, used to build the final CAD risk model based on meta-features derived from baseline model outputs.
  • The framework generated 296 meta-features from ~2,000 variables, including clinical biomarkers, diagnostic categories, and >1,000 polygenic risk scores (PRSs).
  • The final model used 50 selected features (13 measured variables, 22 PRSs, and 15 meta-features), achieving AUROC 0.84 in UK Biobank and AUROC 0.81 in All of Us
  • The framework supports individualized intervention simulation and identifies subgroups with differential benefit, offering new opportunities for precision prevention.

🧰 Assets

This codebase includes components for training individual prediction models, such as CAD diagnoses and biomarker estimations. A consistent pipeline used across multiple prediction tasks to enable meta-feature generation, which feeds into our final CAD risk model and trained the final model.

Key features include:

  • Compatible with tested tree-based ML models: XGBoost, LightGBM, CatBoost
  • Custom utilities:
    • zoish: SHAP-based feature importance wrapper built on fasttreeshap
    • lohrasb: Optuna-based hyperparameter tuner (TPE + Hyperband)

⚙️ Environment Configuration

This project requires Python >=3.10 and uses Poetry for dependency management. If you prefer pip, a requirements.txt is also provided.

Option 1: Using Poetry

poetry install

Option 2: Using pip

pip install -r requirements.txt

Dependencies

Main runtime dependencies:

catboost==1.2.5
category-encoders==2.6.3
fasttreeshap==0.1.6
lightgbm==4.5.0
lohrasb==4.2.0
matplotlib==3.8.4
numpy==1.21.6
optuna-integration==3.6.0
pandas>=1.3.5
ray==2.7.1
scikit-learn==1.0.2
seaborn
shap==0.42.1
tune-sklearn==0.5.0
xgboost==1.7.5
zoish==5.0.4

🚀 Usage

You may either run the provided notebook or invoke the command-line interface:

Option 1: Run the tutorial notebook

A reproducible example is provided in Tutorial_Notebook.ipynb, demonstrating model training and SHAP analysis.

Option 2: Run the pipeline via CLI

The CLI expects a .pkl file containing a preprocessed pandas.DataFrame with appropriate data types. It must include:

  • A target column matching --y_label
  • An ID column matching --id_col
Sample Data

This project uses the publicly available Cardiovascular Disease dataset as an example. The input file data/typed_cardio_train.pkl is generated by data/make_cardio_train_pickle.py and tracked with Git LFS.

To explore all available options and their default values, run:

python -u meta_prediction_estimator.py -h

Example command:

python -u meta_prediction_estimator.py \
  --y_label "cardio" \
  --input_pickle_fp "data/typed_cardio_train.pkl" \
  --id_col "id" \
  --pkg "xgb" \
  --estimator_type "classifier" \
  --n_features 5 \
  --n_trials 100

The pipeline will produce:

  1. A trained pipeline object (including the best estimator)
    final_pipeline__xgb_classifier__cardio.joblib

  2. A SHAP-based feature importance file (mean absolute SHAP values per feature)
    shap__xgb_classifier__cardio__preselect.tsv

See expected_output/ for example outputs.

About

Machine learning pipeline for meta-prediction of coronary artery disease

Resources

License

Stars

Watchers

Forks

Packages

No packages published