Skip to content

HUBioDataLab/PROBE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PROBE: Protein RepresentatiOn BEnchmark

Function-centric evaluation of protein representation models

PROBE overview


Overview

PROBE evaluates fixed-length protein embeddings for how well they capture functional biological knowledge. It provides four complementary benchmarks:

Benchmark Biological Question Key Metric(s) Dataset & Split
Semantic Similarity Inference Do embeddings place functionally similar proteins closer together? Spearman ρ (embedding distance vs Resnik GO similarity) 500, 200, Sparse protein sets
GO-Term Function Prediction Can GO terms be predicted from embeddings? F-max, AUROC, AUPR (5-fold CV) Swiss-Prot Human; splits by size (High / Middle / Low); MF, BP, CC
Drug-Target Family Classification Can embeddings classify proteins into drug-target families? MCC, Accuracy, Macro-F1 (10-fold CV) Random and identity-based splits (nc, uc50, uc30, mm15)
Protein–Protein Binding Affinity Can embeddings predict ΔΔG of interface mutants? MSE, MAE, Pearson R (10-fold CV) SKEMPI v1 (3,047 mutants)

Introduced in Unsal et al. 2022, DOI: 10.1038/s42256-022-00457-9


Getting Started

Option 1: Use the Web Server (Recommended)

Run benchmarks directly in the browser:

▶️ PROBE Leaderboard Space

Upload your embedding vectors, select tasks, and get results. Interactive leaderboards and plots are available.

Described in Çevrim et al. 2025, DOI: 10.1101/2025.04.10.648084


Option 2: Run Locally

System Requirements

  • OS: Ubuntu 20.04 or compatible
  • Python: ≥ 3.9
  • RAM: 8 GB minimum (16 GB+ recommended)

Installation

# Clone repository
git clone https://github.com/HUBioDataLab/PROBE.git
cd PROBE

# (Optional) Set up virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Setup Files

1. Download Benchmark Datasets

mkdir -p data
curl -L -o datasets.zip \
  https://drive.google.com/uc?export=download&id=1elGfjI4jwzcjOBT6LoMnz-7DtdwPFV6i
unzip datasets.zip -d data

Creates:

  • data/auxilary_input/
  • data/preprocess/

2. Obtain Representation Vectors

Choose one:

Place all embeddings under:
data/representation_vectors/


3. Configure Benchmark Run

Edit the probe_config.yaml:

representation_name: MyModel         # prefix for output files

benchmark: all                       # similarity | function | family | affinity | all

representation_file_human: ../data/representation_vectors/mymodel_uniprot_human.csv
representation_file_affinity: ../data/representation_vectors/mymodel_skempi.csv

similarity_tasks: ["Sparse", "200", "500"]
function_prediction_aspect: All_Aspects
function_prediction_dataset: All_Data_Sets
family_prediction_dataset: ["nc", "uc50", "uc30", "mm15"]

detailed_output: False

Set detailed_output: True to additionally save pickled models, raw predictions, confusion matrices, and per‑fold score arrays.


4. Run the Benchmark

cd bin
python PROBE.py

Results will be saved in the results/ folder.


Preparing New Embedding Files

  1. Generate Embeddings for:

    • Swiss‑Prot human canonical proteins
      FASTA
    • SKEMPI v1 complexes
      FASTA
  2. Format as CSV:

    • Column 0 header Entry (UniProt accession or SKEMPI ID)
    • Remaining columns: integer feature indices (0,1,2,…)
    • One row per sequence; all rows have equal vector length
  3. Save to: data/representation_vectors/
    Reference the file paths in probe_config.yaml.


Output Files

Benchmark Filename Pattern Contents
Semantic Similarity Semantic_sim_inference_<matrix>_<rep>.csv Spearman correlations for 500, 200, Sparse
Function Prediction Ontology_based_function_prediction_5cv_mean_<rep>.tsv
..._std_<rep>.tsv
Mean ± SD of F-max, AUROC, AUPR
Family Classification Drug_target_protein_family_classification_mean_results_<split>_<rep>.csv
..._class_based_results_...csv
Overall + per-family metrics
Binding Affinity Affinit_prediction_skempiv1_<rep>.csv
..._detail.csv
MSE, MAE, Pearson R, per fold

Acknowledgements

  • Benchmark: Süleyman Unsal, Hakan Atas, Mehmet Albayrak, Kadir Turhan, Ahmet Cem Acar, Tunca Doğan
  • Web Platform: Elif Çevrim, Melih Gökay Yiğit, Erva Ulusoy, Ardan Yılmaz, Tunca Doğan

License

Released under the GNU General Public License v3.0


Citation

Unsal, S., Atas, H., Albayrak, M. et al. Learning functional properties of proteins with language models. Nat Mach Intell 4, 227–245 (2022). https://doi.org/10.1038/s42256-022-00457-9
Çevrim, E., Yiğit, M. G., Ulusoy, E., Yılmaz, A., & Doğan, T. (2025). A Benchmarking Platform for Assessing Protein Language Models on Function-related Prediction Tasks. In Protein Function Prediction: Methods and Protocols (pp. 241-268). New York, NY: Springer US.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages