Trustworthy-ML-Lab

ThinkEdit Public
An effective weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s representation space.

Trustworthy-ML-Lab/ThinkEdit’s past year of commit activity

Python 14 1 0 0 Updated Aug 20, 2025
Concept-Bottleneck-LLM Public

Trustworthy-ML-Lab/Concept-Bottleneck-LLM’s past year of commit activity

Python 5 0 0 0 Updated Aug 15, 2025
Robust_HighUtil_Smoothed_DRL Public
[ICML 24] S-DQN and S-PPO: Robust smoothed deep RL agents without sacrificing performance

Trustworthy-ML-Lab/Robust_HighUtil_Smoothed_DRL’s past year of commit activity

Python 5 0 0 0 Updated Aug 15, 2025
CB-LLMs Public
[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.

Trustworthy-ML-Lab/CB-LLMs’s past year of commit activity

Python 24 5 0 0 Updated Aug 15, 2025
Neuron_Eval Public
[ICML 25] A unified mathematical framework to evaluate neuron explanations of deep learning models with sanity tests

Trustworthy-ML-Lab/Neuron_Eval’s past year of commit activity

Jupyter Notebook 6 0 0 0 Updated Jul 1, 2025
efficient_neuron_eval Public

Trustworthy-ML-Lab/efficient_neuron_eval’s past year of commit activity

1 0 0 0 Updated Jun 10, 2025
VLG-CBM Public
[NeurIPS 24] A new training and evaluation framework for learning interpretable deep vision models and benchmarking different interpretable concept-bottleneck-models (CBMs)

Trustworthy-ML-Lab/VLG-CBM’s past year of commit activity

Jupyter Notebook 21 2 1 0 Updated Jun 5, 2025
posthoc-generative-cbm Public
[CVPR 2025] Concept Bottleneck Autoencoder (CB-AE) -- efficiently transform any pretrained (black-box) image generative model into an interpretable generative concept bottleneck model (CBM) with minimal concept supervision, while preserving image quality

Trustworthy-ML-Lab/posthoc-generative-cbm’s past year of commit activity

Jupyter Notebook 14 1 1 0 Updated Jun 4, 2025
Linear-Explanations Public
[ICML 24] A novel automated neuron explanation framework that can accurately describe poly-semantic concepts in deep neural networks

Trustworthy-ML-Lab/Linear-Explanations’s past year of commit activity

Jupyter Notebook 13 0 0 0 Updated May 2, 2025
effective_skill_unlearning Public
[NAACL 25] Two novel, light-weight, and training-free skill unlearning methods for LLMs

Trustworthy-ML-Lab/effective_skill_unlearning’s past year of commit activity

Python 4 0 0 0 Updated Mar 27, 2025

View all repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trustworthy-ML-Lab

Popular repositories Loading

Repositories

People

Top languages

Most used topics