-
Notifications
You must be signed in to change notification settings - Fork 78
RFC-0042-aecf-multimodal-fusion.md #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… Multimodal Learning This RFC proposes adding AECF as a standard multimodal fusion layer in PyTorch. Key features: - Adaptive entropy-driven curriculum masking for robust multimodal learning - Drop-in replacement for existing fusion approaches - Built-in robustness to missing modalities at inference time - Superior calibration properties with 18pp mAP improvement - Minimal runtime overhead (<3%) The implementation includes: - torch.nn.CurriculumMasking for entropy-based adaptive masking - torch.nn.MultimodalAttentionPool for attention-based multimodal fusion - Factory functions and functional interfaces for ease of use Based on 'Robust Multimodal Learning via Entropy-Gated Contrastive Fusion' (Chlon et al., 2025) - https://arxiv.org/abs/2505.15417
This document provides: - High-level explanation of what AECF does and why it matters - Technical implementation details and architecture - Experimental results and validation - Integration plan for PyTorch core - Comprehensive test coverage overview Serves as supplementary material to the main RFC document.
This commit adds: - Complete working implementation (5,337 lines of Python code) - Comprehensive test suite (765 lines of unit tests) - Real-world MS-COCO benchmarking experiments - Performance validation showing +18pp mAP improvement - Production-ready features (gradient checkpointing, numerical stability) - Multiple fusion layer comparisons and architectures The reference implementation demonstrates: ✅ Drop-in compatibility with existing PyTorch code ✅ Superior performance under missing modality scenarios ✅ Robust numerical stability under all tested conditions ✅ <3% runtime overhead compared to standard attention ✅ Easy integration with vision-language, medical, and robotics models Reviewers can immediately test the implementation: cd reference-implementation/ pip install -r requirements.txt python -m pytest test_suite/ -v python -m aecf.coco_tests.test_organized This strengthens the RFC proposal by providing concrete evidence of AECF's benefits and demonstrating implementation feasibility.
Added complete submission guide with step-by-step instructions for submitting the AECF RFC to PyTorch. The RFC is now complete with: ✅ 20KB+ comprehensive RFC document following PyTorch template ✅ 5,337 lines of reference implementation code ✅ 765 lines of comprehensive unit tests ✅ Real-world MS-COCO benchmarking experiments ✅ Performance validation showing +18pp mAP improvement ✅ Production-ready optimizations and numerical stability ✅ Complete documentation and usage examples Ready for submission to pytorch/rfcs repository
Hi @leochlon! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Hey @leochlon, thanks for the request! Note that we maintain a very high bar for inclusion of new modules within From what I can tell, this is a new technique (https://arxiv.org/html/2505.15417v1) that needs time to establish user acceptance. I'll encourage you to maintain an implementation of this technique in a separate GitHub repo to make it available for users. We can leave this issue open to gauge user interest over time and revisit this in the future if the technique becomes ubiquitous. Please let us know if there is some technical reason why it is not possible to maintain this in a separate repo so we can evaluate the extension mechanisms we provide within PyTorch. I'll also tag @NicolasHug here on torchvision just in case he has any thoughts |
Summary
We propose adding Adaptive Entropy-Gated Contrastive Fusion (AECF) to PyTorch as a core multimodal fusion layer that addresses a critical production problem: missing modalities in real-world deployments.
The Problem
Current multimodal models fail catastrophically when sensors break, data is incomplete, or modalities are unavailable at inference time. This is a major barrier to deploying multimodal AI in production environments.
The Solution
AECF uses entropy-driven curriculum learning to train models that are robust to missing modalities:
Key Results
Implementation
Complete reference implementation with 5,337 lines of production-ready code, comprehensive tests, and MS-COCO benchmarks included in the RFC.
Why This Matters
Multimodal AI is rapidly expanding (vision-language models, robotics, autonomous vehicles), but robustness to missing modalities remains an unsolved problem. AECF provides a principled, efficient solution that PyTorch users need today.
Request: Please route to multimodal/vision experts for technical review.