Skip to content

RFC-0042-aecf-multimodal-fusion.md #76

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

leochlon
Copy link

Summary

We propose adding Adaptive Entropy-Gated Contrastive Fusion (AECF) to PyTorch as a core multimodal fusion layer that addresses a critical production problem: missing modalities in real-world deployments.

The Problem

Current multimodal models fail catastrophically when sensors break, data is incomplete, or modalities are unavailable at inference time. This is a major barrier to deploying multimodal AI in production environments.

The Solution

AECF uses entropy-driven curriculum learning to train models that are robust to missing modalities:

  • High attention entropy → Less masking → Easier learning
  • Low attention entropy → More masking → Robustness training

Key Results

  • +18 percentage points mAP improvement when modalities are missing
  • 200% reduction in calibration error
  • Only 1% runtime overhead
  • Drop-in replacement for existing fusion layers

Implementation

Complete reference implementation with 5,337 lines of production-ready code, comprehensive tests, and MS-COCO benchmarks included in the RFC.

Why This Matters

Multimodal AI is rapidly expanding (vision-language models, robotics, autonomous vehicles), but robustness to missing modalities remains an unsolved problem. AECF provides a principled, efficient solution that PyTorch users need today.

Request: Please route to multimodal/vision experts for technical review.

… Multimodal Learning

This RFC proposes adding AECF as a standard multimodal fusion layer in PyTorch.

Key features:
- Adaptive entropy-driven curriculum masking for robust multimodal learning
- Drop-in replacement for existing fusion approaches
- Built-in robustness to missing modalities at inference time
- Superior calibration properties with 18pp mAP improvement
- Minimal runtime overhead (<3%)

The implementation includes:
- torch.nn.CurriculumMasking for entropy-based adaptive masking
- torch.nn.MultimodalAttentionPool for attention-based multimodal fusion
- Factory functions and functional interfaces for ease of use

Based on 'Robust Multimodal Learning via Entropy-Gated Contrastive Fusion'
(Chlon et al., 2025) - https://arxiv.org/abs/2505.15417
This document provides:
- High-level explanation of what AECF does and why it matters
- Technical implementation details and architecture
- Experimental results and validation
- Integration plan for PyTorch core
- Comprehensive test coverage overview

Serves as supplementary material to the main RFC document.
This commit adds:
- Complete working implementation (5,337 lines of Python code)
- Comprehensive test suite (765 lines of unit tests)
- Real-world MS-COCO benchmarking experiments
- Performance validation showing +18pp mAP improvement
- Production-ready features (gradient checkpointing, numerical stability)
- Multiple fusion layer comparisons and architectures

The reference implementation demonstrates:
✅ Drop-in compatibility with existing PyTorch code
✅ Superior performance under missing modality scenarios
✅ Robust numerical stability under all tested conditions
✅ <3% runtime overhead compared to standard attention
✅ Easy integration with vision-language, medical, and robotics models

Reviewers can immediately test the implementation:
  cd reference-implementation/
  pip install -r requirements.txt
  python -m pytest test_suite/ -v
  python -m aecf.coco_tests.test_organized

This strengthens the RFC proposal by providing concrete evidence
of AECF's benefits and demonstrating implementation feasibility.
Added complete submission guide with step-by-step instructions for submitting the AECF RFC to PyTorch.

The RFC is now complete with:
✅ 20KB+ comprehensive RFC document following PyTorch template
✅ 5,337 lines of reference implementation code
✅ 765 lines of comprehensive unit tests
✅ Real-world MS-COCO benchmarking experiments
✅ Performance validation showing +18pp mAP improvement
✅ Production-ready optimizations and numerical stability
✅ Complete documentation and usage examples

Ready for submission to pytorch/rfcs repository
@facebook-github-bot
Copy link
Contributor

Hi @leochlon!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@mikaylagawarecki
Copy link

Hey @leochlon, thanks for the request! Note that we maintain a very high bar for inclusion of new modules within torch.nn, as each comes with a substantial maintenance cost on our end. In general, we will accept new modules if the underlying techniques have already achieved widespread adoption and there is a broad expectation that it PyTorch will provide such a module. It's also beneficial if there are performance reasons why the module should be provided by PyTorch itself rather than in a third-party repo.

From what I can tell, this is a new technique (https://arxiv.org/html/2505.15417v1) that needs time to establish user acceptance. I'll encourage you to maintain an implementation of this technique in a separate GitHub repo to make it available for users. We can leave this issue open to gauge user interest over time and revisit this in the future if the technique becomes ubiquitous. Please let us know if there is some technical reason why it is not possible to maintain this in a separate repo so we can evaluate the extension mechanisms we provide within PyTorch.

I'll also tag @NicolasHug here on torchvision just in case he has any thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants