Skip to content

FSDP cpu offload hits CUDA error: invalid argument when pin_memory is True #157146

@Edenzzzz

Description

@Edenzzzz

🐛 Describe the bug

When enabling cpu offload in FSDP, the pin_memory operation can cause CUDA error: invalid argument sometimes (observed this on A40 but not H100). See https://github.com/hao-ai-lab/FastVideo/actions/runs/15932900017/job/44946105934

Image

Looking at the line, it seems confusing that the parameter is moved to CPU, then pinned to GPU. AFIK pinning on CPU's page locked memory can accelerate transfer, but I don't know if it even makes sense to pin a CPU tensor on GPU.

Versions

It failed in a remote CI so hard to run this, but we used torch 2.7.1+cu128 on A40 GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions