Skip to content

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Feb 17, 2024

Stack from ghstack (oldest at bottom):

Just found out that HF dataset has its own API to do data split (across DP ranks). Verified that it has the expected data behavior (same on SP ranks, different on DP ranks).

Note: This is still a map-style dataset, that has to be loaded in memory. Setting streaming=True for load_dataset returns an IterableDataset whose data doesn't have to fit in memory, but the data loading speed is significantly slower.

tianyu-l added a commit that referenced this pull request Feb 17, 2024
ghstack-source-id: 489d666
Pull Request resolved: #65
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 17, 2024
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Just found out that HF dataset has its own [API](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/main_classes#datasets.distributed.split_dataset_by_node) to do data split (across DP ranks). Verified that it has the expected data behavior (same on SP ranks, different on DP ranks).

Note: This is still a map-style dataset, that has to be loaded in memory. Setting `streaming=True` for [load_dataset](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/loading_methods#datasets.load_dataset) returns an IterableDataset whose data doesn't have to fit in memory, but the data loading speed is significantly slower.


[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Feb 21, 2024
ghstack-source-id: e23d5e0
Pull Request resolved: #65
@tianyu-l tianyu-l merged commit 5ebe2e7 into gh/tianyu-l/1/base Feb 21, 2024
tianyu-l added a commit that referenced this pull request Feb 21, 2024
ghstack-source-id: e23d5e0
Pull Request resolved: #65
@tianyu-l tianyu-l deleted the gh/tianyu-l/1/head branch February 21, 2024 20:08
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
ghstack-source-id: e23d5e0
Pull Request resolved: #65
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: e23d5e0
Pull Request resolved: pytorch#65
payoto pushed a commit to graphcore-research/torchtitan-fork that referenced this pull request Feb 7, 2025
Adding FP8 version, removing multinode one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants