support infinite loop over alpaca dataset #66

tianyu-l · 2024-02-22T21:41:51Z

Stack from ghstack (oldest at bottom):

-> support infinite loop over alpaca dataset #66

Previously, alpaca dataset is consumed up after only ~50 iterations with 8 data parallel ranks and 8 batch size. This PR adds the (default) option to loop infinitely on the dataset, so that we can unblock integrating other functionalities. Note that loss-related metrics should be read with caution as this will cause overfit.

Update: moved to #92 because migrating to pytorch/ confused ghstack.

[ghstack-poisoned]

ghstack-source-id: e9fa7fd Pull Request resolved: #66

XilunWu

one question on stop condition. otherwise LGTM.

XilunWu · 2024-02-23T04:56:32Z

torchtrain/datasets/alpaca.py

+            if not self.infinite:
+                break


should we add some mechanic to allow a stop? self.infinite is a constant after being initialized.

I think cmd + c should be sufficient?

wanchaol

sgtm!

wanchaol · 2024-02-23T18:55:36Z

torchtrain/datasets/alpaca.py

-                yield input, label
+                while len(all_tokens) >= max_buffer_token_len:
+                    x = torch.LongTensor(all_tokens[:max_buffer_token_len])
+                    # batched_x = x.reshape(self.batch_size, -1)


nit: we can delete the staled comment?

wanchaol · 2024-02-23T18:55:57Z

torchtrain/datasets/alpaca.py

+            if not self.infinite:
+                break


I think cmd + c should be sufficient?

ghstack-source-id: e9fa7fd Pull Request resolved: #66

support infinite loop over alpaca dataset

0642110

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Feb 22, 2024

support infinite loop over alpaca dataset

1b736be

ghstack-source-id: e9fa7fd Pull Request resolved: #66

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 22, 2024

tianyu-l requested review from wanchaol and lessw2020 February 22, 2024 21:50

XilunWu reviewed Feb 23, 2024

View reviewed changes

wanchaol approved these changes Feb 23, 2024

View reviewed changes

tianyu-l mentioned this pull request Feb 27, 2024

support infinite loop over alpaca dataset #92

Merged

tianyu-l closed this Feb 27, 2024

tianyu-l added a commit that referenced this pull request Aug 16, 2024

support infinite loop over alpaca dataset

2969f8f

ghstack-source-id: e9fa7fd Pull Request resolved: #66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support infinite loop over alpaca dataset #66

support infinite loop over alpaca dataset #66

Uh oh!

tianyu-l commented Feb 22, 2024 •

edited

Loading

Uh oh!

XilunWu left a comment

Uh oh!

XilunWu Feb 23, 2024

Uh oh!

wanchaol Feb 23, 2024

Uh oh!

wanchaol left a comment

Uh oh!

wanchaol Feb 23, 2024

Uh oh!

wanchaol Feb 23, 2024

Uh oh!

Uh oh!

support infinite loop over alpaca dataset #66

support infinite loop over alpaca dataset #66

Uh oh!

Conversation

tianyu-l commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

XilunWu Feb 23, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Feb 23, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol Feb 23, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Feb 23, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianyu-l commented Feb 22, 2024 •

edited

Loading