Fix bad synthetic dataloader with per device batch size < 1. #1706

wang2yn84 · 2025-05-09T00:13:34Z

Description

Fix BadSyntheticDataIterator for grain. The local iterator is missing and workload will error out on when using Grain dataset together with pdb < 1.

Tests

Manually run the following workload: python -m MaxText.train MaxText/configs/base.yml skip_jax_distributed_system=True run_name=lance_test attention=dot_product dataset_type=grain tokenizer_path=assets/tokenizer.llama2 hardware=gpu logits_dot_in_fp32=false enable_goodput_recording=false monitor_goodput=false remat_policy=full weight_dtype=bfloat16 save_config_to_gcs=false scan_layers=false per_device_batch_size=0.25 dcn_fsdp_parallelism=-1 dcn_data_parallelism=1 ici_fsdp_parallelism=1 ici_tensor_parallelism=8 packing=false enable_checkpoint_cloud_logger=true dataset_path=/scratch/lancewang/dataset_pvc/ grain_train_files=/scratch/lancewang/dataset_pvc/array-record/c4/en/3.0.1/c4-train.array_record* grain_worker_count=1 enable_checkpointing=false async_checkpointing=true checkpoint_period=10 save_config_to_gcs=false base_output_directory=/scratch/lancewang/outputs

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

bvandermoon · 2025-05-09T00:16:06Z

MaxText/input_pipeline/input_pipeline_interface.py

@@ -83,6 +83,7 @@ def __init__(self, config, mesh):
    self.mesh = mesh
    dataset = BadSyntheticDataIterator.get_bad_synthetic_data(config)
    self.data_generator = multihost_dataloading.MultiHostDataLoadIterator(dataset, self.mesh)
+    self.local_iterator = self.data_generator.local_iterator


Does it need to be used somewhere?

Hi Branden, yes it's used here

maxtext/MaxText/train.py

Line 193 in 9b5fca5

iter=grain.PyGrainCheckpointSave(data_iterator.local_iterator),

Fix the way we call BadSyntheticDataIterator

7d5e0e5

wang2yn84 requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla, RissyRan, richjames0, rni418, gagika, shralex, yangyuwei, SurbhiJainUSC, hengtaoguo, A9isha, wyzhang, mitalisi, gpolovets1, mailvijayasingh, jrplatin, patemotter and Lumosis as code owners May 9, 2025 00:13

wang2yn84 changed the title ~~Fix the way we use bad synthetic dataloader.~~ Fix bad synthetic dataloader with per device batch size < 1. May 9, 2025

bvandermoon reviewed May 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bad synthetic dataloader with per device batch size < 1. #1706

Fix bad synthetic dataloader with per device batch size < 1. #1706

Uh oh!

wang2yn84 commented May 9, 2025 •

edited

Loading

Uh oh!

bvandermoon May 9, 2025

Uh oh!

wang2yn84 May 22, 2025

Uh oh!

Uh oh!

Fix bad synthetic dataloader with per device batch size < 1. #1706

Are you sure you want to change the base?

Fix bad synthetic dataloader with per device batch size < 1. #1706

Uh oh!

Conversation

wang2yn84 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

bvandermoon May 9, 2025

Choose a reason for hiding this comment

Uh oh!

wang2yn84 May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wang2yn84 commented May 9, 2025 •

edited

Loading