-
Notifications
You must be signed in to change notification settings - Fork 569
Change random seeding in resnet and transformer to meet spec, delete … #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
petermattson
approved these changes
Apr 20, 2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
dagarcia-nvidia
pushed a commit
to dagarcia-nvidia/training
that referenced
this pull request
Feb 27, 2019
* Update .gitignore * Update .gitignore
ekrimer
pushed a commit
to ekrimer/training
that referenced
this pull request
Sep 6, 2019
ekrimer
pushed a commit
to ekrimer/training
that referenced
this pull request
Oct 10, 2019
ekrimer
pushed a commit
to ekrimer/training
that referenced
this pull request
Oct 10, 2019
* Do not construct large single tensor of training data. For memory reasons. Remove reliance on torch in convert script. Prefer only numpy. * Small fix for num_elems query. * Remove torch import and clean up comments. Adding initial compliance checking. using relative path Extend to print more errors at a time. Adding hooks for others to call. Update v0.5.0_level1.yaml Adding l2 checking for ssd. add transformer to is_a_benchmark Adding L2 for most benchmarks. Unused tags (mlcommons#3) * train_learn_rate is not defined in tags.py, opt_learning_rate is instead * opt_learning_rate can be more frequent than train_epoch * train_loop is once per run, train_epoch is once per epoch * model_hp_vocab_size is not used in the Transformer reference (preproc_vocal_size is used instead) report accuracy and target from log parser (mlcommons#4) Several bug fixes to prevent crashes. (mlcommons#5) * fix crash when there are no log lines * fix crash when eval_target is a dict don't include idea files Allow more than one clear caches We'd prefer to allow multiple caches to be cleared, esp. in situations where there are multiple caches to clear :) Removing "bottleneck_block" from Resnet L2 @nvpstr for visibility The "bottleneck_block" is actually not a tag but it is a constant for logging. It was incorrectly being treated as a tag here (due to copy-pasting from the constants file in the reference). Allow more flexibility for async eval. Sometimes async evaluation in distributed environments can run into issues with exact tag counts. Adding place holder for maskrcnn so there is no error My understanding is that there isn't an annotated reference for L2. Adding trivial contents. The file can't be empty. method to return start_time with compliance check So it can be treated as a module Allow a prefix to the log line Allow (and ignore) the logline to have a prefix. Deduplicate lines. Don't remove preceeding stuff. Add debug... remove debug. Handling duplicates.
mwawrzos
referenced
this pull request
in mwawrzos/training
Dec 20, 2019
mwawrzos
referenced
this pull request
in mwawrzos/training
Mar 27, 2020
* Do not construct large single tensor of training data. For memory reasons. Remove reliance on torch in convert script. Prefer only numpy. * Small fix for num_elems query. * Remove torch import and clean up comments. Adding initial compliance checking. using relative path Extend to print more errors at a time. Adding hooks for others to call. Update v0.5.0_level1.yaml Adding l2 checking for ssd. add transformer to is_a_benchmark Adding L2 for most benchmarks. Unused tags (#3) * train_learn_rate is not defined in tags.py, opt_learning_rate is instead * opt_learning_rate can be more frequent than train_epoch * train_loop is once per run, train_epoch is once per epoch * model_hp_vocab_size is not used in the Transformer reference (preproc_vocal_size is used instead) report accuracy and target from log parser (#4) Several bug fixes to prevent crashes. (#5) * fix crash when there are no log lines * fix crash when eval_target is a dict don't include idea files Allow more than one clear caches We'd prefer to allow multiple caches to be cleared, esp. in situations where there are multiple caches to clear :) Removing "bottleneck_block" from Resnet L2 @nvpstr for visibility The "bottleneck_block" is actually not a tag but it is a constant for logging. It was incorrectly being treated as a tag here (due to copy-pasting from the constants file in the reference). Allow more flexibility for async eval. Sometimes async evaluation in distributed environments can run into issues with exact tag counts. Adding place holder for maskrcnn so there is no error My understanding is that there isn't an annotated reference for L2. Adding trivial contents. The file can't be empty. method to return start_time with compliance check So it can be treated as a module Allow a prefix to the log line Allow (and ignore) the logline to have a prefix. Deduplicate lines. Don't remove preceeding stuff. Add debug... remove debug. Handling duplicates.
mwawrzos
referenced
this pull request
in mwawrzos/training
Mar 30, 2020
* Do not construct large single tensor of training data. For memory reasons. Remove reliance on torch in convert script. Prefer only numpy. * Small fix for num_elems query. * Remove torch import and clean up comments. Adding initial compliance checking. using relative path Extend to print more errors at a time. Adding hooks for others to call. Update v0.5.0_level1.yaml Adding l2 checking for ssd. add transformer to is_a_benchmark Adding L2 for most benchmarks. Unused tags (#3) * train_learn_rate is not defined in tags.py, opt_learning_rate is instead * opt_learning_rate can be more frequent than train_epoch * train_loop is once per run, train_epoch is once per epoch * model_hp_vocab_size is not used in the Transformer reference (preproc_vocal_size is used instead) report accuracy and target from log parser (#4) Several bug fixes to prevent crashes. (#5) * fix crash when there are no log lines * fix crash when eval_target is a dict don't include idea files Allow more than one clear caches We'd prefer to allow multiple caches to be cleared, esp. in situations where there are multiple caches to clear :) Removing "bottleneck_block" from Resnet L2 @nvpstr for visibility The "bottleneck_block" is actually not a tag but it is a constant for logging. It was incorrectly being treated as a tag here (due to copy-pasting from the constants file in the reference). Allow more flexibility for async eval. Sometimes async evaluation in distributed environments can run into issues with exact tag counts. Adding place holder for maskrcnn so there is no error My understanding is that there isn't an annotated reference for L2. Adding trivial contents. The file can't be empty. method to return start_time with compliance check So it can be treated as a module Allow a prefix to the log line Allow (and ignore) the logline to have a prefix. Deduplicate lines. Don't remove preceeding stuff. Add debug... remove debug. Handling duplicates.
johntran-nv
pushed a commit
that referenced
this pull request
Apr 7, 2021
* RNN-T reference update for MLPerf Training v1.0 * switch to stable DALI release * transcritp tensor building - index with np array instead of torch tensor * fix multi-GPU bucketing * eval every epoch, logging improvement * user can adjust optimizer betas * gradient clipping * missing config file * [README] add driver disclaimer * right path to sentencepieces * bind all gpus in docker/launch.sh script * move speed perturbation out of evaluation * remove not related code; update logging; default arguments with LAMB * add evaluation when every sample is seen once * add run_and_time.sh * update logging * missing augmentation logs * revert unwanted dropout removal from first two encode layers * scaling weights initialization * limit number of symbols produced by the greedy decoder * simplification - rm old eval pipeline * dev_ema in tb_logginer * loading from checkpoint restores optimizer state * Rnnt logging update (#4) * logging uses constants instead of raw strings * missing log entries * add weights initialization logging according to mlcommons/logging#80 * 0.5 wights initialization scale gives more stable convergence * fix typo, update logging lib to include new constant * README update * apply review suggestions * [README] fix model diagram 2x time stacking after 2nd encoder layer, not 3x * transcript tensor padding comment * DALI output doesn't need extra zeroing of padding * Update README.md Links to code sources, fix LSTM weight and bias initialization description * [README] model diagram fix - adjust to 1023 sentencepieces
JackCaoG
referenced
this pull request
in pytorch-tpu/training
Aug 19, 2022
Add `DistributedStrategy` (base class for distributed ops in PyTorch CUDA and PyTorch/XLA)
suexu1025
added a commit
to suexu1025/training
that referenced
this pull request
Apr 9, 2024
add_brats fix eval OOM fix mis issues
suexu1025
added a commit
to suexu1025/training
that referenced
this pull request
Apr 9, 2024
update Unet3d
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…unused code.