Skip to content

Add info about initialization method in logging #80

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xyhuang opened this issue Jan 9, 2021 · 3 comments
Open

Add info about initialization method in logging #80

xyhuang opened this issue Jan 9, 2021 · 3 comments
Assignees

Comments

@xyhuang
Copy link
Contributor

xyhuang commented Jan 9, 2021

No description provided.

@mwawrzos
Copy link
Contributor

mwawrzos commented Feb 7, 2021

For each weights tensor, this entry should be logged right after the weights initialization is done.
Here is an example in RNN-T reference: mlcommons/training@66d6c2d#diff-eb3462d93ad4cb9033e2a2884ef241e7a866019ca804a0bc4a3453bcd96bf05cR105-R107

The compliance checker will validate if the number of entries in the log matches the reference. This way submitter can ensure if he reports all initializations. Here is an example for RNN-T checks:
https://github.com/mlcommons/logging/blob/master/mlperf_logging/compliance_checker/1.0.0/closed_rnnt.yaml#L1-L16

The purpose of that change is to simplify the review process.
Thanks to these log entries, a reviewer can quickly identify, which part of the code is responsible for determining initial tensor weight values.

@xyhuang
Copy link
Contributor Author

xyhuang commented Feb 8, 2021

Infra WG:

  • we might want to double check if the changes can work for different frameworks
  • potential alternative solutions:
    • log start/stop of weight initialization
    • add a metadata specifying which layer is initialized, and the checker should verify if all required layers are present

@mwawrzos
Copy link
Contributor

mwawrzos commented Feb 9, 2021

the metadata-based solution implemented here: aa44366

mwawrzos added a commit to mwawrzos/training that referenced this issue Feb 17, 2021
* logging uses constants instead of raw strings
* missing log entries
* add weights initialization logging according to mlcommons/logging#80
xyhuang pushed a commit that referenced this issue Feb 22, 2021
* new keys according to #78 and #80

* RNN-T constants

* RNN-T update constants

* [compliance_checker] update to rules 1.0

* [compliance_checker] add gradient_accumulation_steps

* update constants

* [compliance_checker] RNN-T rules

* add rnnt and unet3d benchmarks

* Revert "RNN-T update constants"

This reverts commit 03b986a.

* Revert "RNN-T constants"

This reverts commit b550182.

* [compliance_checker] check weights_initialization based on metadata

* Add unet3d

* [compliance_checker][RNN-T] missing weights initialization check

* [compliance_checker][Unet3D] target 0.91 -> 0.908

after
mlcommons/training@149c2b8

Co-authored-by: michalm <[email protected]>
johntran-nv pushed a commit to mlcommons/training that referenced this issue Apr 7, 2021
* RNN-T reference update for MLPerf Training v1.0

* switch to stable DALI release

* transcritp tensor building - index with np array instead of torch tensor

* fix multi-GPU bucketing

* eval every epoch, logging improvement

* user can adjust optimizer betas

* gradient clipping

* missing config file

* [README] add driver disclaimer

* right path to sentencepieces

* bind all gpus in docker/launch.sh script

* move speed perturbation out of evaluation

* remove not related code; update logging; default arguments with LAMB

* add evaluation when every sample is seen once

* add run_and_time.sh

* update logging

* missing augmentation logs

* revert unwanted dropout removal from first two encode layers

* scaling weights initialization

* limit number of symbols produced by the greedy decoder

* simplification - rm old eval pipeline

* dev_ema in tb_logginer

* loading from checkpoint restores optimizer state

* Rnnt logging update (#4)

* logging uses constants instead of raw strings
* missing log entries
* add weights initialization logging according to mlcommons/logging#80

* 0.5 wights initialization scale gives more stable convergence

* fix typo, update logging lib to include new constant

* README update

* apply review suggestions

* [README] fix model diagram

2x time stacking after 2nd encoder layer, not 3x

* transcript tensor padding comment

* DALI output doesn't need extra zeroing of padding

* Update README.md

Links to code sources, fix LSTM weight and bias initialization description

* [README] model diagram fix - adjust to 1023 sentencepieces
xyhuang pushed a commit that referenced this issue Apr 13, 2021
* new keys according to #78 and #80

* RNN-T constants

* RNN-T update constants

* [compliance_checker] update to rules 1.0

* [compliance_checker] add gradient_accumulation_steps

* update constants

* [compliance_checker] RNN-T rules

* add rnnt and unet3d benchmarks

* Revert "RNN-T update constants"

This reverts commit 03b986a.

* Revert "RNN-T constants"

This reverts commit b550182.

* RNN-T constants

* [compliance_checker] check weights_initialization based on metadata

* align naming with other constants

* Add unet3d

* [compliance_checker][RNN-T] update compliance checker

* [compliance_checker][RNN-T] missing weights initialization check

* [logging][rnn-t] weights initialization scale constant

* undo unwanted change in unet3d

Co-authored-by: michalm <[email protected]>
@xyhuang xyhuang added the v1.1 label Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants