Skip to content

Xvector Egs, Etc #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 18, 2016
Merged

Xvector Egs, Etc #8

merged 5 commits into from
Feb 18, 2016

Conversation

david-ryan-snyder
Copy link

For nnet3-xvector-get-egs I'm assuming that we don't need to worry about left or right context as we do in other binaries.

Also included in this pull request are a few improvements to the xvector objf/deriv code, such as fixing numerical overflow issues, typos, and mistakes in the comments.

@danpovey
Copy link
Owner

Thanks!
Merging.
Please try to finish the get-egs script- including shuffling of egs.
Add the --max-jobs-run $nj option to the $cmd when shuffling to avoid overwhelming the disk (it will have $num_train_archives jobs).
For extracting the examples for the training subset and validation, you'll have to add an extra option to the python script to make the chunk-sizes deterministic rather than random (note: there will typically be one job, and --num-archives=3). The left and right chunk sizes will be identical, and they will range from min-chunk-size to max-chunk-size in a geometric pattern as you go from the first to the last archive.
For the 'archive-chunk-sizes' file you may have to add some kind of way of specifying a filename suffix or pattern so that we can get separate versions of that file for the training-subset and validation-subset egs.

Dan

@david-ryan-snyder
Copy link
Author

Please try to finish the get-egs script- including shuffling of egs.

Will do.

For the 'archive-chunk-sizes' file you may have to add some kind of way of specifying a filename suffix or pattern so that we can get separate versions of that file for the training-subset and validation-subset egs.

If I understand this correctly, we plan on using the same utterances in train and validation (but of course, different cuts)? Edit: Nevermind, I misread that.

@danpovey
Copy link
Owner

No, for validation the utterances are a held-out set-- see the get_egs.sh
script, it creates that subset.
You'd call that python script two times more- once for training-subset and
once for validation. And they use a different, smaller number of
frames-per-archive- again, that's drafted in the script.

Dan

On Wed, Feb 17, 2016 at 5:25 PM, david-ryan-snyder <[email protected]

wrote:

Please try to finish the get-egs script- including shuffling of egs.

Will do.

For the 'archive-chunk-sizes' file you may have to add some kind of way of
specifying a filename suffix or pattern so that we can get separate
versions of that file for the training-subset and validation-subset egs.

If I understand this correctly, we plan on using the same utterances in
train and validation (but of course, different cuts)?


Reply to this email directly or view it on GitHub
#8 (comment).

danpovey added a commit that referenced this pull request Feb 18, 2016
@danpovey danpovey merged commit cb4635c into danpovey:xvector Feb 18, 2016
danpovey pushed a commit that referenced this pull request Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants