Skip to content

[src] Incremental Lattice Determinization for Low-Latency WFST Decoder #3317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 49 commits into from

Conversation

chenzhehuai
Copy link
Contributor

@chenzhehuai chenzhehuai commented May 13, 2019

The original lattice determinization algorithm is always conducted after we generate the lattice of the whole utterance. The reason is that it consumes the lattice of the whole utterance to remain the only best output-label sequence (of HMM states) for each input-label sequence (of words).

The motivation of incremental determinization is to spread out the work of determinization over time, which can be useful for online applications. The method is non-trivial because it determinizes the lattice chunk-by-chunk while still guaranteeing the successful path, going through all chunks, is unique for any unique input label sequence.

Our method is to decode WFSTs and generate lattices at each frame as the previous method. And then we chunk lattices over time and do lattice determinization chunk-by-chunk (with specific designs). After that, we append chunks together incrementally (with specific designs).

We are working on summarizing the algorithm and experiments https://www.overleaf.com/read/qmzpxkjypdvk

@hainan-xv @mahsa7823 @LvHang

chenzhehuai and others added 30 commits January 26, 2018 13:31
…the way to the currently-decoded frame, we go up to, say, t-10 (unless this is the end of the utterance), and the same way that we put in temporary initial-probs, we also put in temporary final-probs which reflect the on the states at frame t-10. (we remove them later on, of course).
1. in the determinized lattice, there could be multiple final arcs with the same state label. I need to change the logic here.
2. for the first chunk, there could be some final arcs starting from state 0, while for the last chunk, there could be some initial arcs ending in final state. Hence, I found that we cannot distinguish final and initial arcs by simply "if (s==0)" or "if (clat.Final(arc_appended.nextstate)!=CompactLatticeWeight::Zero()"
+ grep -H Overall exp_dec/incre.fl.1f/base/ora.base exp_dec/incre.fl.1f/base/ora.den.base exp_dec/incre.fl.1f/incre/ora.base exp_dec/incre.fl.1f/incre/ora.den.base
exp_dec/incre.fl.1f/base/ora.base:LOG (lattice-oracle[5.5.276~4-6f366]:main():lattice-oracle.cc:383) Overall %WER 1.70591 [ 342 / 20048, 109 insertions, 22 deletions, 211 substitutions ]
exp_dec/incre.fl.1f/base/ora.den.base:LOG (lattice-depth[5.5.276~4-6f366]:main():lattice-depth.cc:79) Overall density is 25.1613 over 244027 frames.
exp_dec/incre.fl.1f/incre/ora.base:LOG (lattice-oracle[5.5.276~4-6f366]:main():lattice-oracle.cc:383) Overall %WER 1.80567 [ 362 / 20048, 108 insertions, 25 deletions, 229 substitutions ]
exp_dec/incre.fl.1f/incre/ora.den.base:LOG (lattice-depth[5.5.276~4-6f366]:main():lattice-depth.cc:79) Overall density is 28.1682 over 244027 frames.
+ grep -H WER exp_dec/incre.fl.1f/base/wer exp_dec/incre.fl.1f/incre/wer
exp_dec/incre.fl.1f/base/wer:%WER 12.57 [ 2532 / 20138, 305 ins, 287 del, 1940 sub ]
exp_dec/incre.fl.1f/incre/wer:%WER 12.57 [ 2532 / 20138, 305 ins, 287 del, 1940 sub ]
+ grep real exp_dec/incre.fl.1f/base/log/decode.1.log exp_dec/incre.fl.1f/incre/log/decode.1.log
exp_dec/incre.fl.1f/base/log/decode.1.log:LOG (latgen-faster-mapped[5.5.276~4-6f366]:main():latgen-faster-mapped.cc:164) Time taken 48.4324s: real-time factor assuming 100 frames/sec is 0.912442
exp_dec/incre.fl.1f/incre/log/decode.1.log:LOG (latgen-incremental-mapped[5.5.276~4-6f366]:main():latgen-incremental-mapped.cc:164) Time taken 54.6669s: real-time factor assuming 100 frames/sec is 1.0299
…son (remove it later)

2. add determinize-beam-offset. By this way, the beam used in lattice determinization
is (determinize_beam_offset + lattice_beam)
the new algorithm is to determinize "states in the appended lattice with final-arcs to also have non-final arcs leaving them"
Copy link
Contributor

@hainan-xv hainan-xv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to merge.

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small comments in the review.... but there are a couple of slightly bigger issues (and I'm not sure whether to just merge this now or to wait), are:

  • IMO the right way to demonstrate the utility of this is to have a version of an online-decoding setup that uses this.. since this is mostly useful in online scenarios. E.g. modify
    online2bin/online2-wav-nnet3-latgen-faster.cc -> online2bin/online2-wav-nnet3-latgen-incremental.cc and have some code in there that calls it in the way a real user would call it-- i.e. not by calling Decode(), which requires all the input to be available, but by calling AdvanceDecoding() periodically and getting the lattice (possibly with a NULL pointer if it's not needed).

  • Someone needs to go over the comments with a fine tooth comb. There are some that need to be reorganized/moved, and generally just making sure they are clear and that they are consistent with how we defined things in the paper.

@hainan-xv do you have any time to do the comment-related part?
I am not sure whether @chenzhehuai still has time to do any work on this, i.e. whether I should ask him or you to do the coding part and the associated testing.

bool GetBestPath(Lattice *ofst, bool use_final_probs = true);

/**
The following function is specifically designed for incremental
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good that you are making an effort to explain things, but we should separate explanatin of the interace and external behavior, from explanation of the algorithm. It may be better to just refer to the paper for explanation of the algorithm.

@chenzhehuai
Copy link
Contributor Author

@danpovey Do you mean having the following

decoder/lattice-incremental-online-decoder.h
decoder/lattice-incremental-online-decoder.cc
online2/online-nnet3-incremental-decoding.h
online2/online-nnet3-incremental-decoding.cc
online2bin/online2-wav-nnet3-latgen-incremental.cc

btw, Hainan says he will take some time to refine the comment.

@danpovey
Copy link
Contributor

danpovey commented Aug 4, 2019

Yes, that's what I mean.

@danpovey
Copy link
Contributor

danpovey commented Aug 4, 2019

oh-- and your build is failing.

@chenzhehuai
Copy link
Contributor Author

@danpovey Done. Hainan, please review this version

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comment


decoder.AdvanceDecoding();

if (do_endpointing && decoder.EndpointDetected(endpoint_opts)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be somewhere inside this loop where it keeps the lattice computation up to date, e.g. call GetLattice() with a NULL argument. Otherwise it's not doing the online stuff in a meaningful way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now I call GetLattice() inside AdvanceDecoding(). Do you think it'd better be move it to here?

@danpovey
Copy link
Contributor

@chenzhehuai @hainan-xv I guess you have been super busy, but I don't think there is much to do here.

Copy link
Contributor

@hainan-xv hainan-xv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments. The PR looks very good overall. I'm talking to Zhehuai offline regarding some code not included in this review.

Copy link
Contributor

@hainan-xv hainan-xv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments.

}

template <typename FST, typename Token>
bool LatticeIncrementalDecoderTpl<FST, Token>::GetLattice(bool use_final_probs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a function in the interface that does the incremental-determinization work without actually outputting the lattice. (Because determinizer_.GetDeterminizedLattice() does extra work to get the lattice and we might sometimes want to do the incremental-determinization work when we don't need the lattice).

@stansalvador
Copy link
Contributor

stansalvador commented Nov 25, 2019

The work done in this branch is pretty helpful for one of our projects, thanks for working to add it to kaldi.

We've tried using this branch and it generally works well, but we've noticed some frequent problems when there are relatively long (ex: >15-20 seconds) portions of non-speech if determinize_max_active is set to a value >=6 (more often for higher values). We see this issue for several different models/LMs. In these non-speech regions the number of states in the chunk and lattice roughly doubles after every chunk. This can quickly reduce the decoding speed to many times slower than real-time, consume large amounts of memory, and in many cases it cannot fully decode an audio file. If it makes it through the silence region it then proceeds as normal with a typical number of states in each new chunk but the number of states in the lattice remains very large. The paper only presents results for values of 25-150... how unexpected is it do require such a small value for some audio files?

We can reproduce this behavior pretty reliably on TEDLIUM audio where there is usually ~17 seconds of music and applause near the beginning of each audio file. Using larger chunk sizes also eliminates the problem, but the value required to work around this issue is file specific and larger values reduces the benefits of using incremental determinization.

The results below were generated from the audio file (a TEDLIUM audio file chopped up a little so the problematic music/applause part is in the middle of a 50 second file) "https://s3.amazonaws.com/cobalt-release/perm/824594725093/AimeeMullins_2009P-middle10-first30-middle10.wav" when running the following command with different values for --determinize-period. The number of states in the/chunk lattice at the problematic area of the audio file are shown for each test case (other parameters are defaults).

kaldi/src/online2bin/online2-wav-nnet3-latgen-incremental --verbose=3 --determinize-delay=25 --determinize-max-active=<VARIED_THIS_PARAMETER> --redeterminize-max-frames=2147483647 --determinize-period=20 --online=true --config=/path/to/online_nnet3_decoding.conf --max-active=5000 --beam=1.0 --lattice-beam=5.0 --acoustic-scale=1.0 --word-symbol-table=/path/to/words.txt /path/to/final.mdl /path/to/HCLG.fst 'ark:echo utt_1 utt_1|' 'scp:echo utt_1 /path/to/AimeeMullins_2009P-middle10-first30-middle10.wav |' ark:/path/to/outlattice.ark

--determinize-max-active=10 (finishes in 6 seconds for the 50 second file, often a value of 5 is needed to avoid a slowdowns on some audio files)
...
VLOG[2] ... Frame: ( 0 , 40 ) states of the chunk: 11 states of the lattice: 11
VLOG[2] ... Frame: ( 40 , 76 ) states of the chunk: 21 states of the lattice: 31
VLOG[2] ... Frame: ( 76 , 100 ) states of the chunk: 8 states of the lattice: 38
VLOG[2] ... Frame: ( 100 , 120 ) states of the chunk: 8 states of the lattice: 45
VLOG[2] ... Frame: ( 120 , 140 ) states of the chunk: 6 states of the lattice: 50
VLOG[2] ... Frame: ( 140 , 160 ) states of the chunk: 13 states of the lattice: 62
VLOG[2] ... Frame: ( 160 , 200 ) states of the chunk: 13 states of the lattice: 74
VLOG[2] ... Frame: ( 200 , 220 ) states of the chunk: 4 states of the lattice: 77
VLOG[2] ... Frame: ( 220 , 240 ) states of the chunk: 7 states of the lattice: 83
VLOG[2] ... Frame: ( 240 , 260 ) states of the chunk: 8 states of the lattice: 90
VLOG[2] ... Frame: ( 260 , 280 ) states of the chunk: 4 states of the lattice: 93
VLOG[2] ... Frame: ( 280 , 300 ) states of the chunk: 6 states of the lattice: 98
VLOG[2] ... Frame: ( 300 , 332 ) states of the chunk: 10 states of the lattice: 107
VLOG[2] ... Frame: ( 332 , 940 ) states of the chunk: 7 states of the lattice: 113
VLOG[2] ... Frame: ( 940 , 980 ) states of the chunk: 6 states of the lattice: 118
VLOG[2] ... Frame: ( 980 , 1000 ) states of the chunk: 6 states of the lattice: 123
VLOG[2] ... Frame: ( 1000 , 1032 ) states of the chunk: 8 states of the lattice: 130
VLOG[2] ... Frame: ( 1032 , 1060 ) states of the chunk: 5 states of the lattice: 134
VLOG[2] ... Frame: ( 1060 , 1080 ) states of the chunk: 7 states of the lattice: 140
VLOG[2] ... Frame: ( 1080 , 1100 ) states of the chunk: 10 states of the lattice: 149
VLOG[2] ... Frame: ( 1100 , 1120 ) states of the chunk: 7 states of the lattice: 155
VLOG[2] ... Frame: ( 1120 , 1140 ) states of the chunk: 5 states of the lattice: 159
VLOG[2] ... Frame: ( 1140 , 1160 ) states of the chunk: 8 states of the lattice: 166
VLOG[2] ... Frame: ( 1160 , 1180 ) states of the chunk: 5 states of the lattice: 170
VLOG[2] ... Frame: ( 1180 , 1200 ) states of the chunk: 5 states of the lattice: 174
VLOG[2] ... Frame: ( 1200 , 1239 ) states of the chunk: 28 states of the lattice: 201
VLOG[2] ... Frame: ( 1239 , 1260 ) states of the chunk: 5 states of the lattice: 205
VLOG[2] ... Frame: ( 1260 , 1300 ) states of the chunk: 13 states of the lattice: 217
VLOG[2] ... Frame: ( 1300 , 1340 ) states of the chunk: 10 states of the lattice: 226
VLOG[2] ... Frame: ( 1340 , 1378 ) states of the chunk: 9 states of the lattice: 234
VLOG[2] ... Frame: ( 1378 , 1400 ) states of the chunk: 20 states of the lattice: 253
VLOG[2] ... Frame: ( 1400 , 1440 ) states of the chunk: 13 states of the lattice: 265
VLOG[2] ... Frame: ( 1440 , 1477 ) states of the chunk: 10 states of the lattice: 274
VLOG[2] ... Frame: ( 1477 , 1500 ) states of the chunk: 17 states of the lattice: 290
VLOG[2] ... Frame: ( 1500 , 1540 ) states of the chunk: 9 states of the lattice: 298
VLOG[2] ... Frame: ( 1540 , 1560 ) states of the chunk: 7 states of the lattice: 304
VLOG[2] ... Frame: ( 1560 , 1580 ) states of the chunk: 5 states of the lattice: 308
VLOG[2] ... Frame: ( 1580 , 1600 ) states of the chunk: 9 states of the lattice: 316
VLOG[2] ... Frame: ( 1600 , 1620 ) states of the chunk: 6 states of the lattice: 321
VLOG[2] ... Frame: ( 1620 , 1640 ) states of the chunk: 10 states of the lattice: 330
VLOG[2] ... Frame: ( 1640 , 1666 ) states of the chunk: 9 states of the lattice: 338

--determinize-max-active=20 (finishes in 8.8 minutes for the 50 second file)
VLOG[2] ... Frame: ( 0 , 40 ) states of the chunk: 11 states of the lattice: 11
VLOG[2] ... Frame: ( 40 , 60 ) states of the chunk: 17 states of the lattice: 27
VLOG[2] ... Frame: ( 60 , 100 ) states of the chunk: 11 states of the lattice: 37
VLOG[2] ... Frame: ( 100 , 120 ) states of the chunk: 8 states of the lattice: 44
VLOG[2] ... Frame: ( 120 , 140 ) states of the chunk: 6 states of the lattice: 49
VLOG[2] ... Frame: ( 140 , 160 ) states of the chunk: 13 states of the lattice: 61
VLOG[2] ... Frame: ( 160 , 180 ) states of the chunk: 14 states of the lattice: 74
VLOG[2] ... Frame: ( 180 , 200 ) states of the chunk: 4 states of the lattice: 77
VLOG[2] ... Frame: ( 200 , 220 ) states of the chunk: 4 states of the lattice: 80
VLOG[2] ... Frame: ( 220 , 240 ) states of the chunk: 8 states of the lattice: 87
VLOG[2] ... Frame: ( 240 , 260 ) states of the chunk: 8 states of the lattice: 94
VLOG[2] ... Frame: ( 260 , 280 ) states of the chunk: 4 states of the lattice: 97
VLOG[2] ... Frame: ( 280 , 300 ) states of the chunk: 6 states of the lattice: 102
VLOG[2] ... Frame: ( 300 , 334 ) states of the chunk: 11 states of the lattice: 112
VLOG[2] ... Frame: ( 334 , 380 ) states of the chunk: 12 states of the lattice: 123
VLOG[2] ... Frame: ( 380 , 400 ) states of the chunk: 13 states of the lattice: 135
VLOG[2] ... Frame: ( 400 , 438 ) states of the chunk: 17 states of the lattice: 151
VLOG[2] ... Frame: ( 438 , 460 ) states of the chunk: 26 states of the lattice: 176
VLOG[2] ... Frame: ( 460 , 480 ) states of the chunk: 41 states of the lattice: 216
VLOG[2] ... Frame: ( 480 , 500 ) states of the chunk: 74 states of the lattice: 289
VLOG[2] ... Frame: ( 500 , 540 ) states of the chunk: 138 states of the lattice: 426
VLOG[2] ... Frame: ( 540 , 560 ) states of the chunk: 266 states of the lattice: 691
VLOG[2] ... Frame: ( 560 , 580 ) states of the chunk: 523 states of the lattice: 1213
VLOG[2] ... Frame: ( 580 , 600 ) states of the chunk: 1033 states of the lattice: 2245
VLOG[2] ... Frame: ( 600 , 640 ) states of the chunk: 2059 states of the lattice: 4303
VLOG[2] ... Frame: ( 640 , 660 ) states of the chunk: 4105 states of the lattice: 8407 <-starts getting very slow here
VLOG[2] ... Frame: ( 660 , 700 ) states of the chunk: 8203 states of the lattice: 16609
VLOG[2] ... Frame: ( 700 , 740 ) states of the chunk: 16396 states of the lattice: 33004 <-is really slow here
WARNING ... Last chunk processing failed. We will retry from frame 0.
VLOG[2] ... Frame: ( 0 , 780 ) states of the chunk: 67 states of the lattice: 67 <-runs at ~normal speed from here
VLOG[2] ... Frame: ( 780 , 800 ) states of the chunk: 14 states of the lattice: 80
VLOG[2] ... Frame: ( 800 , 838 ) states of the chunk: 18 states of the lattice: 97
VLOG[2] ... Frame: ( 838 , 860 ) states of the chunk: 23 states of the lattice: 119
VLOG[2] ... Frame: ( 860 , 880 ) states of the chunk: 35 states of the lattice: 153
VLOG[2] ... Frame: ( 880 , 900 ) states of the chunk: 60 states of the lattice: 212
VLOG[2] ... Frame: ( 900 , 940 ) states of the chunk: 68 states of the lattice: 279
VLOG[2] ... Frame: ( 940 , 960 ) states of the chunk: 7 states of the lattice: 285
VLOG[2] ... Frame: ( 960 , 980 ) states of the chunk: 3 states of the lattice: 287
VLOG[2] ... Frame: ( 980 , 1000 ) states of the chunk: 6 states of the lattice: 292
VLOG[2] ... Frame: ( 1000 , 1040 ) states of the chunk: 8 states of the lattice: 299
VLOG[2] ... Frame: ( 1040 , 1060 ) states of the chunk: 5 states of the lattice: 303
VLOG[2] ... Frame: ( 1060 , 1080 ) states of the chunk: 7 states of the lattice: 309
VLOG[2] ... Frame: ( 1080 , 1100 ) states of the chunk: 10 states of the lattice: 318
VLOG[2] ... Frame: ( 1100 , 1120 ) states of the chunk: 7 states of the lattice: 324
VLOG[2] ... Frame: ( 1120 , 1140 ) states of the chunk: 5 states of the lattice: 328
VLOG[2] ... Frame: ( 1140 , 1160 ) states of the chunk: 8 states of the lattice: 335
VLOG[2] ... Frame: ( 1160 , 1180 ) states of the chunk: 5 states of the lattice: 339
VLOG[2] ... Frame: ( 1180 , 1200 ) states of the chunk: 5 states of the lattice: 343
VLOG[2] ... Frame: ( 1200 , 1240 ) states of the chunk: 28 states of the lattice: 370
VLOG[2] ... Frame: ( 1240 , 1260 ) states of the chunk: 5 states of the lattice: 374
VLOG[2] ... Frame: ( 1260 , 1280 ) states of the chunk: 9 states of the lattice: 382
VLOG[2] ... Frame: ( 1280 , 1300 ) states of the chunk: 7 states of the lattice: 388
VLOG[2] ... Frame: ( 1300 , 1340 ) states of the chunk: 10 states of the lattice: 397
VLOG[2] ... Frame: ( 1340 , 1360 ) states of the chunk: 14 states of the lattice: 410
VLOG[2] ... Frame: ( 1360 , 1380 ) states of the chunk: 7 states of the lattice: 416
VLOG[2] ... Frame: ( 1380 , 1400 ) states of the chunk: 22 states of the lattice: 437
VLOG[2] ... Frame: ( 1400 , 1420 ) states of the chunk: 13 states of the lattice: 449
VLOG[2] ... Frame: ( 1420 , 1440 ) states of the chunk: 12 states of the lattice: 460
VLOG[2] ... Frame: ( 1440 , 1460 ) states of the chunk: 7 states of the lattice: 466
VLOG[2] ... Frame: ( 1460 , 1480 ) states of the chunk: 9 states of the lattice: 474
VLOG[2] ... Frame: ( 1480 , 1500 ) states of the chunk: 14 states of the lattice: 487
VLOG[2] ... Frame: ( 1500 , 1540 ) states of the chunk: 9 states of the lattice: 495
VLOG[2] ... Frame: ( 1540 , 1560 ) states of the chunk: 7 states of the lattice: 501
VLOG[2] ... Frame: ( 1560 , 1580 ) states of the chunk: 5 states of the lattice: 505
VLOG[2] ... Frame: ( 1580 , 1600 ) states of the chunk: 9 states of the lattice: 513
VLOG[2] ... Frame: ( 1600 , 1620 ) states of the chunk: 6 states of the lattice: 518
VLOG[2] ... Frame: ( 1620 , 1640 ) states of the chunk: 10 states of the lattice: 527
VLOG[2] ... Frame: ( 1640 , 1666 ) states of the chunk: 9 states of the lattice: 535

We see this behavior for all of our models that we tried (it doesn't seem specific to a particular AM/LM).
I'm wondering if there is a better way to avoid this bad behavior to improve the robustness of incremental determinization without reducing the --determinize-max-active parameter to ~5 (which will reduce lattice density).

@danpovey
Copy link
Contributor

Can you please look at branch #3737 which is suppose to replace this ? . Closing this PR.

@danpovey danpovey closed this Nov 26, 2019
@stansalvador
Copy link
Contributor

thanks, I'll try it out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants