Skip to content

swap axis for optimization in Tensor3dCopy() #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 26, 2015

Conversation

freewym
Copy link

@freewym freewym commented Oct 24, 2015

No description provided.

@danpovey
Copy link
Owner

Thanks-- could you please test if there is any effect on speed, using that
command line I showed you in the log? If the starting .mdl doesn't exist,
you can just take the 100.mdl or some model that exists.
remember to qlogin first.
Dan

On Sat, Oct 24, 2015 at 4:12 PM, Yiming Wang [email protected]
wrote:


You can view, comment on, or merge this pull request online at:

#1
Commit Summary

  • swap axis for optimization in Tensor3dCopy()

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#1.

@danpovey
Copy link
Owner

... also the function is getting a little long-- it might be better to
declare a separate function to do the rearrangement, and call it.

On Sat, Oct 24, 2015 at 6:06 PM, Daniel Povey [email protected] wrote:

Thanks-- could you please test if there is any effect on speed, using that
command line I showed you in the log? If the starting .mdl doesn't exist,
you can just take the 100.mdl or some model that exists.
remember to qlogin first.
Dan

On Sat, Oct 24, 2015 at 4:12 PM, Yiming Wang [email protected]
wrote:


You can view, comment on, or merge this pull request online at:

#1
Commit Summary

  • swap axis for optimization in Tensor3dCopy()

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#1.

@freewym
Copy link
Author

freewym commented Oct 25, 2015

Pushed the new commit. The speedup on Tensor3dCopy seems not significant (8.79502s vs 8.62674s as shown below). The first 3 significant digits of these two time intervals keep the same over multiple runs (8.79 and 8.62 respectively). By printing out the more info, it appears that every time when the swap happens, there is only one ystride (src or dst, but not both) being 1, and x is always swapped with y.

Before optimization:
[cudevice profile]
SymAddMat2 0.484567s
CuMatrix::Resize 0.486771s
MulElements 0.48813s
ApplyLogSoftMaxPerRow 0.552064s
AddMat 0.688837s
CuMatrix::SetZero 0.866473s
AddDiagMatMat 1.04785s
CuMatrixBase::CopyFromMat(from other CuMatrixBase) 1.21041s
AddRows 1.56373s
ApplyHeaviside 2.19864s
CopyRows 4.46078s
Tensor3dCopy 8.79502s
AlphaGeneralFrame 17.1941s
BetaGeneralFrame 18.1611s
AddMatMat 55.2607s
Total GPU time: 116.411s (may involve some double-counting)

After optimization:
[cudevice profile]
SymAddMat2 0.476551s
MulElements 0.489167s
CuMatrix::Resize 0.491286s
ApplyLogSoftMaxPerRow 0.553362s
AddMat 0.687893s
CuMatrix::SetZero 0.86763s
AddDiagMatMat 1.04372s
CuMatrixBase::CopyFromMat(from other CuMatrixBase) 1.21509s
AddRows 1.56267s
ApplyHeaviside 2.20029s
CopyRows 4.46238s
Tensor3dCopy 8.62674s
AlphaGeneralFrame 17.2041s
BetaGeneralFrame 18.174s
AddMatMat 55.1232s
Total GPU time: 116.099s (may involve some double-counting)

@danpovey
Copy link
Owner

OK, thanks- I'll look at it and maybe merge it to-morrow.
Meanwhile, go through the tutorial at OpenFst.org. you may learn enough to
be able to do one of the other recent 'issues' I put on github.
Dan

On Sun, Oct 25, 2015 at 12:19 AM, Yiming Wang [email protected]
wrote:

Pushed the new commit. The speedup on Tensor3dCopy seems not significant
(8.79502s vs 8.62674s as shown below). The first 3 significant digits of
these two time intervals keep the same over multiple runs (8.79 and 8.62
respectively). By printing out the more info, it appears that every time
when the swap happens, there is only one ystride (src or dst, but not both)
being 1, and x is always swapped with y.

Before optimization:
[cudevice profile]
SymAddMat2 0.484567s
CuMatrix::Resize 0.486771s
MulElements 0.48813s
ApplyLogSoftMaxPerRow 0.552064s
AddMat 0.688837s
CuMatrix::SetZero 0.866473s
AddDiagMatMat 1.04785s
CuMatrixBase::CopyFromMat(from other CuMatrixBase) 1.21041s
AddRows 1.56373s
ApplyHeaviside 2.19864s
CopyRows 4.46078s
Tensor3dCopy 8.79502s
AlphaGeneralFrame 17.1941s
BetaGeneralFrame 18.1611s
AddMatMat 55.2607s
Total GPU time: 116.411s (may involve some double-counting)

After optimization:
[cudevice profile]
SymAddMat2 0.476551s
MulElements 0.489167s
CuMatrix::Resize 0.491286s
ApplyLogSoftMaxPerRow 0.553362s
AddMat 0.687893s
CuMatrix::SetZero 0.86763s
AddDiagMatMat 1.04372s
CuMatrixBase::CopyFromMat(from other CuMatrixBase) 1.21509s
AddRows 1.56267s
ApplyHeaviside 2.20029s
CopyRows 4.46238s
Tensor3dCopy 8.62674s
AlphaGeneralFrame 17.2041s
BetaGeneralFrame 18.174s
AddMatMat 55.1232s
Total GPU time: 116.099s (may involve some double-counting)


Reply to this email directly or view it on GitHub
#1 (comment).

@@ -24,11 +24,45 @@
namespace kaldi {
namespace ctc {

void SwapDimsForX(int32& xdim, int32& ydim, int32& zdim,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry- it's against the Google style guide to use non-const references in function parameters. These should be pointers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declared static. and move the comments from header file to .cc file.

@@ -24,11 +24,45 @@
namespace kaldi {
namespace ctc {

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a comment here briefly explaining what the does; and since this function is not exported, it's good practice to declare it 'static'.

danpovey added a commit that referenced this pull request Oct 26, 2015
swap axis for optimization in Tensor3dCopy()
@danpovey danpovey merged commit 49f91ab into danpovey:tombstone Oct 26, 2015
danpovey pushed a commit that referenced this pull request Feb 23, 2016
danpovey pushed a commit that referenced this pull request Feb 23, 2016
danpovey pushed a commit that referenced this pull request May 2, 2016
Add ivector support to online nnet3 decoder
danpovey pushed a commit that referenced this pull request May 2, 2016
danpovey pushed a commit that referenced this pull request Nov 7, 2019
csukuangfj pushed a commit to csukuangfj/kaldi that referenced this pull request Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants