Xvector #5

david-ryan-snyder · 2016-02-13T20:47:55Z

No description provided.

… function and gradient computation for the xvector extractor training. Also adding xvector-test.* which provides a unit test for the gradient.

… a cuda kernal. Still need to do the same for the actual derivatives

danpovey · 2016-02-13T20:53:50Z

src/cudamatrix/cu-kernels.cu

+  int32_cuda scores_index = i + j * scores_dim.stride;
+  Real K = 1.0 / (scores_dim.rows - 2.0);
+  Real L = scores[scores_index];
+  if (i < scores_dim.cols && j < scores_dim.rows && i < j) {


To avoid separately having to zero the upper triangle and the diagonal of the matrix, you might as well do it in this kernel. [i.e. and set it to kUndefined before calling this kernel].
However, I suppose this all becomes moot if you end up using Pegah's idea and rely on the SoftHinge kernel and a fixed scaling matrix.

After looking at it more, I think it's better to just do this in a cuda kernel.

Also, I still need to make kernels for the actual derivatives, which are somewhat nontrivial to compute in an efficient way... I don't think it's possible to use Pegah's idea to handle them.

I think the only not-100%-trivial thing about the derivatives is the fact
that different parts of the matrix have different scaling factors. You
could probably compute the objf and derivs as follows using individual
kernels.

get matrix of scores.

apply fixed-scaling-1 to matrix of scores (to negate different-class)

compute soft-hinge function

Compute TraceMatMat of this matrix with a fixed scaling matrix
fixed-scaling-2 (with 1/(num-rows-2) for different-class members) to get
the objf

use the Sigmoid function to compute the derivative of the soft-hinge nonlinearity

Multiply the derivatives by fixed-scaling-1 * fixed-scaling-2. These
are the derivatives of the objective function w.r.t. the raw scores.

There may be a few signs wrong here.
However, it would be more efficient to do all of the above in a single
kernel.
You can easily do it in the same kernel as computes the objective-function
terms. [do summation via matrix-sum though].

Dan

On Sat, Feb 13, 2016 at 3:58 PM, david-ryan-snyder <[email protected]

wrote:

In src/cudamatrix/cu-kernels.cu
#5 (comment):

@@ -2094,6 +2095,26 @@ static void _diff_xent(const int32_cuda* vec_tgt, Real* mat_net_out, Real* vec_l
}
}

+template
+global
+static void _compute_xvector_objf(const Real* scores, MatrixDim scores_dim,

Real\* obfj_terms, MatrixDim objf_dim,

Real\* obfj_derivs, MatrixDim derivs_dim) {

int32_cuda i = blockIdx.x * blockDim.x + threadIdx.x;

int32_cuda j = blockIdx.y * blockDim.y + threadIdx.y;

int32_cuda scores_index = i + j * scores_dim.stride;

Real K = 1.0 / (scores_dim.rows - 2.0);

Real L = scores[scores_index];

if (i < scores_dim.cols && j < scores_dim.rows && i < j) {

After looking at it more, I don't think it's better to just do this in a
cuda kernel.

Also, I still need to make kernels for the actual derivatives, which are
somewhat nontrivial to compute in an efficient way... I don't think it's
possible to use Pegah's idea to handle them.

—
Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/5/files#r52833363.

I think you're describing an alternative way to get the coefficients for the derivative terms. But, the kernel code above already does that.

On the CPU, the derivative wrt to S needs something like the following (NOTE: I'm ignoring peculiarities due to S being symmetric):

for i=0 ... N: for j = 0 ... N: v = xvectors(i) w = xvectors(j) deriv_S += C(i,j) * (v v' + w w')

Where C() is a coefficient dependent on whether or not the vectors at row i and j are from the same or different classes. This is what we calculated in the kernel above.

Each v,w pair results in its own matrix. I think this makes it harder to deal with in a single kernel. I think the easiest thing to do is to create an additional kernel that works like a modified form of matrix multiplication. Suppose V is the matrix of xvectors and D = NumCols(V). Then P = V' "times" V is the serialized outer product of each row of v. For example, P.Row(0) = Serialized( V.Row(0) * V.Row(0)'). In other words, p_{i,j} = v_{i, (j / D) % D} * v_{i, j % D}.

Once that is done, it should be more straightforward to calculate S_deriv += C(i, j) * (P.Row(i) + P.Row(j)) in parallel.

I don't think you are really thinking about this in the spirit of
backprop. The general principle is that you go forward computing the
objective function, and then you do a process that is roughly the
mirror-image of the forward process to backprop the derivatives through the
computation.

What I described was getting the derivatives of the objective function
w.r.t. the matrix of scores. After that you just have to do the reverse of
the forward operations to get the derivatives w.r.t. S and the matrix of
xvectors.

Dan

On Sat, Feb 13, 2016 at 4:51 PM, david-ryan-snyder <[email protected]

wrote:

In src/cudamatrix/cu-kernels.cu
#5 (comment):

@@ -2094,6 +2095,26 @@ static void _diff_xent(const int32_cuda* vec_tgt, Real* mat_net_out, Real* vec_l
}
}

+template
+global
+static void _compute_xvector_objf(const Real* scores, MatrixDim scores_dim,

Real\* obfj_terms, MatrixDim objf_dim,

Real\* obfj_derivs, MatrixDim derivs_dim) {

int32_cuda i = blockIdx.x * blockDim.x + threadIdx.x;

int32_cuda j = blockIdx.y * blockDim.y + threadIdx.y;

int32_cuda scores_index = i + j * scores_dim.stride;

Real K = 1.0 / (scores_dim.rows - 2.0);

Real L = scores[scores_index];

if (i < scores_dim.cols && j < scores_dim.rows && i < j) {

I think you're describing an alternative way to get the _coefficients _for
the derivative terms. But, the kernel code above already does that.

On the CPU, the derivative wrt to S needs something like the following
(NOTE: I'm ignoring peculiarities due to S being symmetric):

for i=0 ... N:
for j = 0 ... N:
v = xvectors(i)
w = xvectors(j)
deriv_S += C(i,j) * (v v' + w w')

Where C() is a coefficient dependent on whether or not the vectors at row
i and j are from the same or different classes. This is what we calculated
in the kernel above.

Each v,w pair results in its own matrix. I think this makes it harder to
deal with in a single kernel. I think the easiest thing to do is to create
an additional kernel that works like a modified form of matrix
multiplication. Suppose V is the matrix of xvectors and D = NumCols(V).
Then P = V' "times" V is the serialized outer product of each row of v. For
example, P.Row(0) = Serialized( V.Row(0) * V.Row(0)'). In other words,
p_{i,j} = v_{i, (j / D) % D} * v_{i, j % D}.

Once that is done, it should be more straightforward to calculate S_deriv
+= C(i, j) * (P.Row(i) + P.Row(j)) in parallel.

—
Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/5/files#r52834114.

After that you just have to do the reverse of
the forward operations to get the derivatives w.r.t. S and the matrix of xvectors.

Right, that's what I'm referring to. Once you have the derivs of the objf wrt to the scores (included in C(i,j)), you still need to compute the derivative of the scores wrt to S. However, as far as I can tell, unless you try to do that in a kernel, you'll end up with an algorithm with two loops over the xvectors (see psuedo-code in earlier post). I proposed the kernel above to parallelize that computation.

OK, let me work this out...
The forward computation is something like:

A = X X'
cvec = diag(X S X')
u = vector of ones
S = A - cvec u' - u cvec + b
... compute the objf and get S_deriv which is d(objf)/dS
A_deriv = S_deriv
X_deriv += 2 A_deriv X (or something like that)
cvec_deriv = - sum-of-Sderiv-cols - sum-of-Sderiv-rows
when computing the deriv w.r.t. S I am thinking about the expression
cvec_deriv . cvec,
which equals trace(diag(cvec_deriv) X S X'), where diag(cvec_deriv) is a
matrix whose diagonal is cvec_deriv, which we can rearrange to trace(S (X'
diag(cvec_deriv) X)).
We get from this (through a mysterious process, I do it intuitively),
S_deriv = X' diag(cvec_deriv) X
which is pretty easy to compute.

On Sat, Feb 13, 2016 at 5:09 PM, david-ryan-snyder <[email protected]

wrote:

In src/cudamatrix/cu-kernels.cu
#5 (comment):

@@ -2094,6 +2095,26 @@ static void _diff_xent(const int32_cuda* vec_tgt, Real* mat_net_out, Real* vec_l
}
}

+template
+global
+static void _compute_xvector_objf(const Real* scores, MatrixDim scores_dim,

Real\* obfj_terms, MatrixDim objf_dim,

Real\* obfj_derivs, MatrixDim derivs_dim) {

int32_cuda i = blockIdx.x * blockDim.x + threadIdx.x;

int32_cuda j = blockIdx.y * blockDim.y + threadIdx.y;

int32_cuda scores_index = i + j * scores_dim.stride;

Real K = 1.0 / (scores_dim.rows - 2.0);

Real L = scores[scores_index];

if (i < scores_dim.cols && j < scores_dim.rows && i < j) {

After that you just have to do the reverse of
the forward operations to get the derivatives w.r.t. S and the matrix of
xvectors.

Right, that's what I'm referring to. Once you have the derivs of the objf
wrt to the scores, you still need to compute the derivative of the scores
wrt to S. However, as far as I can tell, unless you try to do that in a
kernel, you'll end up with an algorithm with two loops over the xvectors. I
proposed the kernel above to parallelize that computation.

—
Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/5/files#r52834379.

OK, I'll play with it some more to see if I can get it to work without a kernel and without an O(N^2) computation.

In your procedure, it isn't obvious to me (yet) that you can get terms of the form S_deriv = C(x,y) * (x x' + y y') for all combinations of (x,y) pairs. That's where the O(N^2) comes from that I'm trying to avoid.

The fact that it was possible in thee forward computation generally means
it's possible i the backward computation.
You'll get S_deriv = X' diag(cvec_deriv) X, I think.

On Sat, Feb 13, 2016 at 5:39 PM, david-ryan-snyder <[email protected]

wrote:

In src/cudamatrix/cu-kernels.cu
#5 (comment):

@@ -2094,6 +2095,26 @@ static void _diff_xent(const int32_cuda* vec_tgt, Real* mat_net_out, Real* vec_l
}
}

+template
+global
+static void _compute_xvector_objf(const Real* scores, MatrixDim scores_dim,

Real\* obfj_terms, MatrixDim objf_dim,

Real\* obfj_derivs, MatrixDim derivs_dim) {

int32_cuda i = blockIdx.x * blockDim.x + threadIdx.x;

int32_cuda j = blockIdx.y * blockDim.y + threadIdx.y;

int32_cuda scores_index = i + j * scores_dim.stride;

Real K = 1.0 / (scores_dim.rows - 2.0);

Real L = scores[scores_index];

if (i < scores_dim.cols && j < scores_dim.rows && i < j) {

OK, I'll play with it some more to see if I can get it to work without a
kernel and without an O(N^2) computation.

In your procedure, it isn't obvious to me that you can get terms of the
form S_deriv = x x' + y y' for all combinations of (x,y) pairs. That's
where the O(N^2) comes from that I'm trying to avoid.

—
Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/5/files#r52834702.

Xvector

danpovey · 2016-02-14T05:25:34Z

src/ivector/xvector-test.cc

+    const CuMatrixBase<BaseFloat> &xvector_pairs,
+    const CuSpMatrix<BaseFloat> &S,
+    BaseFloat b, CuMatrixBase<BaseFloat> *deriv_xvector,
+    CuVector<BaseFloat> *deriv_S_and_b, BaseFloat *tot_objf,


Please make this two outputs, a CuVector*deriv_S and a BaseFloat *deriv_b.
I am going to give these separate output nodes in the nnet, for easier diagnostics and for easier control of their learning rates.

merge

david-ryan-snyder and others added 5 commits February 11, 2016 22:51

xvector: adding xvector.* which currently contains just the objective…

2fa92eb

… function and gradient computation for the xvector extractor training. Also adding xvector-test.* which provides a unit test for the gradient.

xvector: Adding some error handling to the xvector code

025150a

xvector: Fixing errors in header

0373c07

xvector: simplifying derivative calculation.

adf6a94

xvector: Replacing CPU version of objf computation with one that uses…

6e67582

… a cuda kernal. Still need to do the same for the actual derivatives

danpovey reviewed Feb 13, 2016
View reviewed changes

danpovey added a commit that referenced this pull request Feb 13, 2016

Merge pull request #5 from david-ryan-snyder/chain

598e9b1

Xvector

danpovey merged commit 598e9b1 into danpovey:xvector Feb 13, 2016

danpovey reviewed Feb 14, 2016
View reviewed changes

danpovey pushed a commit that referenced this pull request Nov 7, 2019

Merge pull request #5 from kaldi-asr/master

25706e7

merge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Xvector #5

Xvector #5

Uh oh!

david-ryan-snyder commented Feb 13, 2016

Uh oh!

danpovey Feb 13, 2016

Uh oh!

david-ryan-snyder Feb 13, 2016

Uh oh!

danpovey Feb 13, 2016

Uh oh!

david-ryan-snyder Feb 13, 2016

Uh oh!

danpovey Feb 13, 2016

Uh oh!

david-ryan-snyder Feb 13, 2016

Uh oh!

danpovey Feb 13, 2016

Uh oh!

david-ryan-snyder Feb 13, 2016

Uh oh!

danpovey Feb 13, 2016

Uh oh!

danpovey Feb 14, 2016

Uh oh!

Uh oh!

Xvector #5

Xvector #5

Uh oh!

Conversation

david-ryan-snyder commented Feb 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!