Skip to content

Xvector #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 13, 2016
Merged

Xvector #5

merged 5 commits into from
Feb 13, 2016

Conversation

david-ryan-snyder
Copy link

No description provided.

david-ryan-snyder and others added 5 commits February 11, 2016 22:51
… function and gradient computation for the xvector extractor training. Also adding xvector-test.* which provides a unit test for the gradient.
… a cuda kernal. Still need to do the same for the actual derivatives
int32_cuda scores_index = i + j * scores_dim.stride;
Real K = 1.0 / (scores_dim.rows - 2.0);
Real L = scores[scores_index];
if (i < scores_dim.cols && j < scores_dim.rows && i < j) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid separately having to zero the upper triangle and the diagonal of the matrix, you might as well do it in this kernel. [i.e. and set it to kUndefined before calling this kernel].
However, I suppose this all becomes moot if you end up using Pegah's idea and rely on the SoftHinge kernel and a fixed scaling matrix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at it more, I think it's better to just do this in a cuda kernel.

Also, I still need to make kernels for the actual derivatives, which are somewhat nontrivial to compute in an efficient way... I don't think it's possible to use Pegah's idea to handle them.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only not-100%-trivial thing about the derivatives is the fact
that different parts of the matrix have different scaling factors. You
could probably compute the objf and derivs as follows using individual
kernels.

  • get matrix of scores.
  • apply fixed-scaling-1 to matrix of scores (to negate different-class)
  • compute soft-hinge function
  • Compute TraceMatMat of this matrix with a fixed scaling matrix
    fixed-scaling-2 (with 1/(num-rows-2) for different-class members) to get
    the objf
  • use the Sigmoid function to compute the derivative of the soft-hinge nonlinearity
  • Multiply the derivatives by fixed-scaling-1 * fixed-scaling-2. These
    are the derivatives of the objective function w.r.t. the raw scores.

There may be a few signs wrong here.
However, it would be more efficient to do all of the above in a single
kernel.
You can easily do it in the same kernel as computes the objective-function
terms. [do summation via matrix-sum though].

Dan

On Sat, Feb 13, 2016 at 3:58 PM, david-ryan-snyder <[email protected]

wrote:

In src/cudamatrix/cu-kernels.cu
#5 (comment):

@@ -2094,6 +2095,26 @@ static void _diff_xent(const int32_cuda* vec_tgt, Real* mat_net_out, Real* vec_l
}
}

+template
+global
+static void _compute_xvector_objf(const Real* scores, MatrixDim scores_dim,

  •                              Real\* obfj_terms, MatrixDim objf_dim,
    
  •                              Real\* obfj_derivs, MatrixDim derivs_dim) {
    
  • int32_cuda i = blockIdx.x * blockDim.x + threadIdx.x;
  • int32_cuda j = blockIdx.y * blockDim.y + threadIdx.y;
  • int32_cuda scores_index = i + j * scores_dim.stride;
  • Real K = 1.0 / (scores_dim.rows - 2.0);
  • Real L = scores[scores_index];
  • if (i < scores_dim.cols && j < scores_dim.rows && i < j) {

After looking at it more, I don't think it's better to just do this in a
cuda kernel.

Also, I still need to make kernels for the actual derivatives, which are
somewhat nontrivial to compute in an efficient way... I don't think it's
possible to use Pegah's idea to handle them.


Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/5/files#r52833363.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're describing an alternative way to get the coefficients for the derivative terms. But, the kernel code above already does that.

On the CPU, the derivative wrt to S needs something like the following (NOTE: I'm ignoring peculiarities due to S being symmetric):

for i=0 ... N:
  for j = 0 ... N:
     v = xvectors(i)
     w = xvectors(j)
     deriv_S += C(i,j) * (v v' + w w')

Where C() is a coefficient dependent on whether or not the vectors at row i and j are from the same or different classes. This is what we calculated in the kernel above.

Each v,w pair results in its own matrix. I think this makes it harder to deal with in a single kernel. I think the easiest thing to do is to create an additional kernel that works like a modified form of matrix multiplication. Suppose V is the matrix of xvectors and D = NumCols(V). Then P = V' "times" V is the serialized outer product of each row of v. For example, P.Row(0) = Serialized( V.Row(0) * V.Row(0)'). In other words, p_{i,j} = v_{i, (j / D) % D} * v_{i, j % D}.

Once that is done, it should be more straightforward to calculate S_deriv += C(i, j) * (P.Row(i) + P.Row(j)) in parallel.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you are really thinking about this in the spirit of
backprop. The general principle is that you go forward computing the
objective function, and then you do a process that is roughly the
mirror-image of the forward process to backprop the derivatives through the
computation.

What I described was getting the derivatives of the objective function
w.r.t. the matrix of scores. After that you just have to do the reverse of
the forward operations to get the derivatives w.r.t. S and the matrix of
xvectors.

Dan

On Sat, Feb 13, 2016 at 4:51 PM, david-ryan-snyder <[email protected]

wrote:

In src/cudamatrix/cu-kernels.cu
#5 (comment):

@@ -2094,6 +2095,26 @@ static void _diff_xent(const int32_cuda* vec_tgt, Real* mat_net_out, Real* vec_l
}
}

+template
+global
+static void _compute_xvector_objf(const Real* scores, MatrixDim scores_dim,

  •                              Real\* obfj_terms, MatrixDim objf_dim,
    
  •                              Real\* obfj_derivs, MatrixDim derivs_dim) {
    
  • int32_cuda i = blockIdx.x * blockDim.x + threadIdx.x;
  • int32_cuda j = blockIdx.y * blockDim.y + threadIdx.y;
  • int32_cuda scores_index = i + j * scores_dim.stride;
  • Real K = 1.0 / (scores_dim.rows - 2.0);
  • Real L = scores[scores_index];
  • if (i < scores_dim.cols && j < scores_dim.rows && i < j) {

I think you're describing an alternative way to get the _coefficients _for
the derivative terms. But, the kernel code above already does that.

On the CPU, the derivative wrt to S needs something like the following
(NOTE: I'm ignoring peculiarities due to S being symmetric):

for i=0 ... N:
for j = 0 ... N:
v = xvectors(i)
w = xvectors(j)
deriv_S += C(i,j) * (v v' + w w')

Where C() is a coefficient dependent on whether or not the vectors at row
i and j are from the same or different classes. This is what we calculated
in the kernel above.

Each v,w pair results in its own matrix. I think this makes it harder to
deal with in a single kernel. I think the easiest thing to do is to create
an additional kernel that works like a modified form of matrix
multiplication. Suppose V is the matrix of xvectors and D = NumCols(V).
Then P = V' "times" V is the serialized outer product of each row of v. For
example, P.Row(0) = Serialized( V.Row(0) * V.Row(0)'). In other words,
p_{i,j} = v_{i, (j / D) % D} * v_{i, j % D}.

Once that is done, it should be more straightforward to calculate S_deriv
+= C(i, j) * (P.Row(i) + P.Row(j)) in parallel.


Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/5/files#r52834114.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After that you just have to do the reverse of
the forward operations to get the derivatives w.r.t. S and the matrix of xvectors.

Right, that's what I'm referring to. Once you have the derivs of the objf wrt to the scores (included in C(i,j)), you still need to compute the derivative of the scores wrt to S. However, as far as I can tell, unless you try to do that in a kernel, you'll end up with an algorithm with two loops over the xvectors (see psuedo-code in earlier post). I proposed the kernel above to parallelize that computation.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me work this out...
The forward computation is something like:

A = X X'
cvec = diag(X S X')
u = vector of ones
S = A - cvec u' - u cvec + b
... compute the objf and get S_deriv which is d(objf)/dS
A_deriv = S_deriv
X_deriv += 2 A_deriv X (or something like that)
cvec_deriv = - sum-of-Sderiv-cols - sum-of-Sderiv-rows
when computing the deriv w.r.t. S I am thinking about the expression
cvec_deriv . cvec,
which equals trace(diag(cvec_deriv) X S X'), where diag(cvec_deriv) is a
matrix whose diagonal is cvec_deriv, which we can rearrange to trace(S (X'
diag(cvec_deriv) X)).
We get from this (through a mysterious process, I do it intuitively),
S_deriv = X' diag(cvec_deriv) X
which is pretty easy to compute.

On Sat, Feb 13, 2016 at 5:09 PM, david-ryan-snyder <[email protected]

wrote:

In src/cudamatrix/cu-kernels.cu
#5 (comment):

@@ -2094,6 +2095,26 @@ static void _diff_xent(const int32_cuda* vec_tgt, Real* mat_net_out, Real* vec_l
}
}

+template
+global
+static void _compute_xvector_objf(const Real* scores, MatrixDim scores_dim,

  •                              Real\* obfj_terms, MatrixDim objf_dim,
    
  •                              Real\* obfj_derivs, MatrixDim derivs_dim) {
    
  • int32_cuda i = blockIdx.x * blockDim.x + threadIdx.x;
  • int32_cuda j = blockIdx.y * blockDim.y + threadIdx.y;
  • int32_cuda scores_index = i + j * scores_dim.stride;
  • Real K = 1.0 / (scores_dim.rows - 2.0);
  • Real L = scores[scores_index];
  • if (i < scores_dim.cols && j < scores_dim.rows && i < j) {

After that you just have to do the reverse of
the forward operations to get the derivatives w.r.t. S and the matrix of
xvectors.

Right, that's what I'm referring to. Once you have the derivs of the objf
wrt to the scores, you still need to compute the derivative of the scores
wrt to S. However, as far as I can tell, unless you try to do that in a
kernel, you'll end up with an algorithm with two loops over the xvectors. I
proposed the kernel above to parallelize that computation.


Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/5/files#r52834379.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll play with it some more to see if I can get it to work without a kernel and without an O(N^2) computation.

In your procedure, it isn't obvious to me (yet) that you can get terms of the form S_deriv = C(x,y) * (x x' + y y') for all combinations of (x,y) pairs. That's where the O(N^2) comes from that I'm trying to avoid.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that it was possible in thee forward computation generally means
it's possible i the backward computation.
You'll get S_deriv = X' diag(cvec_deriv) X, I think.

On Sat, Feb 13, 2016 at 5:39 PM, david-ryan-snyder <[email protected]

wrote:

In src/cudamatrix/cu-kernels.cu
#5 (comment):

@@ -2094,6 +2095,26 @@ static void _diff_xent(const int32_cuda* vec_tgt, Real* mat_net_out, Real* vec_l
}
}

+template
+global
+static void _compute_xvector_objf(const Real* scores, MatrixDim scores_dim,

  •                              Real\* obfj_terms, MatrixDim objf_dim,
    
  •                              Real\* obfj_derivs, MatrixDim derivs_dim) {
    
  • int32_cuda i = blockIdx.x * blockDim.x + threadIdx.x;
  • int32_cuda j = blockIdx.y * blockDim.y + threadIdx.y;
  • int32_cuda scores_index = i + j * scores_dim.stride;
  • Real K = 1.0 / (scores_dim.rows - 2.0);
  • Real L = scores[scores_index];
  • if (i < scores_dim.cols && j < scores_dim.rows && i < j) {

OK, I'll play with it some more to see if I can get it to work without a
kernel and without an O(N^2) computation.

In your procedure, it isn't obvious to me that you can get terms of the
form S_deriv = x x' + y y' for all combinations of (x,y) pairs. That's
where the O(N^2) comes from that I'm trying to avoid.


Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/5/files#r52834702.

danpovey added a commit that referenced this pull request Feb 13, 2016
@danpovey danpovey merged commit 598e9b1 into danpovey:xvector Feb 13, 2016
const CuMatrixBase<BaseFloat> &xvector_pairs,
const CuSpMatrix<BaseFloat> &S,
BaseFloat b, CuMatrixBase<BaseFloat> *deriv_xvector,
CuVector<BaseFloat> *deriv_S_and_b, BaseFloat *tot_objf,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this two outputs, a CuVector*deriv_S and a BaseFloat *deriv_b.
I am going to give these separate output nodes in the nnet, for easier diagnostics and for easier control of their learning rates.

danpovey pushed a commit that referenced this pull request Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants