Skip to content

[Doc] Explain the effect of length in Wav2Vec2Model #1889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hihunjin opened this issue Oct 16, 2021 · 4 comments · Fixed by #1890
Closed

[Doc] Explain the effect of length in Wav2Vec2Model #1889

hihunjin opened this issue Oct 16, 2021 · 4 comments · Fixed by #1890

Comments

@hihunjin
Copy link

🚀 The feature

More specific explanation in the docs.
in here, I need more in-detailed explanation about length. Is it a sample rate?

Motivation, pitch

It's confusing to understand what the variable/argument behaves.

Alternatives

No response

Additional context

No response

@mthrok
Copy link
Collaborator

mthrok commented Oct 16, 2021

Hi @hihunjin

When batching multiple audios with different duration, the resulting Tensor would have padding for shorter audios.
length parameter is to indicate what is the valid (unpadded) length of each batch sample.

Say I create a batch from 1 second audio and 0.8 second audio, both single channel and sampled at 16k Hz. The resulting batch Tensor will be in shape of [2, 16000].
The second audio in the batch, actually has 12800 ( == 16000 * 0.8) valid samples and the rest 3200 samples are just padding. In this case the input length Tensor should look like torch.Tensor([16000, 12800]).

By providing the length Tensor, Wav2Vec2Model will compute the appropriate mask when it go through transformer layer, so that artifacts from the padding portion will not affect the computation.

length parameter also helps to get the same sort of information for the output values. Since the Wav2Vec2Model changes the number of frames, it is not intuitively clear what are the valid output lengths just by looking at the shape of the output Tensor. When length is provided, it will compute the valid output lengths and return it.

@mthrok
Copy link
Collaborator

mthrok commented Oct 16, 2021

Length computation in convolution layer

if length is not None:
length = torch.div(length - self.kernel_size, self.stride, rounding_mode='floor') + 1
# When input length is 0, the resulting length can be negative. So fix it here.
length = torch.max(torch.zeros_like(length), length)

Mask computation in Transformer layer

if lengths is not None:
batch_size, max_len, _ = x.shape
# create mask for padded elements and zero-out them
mask = torch.arange(max_len, device=lengths.device).expand(batch_size, max_len) >= lengths[:, None]
x[mask] = 0.0
# extend the mask to attention shape and set weight
mask = -10000.0 * mask[:, None, None, :].to(dtype=features.dtype)
mask = mask.expand(batch_size, 1, max_len, max_len)

@hihunjin
Copy link
Author

Thanks a lot. I appreciate it.

@mthrok
Copy link
Collaborator

mthrok commented Oct 16, 2021

Glad to help.
I will keep this open until we update the doc to include the above information. Thanks for your feedback.

@mthrok mthrok reopened this Oct 16, 2021
@mthrok mthrok changed the title What is length in Wav2Vec2Model? [Doc] Explain the effect of length in Wav2Vec2Model Oct 16, 2021
@mthrok mthrok removed the question label Oct 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants