Skip to content

Commit a3a238d

Browse files
committed
Add audio encoder design
1 parent c937f72 commit a3a238d

File tree

1 file changed

+185
-0
lines changed

1 file changed

+185
-0
lines changed

audio_encoder_design.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
Audio encoding design
2+
=====================
3+
4+
Let's talk about the design of our audio encoding capabilities. This design doc
5+
is not meant to be merged into the repo. I'm creating a PR to start a discussion
6+
and enable comments on the design proposal. The PR will eventually be closed
7+
without merging.
8+
9+
10+
Feature space and requirements
11+
------------------------------
12+
13+
When users give us the samples to be encoded, they have to provide:
14+
15+
- the FLTP tensor of decodec samples
16+
- the sample rate of the samples. That's crucial for FFmpeg to know when each
17+
sample should be played, and it cannot be inferred.
18+
19+
Those are naturally supplied as 2 separate parameters (1 for the tensor, 1 for
20+
the sample rate), but if our APIs also allowed users to pass a single
21+
[AudioSamples](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.AudioSamples.html#torchcodec.AudioSamples)
22+
object as parameter, that could be a good UX.
23+
24+
We want to enable users to encode these samples:
25+
26+
- to a file, like "output.mp3". When encoding to a file, we automatically infer
27+
the format (mp3) from the filename.
28+
- to a file-like (NYI, will come eventually). When encoding to a file-like, we
29+
can't infer the format, so users have to specify it to us.
30+
- to a tensor. Same here, users have to specify the output format.
31+
32+
We want to allow users to specify additional encoding options:
33+
34+
- The encoded bit rate, for compressed formats like mp3
35+
- The number of channels, to automatically encode to audio or to stereo,
36+
potentially different to the number of channels of the input (NYI).
37+
- The encoded sample rate, to automatically encode into a given sample rate,
38+
potentially different from that of the input (NYI).
39+
- Potentially other parameters (like codec-specific stuff).
40+
41+
API proposal
42+
------------
43+
44+
### Option 1
45+
46+
A natural option is to create 3 separate stateless functions: one for each kind
47+
of output we want to support.
48+
49+
```py
50+
def encode_audio_to_file(
51+
samples: torch.Tensor,
52+
sample_rate: int,
53+
filename: Union[str, Path],
54+
bit_rate: Optional[int] = None,
55+
num_channels: Optional[int] = None,
56+
output_sample_rate: Optional[int] = None,
57+
) -> None:
58+
pass
59+
60+
61+
def encode_audio_to_file_like(
62+
samples: torch.Tensor,
63+
sample_rate: int,
64+
file_like: object,
65+
format: str,
66+
bit_rate: Optional[int] = None,
67+
num_channels: Optional[int] = None,
68+
output_sample_rate: Optional[int] = None,
69+
) -> None:
70+
pass
71+
72+
73+
def encode_audio_to_tensor(
74+
samples: torch.Tensor,
75+
sample_rate: int,
76+
format: str,
77+
bit_rate: Optional[int] = None,
78+
num_channels: Optional[int] = None,
79+
output_sample_rate: Optional[int] = None,
80+
) -> torch.Tensor:
81+
pass
82+
```
83+
84+
A few notes:
85+
86+
- Both `to_file_like` and `to_tensor` need an extra `format` parameter, because
87+
it cannot be inferred. In `to_file`, it is inferred from `filename`.
88+
- To avoid collision between the input sample rate and the optional desired
89+
output sample rate, we have to use `output_sample_rate`. That's a bit meh.
90+
Technically, all of `format`, `bit_rate` and `num_channel` could also qualify
91+
for the `output_` prefix, but that would be very heavy.
92+
93+
### Option 2
94+
95+
Another option is to expose each of these functions as methods on a stateless
96+
object.
97+
98+
```py
99+
class AudioEncoder:
100+
def __init__(
101+
self,
102+
samples: torch.Tensor,
103+
sample_rate: int,
104+
):
105+
pass
106+
107+
def to_file(
108+
self,
109+
filename: Union[str, Path],
110+
bit_rate: Optional[int] = None,
111+
num_channels: Optional[int] = None,
112+
sample_rate: Optional[int] = None,
113+
) -> None:
114+
pass
115+
116+
def to_file_like(
117+
self,
118+
file_like: object,
119+
bit_rate: Optional[int] = None,
120+
num_channels: Optional[int] = None,
121+
sample_rate: Optional[int] = None,
122+
) -> None:
123+
pass
124+
125+
def to_tensor(
126+
format: str,
127+
bit_rate: Optional[int] = None,
128+
num_channels: Optional[int] = None,
129+
sample_rate: Optional[int] = None,
130+
) -> torch.Tensor:
131+
pass
132+
```
133+
134+
Usually, we like to expose objects (instead of stateless functions) when there
135+
is a clear state to be managed. That's not the case here: the `AudioEncoder` is
136+
mostly stateless.
137+
Instead, we can justify exposing object by noting that it allows us to cleanly
138+
separate unrelated blocks of parameters:
139+
- the parameters relating to the **input** are in `__init__()`
140+
- the parameters relating to the **output** are in the `to_*` methods.
141+
142+
A nice consequence of that is that we do not have a collision between the 2
143+
`sample_rate` parameters anymore, and their purpose can be made clear through
144+
docs.
145+
146+
A natural extension of option 2 is to allow users to pass an `AudioSample`
147+
object to `__init__()`, like so:
148+
149+
```py
150+
samples = # ... AudioSamples e.g. coming from the decoder
151+
AudioEncoder(samples).to_file("output.wav")
152+
```
153+
154+
This can be enabled via this kind of logic:
155+
156+
```py
157+
class AudioEncoder:
158+
def __init__(
159+
self,
160+
samples: Union[torch.Tensor, AudioSamples],
161+
sample_rate: Optional[int] = None,
162+
):
163+
assert (
164+
isinstance(samples, torch.Tensor) and sample_rate is not None) or (
165+
isinstance(sample, AudioSamples) and sample_rate is None
166+
)
167+
```
168+
169+
170+
### Thinking ahead
171+
172+
I don't want to be prescritive on what the video decoder should look like, but I
173+
suspect that we will soon need to expose a **multistream** encoder, i.e. an
174+
encoder that can encode both an audio and a video stream at the same time (think
175+
of video generation models). I suspect the API of such encoder will look
176+
something like this (a bit similar to what TorchAudio exposes):
177+
178+
```py
179+
Encoder().add_audio(...).add_video(...).to_file(filename)
180+
Encoder().add_audio(...).add_video(...).to_file_like(filelike)
181+
encoded_bytes = Encoder().add_audio(...).add_video(...).to_tensor()
182+
```
183+
184+
This too will involve exposing an object, despite the actual "state" being
185+
managed is very limited.

0 commit comments

Comments
 (0)