|
| 1 | +Audio encoding design |
| 2 | +===================== |
| 3 | + |
| 4 | +Let's talk about the design of our audio encoding capabilities. This design doc |
| 5 | +is not meant to be merged into the repo. I'm creating a PR to start a discussion |
| 6 | +and enable comments on the design proposal. The PR will eventually be closed |
| 7 | +without merging. |
| 8 | + |
| 9 | + |
| 10 | +Feature space and requirements |
| 11 | +------------------------------ |
| 12 | + |
| 13 | +When users give us the samples to be encoded, they have to provide: |
| 14 | + |
| 15 | +- the FLTP tensor of decodec samples |
| 16 | +- the sample rate of the samples. That's crucial for FFmpeg to know when each |
| 17 | + sample should be played, and it cannot be inferred. |
| 18 | + |
| 19 | +Those are naturally supplied as 2 separate parameters (1 for the tensor, 1 for |
| 20 | +the sample rate), but if our APIs also allowed users to pass a single |
| 21 | +[AudioSamples](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.AudioSamples.html#torchcodec.AudioSamples) |
| 22 | +object as parameter, that could be a good UX. |
| 23 | + |
| 24 | +We want to enable users to encode these samples: |
| 25 | + |
| 26 | +- to a file, like "output.mp3". When encoding to a file, we automatically infer |
| 27 | + the format (mp3) from the filename. |
| 28 | +- to a file-like (NYI, will come eventually). When encoding to a file-like, we |
| 29 | + can't infer the format, so users have to specify it to us. |
| 30 | +- to a tensor. Same here, users have to specify the output format. |
| 31 | + |
| 32 | +We want to allow users to specify additional encoding options: |
| 33 | + |
| 34 | +- The encoded bit rate, for compressed formats like mp3 |
| 35 | +- The number of channels, to automatically encode to audio or to stereo, |
| 36 | + potentially different to the number of channels of the input (NYI). |
| 37 | +- The encoded sample rate, to automatically encode into a given sample rate, |
| 38 | + potentially different from that of the input (NYI). |
| 39 | +- Potentially other parameters (like codec-specific stuff). |
| 40 | + |
| 41 | +API proposal |
| 42 | +------------ |
| 43 | + |
| 44 | +### Option 1 |
| 45 | + |
| 46 | +A natural option is to create 3 separate stateless functions: one for each kind |
| 47 | +of output we want to support. |
| 48 | + |
| 49 | +```py |
| 50 | +def encode_audio_to_file( |
| 51 | + samples: torch.Tensor, |
| 52 | + sample_rate: int, |
| 53 | + filename: Union[str, Path], |
| 54 | + bit_rate: Optional[int] = None, |
| 55 | + num_channels: Optional[int] = None, |
| 56 | + output_sample_rate: Optional[int] = None, |
| 57 | +) -> None: |
| 58 | + pass |
| 59 | + |
| 60 | + |
| 61 | +def encode_audio_to_file_like( |
| 62 | + samples: torch.Tensor, |
| 63 | + sample_rate: int, |
| 64 | + file_like: object, |
| 65 | + format: str, |
| 66 | + bit_rate: Optional[int] = None, |
| 67 | + num_channels: Optional[int] = None, |
| 68 | + output_sample_rate: Optional[int] = None, |
| 69 | +) -> None: |
| 70 | + pass |
| 71 | + |
| 72 | + |
| 73 | +def encode_audio_to_tensor( |
| 74 | + samples: torch.Tensor, |
| 75 | + sample_rate: int, |
| 76 | + format: str, |
| 77 | + bit_rate: Optional[int] = None, |
| 78 | + num_channels: Optional[int] = None, |
| 79 | + output_sample_rate: Optional[int] = None, |
| 80 | +) -> torch.Tensor: |
| 81 | + pass |
| 82 | +``` |
| 83 | + |
| 84 | +A few notes: |
| 85 | + |
| 86 | +- Both `to_file_like` and `to_tensor` need an extra `format` parameter, because |
| 87 | + it cannot be inferred. In `to_file`, it is inferred from `filename`. |
| 88 | +- To avoid collision between the input sample rate and the optional desired |
| 89 | + output sample rate, we have to use `output_sample_rate`. That's a bit meh. |
| 90 | + Technically, all of `format`, `bit_rate` and `num_channel` could also qualify |
| 91 | + for the `output_` prefix, but that would be very heavy. |
| 92 | + |
| 93 | +### Option 2 |
| 94 | + |
| 95 | +Another option is to expose each of these functions as methods on a stateless |
| 96 | +object. |
| 97 | + |
| 98 | +```py |
| 99 | +class AudioEncoder: |
| 100 | + def __init__( |
| 101 | + self, |
| 102 | + samples: torch.Tensor, |
| 103 | + sample_rate: int, |
| 104 | + ): |
| 105 | + pass |
| 106 | + |
| 107 | + def to_file( |
| 108 | + self, |
| 109 | + filename: Union[str, Path], |
| 110 | + bit_rate: Optional[int] = None, |
| 111 | + num_channels: Optional[int] = None, |
| 112 | + sample_rate: Optional[int] = None, |
| 113 | + ) -> None: |
| 114 | + pass |
| 115 | + |
| 116 | + def to_file_like( |
| 117 | + self, |
| 118 | + file_like: object, |
| 119 | + bit_rate: Optional[int] = None, |
| 120 | + num_channels: Optional[int] = None, |
| 121 | + sample_rate: Optional[int] = None, |
| 122 | + ) -> None: |
| 123 | + pass |
| 124 | + |
| 125 | + def to_tensor( |
| 126 | + format: str, |
| 127 | + bit_rate: Optional[int] = None, |
| 128 | + num_channels: Optional[int] = None, |
| 129 | + sample_rate: Optional[int] = None, |
| 130 | + ) -> torch.Tensor: |
| 131 | + pass |
| 132 | +``` |
| 133 | + |
| 134 | +Usually, we like to expose objects (instead of stateless functions) when there |
| 135 | +is a clear state to be managed. That's not the case here: the `AudioEncoder` is |
| 136 | +mostly stateless. |
| 137 | +Instead, we can justify exposing object by noting that it allows us to cleanly |
| 138 | +separate unrelated blocks of parameters: |
| 139 | +- the parameters relating to the **input** are in `__init__()` |
| 140 | +- the parameters relating to the **output** are in the `to_*` methods. |
| 141 | + |
| 142 | +A nice consequence of that is that we do not have a collision between the 2 |
| 143 | +`sample_rate` parameters anymore, and their purpose can be made clear through |
| 144 | +docs. |
| 145 | + |
| 146 | +A natural extension of option 2 is to allow users to pass an `AudioSample` |
| 147 | +object to `__init__()`, like so: |
| 148 | + |
| 149 | +```py |
| 150 | +samples = # ... AudioSamples e.g. coming from the decoder |
| 151 | +AudioEncoder(samples).to_file("output.wav") |
| 152 | +``` |
| 153 | + |
| 154 | +This can be enabled via this kind of logic: |
| 155 | + |
| 156 | +```py |
| 157 | +class AudioEncoder: |
| 158 | + def __init__( |
| 159 | + self, |
| 160 | + samples: Union[torch.Tensor, AudioSamples], |
| 161 | + sample_rate: Optional[int] = None, |
| 162 | + ): |
| 163 | + assert ( |
| 164 | + isinstance(samples, torch.Tensor) and sample_rate is not None) or ( |
| 165 | + isinstance(sample, AudioSamples) and sample_rate is None |
| 166 | + ) |
| 167 | +``` |
| 168 | + |
| 169 | + |
| 170 | +### Thinking ahead |
| 171 | + |
| 172 | +I don't want to be prescritive on what the video decoder should look like, but I |
| 173 | +suspect that we will soon need to expose a **multistream** encoder, i.e. an |
| 174 | +encoder that can encode both an audio and a video stream at the same time (think |
| 175 | +of video generation models). I suspect the API of such encoder will look |
| 176 | +something like this (a bit similar to what TorchAudio exposes): |
| 177 | + |
| 178 | +```py |
| 179 | +Encoder().add_audio(...).add_video(...).to_file(filename) |
| 180 | +Encoder().add_audio(...).add_video(...).to_file_like(filelike) |
| 181 | +encoded_bytes = Encoder().add_audio(...).add_video(...).to_tensor() |
| 182 | +``` |
| 183 | + |
| 184 | +This too will involve exposing an object, despite the actual "state" being |
| 185 | +managed is very limited. |
0 commit comments