HumeAI · fern-api · Jul 16, 2025
diff --git a/.mock/definition/tts/__package__.yml b/.mock/definition/tts/__package__.yml
@@ -165,15 +165,9 @@ service:
               - text: >-
                   Beauty is no quality in things themselves: It exists merely in
                   the mind which contemplates them.
-                description: >-
-                  Middle-aged masculine voice with a clear, rhythmic Scots lilt,
-                  rounded vowels, and a warm, steady tone with an articulate,
-                  academic quality.
-            context:
-              generation_id: 09ad914d-8e7f-40f8-a279-e34f07f7dab2
-            format:
-              type: mp3
-            num_generations: 1
+                voice:
+                  name: Male English Actor
+                  provider: HUME_AI
     synthesize-json-streaming:
       path: /v0/tts/stream/json
       method: POST
@@ -206,19 +200,9 @@ service:
               - text: >-
                   Beauty is no quality in things themselves: It exists merely in
                   the mind which contemplates them.
-                description: >-
-                  Middle-aged masculine voice with a clear, rhythmic Scots lilt,
-                  rounded vowels, and a warm, steady tone with an articulate,
-                  academic quality.
-            context:
-              utterances:
-                - text: How can people see beauty so differently?
-                  description: >-
-                    A curious student with a clear and respectful tone, seeking
-                    clarification on Hume's ideas with a straightforward
-                    question.
-            format:
-              type: mp3
+                voice:
+                  name: Male English Actor
+                  provider: HUME_AI
   source:
     openapi: tts-openapi.yml
 types:
@@ -390,22 +374,19 @@ types:
           see our documentation on [instant
           mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). 
 
-          - Dynamic voice generation is not supported with this mode; a
-          predefined
+          - A
           [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice)
-          must be specified in your request.
+          must be specified when instant mode is enabled. Dynamic voice
+          generation is not supported with this mode.
 
-          - This mode is only supported for streaming endpoints (e.g.,
+          - Instant mode is only supported for streaming endpoints (e.g.,
           [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming),
           [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
 
           - Ensure only a single generation is requested
           ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations)
           must be `1` or omitted).
-
-          - With `instant_mode` enabled, **requests incur a 10% higher cost**
-          due to increased compute and resource requirements.
-        default: false
+        default: true
     source:
       openapi: tts-openapi.yml
   ReturnTts:
@@ -514,14 +495,20 @@ types:
         docs: >-
           Natural language instructions describing how the synthesized speech
           should sound, including but not limited to tone, intonation, pacing,
-          and accent (e.g., 'a soft, gentle voice with a strong British
-          accent').
+          and accent.
 
-          - If a Voice is specified in the request, this description serves as
-          acting instructions. For tips on how to effectively guide speech
-          delivery, see our guide on [Acting
+
+          **This field behaves differently depending on whether a voice is
+          specified**:
+
+          - **Voice specified**: the description will serve as acting directions
+          for delivery. Keep directions concise—100 characters or fewer—for best
+          results. See our guide on [acting
           instructions](/docs/text-to-speech-tts/acting-instructions).
-           - If no Voice is specified, a new voice is generated based on this description. See our [prompting guide](/docs/text-to-speech-tts/prompting) for tips on designing a voice.
+
+          - **Voice not specified**: the description will serve as a voice
+          prompt for generating a voice. See our [prompting
+          guide](/docs/text-to-speech-tts/prompting) for design tips.
         validation:
           maxLength: 1000
       speed:

diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -3,7 +3,7 @@ name = "hume"
 
 [tool.poetry]
 name = "hume"
-version = "0.9.1"
+version = "0.9.2"
 description = "A Python SDK for Hume AI"
 readme = "README.md"
 authors = []

diff --git a/reference.md b/reference.md
@@ -145,10 +145,9 @@ This setting affects how the `snippets` array is structured in the response, whi
 **instant_mode:** `typing.Optional[bool]` 
 
 Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on [instant mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). 
-- Dynamic voice generation is not supported with this mode; a predefined [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified in your request.
-- This mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
+- A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode.
+- Instant mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
 - Ensure only a single generation is requested ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations) must be `1` or omitted).
-- With `instant_mode` enabled, **requests incur a 10% higher cost** due to increased compute and resource requirements.
 
 </dd>
 </dl>
@@ -294,10 +293,9 @@ This setting affects how the `snippets` array is structured in the response, whi
 **instant_mode:** `typing.Optional[bool]` 
 
 Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on [instant mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). 
-- Dynamic voice generation is not supported with this mode; a predefined [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified in your request.
-- This mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
+- A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode.
+- Instant mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
 - Ensure only a single generation is requested ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations) must be `1` or omitted).
-- With `instant_mode` enabled, **requests incur a 10% higher cost** due to increased compute and resource requirements.
 
 </dd>
 </dl>
@@ -345,7 +343,7 @@ Streams synthesized speech using the specified voice. If no voice is provided, a
 
 ```python
 from hume import HumeClient
-from hume.tts import FormatMp3, PostedContextWithGenerationId, PostedUtterance
+from hume.tts import PostedUtterance, PostedUtteranceVoiceWithName
 
 client = HumeClient(
     api_key="YOUR_API_KEY",
@@ -354,14 +352,12 @@ client.tts.synthesize_file_streaming(
     utterances=[
         PostedUtterance(
             text="Beauty is no quality in things themselves: It exists merely in the mind which contemplates them.",
-            description="Middle-aged masculine voice with a clear, rhythmic Scots lilt, rounded vowels, and a warm, steady tone with an articulate, academic quality.",
+            voice=PostedUtteranceVoiceWithName(
+                name="Male English Actor",
+                provider="HUME_AI",
+            ),
         )
     ],
-    context=PostedContextWithGenerationId(
-        generation_id="09ad914d-8e7f-40f8-a279-e34f07f7dab2",
-    ),
-    format=FormatMp3(),
-    num_generations=1,
 )
 
 ```
@@ -441,10 +437,9 @@ This setting affects how the `snippets` array is structured in the response, whi
 **instant_mode:** `typing.Optional[bool]` 
 
 Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on [instant mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). 
-- Dynamic voice generation is not supported with this mode; a predefined [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified in your request.
-- This mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
+- A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode.
+- Instant mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
 - Ensure only a single generation is requested ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations) must be `1` or omitted).
-- With `instant_mode` enabled, **requests incur a 10% higher cost** due to increased compute and resource requirements.
 
 </dd>
 </dl>
@@ -494,7 +489,7 @@ The response is a stream of JSON objects including audio encoded in base64.
 
 ```python
 from hume import HumeClient
-from hume.tts import FormatMp3, PostedContextWithUtterances, PostedUtterance
+from hume.tts import PostedUtterance, PostedUtteranceVoiceWithName
 
 client = HumeClient(
     api_key="YOUR_API_KEY",
@@ -503,18 +498,12 @@ response = client.tts.synthesize_json_streaming(
     utterances=[
         PostedUtterance(
             text="Beauty is no quality in things themselves: It exists merely in the mind which contemplates them.",
-            description="Middle-aged masculine voice with a clear, rhythmic Scots lilt, rounded vowels, and a warm, steady tone with an articulate, academic quality.",
+            voice=PostedUtteranceVoiceWithName(
+                name="Male English Actor",
+                provider="HUME_AI",
+            ),
         )
     ],
-    context=PostedContextWithUtterances(
-        utterances=[
-            PostedUtterance(
-                text="How can people see beauty so differently?",
-                description="A curious student with a clear and respectful tone, seeking clarification on Hume's ideas with a straightforward question.",
-            )
-        ],
-    ),
-    format=FormatMp3(),
 )
 for chunk in response.data:
     yield chunk
@@ -596,10 +585,9 @@ This setting affects how the `snippets` array is structured in the response, whi
 **instant_mode:** `typing.Optional[bool]` 
 
 Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on [instant mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). 
-- Dynamic voice generation is not supported with this mode; a predefined [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified in your request.
-- This mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
+- A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode.
+- Instant mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
 - Ensure only a single generation is requested ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations) must be `1` or omitted).
-- With `instant_mode` enabled, **requests incur a 10% higher cost** due to increased compute and resource requirements.
 
 </dd>
 </dl>