Skip to content

Conversation

jinnigu
Copy link
Contributor

@jinnigu jinnigu commented Sep 28, 2025

Summary

Adds inputAudioTranscription support to the Java ADK to achieve feature parity with Python. When enabled, the live connect config requests model-side transcription of input audio into text, allowing real-time processing of spoken input in live streaming scenarios.

Changes

Core Implementation

  • RunConfig: Added inputAudioTranscription field with getter/setter and builder support
  • Basic: Maps RunConfig.inputAudioTranscription to LiveConnectConfig.inputTranscription for model-side transcription
  • Runner: Auto-enables input/output transcription for live multi-agent scenarios to match Python behavior

Bug Fix

  • Runner: Fixed unreachable condition in newInvocationContextForLive() where the outer check !CollectionUtils.isNullOrEmpty(runConfig.responseModalities()) (NOT empty) made the inner check CollectionUtils.isNullOrEmpty(runConfig.responseModalities()) (IS empty) impossible to reach. This prevented the "default to AUDIO modality" logic from ever executing.

Behavior Alignment with Python

  • Auto-sets inputAudioTranscription only for live multi-agent runs (when agent.subAgents() is non-empty)
  • Auto-sets outputAudioTranscription when response modalities imply audio usage
  • Leaves transcription settings unchanged for single-agent scenarios

Testing

  • Added unit tests for RunConfig transcription field handling
  • Added unit tests for Basic flow mapping to LiveConnectConfig

@jinnigu jinnigu force-pushed the feature/inputAudioTranscription branch from f2d2406 to 2ff85c5 Compare September 28, 2025 06:44
@jinnigu jinnigu marked this pull request as ready for review September 28, 2025 07:48
@jinnigu jinnigu force-pushed the feature/inputAudioTranscription branch 4 times, most recently from e19b579 to 2eb051c Compare September 28, 2025 17:12
@jinnigu jinnigu force-pushed the feature/inputAudioTranscription branch from 9418e8b to e884108 Compare October 7, 2025 07:04
@jinnigu jinnigu force-pushed the feature/inputAudioTranscription branch from e884108 to c1da61c Compare October 7, 2025 07:07
Copy link
Member

@vorburger vorburger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinnigu Thank You for contributing this! Is there any way to illustrate that this actually really 😆 fully works, in this PR? The unit tests are... well, unit tests. Would a full-blown integration test for this be possible? Or, how would you feel about if I invite you to add, as part of this PR, a very (most) simple "MVP" in tutorials/audio with just a super simple LlmAgent (without even any sub-agents), with just an AdkWebServer.start(), which allows us to "see this work in action"? That would be awesome!

Comment on lines -369 to +371
if (!CollectionUtils.isNullOrEmpty(runConfig.responseModalities())
&& liveRequestQueue.isPresent()) {
if (liveRequestQueue.isPresent() && !this.agent.subAgents().isEmpty()) {
// Parity with Python: apply modality defaults and transcription settings
// only for multi-agent live scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinnigu The inline comment and the code don't seem to align, here? The "text" says "apply modality defaults" but then this removes !CollectionUtils.isNullOrEmpty(runConfig.responseModalities()... is that intentional? (It may well be, I'm entirely sure about why this was originally like this; but it seems worth double checking.) Also, why would we limit transcription only for multi-agent live scenarios? I would personally love to use this even for a very simple trivial only-LlmAgent use case... you speak to it, and get a persistent transcript in your session store, that's very cool! I'd love to use this e.g. in my (personal) https://docs.enola.dev project - but don't see why it needs to be limited to work only if !this.agent.subAgents().isEmpty().

@vorburger
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully adds inputAudioTranscription support to the Java ADK, achieving feature parity with the Python version. The changes are well-structured, including updates to RunConfig, the Basic flow, and the Runner. A significant improvement is the fix for an unreachable code block in Runner.java, which enhances correctness. The new functionality is also thoroughly covered by unit tests. I have one suggestion to refactor a small portion of the logic in Runner.java to improve maintainability by removing duplicated code. Overall, this is a solid and valuable contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants