-
Notifications
You must be signed in to change notification settings - Fork 150
feat: Add inputAudioTranscription support to Java ADK #463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f2d2406
to
2ff85c5
Compare
e19b579
to
2eb051c
Compare
9418e8b
to
e884108
Compare
e884108
to
c1da61c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinnigu Thank You for contributing this! Is there any way to illustrate that this actually really 😆 fully works, in this PR? The unit tests are... well, unit tests. Would a full-blown integration test for this be possible? Or, how would you feel about if I invite you to add, as part of this PR, a very (most) simple "MVP" in tutorials/audio
with just a super simple LlmAgent
(without even any sub-agents), with just an AdkWebServer.start()
, which allows us to "see this work in action"? That would be awesome!
if (!CollectionUtils.isNullOrEmpty(runConfig.responseModalities()) | ||
&& liveRequestQueue.isPresent()) { | ||
if (liveRequestQueue.isPresent() && !this.agent.subAgents().isEmpty()) { | ||
// Parity with Python: apply modality defaults and transcription settings | ||
// only for multi-agent live scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinnigu The inline comment and the code don't seem to align, here? The "text" says "apply modality defaults" but then this removes !CollectionUtils.isNullOrEmpty(runConfig.responseModalities()
... is that intentional? (It may well be, I'm entirely sure about why this was originally like this; but it seems worth double checking.) Also, why would we limit transcription only for multi-agent live scenarios? I would personally love to use this even for a very simple trivial only-LlmAgent use case... you speak to it, and get a persistent transcript in your session store, that's very cool! I'd love to use this e.g. in my (personal) https://docs.enola.dev project - but don't see why it needs to be limited to work only if !this.agent.subAgents().isEmpty()
.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully adds inputAudioTranscription
support to the Java ADK, achieving feature parity with the Python version. The changes are well-structured, including updates to RunConfig
, the Basic
flow, and the Runner
. A significant improvement is the fix for an unreachable code block in Runner.java
, which enhances correctness. The new functionality is also thoroughly covered by unit tests. I have one suggestion to refactor a small portion of the logic in Runner.java
to improve maintainability by removing duplicated code. Overall, this is a solid and valuable contribution.
Summary
Adds
inputAudioTranscription
support to the Java ADK to achieve feature parity with Python. When enabled, the live connect config requests model-side transcription of input audio into text, allowing real-time processing of spoken input in live streaming scenarios.Changes
Core Implementation
inputAudioTranscription
field with getter/setter and builder supportRunConfig.inputAudioTranscription
toLiveConnectConfig.inputTranscription
for model-side transcriptionBug Fix
newInvocationContextForLive()
where the outer check!CollectionUtils.isNullOrEmpty(runConfig.responseModalities())
(NOT empty) made the inner checkCollectionUtils.isNullOrEmpty(runConfig.responseModalities())
(IS empty) impossible to reach. This prevented the "default to AUDIO modality" logic from ever executing.Behavior Alignment with Python
inputAudioTranscription
only for live multi-agent runs (whenagent.subAgents()
is non-empty)outputAudioTranscription
when response modalities imply audio usageTesting
RunConfig
transcription field handlingBasic
flow mapping toLiveConnectConfig