-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Hi.
LLM -> Llama-3.1-8B-Instruct
In the vllm docs, it is said that:
Tool calling in the chat completion API
vLLM supports only named function calling in the chat completion API. The tool_choice options auto and required are not yet supported but on the roadmap.
To use a named function you need to define the function in the tools parameter and call it in the tool_choice parameter.
It is the callers responsibility to prompt the model with the tool information, vLLM will not automatically manipulate the prompt. This may change in the future.
vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the tools parameter.
Please refer to the OpenAI API reference documentation for more information.
- Can we confirm that this still holds? I see bunch of related PRs and good progress, so I'd like to be sure.
- Since tool calling without named functions does not work, we can't use libraries/frameworks for Agentic AI such as AutoGen. Correct?
For example, when this code is run (from AutoGen docs):
import os
from autogen import UserProxyAgent, ConversableAgent
from typing import Annotated, Literal
Operator = Literal["+", "-", "*", "/"]
def calculator(a: int, b: int, operator: Annotated[Operator, "operator"]) -> int:
if operator == "+":
return a + b
elif operator == "-":
return a - b
elif operator == "*":
return a * b
elif operator == "/":
return int(a / b)
else:
raise ValueError("Invalid operator")
# Let's first define the assistant agent that suggests tool calls.
assistant = ConversableAgent(
name="Assistant",
system_message="You are a helpful AI assistant. "
"You can help with simple calculations. "
"Return 'TERMINATE' when the task is done.",
llm_config={
"config_list": [
{
"model": "<YOUR MODEL NAME>",
"api_key": "<APİ_KEY>",
"base_url": "<BASE_URL_FOR_LOCAL_LLM>"
}
]
}
)
# The user proxy agent is used for interacting with the assistant agent
# and executes tool calls.
user_proxy = ConversableAgent(
name="User",
llm_config=False,
is_termination_msg=lambda msg: msg.get("content") is not None and "TERMINATE" in msg["content"],
human_input_mode="NEVER",
)
# Register the tool signature with the assistant agent.
assistant.register_for_llm(name="calculator", description="A simple calculator")(calculator)
# Register the tool function with the user proxy agent.
user_proxy.register_for_execution(name="calculator")(calculator)
chat_result = user_proxy.initiate_chat(assistant, message="What is (44232 + 13312 / (232 - 32)) * 5?")
it is supposed to produce the following which actually comprises executing the function: (I'm showing just a part of it):
>>>>>>>> USING AUTO REPLY...
Assistant (to User):
I apologize for the confusion, I seem to have made a mistake. Let me recalculate the expression properly.
First, we need to do the calculations within the brackets. So, calculating (1423 - 123), (32 + 23), and then performing remaining operations.
***** Suggested tool call (call_mx3M3fNOwikFNoqSojDH1jIr): calculator *****
Arguments:
{
"input": {
"a": 1423,
"b": 123,
"operator": "-"
}
}
***************************************************************************
--------------------------------------------------------------------------------
>>>>>>>> EXECUTING FUNCTION calculator...
User (to Assistant):
User (to Assistant):
***** Response from calling tool (call_mx3M3fNOwikFNoqSojDH1jIr) *****
1300
**********************************************************************
But when I run it with my local LLM with vllm backend, it does not execute the function, it replies normally instead: (again, just a part of it)
>>>>>>>> USING AUTO REPLY...
Assistant (to User):
<|python_tag|>{"name": "calculator", "parameters": {"a": 43998.56, "b": 5, "operator": "*"}}
--------------------------------------------------------------------------------
User (to Assistant):
--------------------------------------------------------------------------------
>>>>>>>> USING AUTO REPLY...
Assistant (to User):
{"name": "calculator", "parameters": {"a": 219994, "b": 5, "operator": "*"}}
-
As you can see, local llm respond starts with "<|python_tag|>" sometimes. Actually most of the times. And this is not about AutoGen. I encountered this behaviour without using any 3rd party framework/lib. And even though I tried my best to hide this token by editing some lines in the config json files (special_tokens etc.), I failed. Any solution to this?
-
My best attempt to integrate the auto tool calling in vLLM is this:
I added a "default function" in the available tools to llama. It is supposed to call this whenever none of the others is appropriate
{
"type": "function",
"function": {
"name": "default_function",
"description": "If none of the other functions is needed, simply call this.",
"parameters": {
"type": "object",
"properties": {
"normal_prompt": {
"type": "string",
"description": "The prompt user has typed.",
}
},
"required": ["normal_prompt"],
},
}
},
And here is the heart of the code which does what I want. For now, I don't actually call the function but the response is the function call with the full signature. So, only calling is missing. I just wanted to be sure if this is the best we can do with vLLM right now:
def send_request_to_llm(chat_history, use_tools=True):
extra_body = {
"stop_token_ids": [128001, 128008, 128009], # Ensure this is included
"temperature": 0.2,
"top_p": 0.95,
}
if use_tools:
extra_body["guided_decoding_backend"] = 'outlines'
# Prepare the arguments for the streamer call
streamer_args = {
"model": "<my_local_model_path>",
"messages": chat_history,
"extra_body": extra_body,
"stream": True
}
if use_tools:
streamer_args["tools"] = TOOLS
streamer = client.chat.completions.create(**streamer_args)
# In the below code, if use_tools is True that means we are getting a function to call in the end. We can disable streaming.
# Otherwise if the response is a normal response from the model, we want to see streaming text
assistant_response = ""
for chunk in streamer:
delta = chunk.choices[0].delta
if delta.content:
for token in chunk.choices[0].delta.content:
if not use_tools:
print(token, end="", flush=True)
assistant_response += token
if use_tools:
if assistant_response.startswith("{"):
json_object = json.loads(assistant_response)
elif assistant_response.startswith("<|python_tag|>"):
json_object = json.loads(assistant_response[14:])
# Occassionally, even the default function is not called. (A bug?) Handle this way
else:
print("-----------------------")
chat_history.append({"role": "assistant", "content": assistant_response})
send_request_to_llm(chat_history, use_tools=False)
print()
return
# Fetch all parameters
params = json_object['parameters']
# Format the parameters into a single string
formatted_params = ', '.join([f"{key}='{value}'" for key, value in params.items()])
# This is the function call to make such as multiply(5,6)
call_this = f"{json_object['name']}({formatted_params})"
# This block only runs whenever none of the tools is needed (so, default func is used)
if "default_function" in call_this:
chat_history.append({"role": "assistant", "content": assistant_response})
send_request_to_llm(chat_history, use_tools=False)
print()
return
else:
print(assistant_response)
print(call_this)
print()
chat_history.append({"role": "assistant", "content": assistant_response})
Here is an example output. The very last line is the function call to be made after manipulating the model's response:
- And lastly, for incorporating agentic AI workflows when using vLLM, do I have to write everything from scratch? maybe start from the code above and work my way up? I'd be glad if you can steer me in the right direction.
Many thanks.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status