-
Notifications
You must be signed in to change notification settings - Fork 298
Description
System Info
Model - Alibaba-NLP/gte-multilingual-base
Image - text-embeddings-inference:turing-1.5
Azure VM - Standard_NC4as_T4_v3
GPU - Nvidia Tesla T4
AKS version - 1.28.14
OS - Ubuntu 22.04
Command -
command: ["text-embeddings-router"]
args:
[
"--model-id", "Alibaba-NLP/gte-multilingual-base",
"--port", "8080",
"--max-client-batch-size", "2000",
"--payload-limit", "200000000",
"--max-batch-tokens", "260000",
"--revision", "refs/pr/7",
"--auto-truncate"
]
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
When executing the following request the first time:
POST /v1/embeddings
{
"input": "test",
"model": "Alibaba-NLP/gte-multilingual-base"
}
The response is following
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [
-0.055719655,
0.06356562,
-0.030253513
......................
],
"index": 0
}
],
"model": "Alibaba-NLP/gte-multilingual-base",
"usage": {
"prompt_tokens": 3,
"total_tokens": 3
}
}
However, when repeating the same request the second time, I am getting:
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [
null,
null,
null
......................
],
"index": 0
}
],
"model": "Alibaba-NLP/gte-multilingual-base",
"usage": {
"prompt_tokens": 3,
"total_tokens": 3
}
}
I tried setting USE_FLASH_ATTENTION=False
, however, it seems that this env variable is ignored for GTE models. I understand that Turing support is marked as experimental, but is there any way to run this on T4 with or without Flash Attention v1?
Expected behavior
Do not get nulls instead of vector.
Metadata
Metadata
Assignees
Labels
No labels