Skip to content

Nulls instead of vector for Alibaba-NLP/gte-multilingual-base on T4 GPU #439

@superchar

Description

@superchar

System Info

Model - Alibaba-NLP/gte-multilingual-base
Image - text-embeddings-inference:turing-1.5
Azure VM - Standard_NC4as_T4_v3
GPU - Nvidia Tesla T4
AKS version - 1.28.14
OS - Ubuntu 22.04
Command -

command: ["text-embeddings-router"]  
args: 
[ 
      "--model-id", "Alibaba-NLP/gte-multilingual-base",
      "--port", "8080",
      "--max-client-batch-size", "2000",
      "--payload-limit", "200000000",
      "--max-batch-tokens", "260000",
      "--revision", "refs/pr/7",
      "--auto-truncate"
]

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

When executing the following request the first time:

POST /v1/embeddings
{
 "input":  "test",
 "model": "Alibaba-NLP/gte-multilingual-base"
}

The response is following

{
    "object": "list",
    "data": [
        {
            "object": "embedding",
            "embedding": [
                -0.055719655,
                0.06356562,
                -0.030253513
                ......................
            ],
            "index": 0
        }
    ],
    "model": "Alibaba-NLP/gte-multilingual-base",
    "usage": {
        "prompt_tokens": 3,
        "total_tokens": 3
    }
}

However, when repeating the same request the second time, I am getting:

{
    "object": "list",
    "data": [
        {
            "object": "embedding",
            "embedding": [
                null,
                null,
                null
                ......................            
             ],
            "index": 0
        }
    ],
    "model": "Alibaba-NLP/gte-multilingual-base",
    "usage": {
        "prompt_tokens": 3,
        "total_tokens": 3
    }
}

I tried setting USE_FLASH_ATTENTION=False, however, it seems that this env variable is ignored for GTE models. I understand that Turing support is marked as experimental, but is there any way to run this on T4 with or without Flash Attention v1?

Expected behavior

Do not get nulls instead of vector.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions