Skip to content

Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

@Astlaan

Description

@Astlaan

System Info

OS: Windows 11
Rust version: cargo 1.75.0 (1d8b05cdd 2023-11-20)
Hardware: CPU AMD 6800HS

(text-generation-launcher --env didn't work)

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Hi,
I am trying to run a model locally using CPU, since I only have an AMD GPU, which apparently is not yet supported.

  1. I followed the instructions here: https://huggingface.co/docs/text-embeddings-inference/local_cpu
  2. I tried to run this:
text-embeddings-router --model-id dunzhang/stella_en_400M_v5 --port 8080
  1. I get this error:
2024-10-25T21:52:54.872449Z  INFO text_embeddings_router: router\src/main.rs:175: Args { model_id: "dun*****/******_**_***M_v5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-10-25T21:52:54.875192Z  INFO hf_hub: C:\Users\user\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hf-hub-0.3.2\src\lib.rs:55: Token file not found "C:\\Users\\user\\.cache\\huggingface\\token"
2024-10-25T21:52:54.875404Z  INFO download_pool_config: text_embeddings_core::download: core\src\download.rs:38: Downloading `1_Pooling/config.json`
2024-10-25T21:52:54.875746Z  INFO download_new_st_config: text_embeddings_core::download: core\src\download.rs:62: Downloading `config_sentence_transformers.json`
2024-10-25T21:52:54.875919Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:21: Starting download
2024-10-25T21:52:54.876003Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:23: Downloading `config.json`
2024-10-25T21:52:54.876215Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:26: Downloading `tokenizer.json`
2024-10-25T21:52:54.876393Z  INFO download_artifacts: text_embeddings_backend: backends\src\lib.rs:328: Downloading `model.safetensors`
2024-10-25T21:52:54.876567Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:32: Model artifacts downloaded in 647.4µs
2024-10-25T21:52:54.886413Z  INFO text_embeddings_router: router\src/lib.rs:206: Maximum number of tokens per request: 512
2024-10-25T21:52:54.886730Z  INFO text_embeddings_core::tokenization: core\src\tokenization.rs:28: Starting 16 tokenization workers
2024-10-25T21:52:54.930092Z  INFO text_embeddings_router: router\src/lib.rs:248: Starting model backend
Error: Could not create backend

Caused by:
    Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled

It's asking for very specific GPU resources, even though I'm trying to run on the CPU.

Expected behavior

Would expect the model to work :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions