Run TEI model on CPU fails (says Cuda f16 and flash attention is required)

### System Info

OS: Windows 11
Rust version: cargo 1.75.0 (1d8b05cdd 2023-11-20)
Hardware: CPU AMD 6800HS

(text-generation-launcher --env didn't work)

### Information

- [ ] Docker
- [X] The CLI directly

### Tasks

- [X] An officially supported command
- [ ] My own modifications

### Reproduction

Hi,
I am trying to run a model locally using CPU, since I only have an AMD GPU, which apparently is not yet supported.

1) I followed the instructions here: [https://huggingface.co/docs/text-embeddings-inference/local_cpu](https://huggingface.co/docs/text-embeddings-inference/local_cpu)
2) I tried to run this:
```
text-embeddings-router --model-id dunzhang/stella_en_400M_v5 --port 8080
```
3) I get this error:

```
2024-10-25T21:52:54.872449Z  INFO text_embeddings_router: router\src/main.rs:175: Args { model_id: "dun*****/******_**_***M_v5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-10-25T21:52:54.875192Z  INFO hf_hub: C:\Users\user\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hf-hub-0.3.2\src\lib.rs:55: Token file not found "C:\\Users\\user\\.cache\\huggingface\\token"
2024-10-25T21:52:54.875404Z  INFO download_pool_config: text_embeddings_core::download: core\src\download.rs:38: Downloading `1_Pooling/config.json`
2024-10-25T21:52:54.875746Z  INFO download_new_st_config: text_embeddings_core::download: core\src\download.rs:62: Downloading `config_sentence_transformers.json`
2024-10-25T21:52:54.875919Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:21: Starting download
2024-10-25T21:52:54.876003Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:23: Downloading `config.json`
2024-10-25T21:52:54.876215Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:26: Downloading `tokenizer.json`
2024-10-25T21:52:54.876393Z  INFO download_artifacts: text_embeddings_backend: backends\src\lib.rs:328: Downloading `model.safetensors`
2024-10-25T21:52:54.876567Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:32: Model artifacts downloaded in 647.4µs
2024-10-25T21:52:54.886413Z  INFO text_embeddings_router: router\src/lib.rs:206: Maximum number of tokens per request: 512
2024-10-25T21:52:54.886730Z  INFO text_embeddings_core::tokenization: core\src\tokenization.rs:28: Starting 16 tokenization workers
2024-10-25T21:52:54.930092Z  INFO text_embeddings_router: router\src/lib.rs:248: Starting model backend
Error: Could not create backend

Caused by:
    Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled
```

It's asking for very specific GPU resources, even though I'm trying to run on the CPU.

### Expected behavior

Would expect the model to work :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions