-
Notifications
You must be signed in to change notification settings - Fork 298
Description
System Info
OS: Windows 11
Rust version: cargo 1.75.0 (1d8b05cdd 2023-11-20)
Hardware: CPU AMD 6800HS
(text-generation-launcher --env didn't work)
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Hi,
I am trying to run a model locally using CPU, since I only have an AMD GPU, which apparently is not yet supported.
- I followed the instructions here: https://huggingface.co/docs/text-embeddings-inference/local_cpu
- I tried to run this:
text-embeddings-router --model-id dunzhang/stella_en_400M_v5 --port 8080
- I get this error:
2024-10-25T21:52:54.872449Z INFO text_embeddings_router: router\src/main.rs:175: Args { model_id: "dun*****/******_**_***M_v5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-10-25T21:52:54.875192Z INFO hf_hub: C:\Users\user\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hf-hub-0.3.2\src\lib.rs:55: Token file not found "C:\\Users\\user\\.cache\\huggingface\\token"
2024-10-25T21:52:54.875404Z INFO download_pool_config: text_embeddings_core::download: core\src\download.rs:38: Downloading `1_Pooling/config.json`
2024-10-25T21:52:54.875746Z INFO download_new_st_config: text_embeddings_core::download: core\src\download.rs:62: Downloading `config_sentence_transformers.json`
2024-10-25T21:52:54.875919Z INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:21: Starting download
2024-10-25T21:52:54.876003Z INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:23: Downloading `config.json`
2024-10-25T21:52:54.876215Z INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:26: Downloading `tokenizer.json`
2024-10-25T21:52:54.876393Z INFO download_artifacts: text_embeddings_backend: backends\src\lib.rs:328: Downloading `model.safetensors`
2024-10-25T21:52:54.876567Z INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:32: Model artifacts downloaded in 647.4µs
2024-10-25T21:52:54.886413Z INFO text_embeddings_router: router\src/lib.rs:206: Maximum number of tokens per request: 512
2024-10-25T21:52:54.886730Z INFO text_embeddings_core::tokenization: core\src\tokenization.rs:28: Starting 16 tokenization workers
2024-10-25T21:52:54.930092Z INFO text_embeddings_router: router\src/lib.rs:248: Starting model backend
Error: Could not create backend
Caused by:
Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled
It's asking for very specific GPU resources, even though I'm trying to run on the CPU.
Expected behavior
Would expect the model to work :)
Metadata
Metadata
Assignees
Labels
No labels