Formatting, Refactor, and fix Python backend import loop #7

pi314ever · 2024-08-30T03:24:12Z

What does this PR do?

Minor changes:

Proper formatting with cargo fmt and ruff format
Refactor warmup_hpu to not include rand dependency and use max_batch_token properly

Bug fix:

Resolved import loop in Python where Model was incorrectly imported in default_model.py.

yao-matrix · 2024-08-30T05:41:20Z

@pi314ever , I don't prefer to add reformatting things into this PR, since this PR is mainly to add multi-backend design, not the reformat TEI coding-style, mixing things in one PR not comply w/ "one PR, one purpose" rule. if you found there is bug need to fix which is introduced by multi-backend design, pls append a bug-fixing PR to this.

You can submit another reformatting PR to tei repo once this multi-backend PR merged.

kaixuanliu · 2024-08-30T06:11:14Z

Cargo.toml

@@ -29,8 +29,6 @@ tracing = "0.1"
 serde = { version = "1.0", features = ["serde_derive"] }
 serde_json = "1.0"
 thiserror = "1.0"
-rand = "0.8"
-


Let's keep it.

kaixuanliu · 2024-08-30T06:11:24Z

backends/Cargo.toml

@@ -15,7 +15,6 @@ text-embeddings-backend-candle = { path = "candle", optional = true }
 text-embeddings-backend-ort = { path = "ort", optional = true }
 tokio = { workspace = true }
 tracing = { workspace = true }
-rand = { workspace = true }


Keep it is OK

kaixuanliu · 2024-08-30T06:12:03Z

backends/python/server/requirements.txt

@@ -40,4 +40,4 @@ typer==0.6.1 ; python_version >= "3.9" and python_version < "3.13"
 typing-extensions==4.7.1 ; python_version >= "3.9" and python_version < "3.13"
 urllib3==2.0.4 ; python_version >= "3.9" and python_version < "3.13"
 win32-setctime==1.1.0 ; python_version >= "3.9" and python_version < "3.13" and sys_platform == "win32"
-wrapt==1.15.0 ; python_version >= "3.9" and python_version < "3.13"


kaixuanliu · 2024-08-30T06:17:08Z

backends/python/server/text_embeddings_server/models/__init__.py

 except ImportError as e:
    logger.warning(f"Could not import Flash Attention enabled models: {e}")
    FLASH_ATTENTION = False

-if FLASH_ATTENTION:
-    __all__.append(FlashBert)
-



Let's keep it.

kaixuanliu · 2024-08-30T06:17:19Z

backends/python/server/text_embeddings_server/models/__init__.py

 from text_embeddings_server.utils.device import get_device, use_ipex
+


Let's keep it

kaixuanliu · 2024-08-30T06:17:36Z

backends/python/server/text_embeddings_server/models/__init__.py

+            from optimum.habana.transformers.modeling_utils import (
+                adapt_transformers_to_gaudi,
+            )
+
            adapt_transformers_to_gaudi()


kaixuanliu · 2024-08-30T06:18:23Z

backends/python/server/text_embeddings_server/models/default_model.py

 from text_embeddings_server.models.types import PaddedBatch, Embedding
+
 tracer = trace.get_tracer(__name__)


kaixuanliu · 2024-08-30T06:18:38Z

backends/python/server/text_embeddings_server/models/flash_bert.py

-                residual is not None
-             )
+                residual is not None,
+            )



kaixuanliu · 2024-08-30T06:18:50Z

backends/python/server/text_embeddings_server/models/flash_bert.py

@@ -269,6 +273,7 @@ def __init__(self, model_path: Path, device: torch.device, dtype: torch.dtype):
            model = FlashBertModel(f, device, dtype, config)
        if device.type == "hpu":
            from habana_frameworks.torch.hpu import wrap_in_hpu_graph
+
            model = wrap_in_hpu_graph(model, disable_tensor_cache=False)
        self.hidden_size = config.hidden_size



kaixuanliu · 2024-08-30T06:19:13Z

backends/python/server/text_embeddings_server/models/model.py


 from text_embeddings_server.models.types import Batch, Embedding

 B = TypeVar("B", bound=Batch)


-class Model(ABC):
+class Model(ABC, Generic[B]):


Let's keep it. Do not use this.

kaixuanliu · 2024-08-30T06:19:58Z

backends/python/server/text_embeddings_server/server.py

@@ -8,7 +8,7 @@
 from pathlib import Path
 from typing import Optional

-from text_embeddings_server.models import Model, get_model
+from text_embeddings_server.models import get_model, Model


kaixuanliu · 2024-08-30T06:20:53Z

backends/python/server/text_embeddings_server/utils/device.py

@@ -27,6 +32,7 @@ def get_major_and_minor_from_version(full_version):
        return False
    return True



Above is OK.

kaixuanliu · 2024-08-30T06:21:16Z

backends/python/server/text_embeddings_server/utils/device.py

 def use_ipex() -> bool:
    value = os.environ.get("USE_IPEX", "True").lower()
-    return (value in ["true", "1"] and _is_ipex_available())
+    return value in ["true", "1"] and _is_ipex_available()


Add () is better, do not change it

kaixuanliu · 2024-08-30T06:21:32Z

backends/python/server/text_embeddings_server/utils/device.py

            device = torch.device("hpu")
    elif use_ipex():
        import intel_extension_for_pytorch as ipex
+
        if hasattr(torch, "xpu") and torch.xpu.is_available():
            device = torch.device("xpu")


kaixuanliu · 2024-08-30T06:21:55Z

backends/python/server/text_embeddings_server/utils/flash_attn.py

+                max_s,
+                softmax_scale,
+                is_causal=False,
+            )



kaixuanliu · 2024-08-30T06:22:47Z

backends/src/lib.rs

+        .output()
+    {
+        Ok(output) => output.status.success(),
+        Err(_) => false,


Do not change it

pi314ever · 2024-08-30T06:23:33Z

@yao-matrix I agree with one PR one purpose, but I believe that the code should maintain formatting guidelines already followed by the upstream repository. All of the changes in this PR deals with code we will add in support of multi-backend.

The changes to exclude the rand package is to resolve inconsistency in how max_batch_tokens is interpreted. In the upstream warmup function, it defines the maximum number of tokens allowed in each batch. However, we were using it as the maximum token ID. The removal of the rand package makes it more consistent with their version of warmup.

The bug fix was simply an import loop that occurs in default model, which is a single line fix.

kaixuanliu · 2024-08-30T06:24:30Z

backends/src/lib.rs

@@ -99,20 +99,19 @@ impl Backend {
    #[instrument(skip(self))]
    pub async fn warmup_hpu(
        &self,
-        mut max_input_length: usize,
-        max_token: usize,


max_input_length may be changed in this function.

kaixuanliu · 2024-08-30T06:24:55Z

backends/src/lib.rs

+                        max_batch_token
+                    )
+                )
+            );


LGTM. Add this check is better.

kaixuanliu · 2024-08-30T06:25:12Z

backends/src/lib.rs

+        let max_input_length = std::cmp::min(max_input_length, max_warmup_length);
+        let mut seq_lengths: Vec<usize> = (seq_bucket_size..max_input_length + 1)
+            .step_by(seq_bucket_size)
+            .collect();


kaixuanliu · 2024-08-30T06:25:29Z

backends/src/lib.rs

@@ -146,7 +160,7 @@ impl Backend {
            }
        }
        for shape in shapes.iter() {
-            let batch = self.create_warmup_batch(*shape, max_token as u32);
+            let batch = self.create_warmup_batch(*shape);


Let's keep it

kaixuanliu · 2024-08-30T06:26:00Z

backends/src/lib.rs

-        let input_ids: Vec<u32> = (0..length).map(|_| rand::thread_rng().gen_range(0..max_token)).collect();
-        let token_type_ids: Vec<u32> = vec![0; length as usize];
+        let input_ids = vec![0; length as usize];
+        let token_type_ids = vec![0; length as usize];


Let's keep it.

kaixuanliu · 2024-08-30T06:26:16Z

backends/src/lib.rs

-            return self.warmup_hpu(max_input_length, max_batch_tokens, max_batch_requests).await;
+            return self
+                .warmup_hpu(max_input_length, max_batch_tokens, max_batch_requests)
+                .await;
        }


kaixuanliu · 2024-08-30T06:47:26Z

@pi314ever , Hi, I add some comments. Some code style re-format is good. It looks better. However, I have some points of suggestions:

It is not recommended to use too heavy extension tools to check code style, some changes like this is not necessary: L11
In this PR, it is better not change original code if not necessary, we can submit new PR to fix it. Example:R22
As for warmup logic, here I want to use random input instead of all zeros. And I need to set a reasonable value(not too big) here for rand::thread_rng().gen_range API. It is not a big problem, we can get feedback from reviewers.

pi314ever · 2024-08-30T07:09:47Z

Closing this PR in favor of #8 as the bug is a non-issue.

pi314ever added 5 commits August 29, 2024 20:41

Newline at EOF

7665124

Formatting

511f724

Refactor, fix Python backend import loop, remove rand dep

7fa7acb

Remove install-hpu residual

e653e91

Restore server.py

79ed2ca