Skip to content

How to run

Duyi-Wang edited this page Jul 11, 2025 · 7 revisions

Hybrid data type with custom weight location

How to use hybrid data type with custom weight location:

  • Change the data type to "bf16_fp16" in demo.py, which means to use BF16 weights for first token,FP16 weights for next tokens.
  • Set the environment variable: "FIRST_TOKEN_WEIGHT_LOCATION" and "NEXT_TOKEN_WEIGHT_LOCATION". Use the NUMA node ID as the value.

Example: FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=1 SINGLE_INSTANCE=1 OMP_NUM_THREADS=32 taskset -c 56-87 python demo.py

Performance on Intel (R) Xeon (R) CPU Max 9468 with command FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=8 OMP_NUM_THREADS=8 numactl -C 0-7 -m 8 ./example /data/chatglm2-6b-cpu 3 1024

First token latency: Next token latency:

对于混合精度的分布式性能测试:

FIRST_TOKEN_WEIGHT_LOCATION=1 NEXT_TOKEN_WEIGHT_LOCATION=3 OMP_NUM_THREADS=20 mpirun \
    -n 1 numactl -N 1 -m 3 python demo.py --dtype=bf16_fp16 --token_path /data/chatglm2-6b-hf/ --model_path /data/chatglm2-6b-cpu/ --streaming False : \
    -n 1 numactl -N 1 -m 3 python demo.py --dtype=bf16_fp16 --token_path /data/chatglm2-6b-hf/ --model_path /data/chatglm2-6b-cpu/ --streaming False

export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

ENV for xDNN

XDNN_N64  for all gemv but will overwrite with vars below
XDNN_N64_N  for normal gemv
XDNN_N64_NR  for normal gemv with residential
XDNN_N64_A  for batchA gemv
N64Flag_AR  for batchA gemv with residential
N64Flag_C  for batchC gemv
N64Flag_CR  for batchC gemv with residential

if vars > 0 then using N64 version kernel
if vars < 0 then using N16 version kernel
if var == 0 then using default dispatch method

Clone this wiki locally