-
Notifications
You must be signed in to change notification settings - Fork 73
How to run
How to use hybrid data type with custom weight location:
- Change the data type to "bf16_fp16" in demo.py, which means to use BF16 weights for first token,FP16 weights for next tokens.
- Set the environment variable: "FIRST_TOKEN_WEIGHT_LOCATION" and "NEXT_TOKEN_WEIGHT_LOCATION". Use the NUMA node ID as the value.
Example: FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=1 SINGLE_INSTANCE=1 OMP_NUM_THREADS=32 taskset -c 56-87 python demo.py
Performance on Intel (R) Xeon (R) CPU Max 9468 with command
FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=8 OMP_NUM_THREADS=8 numactl -C 0-7 -m 8 ./example /data/chatglm2-6b-cpu 3 1024
First token latency: Next token latency:
对于混合精度的分布式性能测试:
FIRST_TOKEN_WEIGHT_LOCATION=1 NEXT_TOKEN_WEIGHT_LOCATION=3 OMP_NUM_THREADS=20 mpirun \
-n 1 numactl -N 1 -m 3 python demo.py --dtype=bf16_fp16 --token_path /data/chatglm2-6b-hf/ --model_path /data/chatglm2-6b-cpu/ --streaming False : \
-n 1 numactl -N 1 -m 3 python demo.py --dtype=bf16_fp16 --token_path /data/chatglm2-6b-hf/ --model_path /data/chatglm2-6b-cpu/ --streaming False
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
XDNN_N64 for all gemv but will overwrite with vars below
XDNN_N64_N for normal gemv
XDNN_N64_NR for normal gemv with residential
XDNN_N64_A for batchA gemv
N64Flag_AR for batchA gemv with residential
N64Flag_C for batchC gemv
N64Flag_CR for batchC gemv with residential
if vars > 0 then using N64 version kernel
if vars < 0 then using N16 version kernel
if var == 0 then using default dispatch method