intel
diff --git a/‎.azure-pipelines/scripts/ut/env_setup.sh‎
Lines changed: 1 addition & 1 deletion b/‎.azure-pipelines/scripts/ut/env_setup.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/3x/PT_MXQuant.md‎
Lines changed: 106 additions & 0 deletions b/‎docs/source/3x/PT_MXQuant.md‎
Lines changed: 106 additions & 0 deletions
diff --git a/‎docs/source/3x/PT_SmoothQuant.md‎
Lines changed: 4 additions & 6 deletions b/‎docs/source/3x/PT_SmoothQuant.md‎
Lines changed: 4 additions & 6 deletions
diff --git a/‎docs/source/quantization_weight_only.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/quantization_weight_only.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py‎
Lines changed: 3 additions & 1 deletion b/‎examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static/main.py‎
Lines changed: 14 additions & 10 deletions b/‎examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static/main.py‎
Lines changed: 14 additions & 10 deletions
diff --git a/‎examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static/requirements.txt‎
Lines changed: 2 additions & 2 deletions b/‎examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static/requirements.txt‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only/main.py‎
Lines changed: 14 additions & 10 deletions b/‎examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only/main.py‎
Lines changed: 14 additions & 10 deletions
diff --git a/‎examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only/requirements.txt‎
Lines changed: 2 additions & 2 deletions b/‎examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only/requirements.txt‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py‎
Lines changed: 7 additions & 3 deletions b/‎examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py‎
Lines changed: 7 additions & 3 deletions
@@ -92,7 +92,7 @@ elif [[ $(echo "${test_case}" | grep -c "tf pruning") != 0 ]]; then
 fi
 
 if [[ $(echo "${test_case}" | grep -c "api") != 0 ]] || [[ $(echo "${test_case}" | grep -c "adaptor") != 0 ]]; then
-    pip install auto-round
+    pip install git+https://github.com/intel/auto-round.git@ecca5349981044e1278773a251b3fc5c0a11fe7b
 fi
 
 # test deps
 
@@ -0,0 +1,106 @@
+Microscaling Quantization
+===============
+
+1. [Introduction](#introduction)
+2. [Get Started with Microscaling Quantization API](#get-start-with-microscaling-quantization-api)
+3. [Examples](#examples)
+4. [Reference](#reference)
+
+## Introduction
+
+Numerous breakthroughs have emerged across various fields, such as text analysis, language translation and chatbot technologies, fueled by the development of large language models (LLMs). Nevertheless, their increasing power comes with the challenge of explosive growth in parameters, posing obstacles for practical use. To balance memory limits and accuracy preservation for AI models, the Microscaling (MX) specification was promoted from the well-known Microsoft Floating Point (MSFP) data type [1, 2]:
+
+<table>
+  <tr>
+    <th>Format Name</th>
+    <th>Element Data type</th>
+    <th>Element Bits</th>
+    <th>Scaling Block Size</th>
+    <th>Scale Data Type</th> 
+    <th>Scale Bits</th>
+  </tr>
+  <tr>
+    <td rowspan="2">MXFP8</td>
+    <td>FP8 (E5M2)</td>
+    <td rowspan="2">8</td>
+    <td rowspan="2">32</td>
+    <td rowspan="2">E8M0</td>
+    <td rowspan="2">8</td>
+  </tr>
+  <tr>
+    <td>FP8 (E4M3)</td>
+  </tr>
+  <tr>
+    <td rowspan="2">MXFP6</td>
+    <td>FP6 (E3M2)</td>
+    <td rowspan="2">6</td>
+    <td rowspan="2">32</td>
+    <td rowspan="2">E8M0</td>
+    <td rowspan="2">8</td>
+  </tr>
+  <tr>
+    <td>FP6 (E2M3)</td>
+  </tr>
+  <tr>
+    <td>MXFP4</td>
+    <td>FP4 (E2M1)</td>
+    <td>4</td>
+    <td>32</td>
+    <td>E8M0</td> 
+    <td>8</td>
+  </tr>
+  <tr>
+    <td>MXINT8</td>
+    <td>INT8</td>
+    <td>8</td>
+    <td>32</td>
+    <td>E8M0</td> 
+    <td>8</td>
+  </tr>
+</table>
+
+
+At an equivalent accuracy level, the MX data type demonstrates the ability to occupy a smaller area and incur lower energy costs for multiply-accumulate compared to other conventional data types on the same silicon [1].
+
+Neural Compressor seamlessly applies the MX data type to post-training quantization, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. The workflow is shown as below.
+
+<a target="_blank" href="../imgs/mx_workflow.png" text-align:left>
+    <left> 
+        <img src="../imgs/mx_workflow.png" alt="Workflow of MX Quant (source [3])" height=120> 
+    </left>
+</a>
+
+The memory and computational limits of LLMs are more severe than other general neural networks, so our exploration focuses on LLMs first. The following table shows the basic MX quantization recipes in Neural Compressor and enumerates distinctions among various data types. The MX data type replaces general float scale with powers of two to be more hardware-friendly. It adapts a granularity falling between per-channel and per-tensor to balance accuracy and memory consumption.
+
+|            | MX Format |  INT8  |  FP8  |
+|------------|--------------|------------|------------|
+|  Scale  |   $2^{exp}$   |  $\frac{MAX}{amax}$  |  $\frac{MAX}{amax}$  |
+|  Zero point  |   0 (None)   | $2^{bits - 1}$ or $-min * scale$ |   0 (None)   |
+|  Granularity  |  per-block (default blocksize is 32)   |  per-channel or per-tensor  | per-tensor  |
+
+The exponent (exp) is equal to torch.floor(torch.log2(amax)), MAX is the representation range of the data type, amax is the max absolute value of per-block tensor, and rmin is the minimum value of the per-block tensor.
+
+
+## Get Started with Microscaling Quantization API
+
+To get a model quantized with Microscaling Data Types, users can use the Microscaling Quantization API as follows.
+
+```python
+from neural_compressor.torch.quantization import MXQuantConfig, quantize
+
+quant_config = MXQuantConfig(w_dtype=args.w_dtype, act_dtype=args.act_dtype, weight_only=args.woq)
+user_model = quantize(model=user_model, quant_config=quant_config)
+```
+  
+## Examples
+
+- PyTorch [huggingface models](/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/mx)
+
+
+## Reference
+
+[1]: Darvish Rouhani, Bita, et al. "Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point." Advances in neural information processing systems 33 (2020): 10271-10281 
+
+[2]: OCP Microscaling Formats (MX) Specification
+
+[3]: Rouhani, Bita Darvish, et al. "Microscaling Data Formats for Deep Learning." arXiv preprint arXiv:2310.10537 (2023). 
@@ -341,27 +341,25 @@ To set a fixed alpha for the entire model, users can follow this example:
 ```python
 from neural_compressor.torch.quantization import SmoothQuantConfig, convert, prepare
 
-quant_config = SmoothQuantConfig(alpha=0.5, folding=False)
-example_inputs = torch.zeros([1, 3])
-
 
 def run_fn(model):
     model(example_inputs)
 
 
+quant_config = SmoothQuantConfig(alpha=0.5, folding=False)
 prepared_model = prepare(fp32_model, quant_config=quant_config, example_inputs=example_inputs)
 run_fn(prepared_model)
 q_model = convert(prepared_model)
 ```
 `SmoothQuantConfig` description:
 
-"alpha": a float value. Default is 0.5.
+`alpha`: a float value. Default is 0.5.
 
-"folding": whether to fold mul into the previous layer, where mul is required to update the input distribution during smoothing.
+`folding`: whether to fold mul into the previous layer, where mul is required to update the input distribution during smoothing.
 - True: Fold inserted mul into the previous layer. IPEX will only insert mul for layers can do folding. 
 - False: Allow inserting mul to update the input distribution and no folding. IPEX (version>=2.1) can fuse inserted mul automatically. For Stock PyTorch, setting folding=False will convert the model to a QDQ model.
 
-To get more information, please refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x/pytorch/nlp/huggingface_models/language-modeling/quantization/llm).
+To get more information, please refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm).
 
 
 ## Supported Framework Matrix
 
@@ -87,6 +87,8 @@ Notes:
 |  use_max_length  | False | Whether to align all calibration data to fixed length, which equals to pad_max_length. |
 |  block_size  | 128 | Execute GPTQ quantization per block, block shape = [$C_{out}$, block_size] |
 |  static_groups  | False | Whether to calculate group wise quantization parameters in advance. This option mitigate actorder's extra computational requirements |
+|  true_sequential  | False | Whether to quantize layers within a transformer block in their original order. This can lead to higher accuracy but slower overall quantization process. |
+|  lm_head  | False | Whether to quantize the lm_head (linear layer related to prediction in the end of the language models). |
 
 **Note:** Neural compressor provides `Unsigned integer for asymmetric quantization` and `Signed integer for symmetric quantization`. Please follow the below section to compress the low bit data type for saving.
 
 
@@ -11,7 +11,7 @@
 import datasets
 from torch.nn.functional import pad
 from torch.utils.data import DataLoader
-from transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer
+from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
 
 parser = argparse.ArgumentParser()
 parser.add_argument(
@@ -377,7 +377,9 @@ def run_fn(model):
 
     from neural_compressor.torch.quantization import load
     tokenizer = AutoTokenizer.from_pretrained(args.model)
+    config = AutoConfig.from_pretrained(args.model)
     user_model = load(os.path.abspath(os.path.expanduser(args.output_dir)))
+    setattr(user_model, "config", config)
 else:
     user_model, tokenizer = get_user_model()
 
 
@@ -26,7 +26,6 @@
 import onnxruntime as ort
 from torch.nn.functional import pad
 from torch.utils.data import DataLoader
-from intel_extension_for_transformers.llm.evaluation.lm_eval import evaluate
 from optimum.onnxruntime import ORTModelForCausalLM
 from transformers import LlamaConfig, LlamaTokenizer
 
@@ -198,28 +197,33 @@ def replace_architectures(json_path):
         json.dump(data, file, indent=4)
 
 def eval_func(model):
+    from intel_extension_for_transformers.transformers.llm.evaluation.lm_eval import evaluate, LMEvalParser
+
     model_dir = model
     if isinstance(model, str) and model.endswith(".onnx"):
         model_dir = os.path.dirname(model)
 
     replace_architectures(os.path.join(model_dir, "config.json"))
 
-    results = evaluate(
-        model="hf-causal",
-        model_args="pretrained=" + model_dir + ",tokenizer="+ args.tokenizer,
+    eval_args = LMEvalParser(
+        model="hf",
+        model_args="pretrained=" + model_dir + ",tokenizer=" + args.tokenizer + ",model_format=onnx",
         batch_size=args.batch_size,
-        tasks=args.tasks,
-        model_format="onnx",
+        tasks=','.join(args.tasks),
+        device="cpu",
     )
+    results = evaluate(eval_args)
 
     eval_acc = 0
     for task_name in args.tasks:
         if task_name == "wikitext":
-            print("Accuracy for %s is: %s" % (task_name, results["results"][task_name]["word_perplexity"]))
-            eval_acc += results["results"][task_name]["word_perplexity"]
+            print("Accuracy for %s is: %s" %
+                  (task_name, results["results"][task_name]["word_perplexity,none"]))
+            eval_acc += results["results"][task_name]["word_perplexity,none"]
         else:
-            print("Accuracy for %s is: %s" % (task_name, results["results"][task_name]["acc"]))
-            eval_acc += results["results"][task_name]["acc"]
+            print("Accuracy for %s is: %s" %
+                  (task_name, results["results"][task_name]["acc,none"]))
+            eval_acc += results["results"][task_name]["acc,none"]
 
     if len(args.tasks) != 0:
         eval_acc /= len(args.tasks)
 
@@ -7,6 +7,6 @@ onnxruntime-extensions; python_version < '3.11'
 datasets
 optimum
 evaluate
-intel-extension-for-transformers
+intel-extension-for-transformers >= 1.4.1
 peft
-git+https://github.com/EleutherAI/lm-evaluation-harness.git@cc9778fbe4fa1a709be2abed9deb6180fd40e7e2
+lm-eval==0.4.2
@@ -27,7 +27,6 @@
 import onnxruntime as ort
 from torch.nn.functional import pad
 from torch.utils.data import DataLoader
-from intel_extension_for_transformers.llm.evaluation.lm_eval import evaluate
 from optimum.onnxruntime import ORTModelForCausalLM
 from transformers import LlamaConfig, LlamaTokenizer
 
@@ -135,28 +134,33 @@ def replace_architectures(json_path):
         json.dump(data, file, indent=4)
 
 def eval_func(model):
+    from intel_extension_for_transformers.transformers.llm.evaluation.lm_eval import evaluate, LMEvalParser
+
     model_dir = model
     if isinstance(model, str) and model.endswith(".onnx"):
         model_dir = os.path.dirname(model)
 
     replace_architectures(os.path.join(model_dir, "config.json"))
 
-    results = evaluate(
-        model="hf-causal",
-        model_args="pretrained=" + model_dir + ",tokenizer="+ args.tokenizer,
+    eval_args = LMEvalParser(
+        model="hf",
+        model_args="pretrained=" + model_dir + ",tokenizer=" + args.tokenizer + ",model_format=onnx",
         batch_size=args.batch_size,
-        tasks=args.tasks,
-        model_format="onnx",
+        tasks=','.join(args.tasks),
+        device="cpu",
     )
+    results = evaluate(eval_args)
 
     eval_acc = 0
     for task_name in args.tasks:
         if task_name == "wikitext":
-            print("Accuracy for %s is: %s" % (task_name, results["results"][task_name]["word_perplexity"]))
-            eval_acc += results["results"][task_name]["word_perplexity"]
+            print("Accuracy for %s is: %s" %
+                  (task_name, results["results"][task_name]["word_perplexity,none"]))
+            eval_acc += results["results"][task_name]["word_perplexity,none"]
         else:
-            print("Accuracy for %s is: %s" % (task_name, results["results"][task_name]["acc"]))
-            eval_acc += results["results"][task_name]["acc"]
+            print("Accuracy for %s is: %s" %
+                  (task_name, results["results"][task_name]["acc,none"]))
+            eval_acc += results["results"][task_name]["acc,none"]
 
     if len(args.tasks) != 0:
         eval_acc /= len(args.tasks)
 
@@ -7,6 +7,6 @@ onnxruntime-extensions; python_version < '3.11'
 datasets
 optimum
 evaluate
-intel-extension-for-transformers
+intel-extension-for-transformers >= 1.4.1
 peft
-git+https://github.com/EleutherAI/lm-evaluation-harness.git@cc9778fbe4fa1a709be2abed9deb6180fd40e7e2
+lm-eval==0.4.2
@@ -77,6 +77,8 @@
                                                                            this should align with your model config, \
                                                                            and your dataset builder args: args.pad_max_length')
 parser.add_argument('--gptq_static_groups', action='store_true', help='Use determined group to do quantization')
+parser.add_argument('--gptq_true_sequential', action='store_true', help="Whether to run in true_sequential model.")
+parser.add_argument('--gptq_lm_head', action='store_true', help="Whether to use GPTQ to quantize the output layer of the LLMs.")
 # ==============code generation args===========
 parser.add_argument("--code_generation", action="store_true")
 parser.add_argument("--n_samples", default=200, type=int)
@@ -278,7 +280,8 @@ def calib_func(prepared_model):
             'use_max_length': args.gptq_use_max_length,
             'pad_max_length': args.gptq_pad_max_length,
             'static_groups': args.gptq_static_groups,
-            "enable_mse_search": args.woq_enable_mse_search,
+            "true_sequential": args.gptq_true_sequential,
+            "lm_head": args.gptq_lm_head,
         }
         # GPTQ: use assistive functions to modify calib_dataloader and calib_func
         # TEQ: set calib_func=None, use default training func as calib_func
@@ -340,12 +343,13 @@ def eval_func(model):
 
     if args.ipex:
         user_model = load(os.path.abspath(os.path.expanduser(args.output_dir)))
+        tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=args.trust_remote_code)
     else:
-        user_model, _ = get_user_model()
+        user_model, tokenizer = get_user_model()
         kwargs = {'weight_only': True} if args.approach == 'weight_only' else {}
         user_model = load(os.path.abspath(os.path.expanduser(args.output_dir)), user_model, **kwargs)
 else:
-    user_model, _ = get_user_model()
+    user_model, tokenizer = get_user_model()
 
 if args.accuracy:
     user_model.eval()