phi3

Apr 10, 2025

183615d · Apr 10, 2025

Name	Name	Last commit message	Last commit date
parent directory ..
vision	vision	Engine: Improve output structure, CLI: Configurable model options, Se…	Sep 18, 2024
.gitignore	.gitignore	Engine: Improve output structure, CLI: Configurable model options, Se…	Sep 18, 2024
README.md	README.md	Offline QuaRot Implementation (#1556 )	Jan 22, 2025
README_VISION.md	README_VISION.md	update examples (#1434 )	Oct 24, 2024
phi3.py	phi3.py	Update to use ruff format (#1740 )	Apr 10, 2025
phi3_nvmo_ptq.json	phi3_nvmo_ptq.json	update phi3_nvmo_ptq example and onnxruntime-directml version (#1458 )	Nov 5, 2024
phi3_template.json	phi3_template.json	Add LoHa and LoKr Pass (#1587 )	Feb 5, 2025
phi3_vision.py	phi3_vision.py	Examples: Update genai API calls, Add deepseek/llama3 NPU examples (#…	Mar 17, 2025
requirements-awq.txt	requirements-awq.txt	AutoAWQQuantizer: Option to save in hf format, Phi3: Finetune-AWQ-MB …	Aug 29, 2024
requirements-nvmo-awq.txt	requirements-nvmo-awq.txt	Update TensorRT Model Optimizer documentation and dependencies (#1483 )	Nov 13, 2024
requirements-quarot.txt	requirements-quarot.txt	Add quantization using QuaRot to CLI command (#1381 )	Oct 2, 2024
requirements-vision.txt	requirements-vision.txt	👓 phi3-vision tutorial (#1314 )	Aug 23, 2024
requirements.txt	requirements.txt	add AzureML support (#1171 )	May 23, 2024

README.md

Phi3 optimization with Olive

This folder contains an example of optimizing the Phi-3-Mini-4K-Instruct model from Hugging Face or Azure Machine Learning Model Catalog for different hardware targets with Olive.

Prerequisites

Install the dependencies

pip install -r requirements.txt

einops
Pytorch: >=2.2.0
The official website offers packages compatible with CUDA 11.8 and 12.1. Please select the appropriate version according to your needs.
Package onnxruntime: >=1.18.0
Package onnxruntime-genai: >=0.2.0.

If you target GPU, pls install onnxruntime and onnxruntime-genai gpu packages.

For optimizing model from Hugging Face

if you have not logged in Hugging Face account,

Install Hugging Face CLI and login your Hugging Face account for model access

huggingface-cli login

For optimizing model from Azure Machine Learning Model Catalog

Install Olive with Azure Machine Learining dependency

pip install olive-ai[azureml]

if you have not logged in Azure account,

Install Azure Command-Line Interface (CLI) following this link
Run az login to login your Azure account to allows Olive to access the model.

Usage with CLI

You can use Olive CLI command to export, fine-tune, and optimize the model for a chosen hardware target. Few examples below:

# To auto-optimize the exported model
olive auto-opt -m microsoft/Phi-3-mini-4k-instruct --precision int8

# To quantize the model
olive quantize -m microsoft/Phi-3-mini-4k-instruct --implementation gptq

# To tune ONNX session params
olive tune-session-params -m microsoft/Phi-3-mini-4k-instruct --io_bind --enable_cuda_graph

For more information on available options to individual CLI command run olive <command-name> --help on the command line.

Usage with custom configuration

we will use the phi3.py script to fine-tune and optimize model for a chosen hardware target by running the following commands.

python phi3.py [--target HARDWARE_TARGET] [--precision DATA_TYPE] [--source SOURCE] [--finetune_method METHOD] [--inference] [--prompt PROMPT] [--max_length LENGTH]

# Examples
python phi3.py --target mobile

python phi3.py --target mobile --source AzureML

python phi3.py --target mobile --inference --prompt "Write a story starting with once upon a time" --max_length 200

# Fine-tune the model with lora method, optimize the model for cuda target and inference with ONNX Runtime Generate() API
python phi3.py --target cuda --finetune_method lora --inference --prompt "Write a story starting with once upon a time" --max_length 200

# Fine-tune, quantize using AWQ and optimize the model for cpu target
python phi3.py --target cpu --precision int4 --finetune_method lora --awq

# Search and generate an optimized ONNX session tuning config
python phi3.py --target cuda --precision fp16 --tune-session-params

Run the following to get more information about the script:

python phi3.py --help

This script includes

Generate the Olive configuration file for the chosen HW target
(optional) Fine-tune model by lora or qlora method with dataset of nampdn-ai/tiny-codes.
(optional) Quantize the original or fine-tuned model using AWQ. If AWQ is not used, the model will be quantized using RTN if precision is int4.
Generate optimized onnx model with Olive based on the configuration file for the chosen HW target
Search and generate optimized ONNX session tuning config
(optional) Inference the optimized model with ONNX Runtime Generate() API with non-web target

If you have an Olive configuration file, you can also run the olive command for model generation:

olive run [--config CONFIGURATION_FILE]

# Examples
olive run --config phi3_run_mobile_int4.json

Get access to fine-tuning dataset

Get access to the following resources on Hugging Face Hub:

nampdn-ai/tiny-codes

Quantize Models with NVIDIA TensorRT Model Optimizer

The TensorRT Model Optimizer is designed to bring advanced model compression techniques, including quantization, to Windows RTX PC systems. Engineered for Windows, it delivers rapid and efficient quantization through features such as local GPU calibration, reduced memory usage, and fast processing. The primary goal of TensorRT Model Optimizer is to produce optimized, ONNX-format models compatible with DirectML backends.

Setup

Run the following commands to install necessary packages:

pip install olive-ai[nvmo]
pip install onnxruntime-genai-directml>=0.4.0
pip install onnxruntime-directml==1.20.0
pip install -r requirements-nvmo-awq.txt

Install the CUDA version compatible with CuPy as specified in requirements-nvmo-awq.txt.

Validate Installation

After setup, confirm the correct installation of the modelopt package by running:

python -c "from modelopt.onnx.quantization.int4 import quantize as quantize_int4"

Quantization

To perform quantization, use the configuration file phi3_nvmo_ptq.json. This config executes two passes: one for model building and one for quantization. Note that ModelOpt currently only supports quantizing models created with the modelbuild tool.

olive run --config phi3_nvmo_ptq.json

Steps to Quantize Different LLM Models

Locate and Update Configuration File: Open phi3_nvmo_ptq.json in a text editor. Update the model_path to point to the directory or repository of the model you want to quantize. Ensure that tokenizer_dir is set to the tokenizer directory for the new model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

phi3

phi3

README.md

Phi3 optimization with Olive

Prerequisites

For optimizing model from Hugging Face

For optimizing model from Azure Machine Learning Model Catalog

Usage with CLI

Usage with custom configuration

Get access to fine-tuning dataset

Quantize Models with NVIDIA TensorRT Model Optimizer

Setup

Validate Installation

Quantization

Steps to Quantize Different LLM Models

More Inference Examples

Files

phi3

Directory actions

More options

Directory actions

More options

Latest commit

History

phi3

Folders and files

parent directory

README.md

Phi3 optimization with Olive

Prerequisites

For optimizing model from Hugging Face

For optimizing model from Azure Machine Learning Model Catalog

Usage with CLI

Usage with custom configuration

Get access to fine-tuning dataset

Quantize Models with NVIDIA TensorRT Model Optimizer

Setup

Validate Installation

Quantization

Steps to Quantize Different LLM Models

More Inference Examples