Quantization 是目前優化模型效能很常見的手法,簡單來說就是減少浮點數的精度範圍,使得模型更快更小,而我們可以透過 Optimum 很容易辦到這件事情。一般有兩種技巧,Dynamic quantization 是最簡單最好上手的方式,但是準確度比較容易發生下降的情形,另一種技巧是 Static quantization,通常來說準確度有機會較高,但是比較難以調校。近年還有有一種新技巧是 Quantization aware training,簡單來說就是用假的32位元的浮點數來訓練。我們今天就來做入門的 Dynamic quantization 吧!
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model_name="deepset/roberta-base-squad2"
onnx_path = Path("onnx")
task = "question-answering"
quantizer = ORTQuantizer.from_pretrained(model_name, feature=task)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
quantized_path = quantizer.export(
onnx_model_path= onnx_path / "model. onnx",
onnx_quantized_model_path= onnx_path / "model-quantized.onnx",
quantization_config=qconfig
)
import os
size = os.path.getsize(onnx_path / "model.onnx")/(1024*1024)
print(f"Vanilla Onnx Model file size: {size: .2f} MB")
size = os.path.getsize(onnx_path / "model-quantized.onnx")/(1024*1024)
print (f"Quantized Onnx Model file size: {size:.2f} MB")
會得到
Vanilla Onnx Model file size: 473.34 MB
Quantized Onnx Model file size: 230.83 MB
真的模型變小很多了對吧!
from transformers import pipeline
question = "What's my name?"
context = "My name is Ko Ko and I live in Taiwan."
quantized_model = ORTModelForQuestionAnswering. from_pretrained (onnx_path, file_name="model-quantized.onnx")
# test the quantized model with using transformers pipeline
quantized_qa = pipeline(task, model=quantized_model, tokenizer=tokenizer,
handle_impossible_answer=True)
prediction = quantized_qa(question=question, context=context)
prediction
其他還有一些模型優化技巧,例如說 Distill,一般翻譯成知識蒸餾,或是 Pruning ,一般翻譯成模型修剪,都是可以用在模型效能優化上。尤其是知識蒸餾近年來還滿受歡迎的。不過今天已經來到28天,剩下兩天要講 Deploy 和做 Chatbot 等相關應用了。