QuantizeML package provides base layers and quantization tools for deep-learning models. It allows the quantization of CNN and Vision Transformer models using low-bitwidth weights and outputs. Once quantized with the provided tools, CNN2SNN toolkit will be able to convert the model and execute it with Akida runtime.
The FixedPoint representation
QuantizeML uses a FixedPoint representation in place of float values for layers inputs, outputs and weights.
FixedPoint numbers are actually integers with a static number of fractional bits so that:
The precision of the representation is directly related to the number of fractional bits. For example, representing PI using an 8-bit FixedPoint with varying fractional bits:
Further details are available in the FixedPoint API documentation.
Thanks to the FixedPoint representation, all operations within layers are implemented as integer only operations 1.
The first step in the workflow is to train a standard Keras model. This trained model is the starting point for the quantization stage. Once it is established that the overall model configuration prior to quantization yields a satisfactory performance on the task, one can proceed with quantization.
Let’s take the DS-CNN model from our zoo that targets KWS task as an example:
from akida_models import fetch_file from quantizeml.models import load_model model_file = fetch_file("https://data.brainchip.com/models/AkidaV2/ds_cnn/ds_cnn_kws.h5", fname="ds_cnn_kws.h5") model = load_model(model_file)
The QuantizeML toolkit offers a turnkey solution to quantize a model: the quantize function. It replaces the Keras layers (or custom QuantizeML layers) with quantized, integer only layers. The obtained quantized model is still a Keras model that can be evaluated with a standard Keras pipeline.
Here’s an example for 8-bit quantization:
from quantizeml.layers import QuantizationParams qparams8 = QuantizationParams(input_weight_bits=8, weight_bits=8, activation_bits=8)
Here’s an example for 4-bit quantization (with first layer weights set to 8-bit):
from quantizeml.layers import QuantizationParams qparams4 = QuantizationParams(input_weight_bits=8, weight_bits=4, activation_bits=4)
Note that quantizating the first weights to 8-bit helps preserving accuracy.
QuantizeML uses a uniform quantization scheme centered on zero. During quantization, the floating point values are mapped to a given bitwidth quantization space of the form:
scales is a real number used to map the FixedPoint numbers to a quantization space. It is calculated as follows:
Inputs, weights and outputs scales are folded into a single output scale vector.
To avoid saturation in downstream operations throughout a model graph, the bitwidth of intermediary results is decreased using OutputQuantizer. The quantize function has built-in rules to automatically isolate building blocks of layers after which such quantization is required and will insert the OutputQuantizer objects during the quantization process.
To properly operate, an OutputQuantizer must be calibrated so that it determines an adequate quantization range. Calibration will determine the quantization range statistically. It is possible to pass down samples to the quantize function so that calibration and quantization are performed simultaneously.
Calibration samples are available on Brainchip data server for datasets used in our zoo. They must be downloaded and deserialized before being used for calibration.
import numpy as np from akida_models import fetch_file samples = fetch_file("https://data.brainchip.com/dataset-mirror/samples/kws/kws_batch1024.npz", fname="kws_batch1024.npz") samples = np.load(samples) samples = np.concatenate([samples[item] for item in samples.files])
Quantizing the DS-CNN model to 8-bit is then done with:
from quantizeml.models import quantize quantized_model = quantize(model, qparams=qparams8, samples=samples)
Please refer to calibrate for more details on calibration.
Direct quantization of a standard Keras model (also called Post Training Quantization, PTQ) generally introduces a drop in performance. This drop is usually small for 8-bit or even 4-bit quantization of simple models, but it can be very significant for low quantization bitwidth and complex models (AkidaNet or transformers architectures).
If the quantized model offers acceptable performance, it can be directly converted into an Akida model (see the convert function).
However, if the performance drop is too high, a Quantization Aware Training (QAT) step is required to recover the performance prior to quantization. Since the quantized model is a Keras model, it can then be trained using the standard Keras API.
Check out the examples section for tutorials on quantization, PTQ and QAT.
The tookit supports a wide range of layers (see the supported type section). When hitting a non-compatible layer, QuantizeML will simply stop the quantization before this layer and add a Dequantizer before it so that inference is still possible. When such an event occurs, a warning is raised to the user with the faulty layer name.
While quantization comes with some restrictions on layer order (e.g. MaxPool2D operation should be placed before ReLU activation), the sanitize helper is called before quantization to deal with such restrictions and edit the model accordingly. sanitize will also handle some layers that are not in the supported layer types such as:
ZeroPadding2D which is replaced with ‘same’ padding convolution when possible
- Lambda layers:
Lambda(relu) or Activation(‘relu’) → ReLU,
Lambda(transpose) → Permute,
Lambda(reshape) → Reshape,
Lambda(add) → Add.
Command line interface
In addition to the programming interface, QuantizeML toolkit also provides a command-line interface to perform quantization, dump a quantized model configuration, check a quantized model and insert a rescaling layer.
Quantizing a model through the CLI uses almost the same arguments as the programming interface but the quantization parameters are split into the parameters: input weight quantization with “-i”, weight bitwidth with “-w” and activation bitwidth with the “-a” options.
quantizeml quantize -m model_keras.h5 -i 8 -w 8 -a 8
Note that without calibration options explicitly given, calibration will happen with 1024 randomly generated samples. It is generally advised to use real samples serialized in a numpy .npz file.
quantizeml quantize -m model_keras.h5 -i 8 -w 8 -a 8 -sa some_samples.npz -bs 128 -e 2
For akida 1.0 compatibility, it is mandatory to have activations quantized per-tensor instead of the default per-axis quantization:
quantizeml quantize -m model_keras.h5 -i 8 -w 4 -a 4 --per_tensor_activations
Advanced users might want to customize the default quantization pattern and this is made possible by dumping a quantized model configuration to a .json file and quantizing again using the “-c” option.
quantizeml config -m model_keras_i8_w8_a8.h5 -o config.json ... manual configuration changes ... quantizeml quantize -m model_keras.h5 -c config.json
Editing a model configuration can be complicated and might have negative effects on quantized accuracy or even model graph. This should be reserved to users deeply familiar with QuantizeML concepts.
It is possible to check for quantization errors using the check CLI that will report inaccurate weight scales quantization or saturation in integer operations.
quantizeml check -m model_keras_i8_w8_a8.h5
Some models might not include a Rescaling layer in their architecture and have a separated preprocessing pipeline (ie. moving from [0, 255] images to a [-1, 1] normalized representation). As having a rescaling layer might be useful, QuantizeML offers the insert_rescaling CLI that will add a Rescaling layer at the beginning of a given model.
quantizeml insert_rescaling -m model_keras.h5 -s 0.007843 -o -1 -d model_updated.h5
where \(0.007843 = 1/127.5\).
Supported layer types
The QuantizeML toolkit provides quantization of the following layer types which are standard Keras layers for most part and custom QuantizeML layers for some of them:
See https://en.wikipedia.org/wiki/Fixed-point_arithmetic for more details on the arithmetics.