QuantizeML API

Layers

Reshaping

class quantizeml.layers.QuantizedFlatten(*args, **kwargs)[source]

A Flatten layer that operates on quantized inputs

class quantizeml.layers.QuantizedPermute(*args, **kwargs)[source]

A Permute layer that operates on quantized inputs

Note: Keras Permute layer simply wraps the Tensorflow transpose op.

Parameters:

dims (tuple of ints) – Permutation pattern does not include the samples dimension. Indexing starts at 1. For instance, (2, 1) permutes the first and second dimensions of the input.

class quantizeml.layers.QuantizedReshape(*args, **kwargs)[source]

A Reshape layer that operates on quantized inputs

Parameters:

target_shape (tuple of ints) – Target shape, does not include the samples dimension (batch size).

Activations

class quantizeml.layers.QuantizedReLU(*args, **kwargs)[source]

Quantized version of the ReLU activation layer applicable on FixedPoint tensor.

Parameters:
  • max_value (float, optional) – ReLU maximum value. Defaults to 6.

  • quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

Attention

class quantizeml.layers.Attention(*args, **kwargs)[source]

Dot-product attention layer with configurable softmax.

Inputs are a tuple of tensors:

  • a query tensor of shape [batch, tokens, hidden],

  • a key tensor of shape [batch, tokens, hidden],

  • a value tensor of shape [batch, tokens, hidden].

The calculation follows the steps:

  1. Split query, key, value per attention heads

    q, k, v : [batch, tokens, hidden] -> [batch, token, num_heads, dim]

  2. Calculate cross-token scores as a query-key dot product:

    scores = tf.matmul(query, key, transpose_b=True)

    scores : [batch, num_heads, token, token]

  3. Rescale score by dividing by the squared-root of dim.

  4. Use scores to calculate a mask

    mask = softmax(scores)

  5. Combine mask with value

    output = tf.matmul(mask, value)

    output: [batch, num_heads, token, dim]

  6. Merge heads to get back to 2D

    output: [batch, num_heads, token, dim] -> [batch, token, hidden]

Parameters:
  • num_heads (int) – the number of attention heads

  • softmax (str, optional) – ‘softmax’ or ‘shiftmax’. Defaults to ‘softmax’

class quantizeml.layers.QuantizedAttention(*args, **kwargs)[source]

An Attention layer that operates on quantized inputs

Parameters:
  • num_heads (int) – the number of attention heads

  • quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

  • softmax (str, optional) – ‘softmax’ or ‘shiftmax’. Defaults to ‘shiftmax’

quantizeml.layers.string_to_softmax(s)[source]

Convert a string to a softmax function. Available options are ‘softmax’ for standard softmax, ‘shiftmax’ for shiftmax.

Parameters:

s (str) – string to convert.

Returns:

A softmax function.

Normalization

class quantizeml.layers.QuantizedBatchNormalization(*args, **kwargs)[source]

Layer that normalizes its inputs, on the last axis.

The normalization is applied like this:

\[\begin{split}y = \\frac{(x - \\mu) \\cdot \\gamma}{\\sigma} + \\beta \\ = \\frac{x \\cdot \\gamma}{\\sigma} - \\ \\frac{\\mu\\cdot \\gamma}{\\gamma} + \\beta\end{split}\]

if we consider:

\[\begin{split}a = \\frac{\\gamma}{\\sigma}\end{split}\]

and

\[\begin{split}b = -\\frac{\\mu\\cdot \\gamma}{\\sigma} + \\beta\end{split}\]

The normalization can be re-written as:

\[\begin{split}y = a \\cdot x + b\end{split}\]

Note that this layer will hold variables with names gamma, beta, moving_mean (\(\\mu\)), and moving_variance (\(\\sigma = \\sqrt{moving\_variance + \\epsilon}\)), so they can be converted from a BatchNormalization layer. However, it’s a and b that are going to be quantized.

Parameters:
  • quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

  • axis (int, optional) – The axis that was normalized on the BatchNormalization layer. The only supported value is the last dimension.

  • epsilon (float, optional) – Small value to avoid dividing by zero. Defaults to 1e-3.

class quantizeml.layers.LayerMadNormalization(*args, **kwargs)[source]

Approximates the keras.layers.LayerNormalization (LN), replacing the computation of the standard deviation by the mean average deviation (mad).

Taking into account the complexity of computing the standard deviation, the LayerMadNormalization (LMN) is intended to replace the \(std(x)\) by \(mad(x)\) defined as:

\[mad(x) = \frac{sum(|x - mean(x)|)}{nb\_channels}\]

To simplify it even more and make it more hardware-friendly, \(mean(x)\) is zeroed:

\[mad(x) = \frac{sum(|x|)}{nb\_channels}\]

Then, the equation of the layer is defined as:

\[LMN(x) = \gamma\frac{x}{mad(x)} + \beta\]

Note

A tuning step in the switching procedure between the LN to LMN layer will be required to find the \((\gamma, \beta)\) parameters that match the standard deviation changes.

class quantizeml.layers.QuantizedLayerNormalization(*args, **kwargs)[source]

A LayerNormalization layer that operates on quantized inputs and weights.

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

Convolution

class quantizeml.layers.PaddedConv2D(*args, **kwargs)[source]

A convolutional layer that can use custom padding values.

Note that when padding values are provided, padding ‘SAME’ will be applied with the provided value (overriding ‘padding’ parameter).

Parameters:

padding_value (float, list, tensor, optional) – the value or the list of values used when padding for the ‘same’ convolution type. Padding is per-tensor if one value is provided or per-channel otherwise. If None, zero-padding is used. Defaults to None.

class quantizeml.layers.QuantizedConv2D(*args, **kwargs)[source]

A convolutional layer that operates on quantized inputs and weights.

Note that when padding values are provided, padding ‘SAME’ will be applied with the provided value (overriding ‘padding’ parameter).

Parameters:
  • quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

  • padding_value (float, list, tensor, optional) – the value or the list of values used when padding for the ‘same’ convolution type. Padding is per-tensor if one value is provided or per-channel otherwise. If None, zero-padding is used. Defaults to None.

class quantizeml.layers.QuantizedConv2DTranspose(*args, **kwargs)[source]

A transposed convolutional layer that operates on quantized inputs and weights.

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

Depthwise convolution

class quantizeml.layers.QuantizedDepthwiseConv2D(*args, **kwargs)[source]

A depthwise convolutional layer that operates on quantized inputs and weights.

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

class quantizeml.layers.DepthwiseConv2DTranspose(*args, **kwargs)[source]

A transposed depthwise convolutional layer.

It performs a transposed depthwise convolution on inputs.

class quantizeml.layers.QuantizedDepthwiseConv2DTranspose(*args, **kwargs)[source]

A transposed depthwise convolutional layer that operates on quantized inputs and weights.

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

Separable convolution

class quantizeml.layers.QuantizedSeparableConv2D(*args, **kwargs)[source]

A separable convolutional layer that operates on quantized inputs and weights.

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

Dense

class quantizeml.layers.QuantizedDense(*args, **kwargs)[source]

A Dense layer that operates on quantized inputs and weights

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

Skip connection

class quantizeml.layers.Add(*args, **kwargs)[source]

Wrapper class of keras.layers.Add that allows to average inputs.

We only support a tuple of two inputs with the same shape.

Parameters:
  • average (bool, optional) – if True, compute the average across all inputs. Defaults to False.

  • activation (bool, optional) – If True, apply an activation function after the addition. Defaults to False.

class quantizeml.layers.QuantizedAdd(*args, **kwargs)[source]

Sums two inputs and quantize the output.

The two inputs must be provided as a list or tuple of FixedPoint or Tensors.

The outputs are quantized according to the specified quantization configuration.

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

class quantizeml.layers.QuantizedConcatenate(*args, **kwargs)[source]

A Concatenate layer that operates on quantized inputs

Pooling

class quantizeml.layers.QuantizedMaxPool2D(*args, **kwargs)[source]

A max pooling layer that operates on quantized inputs.

class quantizeml.layers.QuantizedGlobalAveragePooling2D(*args, **kwargs)[source]

A global average pooling layer that operates on quantized inputs.

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

Shiftmax

class quantizeml.layers.Shiftmax(*args, **kwargs)[source]

Wrapper class of shiftmax function, that calculates a softmax-like activation.

Note that shiftmax operation is performed always along the last axis.

class quantizeml.layers.QuantizedShiftmax(*args, **kwargs)[source]

A quantized layer to do a quantized function similar to the softmax, but using base 2 instead of e. So we replace

\[softmax(x_i) = \frac{e^{x_i}}{sum(e^{x_k})}\]

With this:

\[softmax2(x_i) = \frac{2^{x_i}}{sum(2^{x_k})}\]

This approximation is close enough to the original function. In order to make it more hardware friendly, we also approximated the \(sum(2^{x_k})\) to the closest power of two:

\[shiftmax(x_i) = \frac{2^{x_i}}{2^{round(log2(sum(2^{x_k})))}}\]

So it can be implemented with a simple shift operation.

Implementation is inspired from this paper:

Cardarilli, G.C., Di Nunzio, L., Fazzolari, R. et al. A pseudo-softmax function for hardware-based high speed image classification. Sci Rep 11, 15307 (2021). https://doi.org/10.1038/s41598-021-94691-7

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

quantizeml.layers.shiftmax(logits, axis=-1)[source]

Computes softmax-like activations, but using base 2 for the exponential.

Used as approximation of the softmax activation.

This function performs the equivalent of

>>>  logits = tf.floor(logits)
>>>  exp = 2 ** logits
>>>  sum_exp_shift = tf.round(tf.log2(tf.reduce_sum(exp, axis, keepdims=True)))
>>>  softmax = exp / 2 ** sum_exp_shift = 2 ** (logits -  sum_exp_shift)

When 2 ** sum_exp_shift is an approximated of sum_exp as a Power-of-Two (PoT)

To avoid a high exponential (and a tf.inf representation by tensorflow), we adopt the following equivalence:

Making the variable change \(y=logits-x0\), we reach the same result as \(p=shiftmax(logits)\), because,

\[p' = \frac{2^y}{sum(2^y)} = \frac{2^{logits-x0}}{sum(2^{logits-x0})} = \frac{2^{logits} * 2^{-x0}}{2^{-x0} * sum(2^{logits})} = \frac{2^{logits}}{sum(2^{logits})} = p\]

We take \(x0 = max(logits)\).

Parameters:
  • logits (tf.Tensor) – a non-empty Tensor.

  • axis (int, list, optional) – the dimension shiftmax would be performed on. The default is -1 which indicates the last dimension.

Returns:

value of shiftmax function with the same type and shape as logits.

Return type:

tf.Tensor

Raises:

InvalidArgumentError – if logits is empty or axis is beyond the last dimension of logits.

Note

We floor the logits to approximate the results to those expected when quantizing the operation.

Transformers

class quantizeml.layers.ClassToken(*args, **kwargs)[source]

Append a class token to an input layer.

Parameters:

initializer (keras.initializers.Initializer) – Initializer for the class variable. Defaults to None.

class quantizeml.layers.QuantizedClassToken(*args, **kwargs)[source]

Quantize the ClassToken layer, allowing quantization of the output.

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

class quantizeml.layers.AddPositionEmbs(*args, **kwargs)[source]

Adds (optionally learned) positional embeddings to the inputs.

Parameters:

initializer (keras.initializers.Initializer) – Initializer for the class variable. Defaults to None.

class quantizeml.layers.QuantizedAddPositionEmbs(*args, **kwargs)[source]

Quantize the AddPositionEmbs layer, allowing operations in FixedPoint domain.

Parameters:

quant_config (dict, optional) – the serialized quantization configuration. Defaults to None.

class quantizeml.layers.ExtractToken(*args, **kwargs)[source]

Wrapper class of tf.gather operation that allows to extract a Token.

Parameters:
  • token (int) – the indice of the token to extract.

  • axis (int, optional) – axis over which the user gather the token. Defaults to 1.

class quantizeml.layers.QuantizedExtractToken(*args, **kwargs)[source]

Quantized version of the ExtractToken layer. Accepts only FixedPoint inputs.

Rescaling

class quantizeml.layers.QuantizedRescaling(*args, **kwargs)[source]

A layer that multiplies integer inputs by a scale

This is a simplified version of the keras Rescaling layer:

  • it only supports a scalar scale,

  • it only supports zero offsets.

This layer assumes the inputs are 8-bit integer: it simply wraps them into an 8-bit per-tensor QFloat with the specified scale.

Parameters:

scale (float) – a scalar scale.

Dropout

class quantizeml.layers.QuantizedDropout(*args, **kwargs)[source]

A dropout layer that operates on quantized inputs and weights.

It is only implemented as a passthrough.

Quantizers

class quantizeml.layers.Quantizer(*args, **kwargs)[source]

The base class for all quantizers.

The bitwidth defines the number of quantization levels on which the values will be quantized. For a quantizer that accepts unsigned values, the maximum quantization level is \(2 ^ {bitwidth} - 1\). For a quantizer that accepts signed values, we lose one bit of precision to store the sign. When the quantizer is signed, the quantization interval is asymmetric around zero (i.e range: \([- 2 ^ {bitwidth - 1}, 2 ^ {bitwidth - 1} - 1]\)).

Parameters:
  • bitwidth (int) – the quantization bitwidth.

  • signed (bool, optional) – whether the quantizer expects signed values or unsigned. Defaults to True.

class quantizeml.layers.WeightQuantizer(*args, **kwargs)[source]

Bases: Quantizer

A uniform quantizer that converts a float Tensor to a QFloat representation.

In order, the WeightQuantizer:

  • evaluates the scales required to align the values on optimal ranges for FixedPoint quantization,

  • quantizes the rescaled Tensor as a FixedPoint and returns a QFloat.

Parameters:
  • bitwidth (int, optional) – the quantization bitwidth, defaults to 4.

  • signed (bool, optional) – whether the quantizer expects signed values or unsigned. Defaults to True.

  • axis (int, optional) – the quantization range is a scalar (None) or a vector corresponding to the given axis. Defaults to -1.

  • fp_quantizer (bool, optional) – True to enable FixedPoint quantization, QFloat otherwise. Defaults to False.

Methods:

build(input_shape)

Build the layer.

call(inputs)

Quantize the float inputs

get_config()

Get the config of the layer.

build(input_shape)[source]

Build the layer.

Parameters:

input_shape (list) – the shape of input tensor.

call(inputs)[source]

Quantize the float inputs

The quantization is done in two steps:

  1. Compute the quantization ranges,

  2. Quantize the inputs.

Parameters:

inputs (tf.Tensor) – the inputs tensor.

Returns:

the quantized tensor.

Return type:

QFloat

get_config()[source]

Get the config of the layer.

Returns:

the config of the layer.

Return type:

dict

class quantizeml.layers.AlignedWeightQuantizer(*args, **kwargs)[source]

Bases: Quantizer

A uniform quantizer that converts a float Tensor to a QFloat representation.

Unlike its sibling the WeightQuantizer, it does not evaluate the fractional bits and scales of the resulting QFloat, but instead aligns them on those of another QFloat input.

Parameters:
  • bitwidth (int, optional) – the quantization bitwidth. Defaults to 8.

  • signed (bool, optional) – whether the quantizer expects signed values or unsigned. Defaults to True.

Methods:

call(inputs, other)

Quantize the float inputs, aligned on another QFloat

call(inputs, other)[source]

Quantize the float inputs, aligned on another QFloat

The quantization is done in several steps:

  1. Compute the quantization ranges,

  2. Evaluate the maximum fractional bits,

  3. Quantize the inputs as a QFloat,

  4. Align the QFloat fractional bits on the other.

Parameters:
  • inputs (tf.Tensor) – the inputs tensor.

  • other (QFloat) – a tensor to align on.

Returns:

a quantized tensor with the same scales and frac_bits as other.

Return type:

QFloat

class quantizeml.layers.OutputQuantizer(*args, **kwargs)[source]

Bases: Quantizer

A uniform FixedPoint quantizer that selects the optimal number of fractional bits for the range of its inputs and updates them accordingly.

The typical use case is to decrease the bitwidth of the result of a quantized layer operation to avoid a saturation in downstream operations.

If the input is a QFloat, it is converted to a FixedPoint before updating its bitwidth.

Parameters:
  • bitwidth (int, optional) – the quantization bitwidth. Defaults to 8.

  • signed (bool, optional) – whether the quantizer expects signed values or unsigned. Defaults to True.

  • axis (str, optional) – the quantization range is a scalar (‘per-tensor’) or a vector corresponding to the last axis (‘per-axis’). Defaults to ‘per-tensor’.

  • scale_bits – (int, optional): the bitwidth to use when quantizing output scales. Defaults to 8.

  • buffer_bitwidth – (int, optional): buffer bitwidth value. Defaults to 32.

Methods:

build(input_shape)

Build the layer.

call(inputs)

Quantize the QTensor inputs to a lower bitwidth.

get_config()

Get the config of the layer.

Attributes:

frac_bits

Compute and return the number of fractional bits for this OutputQuantizer.

build(input_shape)[source]

Build the layer.

Parameters:

input_shape (list) – the shape of input tensor.

call(inputs)[source]

Quantize the QTensor inputs to a lower bitwidth.

The quantization happens with the following steps:

  1. Evaluate the nearest power(s) of two containing the quantization range(s)

  2. Quantize the inputs.

Parameters:

inputs (QTensor) – the inputs tensor.

Returns:

the quantized tensor.

Return type:

FixedPoint

property frac_bits

Compute and return the number of fractional bits for this OutputQuantizer.

Returns:

an integer tensor of fractional bits

Return type:

tf.Tensor

get_config()[source]

Get the config of the layer.

Returns:

the config of the layer.

Return type:

dict

class quantizeml.layers.Dequantizer(*args, **kwargs)[source]

Bases: Layer

Layer that allows to dequantize its inputs.

Methods:

call(inputs)

Convert QTensor inputs to float.

call(inputs)[source]

Convert QTensor inputs to float.

Parameters:

inputs (tf.Tensor or QTensor) – the inputs tensor(s).

Returns:

the dequantized tensor(s).

Return type:

tf.Tensor

Calibration

class quantizeml.layers.OutputObserver(*args, **kwargs)[source]

Calibration layer.

This layer is used to compute the future range_max of the equivalent OutputQuantizer in the quantized model. It is placed where the OutputQuantizer will be inserted (end of blocks) and accumulates the observed maximum values (with momentum) for input in the float model.

Parameters:
  • axis (str) – the quantization range is a scalar (‘per-tensor’) or a vector corresponding to the last axis (‘per-axis’). Defaults to ‘per-tensor’.

  • momentum (float) – the momentum for the moving average. Defaults to 0.9.

Recording

quantizeml.layers.recording(enable)[source]

Enable or disable recording.

Parameters:

enable (bool) – True to enable recording, False to disable it

class quantizeml.layers.TensorRecorder(*args, name='', **kwargs)[source]

Wrapper class to store and retrieve a tf.Tensor extracted from a graph.

This is mainly used to recover FixedPoint alignment shift information.

class quantizeml.layers.FixedPointRecorder(name='')[source]

Wrapper class to store and retrieve a FixedPoint extracted from a graph.

This is mainly used to recover FixedPoint quantized weights.

class quantizeml.layers.QFloatRecorder(name='')[source]

Wrapper class to store and retrieve a QFloat extracted from a graph.

This is mainly used to recover QFloat quantized weights.

class quantizeml.layers.NonTrackVariable(name='')[source]

A wrapper class for the temporary Tensor variables that should be tracked only during the call and which does not require to be serialized within the layer.

class quantizeml.layers.NonTrackFixedPointVariable(name='')[source]

A wrapper class for the temporary FixedPoint variables that should be tracked only during the call and which does not require to be serialized within the layer.

Models

Transforms

quantizeml.models.transforms.align_rescaling(model)[source]

Aligns the Rescaling layer of the model to make it quantization ready.

This folds the offset into the bias of next layer.

The resulting Rescaling is therefore compatible with a quantization to a QuantizedRescaling.

If the source model does not contain a Rescaling or if its Rescaling is already aligned, then the original model is returned.

Parameters:

model (keras.Model) – the source Keras model

Returns:

the original model or a new model with Rescaling layer aligned

Return type:

keras.Model

quantizeml.models.transforms.invert_batchnorm_pooling(model)[source]

Inverts pooling and BatchNormalization layers in a model to have BN layer before pooling.

Returns a new model where pooling and batch normalization layers are inverted. From a Keras model where pooling layers precede batch normalization layers, this function places the BN layers before pooling layers. This is the first step before folding BN layers into processing layers.

Note

Inversion of layers is equivalent only if the gammas of BN layers are positive. The function raises an error if not.

Parameters:

model (keras.Model) – a model

Returns:

the updated model

Return type:

keras.Model

Raises:

RuntimeError – if a candidate BatchNormalization layer has gamma values that are not strictly positive.

quantizeml.models.transforms.fold_batchnorms(model)[source]

Returns a new model where BatchNormalization layers are folded into previous layers.

From a Keras model where BN layers follow processing layers, this function removes the BN layers and updates the preceding layers weights and bias accordingly. The new model is strictly equivalent to the previous one.

Parameters:

model (keras.Model) – a model

Returns:

the original model or a model with BN folded

Return type:

keras.Model

quantizeml.models.transforms.insert_layer(model, target_layer_name, new_layer)[source]

Inserts the given layer in the model after the layer with the name target_layer_name.

Note that new_layer type is restricted to (OutputQuantizer, Dequantizer).

Parameters:
  • model (keras.Model) – the model to update

  • target_layer_name (str) – name of the layer after which to insert a layer

  • new_layer (keras.layers.Layer) – layer to insert

Raises:

ValueError – when target_layer_name is not found in model or new_layer is not in (OutputQuantizer, Dequantizer)

Returns:

the new model

Return type:

keras.Model

quantizeml.models.transforms.insert_rescaling(model, scale, offset)[source]

Inserts a Rescaling as first layer of the Model (after the Input)

Parameters:
  • model (keras.Model) – the model to update

  • scale (float) – the Rescaling scale

  • offset (float) – the Rescaling offset

Raises:

ValueError – when the Model does not have an Input layer.

Returns:

the new model

Return type:

keras.Model

quantizeml.models.transforms.invert_relu_maxpool(model)[source]

Inverts ReLU and MaxPool2D layers in a model to have MaxPool2D before ReLU.

This transformation produces a strictly equivalent model.

Parameters:

model (keras.Model) – a model

Returns:

keras.Model: the original model or the updated model

Return type:

keras.Model

quantizeml.models.transforms.remove_zeropadding2d(model)[source]

Removes ZeroPadding2D layers from a model.

ZeroPadding2D layers will not be supported by quantization so this transform adds support so that when the ZeroPadding2D layers are immediately followed by a convolution layer with ‘valid’ padding, they are removed and the following convolution is updated with a ‘same’ padding instead. This can however only happen when the padding specified in ZeroPadding2D actually corresponds to a ‘same’ padding.

Parameters:

model (keras.Model) – the model to update

Returns:

the original model or a new model with ZeroPadding2D removed

Return type:

keras.Model

quantizeml.models.transforms.replace_lambda(model)[source]

Replaces lambda layers from a model with their equivalent Keras layer.

This transform handles the following replacements:

  • Lambda(relu) or Activation(‘relu’) → ReLU,

  • Lambda(transpose) → Permute,

  • Lambda(reshape) → Reshape,

  • Lambda(add) → Add,

  • Lambda(‘gelu’) → Activation(‘gelu’),

  • Lambda(‘silu’) → Activation(‘silu’).

Parameters:

model (keras.Model) – the model of interest

Returns:

the original model or a new one with lambda replaced.

Return type:

keras.Model

quantizeml.models.transforms.sanitize(model)[source]

Sanitize a model preparing it for quantization.

This is a wrapping successive calls to several model transformations which aims at making the model quantization ready.

Parameters:

model (keras.Model) – the input model

Returns:

the sanitized model

Return type:

keras.Model

Quantization

quantizeml.models.quantize(model, q_config=None, qparams=QuantizationParams(activation_bits=8, per_tensor_activations=False, weight_bits=8, output_bits=8, input_weight_bits=8, input_dtype=uint8, buffer_bits=32), samples=None, num_samples=1024, batch_size=None, epochs=1, quantize_until=None)[source]

Quantizes a Keras or ONNX model using the provided configuration or parameters.

Details on how this function behaves:

  • q_config has priority over qparams, meaning that when a match is found in q_config the given configuration will be used instead of qparams. This is useful to handle specific cases (e.g per-tensor output quantizer). This is only used when quantizing Keras models.

  • when no configuration is given, quantization parameters are deduced from qparams and OutputQuantizers are automatically set on appropriate layers.

  • qparams are only applied to ‘float’ Keras layers when they are first quantized. As a result, when re-quantizing a model, one must provide a complete q_config. This is made easy with the dump_config helper. Note the only configuration supported when quantizing ONNX models is 8-bit for weights and activations, but per_tensor_activations param will be taken into account.

If not already present, a final Dequantizer will be added at the end of the Model.

The model will also be calibrated using the provided (or randomly generated inputs).

Parameters:
  • model (keras.Model or ModelProto) – the model to quantize

  • q_config (dict, optional) – quantization configuration as a dictionary mapping layer names to their quantization configuration. Defaults to None.

  • qparams (QuantizationParams, optional) – global quantization parameters. Defaults to QuantizationParams().

  • samples (tf.Dataset, np.array or generator, optional) – calibration samples. When no samples are provided, random samples are generated. Defaults to None.

  • num_samples (int, optional) – number of samples to use in the provided samples or number of samples to generate. Defaults to 1024.

  • batch_size (int, optional) – the batch size. Defaults to None.

  • epochs (int, optional) – the number of epochs. This parameter must be 1 for ONNX models. Defaults to 1.

  • quantize_until (str, optional) – name of the layer/node until which to quantize: other ones after it will stay unchanged. Defaults to None.

Returns:

the quantized model

Return type:

keras.Model or ModelProto

quantizeml.models.dump_config(model)[source]

Dump the quantization configuration of a quantized model, exporting the configuration for each quantized layer.

Parameters:

model (keras.Model) – a quantized model.

Returns:

the configuration of the model.

Return type:

dict

quantizeml.models.record_quantization_variables(model)[source]

Helper method to record quantization objects in the graph.

Passing a dummy sample through the model in recording mode, this triggers the recording of all dynamic quantization objects.

Parameters:

model (keras.Model) – model for which objects need to be recorded.

Quantization parameters

class quantizeml.models.QuantizationParams(activation_bits=8, per_tensor_activations=False, weight_bits=8, output_bits=8, input_weight_bits=8, input_dtype='uint8', buffer_bits=32)[source]

Class that holds quantization parameters.

This is a read-only data class.

Parameters:
  • activation_bits (int, optional) – activations quantization bitwidth. Defaults to 8.

  • per_tensor_activations (bool, optional) – whether to quantize activation per-tensor or per-axis. Defaults to False.

  • weight_bits (int, optional) – weights quantization bitwidth. Defaults to 8.

  • output_bits (int, optional) – outputs quantization bitwidth. Defaults to 8.

  • input_weight_bits (int, optional) – weights quantization bitwidth for the first layer. Defaults to 8.

  • input_dtype (np.dtype or str, optional) – expected model input format. If given as a string, should follow numpy string type requirements. Defaults to ‘uint8’.

  • buffer_bits (int, optional) – maximal buffer bitwidth allowed in operations. Defaults to 32.

quantizeml.models.get_quantization_params()[source]

Returns global quantization parameters.

Returns:

the quantization parameters

Return type:

QuantizationParams

quantizeml.models.quantization(qparams)[source]

Sets quantization parameters in a context.

Parameters:

qparams (QuantizationParams) – quantization parameters

Calibration

quantizeml.models.calibrate(model, qmodel, samples=None, num_samples=1024, batch_size=None, epochs=1)[source]

Calibrates the model using the provided samples.

With TENN models only np.array samples are supported for calibration. Those should have a temporally coherent data, which means that their expected shape is [batch_size*Seq, dim_0,, …, dim_n] for spatiotemporal TENNs where:

  • batch_size is the same batch_size provided to the calibration.

  • Seq is a dataset parameter that defines the temporally coherent data (eg number of frames per video clips).

and [batch_size, (model.input_shape)] for recurrent TENNs.

When no samples are provided, random samples are generated.

Parameters:
  • model (keras.Model) – the original model

  • qmodel (keras.Model) – the quantized model to calibrate

  • samples (tf.Dataset, np.array or generator, optional) – calibration samples. When no samples are provided, random samples are generated. Defaults to None.

  • num_samples (int, optional) – number of samples to use in the provided samples or number of samples to generate. Defaults to 1024.

  • batch_size (int, optional) – the batch size. Defaults to None.

  • epochs (int, optional) – the number of epochs. Defaults to 1.

quantizeml.models.calibration_required(model)[source]

Checks if a model requires calibration.

If one of the ‘OutputQuantizer’ layers in the model has its range_max variable set to 1, it requires calibration.

Parameters:

model (keras.Model) – the model to check

Returns:

True if calibration is required, False otherwise.

Return type:

bool

Utils

quantizeml.models.apply_weights_to_model(model, weights, verbose=True)[source]

Loads weights from a dictionary and apply it to a model.

Go through the dictionary of weights, find the corresponding variable in the model and partially load its weights.

Parameters:
  • model (keras.Model) – the model to update

  • weights (dict) – the dictionary of weights

  • verbose (bool, optional) – if True, throw warning messages if a dict item is not found in the model. Defaults to True.

Tensors

QTensor

class quantizeml.tensors.QTensor(shape: TensorShape)[source]

Bases: ExtensionType

Abstract class to exchange quantized tensors between layers

Classes:

Spec

alias of Spec

Methods:

assert_per_tensor()

Asserts that a QTensor is quantized per-tensor

clone()

Returns a copy of the QTensor

to_float()

Returns a float representation of the QTensor

Attributes:

name

Returns the QTensor name

per_tensor

Returns if QTensor is quantized per-tensor

Spec

alias of Spec

assert_per_tensor()[source]

Asserts that a QTensor is quantized per-tensor

clone()[source]

Returns a copy of the QTensor

Returns:

the copy.

Return type:

QTensor

property name

Returns the QTensor name

Returns:

the QTensor name

Return type:

str

property per_tensor

Returns if QTensor is quantized per-tensor

Returns:

True if QTensor is quantized per-tensor or False on per-axis case.

Return type:

bool

to_float()[source]

Returns a float representation of the QTensor

Returns:

the float representation.

Return type:

tf.Tensor

FixedPoint

class quantizeml.tensors.FixedPoint(values, value_bits, frac_bits)[source]

Bases: QTensor

A Tensor of integer values representing fixed-point numbers

The value_bits parameter sets the maximum integer values that can be stored:

\[int\_max = 2^{bits} - 1.\]

When a FixedPoint is created, its values are clipped to [-int_max-1, int_max].

Parameters:
  • values (tf.Tensor) – a tensor of integer values

  • value_bits (int) – the number of value bits.

  • frac_bits (tf.Tensor) – an integer tensor of fractional bits.

Classes:

Spec

alias of Spec

Methods:

abs()

Returns the absolute value of the FixedPoint

align(other[, value_bits])

Align fractional bits

downscale(frac_bits)

Reduce the precision of a FixedPoint

expand(value_bits)

Expand the FixedPoint to the specified bitwidth

floor()

Floors the FixedPoint

max_frac_bits(value_bits, ranges[, clamp])

Evaluate the maximum fractional bit index for the quantization ranges.

promote(bits)

Increase the number of value bits

quantize(x, value_bits[, frac_bits])

Converts a float Tensor to a FixedPoint

rescale(frac_bits[, value_bits])

Rescale a FixedPoint to a specified precision and bitwidth

shift(s)

Apply a tensor-wide left or right shift.

to_float()

Returns a float representation of the QTensor

upscale(frac_bits[, value_bits])

Align a FixedPoint to a specified precision

Attributes:

name

Returns the QTensor name

per_tensor

Returns if QTensor is quantized per-tensor

sign

Returns the sign of the FixedPoint

Spec

alias of Spec

abs()[source]

Returns the absolute value of the FixedPoint

Returns:

the absolute value.

Return type:

FixedPoint

align(other, value_bits=None)[source]

Align fractional bits

This returns an equivalent FixedPoint with a scalar fractional bit corresponding to the maximum of the current and other FixedPoint on all channels.

This is required before performing an operation that adds or subtracts elements along the last dimension, to make sure all these elements are in the same scale.

Parameters:
  • other (FixedPoint) – a FixedPoint to align to

  • value_bits (int, optional) – the target value bits. Defaults to None.

Returns:

a new FixedPoint with aligned fractional bits and the shift that was applied.

Return type:

tuple(FixedPoint, tf.Tensor)

downscale(frac_bits)[source]

Reduce the precision of a FixedPoint

Parameters:

frac_bits (tf.Tensor) – the target fractional bits

Returns:

the downscaled FixedPoint

Return type:

FixedPoint

expand(value_bits)[source]

Expand the FixedPoint to the specified bitwidth

This returns an equivalent FixedPoint with a higher or equal number of value bits and a scalar fractional bit corresponding to the maximum of the initial fractional bits on all channels.

This is mostly used to recover a per-tensor FixedPoint that has been compressed to a lower number of value bits.

Parameters:

value_bits (int) – the target value_bits

Returns:

a new FixedPoint with expanded fractional bits and the shift that was applied.

Return type:

tuple(FixedPoint, tf.Tensor)

floor()[source]

Floors the FixedPoint

Returns:

a new FixedPoint without fractional bits and the shift that was applied.

Return type:

tuple(FixedPoint, tf.Tensor)

static max_frac_bits(value_bits, ranges, clamp=True)[source]

Evaluate the maximum fractional bit index for the quantization ranges.

This method evaluates the minimum number of integer bits required to cover the specified quantization ranges (this can be a negative number if the ranges are strictly lower than 0.5).

From that it deduces the rightmost fractional bit indices.

The resulting frac_bits can be a negative number if the ranges are higher than the biggest integer that can be represented with the specified value bits.

If specified, the maximum fractional bits are clamped to the available value_bits.

Parameters:
  • value_bits (int) – the number of value bits.

  • (tf (ranges) – Tensor): a tensor of float quantization ranges.

  • clamp (bool, optional) – clamp the results to self.value_bits. Defaults to True.

Returns:

Tensor: a tensor of fractional bits.

Return type:

tf

property name

Returns the QTensor name

Returns:

the QTensor name

Return type:

str

property per_tensor

Returns if QTensor is quantized per-tensor

Returns:

True if QTensor is quantized per-tensor or False on per-axis case.

Return type:

bool

promote(bits)[source]

Increase the number of value bits

Parameters:

bits (int) – the new number of value bits

Returns:

a FixedPoint with increased value bits

Return type:

FixedPoint

static quantize(x, value_bits, frac_bits=None)[source]

Converts a float Tensor to a FixedPoint

It converts the original float values into integer values so that:

\[{x_{int}} = round(x * 2^{frac\_bits})\]

Note: \(2^{-frac\_bits}\) represents the FixedPoint precision.

Before returning, the resulting integer values are clipped to the maximum integer values that can be stored for the specified value bits:

\[[-2^{value\_bits}, 2^{value\_bits} - 1]\]

If frac_bits is not specified, the method starts by evaluating the number of bits to dedicate to represent the integer part of the float tensor, clipped to value_bits:

\[int\_bits = ceil(log2(x))\]

Note: this number can be negative when x < 0.5.

It then evaluates the offset of the least significant bit of the fractional part of the float starting from zero. This represents the fractional bits:

\[frac\_bits = value\_bits - int\_bits\]
Parameters:
  • x (tf.Tensor) – a tensor of float values.

  • value_bits (int) – the number of value bits

  • frac_bits (tf.Tensor, optional) – an integer tensor of fractional bits. Defaults to None.

Returns:

the FixedPoint tensor

Return type:

FixedPoint

rescale(frac_bits, value_bits=None)[source]

Rescale a FixedPoint to a specified precision and bitwidth

This primarily rescales the FixedPoint values to match the precision specified by the target fractional bits.

Optionally, this adjusts the value bits to the specified bitwidth.

The rescaling operation is:

  • a left shift of the values when their precision increases,

  • a rounded right shift of the values when their precision decreases.

This method can be used to:

  • compress a FixedPoint to a lower bitwidth after having reduced its precision,

  • expand a FixedPoint to a larger bitwidth after having increased its precision.

Parameters:
  • frac_bits (tf.Tensor) – the target fractional bits

  • value_bits (int, optional) – the target value bits

Returns:

the rescaled FixedPoint

Return type:

FixedPoint

shift(s)[source]

Apply a tensor-wide left or right shift.

This takes a tensor of shift values and apply them on each item of the FixedPoint values.

The shift values should positive or negative integer:

  • if the value is positive, it is a left-shift,

  • if the value is negative, it is a right-shift.

The resulting FixedPoint has the same value bits and fractional bits as the source FixedPoint, which means that clipping is applied on left-shift and flooring is applied on right-shift.

Parameters:

s (tf.Tensor) – the shift values for each pixel.

Returns:

the result as a FixedPoint

Return type:

FixedPoint

property sign

Returns the sign of the FixedPoint

Returns:

the sign as a FixedPoint.

Return type:

FixedPoint

to_float()[source]

Returns a float representation of the QTensor

Returns:

the float representation.

Return type:

tf.Tensor

upscale(frac_bits, value_bits=None)[source]

Align a FixedPoint to a specified precision

The target precision must be higher than the current one.

Parameters:
  • frac_bits (tf.Tensor) – the target fractional bits

  • value_bits (int, optional) – the target value bits

Returns:

the upscaled FixedPoint

Return type:

FixedPoint

QFloat

class quantizeml.tensors.QFloat(fp, scales)[source]

Bases: QTensor

A Tensor of FixedPoint values and scales representing float numbers

The QFloat is a dual representation of a float Tensor combining FixedPoint values and float scales.

The QFloat is typically used to represent float tensors whose quantization range is not ‘optimal’ for FixedPoint quantization: the original tensor is first divided by the scales to be aligned on optimal ranges, then quantized to FixedPoint values.

When converting back to float, values are dequantized and multiplied by the scales to obtain the approximated float tensor.

Parameters:
  • fp (FixedPoint) – a FixedPoint of values

  • scales (tf.Tensor) – a Tensor of scales

Classes:

Spec

alias of Spec

Methods:

expand(value_bits)

Expand the QFloat to the specified bitwidth

max_frac_bits(value_bits, ranges, scales[, ...])

Evaluate the maximum fractional bit index for the quantization ranges.

optimal_scales(ranges, value_bits)

Evaluates the optimal QFloat scales for quantization ranges.

promote(bits)

Increase the number of value bits

quantize(x, value_bits, scales[, frac_bits])

Converts a float Tensor to a QFloat

quantize_scales(scales, scale_bits)

Quantizes the QFloat scales with the specified bitwidth.

to_fixed_point([scale_bits])

Returns a FixedPoint representation of the QFloat

to_float()

Returns a float representation of the QFloat

upscale(frac_bits[, value_bits])

Align a QFloat to a specified precision

Attributes:

name

Returns the QTensor name

per_tensor

Returns if QTensor is quantized per-tensor

Spec

alias of Spec

expand(value_bits)[source]

Expand the QFloat to the specified bitwidth

This returns an equivalent QFloat with a higher or equal number of value bits and a scalar fractional bit corresponding to the maximum of the initial fractional bits on all channels. The scales remains unchanged.

This is mostly used to recover a per-tensor QFloat that has been compressed to a lower number of value bits.

Note that even if the frac_bits are aligned the scales remained unchanged.

Parameters:

value_bits (int) – the target value_bits

Returns:

a new QFloat with expanded fractional bits and the shift that was applied.

Return type:

tuple(QFloat, tf.Tensor)

static max_frac_bits(value_bits, ranges, scales, clamp=True)[source]

Evaluate the maximum fractional bit index for the quantization ranges.

This method evaluates the minimum number of integer bits required to cover the specified quantization ranges after having rescaled them with the specified scales. It simply calls the equivalent FixedPoint method on the rescaled ranges. If specified, it clamps the results to the available value_bits.

Parameters:
  • value_bits (int) – the number of value bits.

  • ranges (tf.Tensor) – a tensor of float quantization ranges.

  • scales (tf.Tensor) – the scales to apply to the quantization ranges.

  • clamp (bool, optional) – clamp the results to self.value_bits. Defaults to True.

Returns:

a tensor of fractional bits.

Return type:

tf.Tensor

property name

Returns the QTensor name

Returns:

the QTensor name

Return type:

str

static optimal_scales(ranges, value_bits)[source]

Evaluates the optimal QFloat scales for quantization ranges.

We choose the optimal quantization range for a given bitwidth as:

[-int_max, int_max], with \(int\_max = 2^{bits} - 1\).

This methods evaluates the scales as the ratio to align the specified ranges to the optimal ranges.

Parameters:
  • ranges (tf.Tensor) – a tensor of quantization ranges.

  • value_bits (int) – the number of value bits.

Returns:

the optimal scales.

Return type:

tf.Tensor

property per_tensor

Returns if QTensor is quantized per-tensor

Returns:

True if QTensor is quantized per-tensor or False on per-axis case.

Return type:

bool

promote(bits)[source]

Increase the number of value bits

Parameters:

bits (int) – the new number of value bits

Returns:

a QFloat with increased value bits

Return type:

QFloat

static quantize(x, value_bits, scales, frac_bits=0.0)[source]

Converts a float Tensor to a QFloat

It first evaluates and quantizes the scales required to align the quantization ranges to the optimal range for the specified value bits.

It then quantizes the inputs with the quantized scales.

The resulting integer values are clipped to [-int_max-1, int_max].

Parameters:
  • x (tf.Tensor) – a tensor of float values.

  • value_bits (int) – the number of value bits.

  • scales (tf.Tensor) – a tensor of alignment scales.

  • frac_bits (int) – the inner FixedPoint fractional bits (defaults to 0).

Returns:

the QFloat representation.

Return type:

QFloat

static quantize_scales(scales, scale_bits)[source]

Quantizes the QFloat scales with the specified bitwidth.

Parameters:
  • scales (tf.Tensor) – a tensor of float scales.

  • scale_bits (int) – the number of scales bits.

Returns:

the FixedPoint scales.

Return type:

FixedPoint

to_fixed_point(scale_bits=8)[source]

Returns a FixedPoint representation of the QFloat

Parameters:

scale_bits (int, optional) – the scales quantization bitwidth. Defaults to 8.

Returns:

the FixedPoint representation and scales.

Return type:

(FixedPoint, FixedPoint)

to_float()[source]

Returns a float representation of the QFloat

Returns:

the float representation.

Return type:

tf.Tensor

upscale(frac_bits, value_bits=None)[source]

Align a QFloat to a specified precision

The target precision must be higher than the current one.

Parameters:
  • frac_bits (tf.Tensor) – the target fractional bits

  • value_bits (int, optional) – the target value bits (defaults to current value bits)

Returns:

the upscaled FixedPoint

Return type:

FixedPoint

ONNX support

Layers

quantizeml.onnx_support.layers.OnnxLayer(base_name, name='', **kwargs)[source]

Abstract class that represents an onnx subgraph in brainchip domain.

Child must define the attributes on __init__ and return the node list (subgraph) on build_subgraph(). If these requirements are met, make_node() could be used to define/register the custom node.

Parameters:
  • base_name (str) – the operation type base name.

  • name (str, optional) – the node name. Defaults to ‘’.

  • kwargs (dict, optional) – the custom attributes. Each attribute type will be infered by onnx.helper.make_attribute(). Defaults to {}.

quantizeml.onnx_support.layers.QuantizedConv2D(strides=[1, 1], pool_type='none', pool_size=(2, 2), pool_strides=(2, 2), pool_pads=[0, 0, 0, 0], activation=False, name='')[source]

Intermediate representation of QLinearConv() + MaxPool() + ReLU() as an exportable node.

Parameters:
  • strides (list of int, optional) – the convolutional strides. Defaults to [1, 1].

  • pool_type (str, optional) – the pool type, one of {“none”, “max”, “gap”}. Defaults to “none”.

  • pool_size (list of int, optional) – the kernel pool shape. Ignore it when pool_type != “max”. Defaults to (2, 2).

  • pool_stride (list of int, optional) – the kernel strides. Ignore it when pool_type != “max”. Defaults to (2, 2).

  • pool_pads (list of int, optional) – the size of each padding dimension. Ignore it when pool_type != “max”. Defaults to [0, 0, 0, 0].

  • input_conv (bool, optional) – whether it is extended the set of operations of the basic QuantizedConv2D, allowing to modify the padding value per input channel. Defaults to False.

  • activation (bool, optional) – whether to apply relu operation. Defaults to False.

  • name (str, optional) – the node name. Defaults to ‘’.

quantizeml.onnx_support.layers.QuantizedDepthwise2D(strides=[1, 1], activation=False, name='')[source]

Intermediate representation of Conv() + MaxPool() + ReLU() as an exportable node.

Parameters:
  • strides (list of int, optional) – the convolutional strides. Defaults to [1, 1].

  • activation (bool, optional) – whether to apply relu operation. Defaults to False.

  • name (str, optional) – the node name. Defaults to ‘’.

quantizeml.onnx_support.layers.QuantizedConv2DTranspose(strides=[1, 1], pads=[0, 0, 0, 0], activation=False, name='')[source]

Intermediate representation of the upsampling layer QuantizedConv2DTranspose().

Parameters:
  • strides (list of int, optional) – the convolutional strides. Defaults to [1, 1].

  • activation (bool, optional) – whether to apply relu operation. Defaults to False.

  • name (str, optional) – the node name. Defaults to ‘’.

quantizeml.onnx_support.layers.QuantizedDepthwise2DTranspose(strides=[1, 1], pads=[0, 0, 0, 0], activation=False, name='')[source]

Intermediate representation of the upsampling layer QuantizedDepthwise2DTranspose.

Inherits from QuantizedConv2DTranspose: only different attribute is group.

Parameters:
  • strides (list of int, optional) – the convolutional strides. Defaults to [1, 1].

  • activation (bool, optional) – whether to apply relu operation. Defaults to False.

  • name (str, optional) – the node name. Defaults to ‘’.

quantizeml.onnx_support.layers.QuantizedDense1D(flatten=False, activation=False, name='')[source]

Intermediate representation of Flatten() + QGemm() + ReLU() as an exportable node.

Parameters:
  • flatten (bool, optional) – whether to flatten the inputs. Defaults to False.

  • activation (bool, optional) – whether to apply relu operation. Defaults to False.

  • name (str, optional) – the node name. Defaults to ‘’.

quantizeml.onnx_support.layers.QuantizedAdd(activation=False, name='')[source]

Intermediate representation of Add() as an exportable node.

Parameters:
  • activation (bool, optional) – whether to apply relu operation. Defaults to False.

  • name (str, optional) – the node name. Defaults to ‘’.

quantizeml.onnx_support.layers.QuantizedConcat(axis, name='')[source]

Intermediate representation of Concatenate() as an exportable node.

Parameters:

name (str, optional) – the node name. Defaults to ‘’.

quantizeml.onnx_support.layers.InputQuantizer(input_tp, perm=None, input_signed=False, name='')[source]

Intermediate representation of QuantizeLinear(), use to quantize the input.

Parameters:
  • input_tp (TensorProto) – the input of the ONNX model.

  • perm (list, optional) – list representing the permutations of the rescale node. Defaults to None.

  • input_signed (bool, optional) – whether the input is signed. Defaults to False.

  • name (str, optional) – the node name. Defaults to ‘’.

quantizeml.onnx_support.layers.Dequantizer(name='')[source]

Intermediate representation of DequantizeLinear(), use to dequantize the inputs.

Parameters:

name (str, optional) – the node name. Defaults to ‘’.

Custom patterns

quantizeml.onnx_support.quantization.custom_pattern_scope(new_patterns)[source]

Register a custom pattern in the context to be used at quantization time.

A pattern is understood as a sequence of continuous operations in the graph, whose representation can converge in an OnnxLayer.

Parameters:

new_patterns (dict) – a list of sequence of nodes (keys) and their mapper function (values).

Model I/O

quantizeml.load_model(model_path, custom_layers=None, compile_model=True)[source]

Loads an Onnx or Keras model. An error is raised if the provided model extension is not supported.

Parameters:
  • model_path (str) – path of the model to load.

  • custom_layers (dict, optional) – custom layers to add to the Keras model. Defaults to None.

  • compile_model (bool, optional) – whether to compile the Keras model. Defaults to True.

Returns:

Loaded model.

Return type:

keras.models.Model or onnx.ModelProto

Raises:

ValueError – if the model could not be loaded using Keras and ONNX loaders.

quantizeml.save_model(model, path)[source]

Save an ONNX or Keras model into a path.

Note extension is overwritten given the model type.

Parameters:
  • model (keras.Model, keras.Sequential or onnx.ModelProto) – model to serialize.

  • model_path (str) – path to save the model.

Returns:

the path where the model was saved.

Return type:

str

Raises:

ValueError – if the model to save is not a Keras or ONNX model.

Analysis

Kernel distribution

quantizeml.analysis.plot_kernel_distribution(model, logdir)[source]

Plot the kernel distribution of each layer/node in the model.

Distributions are plotted in two ways: histogram and boxplot

After exporting them, the plots can be plotted through the command-line:

>>> tensorboard --logdir=`logdir`
Parameters:
  • model (onnx.ModelProto or tf.keras.Model) – the model to plot the kernel distribution

  • logdir (str) – the directory to save the plots

Quantization error

quantizeml.analysis.measure_layer_quantization_error(fmodel, qmodel, target_layer=None, batch_size=16, seed=None)[source]

Measures the layer quantization error

Returns a dictionary where the keys are the name of each layer and the values are a dictionary composed of the set of the following metrics:

  • Symmetrical Mean Absolute Percentage Error (SMAPE): tools.metrics.SMAPE()

  • Saturation: Percentage of how many values in the quantized layer saturate

Example

>>> summary = measure_layer_quantization_error(fmodel, qmodel)
>>> assert isinstance(summary[a_layer_name], dict)
>>> assert "SMAPE" in summary[a_layer_name]
Parameters:
  • fmodel (onnx.ModelProto or tf.keras.Model) – the float model.

  • qmodel (onnx.ModelProto or tf.keras.Model) – the quantized version of fmodel.

  • target_layer (str, optional) – computation error is performed only in the target layer/node, expanding the analysis to each output channel. Defaults to None.

  • batch_size (int, optional) – the batch size of the samples to be generated. It allows a better metrics generalization, but consumes more resources. Defaults to 16.

  • seed (int, optional) – a random seed. Defaults to None.

Returns:

the quantization error for each layer

Return type:

dict

Notes

  • Layers/Nodes that do not produce quantization errors will not be taken into account (e.g. QuantizedReshape).

quantizeml.analysis.measure_cumulative_quantization_error(fmodel, qmodel, target_layer=None, batch_size=16, seed=None)[source]

Measures the cumulative quantization error

Returns a dictionary where the keys are the name of each layer and the values are a dictionary composed of the set of the following metrics:

  • Symmetrical Mean Absolute Percentage Error (SMAPE): tools.metrics.SMAPE()

  • Saturation: Percentage of how many values in the quantized layer saturate

Each metric measures the quantization error from the input to the layer.

Example

>>> summary = measure_cumulative_quantization_error(fmodel, qmodel)
>>> assert isinstance(summary[a_layer_name], dict)
>>> assert "SMAPE" in summary[a_layer_name]
Parameters:
  • fmodel (onnx.ModelProto or tf.keras.Model) – the float model.

  • qmodel (onnx.ModelProto or tf.keras.Model) – the quantized version of fmodel.

  • target_layer (str, optional) – error computation is performed only in the target layer/node, expanding the analysis to each output channel. Defaults to None.

  • batch_size (int, optional) – the batch size of the samples to be generated. It allows a better metrics generalization, but consumes more resources. Defaults to 16.

  • seed (int, optional) – a random seed. Defaults to None.

Returns:

the quantization error for each layer

Return type:

dict

Notes

  • Layers/Nodes that do not produce quantization errors will not be taken into account (e.g. QuantizedReshape).

Metrics

quantizeml.analysis.tools.SMAPE(name='smape', **kwargs)[source]

Compute the Symmetric Mean Absolute Percentage Error (SMAPE) as:

>>> mean(abs(x - y) / (abs(x) + abs(y)))

Reference: https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error

Parameters:

name (str, optional) – name of the metric. Defaults to “smape”.

quantizeml.analysis.tools.Saturation(name='saturation', min_value=None, max_value=None, **kwargs)[source]

Returns the percentage of saturating values.

We consider a value saturated if it is one of {min_value, max_value}

Parameters:
  • min_value (np.ndarray, optional) – the minimum of values. If not provided, it is inferred from the values type. Defaults to None.

  • max_value (np.ndarray, optional) – the maximum of values. If not provided, it is inferred from the values type. Defaults to None.

quantizeml.analysis.tools.print_metric_table(summary, model_name='')[source]

Print a table with the results of a set of metrics.

The following format is expected:

# Format for metrics
# 1. Simple set of metrics
metrics_for_key_1 = {"metric_1": key1_metric1_value, "metric_2": key1_metric2_value}
# 2. List of simple set of metrics
metrics_for_key_2 = [{"metric_1": key2_metric1_value1, "metric_2": key2_metric2_value1},
                     {"metric_1": key2_metric1_value2, "metric_2": key2_metric2_value2}]
# 3. List of complex set of metrics
metrics_for_key_3 = [metrics_for_key_2, metrics_for_key_2]

# Summary
summary = {
    "key_1": metrics_for_key_1,
    "key_2": metrics_for_key_2,
    "key_3": metrics_for_key_3,
}
Parameters:
  • summary (dict) – summary of metrics to draw

  • model_name (str, optional) – A model name to display. Defaults to “”.

Note

All metrics must contain the same set of measures