Akida edge learning for keyword spotting

This tutorial demonstrates the Akida NSoC edge learning capabilities using its built-in learning algorithm.

It focuses on a keyword spotting (KWS) example, where an existing Akida network is re-trained to be able to classify new audio keywords.

Just a few samples (few-shot learning) of the new words are sufficient to augment the Akida model with extra classes, while preserving high accuracy.

1. Edge learning process

By “edge learning”, we mean the process of network learning in an edge device. Aside from technical requirements imposed by the device (low power, latency, etc.), the task itself will often present particular challenges:

  1. The application cannot know which, or indeed, how many classes it will be trained on ultimately, so it must be possible to add new classes to the classifier online, i.e. requires continual learning.

  2. Often, there will be no large labelled dataset for new classes, which must instead be learned from just a few samples, i.e. requires few-shot learning.

The Akida NSoC has a built-in learning algorithm designed for training the last layer of a model and well suited for edge learning. The specific use case in this tutorial mimics the process of a mobile phone user who wants to add new speech commands, i.e. new keywords, to a pre-trained voice recognition system with a few preset keywords. To achieve this using the Akida NSoC, learning occurs in 3 stages:

  1. The Akida model preparation: an Akida model must meet specific conditions to be compatible for Akida learning.

  2. The “offline” Akida learning: the last layer of the Akida model is trained from scratch with a large dataset. In this KWS case, the model is trained with 32 keywords from the Google “Speech Commands dataset”.

  3. The “online” (edge learning) stage: new keywords are learned with few samples, adding to the pre-trained words from stage 2.

1.1 Akida model preparation

The Akida NSoC embeds a native learning algorithm allowing training of the last layer of an Akida model. The overall model can then be seen as the combination of two parts:

  • a feature extractor (or spike generator) corresponding to all but the last layer of a standard (back-propagation trained) neural network. This part of the model cannot be trained on the Akida NSoC, and would typically be prepared in advance, e.g. using the CNN2SNN conversion tool. Also, to be compatible with Akida learning, the feature extractor must return binary spikes (1-bit activations).

  • the classification layer (i.e. the last layer). It must have 1-bit weights and usually has several output neurons per class. This layer will be the only one trained using the built-in learning algorithm.

Note that, unlike a standard CNN network where each class is represented by a single output neuron, an Akida native training requires several neurons for each class. The number of neurons per class can be seen as the number of centroids to represent a class; there is an analogy with k-means clustering applied to one-class samples, k being the number of neurons. The choice of the number of neurons is a trade-off: too many neurons per class may be computationally inefficient; in contrast too few neurons per class may have difficulty representing within-class heterogeneity. Like k-means clustering, the choice of k depends on the cluster representation of the data.

Like any training process, hyper-parameters must be set appropriately. The only mandatory parameter is the number of weights (i.e. number of connections for each neuron) which must be correlated to the number of spikes at the end of the feature extractor. Other parameters, such as min_plasticity or learning_competition, are optional and mainly used for model fine-tuning: one can set them to default for a first try.

1.2 “Offline” Akida learning

The model is ready for training. Remember that the feature extractor has already been trained in stage 1. Here, only the last Akida layer is trainable. Training is still “offline” though, corresponding to the preparation of the network with the “preset” command keywords. The last layer is trained from scratch: its binary weights are randomly initialized.

A large dataset is passed through the Akida network and the on-chip learning algorithm updates the weights of the classification layer accordingly. In this KWS example, we take a dataset containing 32 words + a “silence” class (33 classes) for a total of about 94,000 samples.

Note that the dataset on which the feature extractor was trained does not need to be the same as the one used for “offline” training of the classification layer. What is important is that the features extracted are as good as possible for the expected inputs. Since the “edge” classes are, by definition, not known in advance, in practice that typically means making your feature extractor as general as possible.

1.3 “Online” edge learning

“Online” edge learning consists in adding and learning new classes to a former pre-trained model. This stage is meant to be performed on a chip with few examples for each new class.

In practice, edge learning with Akida is similar to “offline” learning, except that:

  • the network has already been trained on a set of classes which need to be kept, and so the novel classes are in addition to those.

  • only few samples are available for training.

In this KWS example, 3 new keywords are learned using 4 samples per word from a single user. Applying data augmentation on these samples adds variability to improve generalization. After edge learning, the model is able to classify the 3 new classes with similar accuracy to the 33 existing classes (and performance on the existing classes is unaffected).

2. Dataset preparation

The data comes from the Google “Speech Commands” dataset containing audio files for 35 keywords. The number of utterances for each word varies from 1,500 to 4,000. 32 words are used to train the Akida model and 3 new words are added for edge learning.

Two datasets are loaded:

  • The first dataset contains all samples of the 32 following keywords extended with the “silence” samples (see the original paper for details on the dataset). In total, 94,252 samples are used. These are split into a training set (90%) and a validation set (10%), used to train the model “offline” (stage 2).

  • The second dataset contains samples of the 3 new keywords from a single speaker: ‘backward’, ‘follow’ and ‘forward’. Since the aim of edge learning is to train with few samples, only 4 utterances will be used for training and the rest for testing (ideally, one would test with many more samples, but the number of repetitions per individual speaker in the database makes this impossible). Data augmentation is applied with time shift and additional background noise, generating 40 training samples per utterances, therefore 4 x 40 = 160 training samples per new word.

The audio files are pre-processed: the mel-frequency cepstral coefficients (MFCC) are computed as features to represent each audio sample. The obtained features for one sample are stored in an array of shape (49, 10, 1). This array of features is chosen as input in the Akida network.

For the sake of simplicity, the pre-processing step is not detailed here; this tutorial directly fetches the pre-processed audio data for both datasets. The pre-processed utility methods to generate these MFCC data are available in the akida_models package.

import pickle

from tensorflow.keras.utils import get_file

# Fetch pre-processed data for 32 keywords
fname = get_file(
    fname='kws_preprocessed_all_words_except_backward_follow_forward.pkl',
    origin=
    "http://data.brainchip.com/dataset-mirror/kws/kws_preprocessed_all_words_except_backward_follow_forward.pkl",
    cache_subdir='datasets/kws')
with open(fname, 'rb') as f:
    [x_train_ak, y_train, x_val_ak, y_val, _, _, word_to_index,
     data_transform] = pickle.load(f)

# Fetch pre-processed data for the 3 new keywords
fname2 = get_file(
    fname='kws_preprocessed_edge_backward_follow_forward.pkl',
    origin=
    "http://data.brainchip.com/dataset-mirror/kws/kws_preprocessed_edge_backward_follow_forward.pkl",
    cache_subdir='datasets/kws')
with open(fname2, 'rb') as f:
    [
        x_train_new_ak, y_train_new, x_val_new_ak, y_val_new, files_train,
        files_val, word_to_index_new, dt2
    ] = pickle.load(f)

print("Wanted words and labels:\n", word_to_index)
print("New words:\n", word_to_index_new)

Out:

Downloading data from http://data.brainchip.com/dataset-mirror/kws/kws_preprocessed_edge_backward_follow_forward.pkl

  8192/251079 [..............................] - ETA: 0s
 73728/251079 [=======>......................] - ETA: 0s
253952/251079 [==============================] - 0s 0us/step
Wanted words and labels:
 {'six': 23, 'three': 25, 'seven': 21, 'bed': 1, 'eight': 6, 'yes': 31, 'cat': 3, 'on': 18, 'one': 19, 'stop': 24, 'two': 27, 'house': 11, 'five': 7, 'down': 5, 'four': 8, 'go': 9, 'up': 28, 'learn': 12, 'no': 16, 'bird': 2, 'zero': 32, 'nine': 15, 'visual': 29, 'wow': 30, 'sheila': 22, 'marvin': 14, 'off': 17, 'right': 20, 'left': 13, 'happy': 10, 'dog': 4, 'tree': 26, '_silence_': 0}
New words:
 {'backward': 0, 'follow': 1, 'forward': 2}

3. Prepare Akida model for learning

As explained above, to be compatible with Akida:

  • the feature extractor must return binary spikes.

  • the classification layer must have binary weights.

For this example, we load a pre-trained model from which we keep the feature extractor, returning binary spikes. This model was previously trained and quantized with Keras and the CNN2SNN tool. The first dataset with 33 classes (32 keywords + “silence”) was used for training.

However, the last layer of this pre-trained model is not compatible for Akida learning since it doesn’t have binary weights. We then remove this last layer and add a new classification layer with 33 classes and 15 neurons per class. One can try with different values of neurons per class, e.g. from 1 to 500 neurons per class, and see the effects on performance and time cost.

Moreover, as for any training algorithm, the learning hyper-parameters have to be correctly set. For the Akida learning algorithm, one important parameter is the number of weights: because of the way the Akida learning algorithm works, the number of spikes at the end of the feature extractor provides a good starting point for this hyper-parameter. Here, we estimate this number of output spikes using 10% of the training set, which is enough to have a reasonable estimation.

from akida_models import ds_cnn_kws_pretrained

# Instantiate a quantized model with pretrained quantized weights
model = ds_cnn_kws_pretrained()
model.summary()

Out:

A local file was found, but it seems to be incomplete or outdated because the auto file hash does not match the original value of A26240D2E284B7ECD2634F8CD77366C0A4C7BD4F39E4BDE4AA7D14D5D860E09E so we will re-download the data.
Downloading data from http://data.brainchip.com/models/ds_cnn/ds_cnn_kws_iq8_wq4_aq4_laq1.h5

  8192/246440 [..............................] - ETA: 0s
 73728/246440 [=======>......................] - ETA: 0s
253952/246440 [==============================] - 0s 0us/step
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_3 (InputLayer)         [(None, 49, 10, 1)]       0
_________________________________________________________________
conv_0 (QuantizedConv2D)     (None, 25, 5, 32)         832
_________________________________________________________________
activation_discrete_relu (Ac (None, 25, 5, 32)         0
_________________________________________________________________
separable_1 (QuantizedSepara (None, 25, 5, 64)         2400
_________________________________________________________________
activation_discrete_relu_1 ( (None, 25, 5, 64)         0
_________________________________________________________________
separable_2 (QuantizedSepara (None, 25, 5, 64)         4736
_________________________________________________________________
activation_discrete_relu_2 ( (None, 25, 5, 64)         0
_________________________________________________________________
separable_3 (QuantizedSepara (None, 25, 5, 64)         4736
_________________________________________________________________
activation_discrete_relu_3 ( (None, 25, 5, 64)         0
_________________________________________________________________
separable_4 (QuantizedSepara (None, 25, 5, 64)         4736
_________________________________________________________________
activation_discrete_relu_4 ( (None, 25, 5, 64)         0
_________________________________________________________________
separable_5 (QuantizedSepara (None, 25, 5, 64)         4736
_________________________________________________________________
separable_5_global_avg (Glob (None, 64)                0
_________________________________________________________________
activation_discrete_relu_5 ( (None, 64)                0
_________________________________________________________________
reshape_1 (Reshape)          (None, 1, 1, 64)          0
_________________________________________________________________
separable_6 (QuantizedSepara (None, 1, 1, 256)         17216
_________________________________________________________________
activation_discrete_relu_6 ( (None, 1, 1, 256)         0
_________________________________________________________________
flatten (Flatten)            (None, 256)               0
_________________________________________________________________
dense_7 (QuantizedDense)     (None, 33)                8481
_________________________________________________________________
act_softmax (Activation)     (None, 33)                0
=================================================================
Total params: 47,873
Trainable params: 47,873
Non-trainable params: 0
_________________________________________________________________
import numpy as np

from math import ceil

from cnn2snn import convert

#  Convert to an Akida model
input_scaling = (255, 0)
model_ak = convert(model, input_scaling=input_scaling)
model_ak.summary()

Out:

Warning: the activation layer 'act_softmax' will be discarded at conversion. The outputs of the Akida model will be the potentials before this activation layer.
                                   Model Summary
____________________________________________________________________________________
Layer (type)                          Output shape  Kernel shape
====================================================================================
conv_0 (InputConvolutional)           [5, 25, 32]   (5, 5, 1, 32)
____________________________________________________________________________________
separable_1 (SeparableConvolutional)  [5, 25, 64]   (3, 3, 32, 1), (1, 1, 32, 64)
____________________________________________________________________________________
separable_2 (SeparableConvolutional)  [5, 25, 64]   (3, 3, 64, 1), (1, 1, 64, 64)
____________________________________________________________________________________
separable_3 (SeparableConvolutional)  [5, 25, 64]   (3, 3, 64, 1), (1, 1, 64, 64)
____________________________________________________________________________________
separable_4 (SeparableConvolutional)  [5, 25, 64]   (3, 3, 64, 1), (1, 1, 64, 64)
____________________________________________________________________________________
separable_5 (SeparableConvolutional)  [1, 1, 64]    (3, 3, 64, 1), (1, 1, 64, 64)
____________________________________________________________________________________
separable_6 (SeparableConvolutional)  [1, 1, 256]   (3, 3, 64, 1), (1, 1, 64, 256)
____________________________________________________________________________________
dense_7 (FullyConnected)              [1, 1, 33]    (1, 1, 256, 33)
____________________________________________________________________________________
Input shape: 49, 10, 1
Backend type: Software - 1.8.10
# Measure Akida accuracy on validation set
batch_size = 1000
preds_ak = np.zeros(y_val.shape[0])
num_batches_val = ceil(x_val_ak.shape[0] / batch_size)
for i in range(num_batches_val):
    s = slice(i * batch_size, (i + 1) * batch_size)
    preds_ak[s] = model_ak.predict(x_val_ak[s])

acc_val_ak = np.sum(preds_ak == y_val) / y_val.shape[0]
print(f"Akida CNN2SNN validation set accuracy: {100 * acc_val_ak:.2f} %")

# For non-regression purpose
assert acc_val_ak > 0.88

Out:

Akida CNN2SNN validation set accuracy: 91.33 %
from akida import FullyConnected

# Replace the last layer by a classification layer with binary weights
# Here, we choose to set 15 neurons per class.
num_classes = 33
num_neurons_per_class = 15

model_ak.pop_layer()
layer_fc = FullyConnected(name='akida_edge_layer',
                          num_neurons=num_classes * num_neurons_per_class,
                          activations_enabled=False)
model_ak.add(layer_fc)
# Estimate the number of spikes at the end of the feature extractor.

# Get the Observer of the feature extractor output (layer before last)
last_layer_feature_extractor = model_ak.get_layer('separable_6')
obs = model_ak.get_observer(last_layer_feature_extractor)

# Forward samples to get the number of output spikes
# 10% of the training set should be sufficient for a good estimate
num_batches = ceil(0.1 * x_train_ak.shape[0] / batch_size)
for i in range(num_batches):
    model_ak.forward(x_train_ak[i * batch_size:(i + 1) * batch_size])

# Retrieve the number of output spikes from the Observer
all_spikes = [spikes.nnz for _, spikes in obs.spikes.items()]
median_spikes = np.median(all_spikes)
print(f"Median of number of spikes: {median_spikes}")

# Fix the number of weights to 1.2 times the median of output spikes
num_weights = int(1.2 * median_spikes)
print("The number of weights is then set to:", num_weights)

# Plot a histogram of the number of output spikes
import matplotlib.pyplot as plt
plt.hist(all_spikes, bins=30)
plt.title("Number of output spikes of the feature extractor")
plt.xlabel("Number of output spikes")
plt.ylabel("Frequency")
plt.show()
Number of output spikes of the feature extractor

Out:

Median of number of spikes: 71.0
The number of weights is then set to: 85

4. Learn with Akida using the training set

This stage shows how to train the Akida model using the built-in learning algorithm in an “offline” stage, i.e. training the classification layer from scratch using a large training set. The dataset containing the 33 classes (32 keywords + “silence”) is used.

Now that the Akida model is ready for training, the hyper-parameters must be set using the compile method of the last layer. Compiling a layer means that this layer is configured for training and ready to be trained. For more information about the learning hyper-parameters, check the user guide. Note that we set the learning_competition to 0.1, which gives a little competition between neurons to prevent learning similar features.

Once the last layer is compiled, the fit method is used to pass the dataset for training. This call is similar to the fit method in tf.keras.

After training, the model is assessed on the validation set using the predict method. It returns the estimated labels for the validation samples. The model is then saved to a .fbz file.

Note that in this specific case, the same dataset was used to train the feature extractor using the CNN2SNN tool in an early stage, and to train this classification layer using the native learning algorithm. However, the edge learning in the next stage passes completely new data in the network.

# Compile Akida model with learning parameters
model_ak.compile(num_weights=num_weights,
                 num_classes=num_classes,
                 learning_competition=0.1)
model_ak.summary()

Out:

                                   Model Summary
____________________________________________________________________________________
Layer (type)                          Output shape  Kernel shape
====================================================================================
conv_0 (InputConvolutional)           [5, 25, 32]   (5, 5, 1, 32)
____________________________________________________________________________________
separable_1 (SeparableConvolutional)  [5, 25, 64]   (3, 3, 32, 1), (1, 1, 32, 64)
____________________________________________________________________________________
separable_2 (SeparableConvolutional)  [5, 25, 64]   (3, 3, 64, 1), (1, 1, 64, 64)
____________________________________________________________________________________
separable_3 (SeparableConvolutional)  [5, 25, 64]   (3, 3, 64, 1), (1, 1, 64, 64)
____________________________________________________________________________________
separable_4 (SeparableConvolutional)  [5, 25, 64]   (3, 3, 64, 1), (1, 1, 64, 64)
____________________________________________________________________________________
separable_5 (SeparableConvolutional)  [1, 1, 64]    (3, 3, 64, 1), (1, 1, 64, 64)
____________________________________________________________________________________
separable_6 (SeparableConvolutional)  [1, 1, 256]   (3, 3, 64, 1), (1, 1, 64, 256)
____________________________________________________________________________________
akida_edge_layer (FullyConnected)     [1, 1, 495]   (1, 1, 256, 495)
____________________________________________________________________________________

              Learning Summary
____________________________________________
Learning Layer    # Input Conn.  # Weights
============================================
akida_edge_layer  256            85
____________________________________________
Input shape: 49, 10, 1
Backend type: Software - 1.8.10
from time import time

# Train the last layer using Akida `fit` method
print(f"Akida learning with {num_classes} classes... \
        (this step can take a few minutes)")
num_batches = ceil(x_train_ak.shape[0] / batch_size)
start = time()
for i in range(num_batches):
    s = slice(i * batch_size, (i + 1) * batch_size)
    model_ak.fit(x_train_ak[s], y_train[s].astype(np.int32))
end = time()

print(f"Elapsed time for Akida training: {end-start:.2f} s")

Out:

Akida learning with 33 classes...         (this step can take a few minutes)
Elapsed time for Akida training: 163.18 s
# Measure Akida accuracy on validation set
preds_val_ak = np.zeros(y_val.shape[0])
for i in range(num_batches_val):
    s = slice(i * batch_size, (i + 1) * batch_size)
    preds_val_ak[s] = model_ak.predict(x_val_ak[s], num_classes=num_classes)

acc_val_ak = np.sum(preds_val_ak == y_val) / y_val.shape[0]
print(f"Akida validation set accuracy: {100 * acc_val_ak:.2f} %")

Out:

Akida validation set accuracy: 89.49 %
import os

from tempfile import TemporaryDirectory

# Save Akida model
temp_dir = TemporaryDirectory(prefix='edge_learning_kws')
model_file = os.path.join(temp_dir.name, 'ds_cnn_edge_kws.fbz')
model_ak.save(model_file)
del model_ak

4. Edge learning

After the “offline” training stage, we emulate the use case where the pre-trained Akida model is loaded on an Akida chip, ready to learn new classes. Our previously saved Akida model has 33 output classes with learned weights. We now add 3 classes to the existing model using the add_classes method and learn the 3 new keywords without changing the already learned weights.

There is no need to compile the final layer again; the new neurons were initialized along with the old ones, based on the learning hyper-parameters given in the compile call. The edge learning then uses the same scheme as for the “offline” Akida learning - only the number of samples used is much more restricted.

Here, each new class is trained using 160 samples, stored in the second dataset: 4 utterances per word from a single speaker, augmented 40 times each. The validation set for new words [‘backward’, ‘follow’, ‘forward’] contains respectively 6, 7 and 6 utterances.

print(f"Validation set of new words ({y_val_new.shape[0]} samples):")
for word, label in word_to_index_new.items():
    print(f" - {word} (label {label}): {np.sum(y_val_new == label)} samples")

# Update new labels following the numbering of the old keywords, i.e, new word
# with label '0' becomes label '34', new word label '1' becomes '35', etc.
y_train_new += num_classes
y_val_new += num_classes

Out:

Validation set of new words (19 samples):
 - backward (label 0): 6 samples
 - follow (label 1): 7 samples
 - forward (label 2): 6 samples
from akida import Model

# Load the pre-trained model (no need to compile it again)
model_edge = Model(model_file)
model_edge.add_classes(3)

# Train the Akida model with new keywords; only few samples are used.
print("\nEdge learning with 3 new classes ...")
start = time()
model_edge.fit(x_train_new_ak, y_train_new.astype(np.int32))
end = time()
print(f"Elapsed time for Akida edge learning: {end-start:.2f} s")

Out:

Edge learning with 3 new classes ...
Elapsed time for Akida edge learning: 0.83 s
# Predict on the new validation set
preds_ak_new = model_edge.predict(x_val_new_ak, num_classes=num_classes + 3)
good_preds_val_new_ak = np.sum(preds_ak_new == y_val_new)
print(f"Akida validation set accuracy on 3 new keywords: \
        {good_preds_val_new_ak}/{y_val_new.shape[0]}")

# Predict on the old validation set. Edge learning of the 3 new keywords barely
# affects the accuracy of the old classes.
preds_ak_old = np.zeros(y_val.shape[0])
for i in range(num_batches_val):
    s = slice(i * batch_size, (i + 1) * batch_size)
    preds_ak_old[s] = model_edge.predict(x_val_ak[s],
                                         num_classes=num_classes + 3)

acc_val_old_ak = np.sum(preds_ak_old == y_val) / y_val.shape[0]
print(f"Akida validation set accuracy on 33 old classes: \
        {100 * acc_val_old_ak:.2f} %")

# For non-regression purpose
assert acc_val_old_ak > 0.88

Out:

Akida validation set accuracy on 3 new keywords:         19/19
Akida validation set accuracy on 33 old classes:         89.08 %

Total running time of the script: ( 3 minutes 2.464 seconds)

Gallery generated by Sphinx-Gallery