YOLO/PASCAL-VOC detection tutorial

This tutorial demonstrates that Akida can perform object detection. This is illustrated using a subset of the PASCAL-VOC 2007 dataset which contains 20 classes. The YOLOv2 architecture from Redmon et al (2016) has been chosen to tackle this object detection problem.

1. Introduction

1.1 Object detection

Object detection is a computer vision task that combines two elemental tasks:

  • object classification that consists in assigning a class label to an image like shown in the AkidaNet/ImageNet inference example

  • object localization that consists of drawing a bounding box around one or several objects in an image

One can learn more about the subject by reading this introduction to object detection blog article.

1.2 YOLO key concepts

You Only Look Once (YOLO) is a deep neural network architecture dedicated to object detection.

As opposed to classic networks that handle object detection, YOLO predicts bounding boxes (localization task) and class probabilities (classification task) from a single neural network in a single evaluation. The object detection task is reduced to a regression problem to spatially separated boxes and associated class probabilities.

YOLO base concept is to divide an input image into regions, forming a grid, and to predict bounding boxes and probabilities for each region. The bounding boxes are weighted by the prediction probabilities.

YOLO also uses the concept of “anchors boxes” or “prior boxes”. The network does not actually predict the actual bounding boxes but offsets from anchors boxes which are templates (width/height ratio) computed by clustering the dimensions of the ground truth boxes from the training dataset. The anchors then represent the average shape and size of the objects to detect. More details on the anchors boxes concept are given in this blog article.

Additional information about YOLO can be found on the Darknet website and source code for the preprocessing and postprocessing functions that are included in akida_models package (see the processing section in the model zoo) is largely inspired from experiencor github.

2. Preprocessing tools

A subset of VOC has been prepared with test images from VOC2007 that contains 5 examples of each class. The dataset is represented as a tfrecord file, containing images, labels, and bounding boxes.

The load_tf_dataset function is a helper function that facilitates the loading and parsing of the tfrecord file.

The YOLO toolkit offers several methods to prepare data for processing, see load_image, preprocess_image.

import tensorflow as tf

from akida_models import fetch_file

# Download TFrecords test set from Brainchip data server
data_path = fetch_file(
    fname="voc_test_20_classes.tfrecord",
    origin="https://data.brainchip.com/dataset-mirror/voc/test_20_classes.tfrecord",
    cache_subdir='datasets/voc',
    extract=True)


# Helper function to load and parse the Tfrecord file.
def load_tf_dataset(tf_record_file_path):
    tfrecord_files = [tf_record_file_path]

    # Feature description for parsing the TFRecord
    feature_description = {
        'image': tf.io.FixedLenFeature([], tf.string),
        'objects/bbox': tf.io.VarLenFeature(tf.float32),
        'objects/label': tf.io.VarLenFeature(tf.int64),
    }

    def _count_tfrecord_examples(dataset):
        return len(list(dataset.as_numpy_iterator()))

    def _parse_tfrecord_fn(example_proto):
        example = tf.io.parse_single_example(example_proto, feature_description)

        # Decode the image from bytes
        example['image'] = tf.io.decode_jpeg(example['image'], channels=3)

        # Convert the VarLenFeature to a dense tensor
        example['objects/label'] = tf.sparse.to_dense(example['objects/label'], default_value=0)

        example['objects/bbox'] = tf.sparse.to_dense(example['objects/bbox'])
        # Boxes were flattenned that's why we need to reshape them
        example['objects/bbox'] = tf.reshape(example['objects/bbox'],
                                             (tf.shape(example['objects/label'])[0], 4))
        # Create a new dictionary structure
        objects = {
            'label': example['objects/label'],
            'bbox': example['objects/bbox'],
        }

        # Remove unnecessary keys
        example.pop('objects/label')
        example.pop('objects/bbox')

        # Add 'objects' key to the main dictionary
        example['objects'] = objects

        return example

    # Create a TFRecordDataset
    dataset = tf.data.TFRecordDataset(tfrecord_files)
    len_dataset = _count_tfrecord_examples(dataset)
    parsed_dataset = dataset.map(_parse_tfrecord_fn)

    return parsed_dataset, len_dataset


labels = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
          'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
          'motorbike', 'person', 'pottedplant', 'sheep', 'sofa',
          'train', 'tvmonitor']

val_dataset, len_val_dataset = load_tf_dataset(data_path)
print(f"Loaded VOC2007 sample test data: {len_val_dataset} images.")
Downloading data from https://data.brainchip.com/dataset-mirror/voc/test_20_classes.tfrecord.

      0/8399422 [..............................] - ETA: 0s
 106496/8399422 [..............................] - ETA: 3s
 860160/8399422 [==>...........................] - ETA: 0s
2244608/8399422 [=======>......................] - ETA: 0s
3825664/8399422 [============>.................] - ETA: 0s
5660672/8399422 [===================>..........] - ETA: 0s
7725056/8399422 [==========================>...] - ETA: 0s
8399422/8399422 [==============================] - 0s 0us/step
Download complete.
Loaded VOC2007 sample test data: 100 images.

Anchors can also be computed easily using YOLO toolkit.

Note

The following code is given as an example. In a real use case scenario, anchors are computed on the training dataset.

from akida_models.detection.generate_anchors import generate_anchors

num_anchors = 5
grid_size = (7, 7)
anchors_example = generate_anchors(val_dataset, num_anchors, grid_size)
Average IOU for 5 anchors: 0.70
Anchors:  [[1.12454, 1.84751], [1.93628, 2.82636], [3.18201, 3.61125], [4.55423, 5.15952], [5.48737, 5.82352]]

3. Model architecture

The model zoo contains a YOLO model that is built upon the AkidaNet architecture and 3 separable convolutional layers at the top for bounding box and class estimation followed by a final separable convolutional which is the detection layer. Note that for efficiency, the alpha parameter in AkidaNet (network width or number of filter in each layer) is set to 0.5.

from akida_models import yolo_base

# Create a yolo model for 20 classes with 5 anchors and grid size of 7
classes = len(labels)

model = yolo_base(input_shape=(224, 224, 3),
                  classes=classes,
                  nb_box=num_anchors,
                  alpha=0.5)
model.summary()
Model: "yolo_base"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 input (InputLayer)          [(None, 224, 224, 3)]     0

 rescaling (Rescaling)       (None, 224, 224, 3)       0

 conv_0 (Conv2D)             (None, 112, 112, 16)      432

 conv_0/BN (BatchNormalizat  (None, 112, 112, 16)      64
 ion)

 conv_0/relu (ReLU)          (None, 112, 112, 16)      0

 conv_1 (Conv2D)             (None, 112, 112, 32)      4608

 conv_1/BN (BatchNormalizat  (None, 112, 112, 32)      128
 ion)

 conv_1/relu (ReLU)          (None, 112, 112, 32)      0

 conv_2 (Conv2D)             (None, 56, 56, 64)        18432

 conv_2/BN (BatchNormalizat  (None, 56, 56, 64)        256
 ion)

 conv_2/relu (ReLU)          (None, 56, 56, 64)        0

 conv_3 (Conv2D)             (None, 56, 56, 64)        36864

 conv_3/BN (BatchNormalizat  (None, 56, 56, 64)        256
 ion)

 conv_3/relu (ReLU)          (None, 56, 56, 64)        0

 dw_separable_4 (DepthwiseC  (None, 28, 28, 64)        576
 onv2D)

 pw_separable_4 (Conv2D)     (None, 28, 28, 128)       8192

 pw_separable_4/BN (BatchNo  (None, 28, 28, 128)       512
 rmalization)

 pw_separable_4/relu (ReLU)  (None, 28, 28, 128)       0

 dw_separable_5 (DepthwiseC  (None, 28, 28, 128)       1152
 onv2D)

 pw_separable_5 (Conv2D)     (None, 28, 28, 128)       16384

 pw_separable_5/BN (BatchNo  (None, 28, 28, 128)       512
 rmalization)

 pw_separable_5/relu (ReLU)  (None, 28, 28, 128)       0

 dw_separable_6 (DepthwiseC  (None, 14, 14, 128)       1152
 onv2D)

 pw_separable_6 (Conv2D)     (None, 14, 14, 256)       32768

 pw_separable_6/BN (BatchNo  (None, 14, 14, 256)       1024
 rmalization)

 pw_separable_6/relu (ReLU)  (None, 14, 14, 256)       0

 dw_separable_7 (DepthwiseC  (None, 14, 14, 256)       2304
 onv2D)

 pw_separable_7 (Conv2D)     (None, 14, 14, 256)       65536

 pw_separable_7/BN (BatchNo  (None, 14, 14, 256)       1024
 rmalization)

 pw_separable_7/relu (ReLU)  (None, 14, 14, 256)       0

 dw_separable_8 (DepthwiseC  (None, 14, 14, 256)       2304
 onv2D)

 pw_separable_8 (Conv2D)     (None, 14, 14, 256)       65536

 pw_separable_8/BN (BatchNo  (None, 14, 14, 256)       1024
 rmalization)

 pw_separable_8/relu (ReLU)  (None, 14, 14, 256)       0

 dw_separable_9 (DepthwiseC  (None, 14, 14, 256)       2304
 onv2D)

 pw_separable_9 (Conv2D)     (None, 14, 14, 256)       65536

 pw_separable_9/BN (BatchNo  (None, 14, 14, 256)       1024
 rmalization)

 pw_separable_9/relu (ReLU)  (None, 14, 14, 256)       0

 dw_separable_10 (Depthwise  (None, 14, 14, 256)       2304
 Conv2D)

 pw_separable_10 (Conv2D)    (None, 14, 14, 256)       65536

 pw_separable_10/BN (BatchN  (None, 14, 14, 256)       1024
 ormalization)

 pw_separable_10/relu (ReLU  (None, 14, 14, 256)       0
 )

 dw_separable_11 (Depthwise  (None, 14, 14, 256)       2304
 Conv2D)

 pw_separable_11 (Conv2D)    (None, 14, 14, 256)       65536

 pw_separable_11/BN (BatchN  (None, 14, 14, 256)       1024
 ormalization)

 pw_separable_11/relu (ReLU  (None, 14, 14, 256)       0
 )

 dw_separable_12 (Depthwise  (None, 7, 7, 256)         2304
 Conv2D)

 pw_separable_12 (Conv2D)    (None, 7, 7, 512)         131072

 pw_separable_12/BN (BatchN  (None, 7, 7, 512)         2048
 ormalization)

 pw_separable_12/relu (ReLU  (None, 7, 7, 512)         0
 )

 dw_separable_13 (Depthwise  (None, 7, 7, 512)         4608
 Conv2D)

 pw_separable_13 (Conv2D)    (None, 7, 7, 512)         262144

 pw_separable_13/BN (BatchN  (None, 7, 7, 512)         2048
 ormalization)

 pw_separable_13/relu (ReLU  (None, 7, 7, 512)         0
 )

 dw_1conv (DepthwiseConv2D)  (None, 7, 7, 512)         4608

 pw_1conv (Conv2D)           (None, 7, 7, 1024)        524288

 pw_1conv/BN (BatchNormaliz  (None, 7, 7, 1024)        4096
 ation)

 pw_1conv/relu (ReLU)        (None, 7, 7, 1024)        0

 dw_2conv (DepthwiseConv2D)  (None, 7, 7, 1024)        9216

 pw_2conv (Conv2D)           (None, 7, 7, 1024)        1048576

 pw_2conv/BN (BatchNormaliz  (None, 7, 7, 1024)        4096
 ation)

 pw_2conv/relu (ReLU)        (None, 7, 7, 1024)        0

 dw_3conv (DepthwiseConv2D)  (None, 7, 7, 1024)        9216

 pw_3conv (Conv2D)           (None, 7, 7, 1024)        1048576

 pw_3conv/BN (BatchNormaliz  (None, 7, 7, 1024)        4096
 ation)

 pw_3conv/relu (ReLU)        (None, 7, 7, 1024)        0

 dw_detection_layer (Depthw  (None, 7, 7, 1024)        9216
 iseConv2D)

 pw_detection_layer (Conv2D  (None, 7, 7, 125)         128125
 )

=================================================================
Total params: 3665965 (13.98 MB)
Trainable params: 3653837 (13.94 MB)
Non-trainable params: 12128 (47.38 KB)
_________________________________________________________________

The model output can be reshaped to a more natural shape of:

(grid_height, grid_width, anchors_box, 4 + 1 + num_classes)

where the “4 + 1” term represents the coordinates of the estimated bounding boxes (top left x, top left y, width and height) and a confidence score. In other words, the output channels are actually grouped by anchor boxes, and in each group one channel provides either a coordinate, a global confidence score or a class confidence score. This process is done automatically in the decode_output function.

from tensorflow.keras import Model
from tensorflow.keras.layers import Reshape

# Define a reshape output to be added to the YOLO model
output = Reshape((grid_size[1], grid_size[0], num_anchors, 4 + 1 + classes),
                 name="YOLO_output")(model.output)

# Build the complete model
full_model = Model(model.input, output)
full_model.output
<KerasTensor: shape=(None, 7, 7, 5, 25) dtype=float32 (created by layer 'YOLO_output')>

4. Training

As the YOLO model relies on Brainchip AkidaNet/ImageNet network, it is possible to perform transfer learning from ImageNet pretrained weights when training a YOLO model. See the PlantVillage transfer learning example for a detail explanation on transfer learning principles. Additionally, for achieving optimal results, consider the following approach:

1. Initially, train the model on the COCO dataset. This process helps in learning general object detection features and improves the model’s ability to detect various objects across different contexts.

2. After training on COCO, transfer the learned weights to a model equipped with a VOC head.

3. Fine-tune the transferred weights on the VOC dataset. This step allows the model to adapt to the specific characteristics and nuances of the VOC dataset, further enhancing its performance on VOC-related tasks.

5. Performance

The model zoo also contains an helper method that allows to create a YOLO model for VOC and load pretrained weights for the detection task and the corresponding anchors. The anchors are used to interpret the model outputs.

The metric used to evaluate YOLO is the mean average precision (mAP) which is the percentage of correct prediction and is given for an intersection over union (IoU) ratio. Scores in this example are given for the standard IoU of 0.5, 0.75 and the mean across IoU thresholds ranging from 0.5 to 0.95, meaning that a detection is considered valid if the intersection over union ratio with its ground truth equivalent is above 0.5 for mAP 50 or above 0.75 for mAP 75.

Note

A call to evaluate_map will preprocess the images, make the call to Model.predict and use decode_output before computing precision for all classes.

from timeit import default_timer as timer
from akida_models import yolo_voc_pretrained
from akida_models.detection.map_evaluation import MapEvaluation

# Load the pretrained model along with anchors
model_keras, anchors = yolo_voc_pretrained()
model_keras.summary()
Downloading data from https://data.brainchip.com/dataset-mirror/coco/coco_anchors.pkl.

  0/126 [..............................] - ETA: 0s
126/126 [==============================] - 0s 6us/step
Download complete.
Downloading data from https://data.brainchip.com/models/AkidaV2/yolo/yolo_akidanet_voc_i8_w8_a8.h5.

       0/14926320 [..............................] - ETA: 0s
  106496/14926320 [..............................] - ETA: 7s
  770048/14926320 [>.............................] - ETA: 1s
 3194880/14926320 [=====>........................] - ETA: 0s
 5234688/14926320 [=========>....................] - ETA: 0s
 7340032/14926320 [=============>................] - ETA: 0s
 9379840/14926320 [=================>............] - ETA: 0s
11649024/14926320 [======================>.......] - ETA: 0s
13852672/14926320 [==========================>...] - ETA: 0s
14926320/14926320 [==============================] - 0s 0us/step
Download complete.
Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 input (InputLayer)          [(None, 224, 224, 3)]     0

 rescaling (QuantizedRescal  (None, 224, 224, 3)       0
 ing)

 conv_0 (QuantizedConv2D)    (None, 112, 112, 16)      448

 conv_0/relu (QuantizedReLU  (None, 112, 112, 16)      32
 )

 conv_1 (QuantizedConv2D)    (None, 112, 112, 32)      4640

 conv_1/relu (QuantizedReLU  (None, 112, 112, 32)      64
 )

 conv_2 (QuantizedConv2D)    (None, 56, 56, 64)        18496

 conv_2/relu (QuantizedReLU  (None, 56, 56, 64)        128
 )

 conv_3 (QuantizedConv2D)    (None, 56, 56, 64)        36928

 conv_3/relu (QuantizedReLU  (None, 56, 56, 64)        128
 )

 dw_separable_4 (QuantizedD  (None, 28, 28, 64)        704
 epthwiseConv2D)

 pw_separable_4 (QuantizedC  (None, 28, 28, 128)       8320
 onv2D)

 pw_separable_4/relu (Quant  (None, 28, 28, 128)       256
 izedReLU)

 dw_separable_5 (QuantizedD  (None, 28, 28, 128)       1408
 epthwiseConv2D)

 pw_separable_5 (QuantizedC  (None, 28, 28, 128)       16512
 onv2D)

 pw_separable_5/relu (Quant  (None, 28, 28, 128)       256
 izedReLU)

 dw_separable_6 (QuantizedD  (None, 14, 14, 128)       1408
 epthwiseConv2D)

 pw_separable_6 (QuantizedC  (None, 14, 14, 256)       33024
 onv2D)

 pw_separable_6/relu (Quant  (None, 14, 14, 256)       512
 izedReLU)

 dw_separable_7 (QuantizedD  (None, 14, 14, 256)       2816
 epthwiseConv2D)

 pw_separable_7 (QuantizedC  (None, 14, 14, 256)       65792
 onv2D)

 pw_separable_7/relu (Quant  (None, 14, 14, 256)       512
 izedReLU)

 dw_separable_8 (QuantizedD  (None, 14, 14, 256)       2816
 epthwiseConv2D)

 pw_separable_8 (QuantizedC  (None, 14, 14, 256)       65792
 onv2D)

 pw_separable_8/relu (Quant  (None, 14, 14, 256)       512
 izedReLU)

 dw_separable_9 (QuantizedD  (None, 14, 14, 256)       2816
 epthwiseConv2D)

 pw_separable_9 (QuantizedC  (None, 14, 14, 256)       65792
 onv2D)

 pw_separable_9/relu (Quant  (None, 14, 14, 256)       512
 izedReLU)

 dw_separable_10 (Quantized  (None, 14, 14, 256)       2816
 DepthwiseConv2D)

 pw_separable_10 (Quantized  (None, 14, 14, 256)       65792
 Conv2D)

 pw_separable_10/relu (Quan  (None, 14, 14, 256)       512
 tizedReLU)

 dw_separable_11 (Quantized  (None, 14, 14, 256)       2816
 DepthwiseConv2D)

 pw_separable_11 (Quantized  (None, 14, 14, 256)       65792
 Conv2D)

 pw_separable_11/relu (Quan  (None, 14, 14, 256)       512
 tizedReLU)

 dw_separable_12 (Quantized  (None, 7, 7, 256)         2816
 DepthwiseConv2D)

 pw_separable_12 (Quantized  (None, 7, 7, 512)         131584
 Conv2D)

 pw_separable_12/relu (Quan  (None, 7, 7, 512)         1024
 tizedReLU)

 dw_separable_13 (Quantized  (None, 7, 7, 512)         5632
 DepthwiseConv2D)

 pw_separable_13 (Quantized  (None, 7, 7, 512)         262656
 Conv2D)

 pw_separable_13/relu (Quan  (None, 7, 7, 512)         1024
 tizedReLU)

 dw_1conv (QuantizedDepthwi  (None, 7, 7, 512)         5632
 seConv2D)

 pw_1conv (QuantizedConv2D)  (None, 7, 7, 1024)        525312

 pw_1conv/relu (QuantizedRe  (None, 7, 7, 1024)        2048
 LU)

 dw_2conv (QuantizedDepthwi  (None, 7, 7, 1024)        11264
 seConv2D)

 pw_2conv (QuantizedConv2D)  (None, 7, 7, 1024)        1049600

 pw_2conv/relu (QuantizedRe  (None, 7, 7, 1024)        2048
 LU)

 dw_3conv (QuantizedDepthwi  (None, 7, 7, 1024)        11264
 seConv2D)

 pw_3conv (QuantizedConv2D)  (None, 7, 7, 1024)        1049600

 pw_3conv/relu (QuantizedRe  (None, 7, 7, 1024)        2048
 LU)

 dw_detection_layer (Quanti  (None, 7, 7, 1024)        11264
 zedDepthwiseConv2D)

 voc_classifier (QuantizedC  (None, 7, 7, 125)         128125
 onv2D)

 dequantizer (Dequantizer)   (None, 7, 7, 125)         0

=================================================================
Total params: 3671805 (14.01 MB)
Trainable params: 3647773 (13.92 MB)
Non-trainable params: 24032 (93.88 KB)
_________________________________________________________________
# Define the final reshape and build the model
output = Reshape((grid_size[1], grid_size[0], num_anchors, 4 + 1 + classes),
                 name="YOLO_output")(model_keras.output)
model_keras = Model(model_keras.input, output)

# Create the mAP evaluator object
map_evaluator = MapEvaluation(model_keras, val_dataset,
                              len_val_dataset, labels, anchors)

# Compute the scores for all validation images
start = timer()

map_dict, average_precisions = map_evaluator.evaluate_map()
mAP = sum(map_dict.values()) / len(map_dict)
end = timer()

for label, average_precision in average_precisions.items():
    print(labels[label], '{:.4f}'.format(average_precision))
print('mAP 50: {:.4f}'.format(map_dict[0.5]))
print('mAP 75: {:.4f}'.format(map_dict[0.75]))
print('mAP: {:.4f}'.format(mAP))
print(f'Keras inference on {len_val_dataset} images took {end-start:.2f} s.\n')
  0%|          | 0/130 [00:00<?, ?it/s]
Getting predictions:   0%|          | 0/130 [00:00<?, ?it/s]
Getting predictions:   1%|          | 1/130 [00:11<25:28, 11.85s/it]
Getting predictions:   2%|▏         | 3/130 [00:11<06:37,  3.13s/it]
Getting predictions:   4%|▍         | 5/130 [00:12<03:14,  1.56s/it]
Getting predictions:   5%|▌         | 7/130 [00:12<01:55,  1.07it/s]
Getting predictions:   7%|▋         | 9/130 [00:12<01:14,  1.63it/s]
Getting predictions:   8%|▊         | 11/130 [00:12<00:50,  2.35it/s]
Getting predictions:  10%|█         | 13/130 [00:12<00:36,  3.24it/s]
Getting predictions:  12%|█▏        | 15/130 [00:12<00:26,  4.32it/s]
Getting predictions:  13%|█▎        | 17/130 [00:13<00:20,  5.47it/s]
Getting predictions:  15%|█▍        | 19/130 [00:13<00:16,  6.73it/s]
Getting predictions:  16%|█▌        | 21/130 [00:13<00:13,  8.03it/s]
Getting predictions:  18%|█▊        | 23/130 [00:13<00:13,  8.21it/s]
Getting predictions:  19%|█▉        | 25/130 [00:13<00:11,  8.95it/s]
Getting predictions:  21%|██        | 27/130 [00:13<00:10, 10.06it/s]
Getting predictions:  22%|██▏       | 29/130 [00:14<00:09, 10.87it/s]
Getting predictions:  24%|██▍       | 31/130 [00:14<00:08, 11.59it/s]
Getting predictions:  25%|██▌       | 33/130 [00:14<00:07, 12.28it/s]
Getting predictions:  27%|██▋       | 35/130 [00:14<00:07, 12.60it/s]
Getting predictions:  28%|██▊       | 37/130 [00:14<00:07, 12.81it/s]
Getting predictions:  30%|███       | 39/130 [00:14<00:06, 13.08it/s]
Getting predictions:  32%|███▏      | 41/130 [00:14<00:06, 13.43it/s]
Getting predictions:  33%|███▎      | 43/130 [00:15<00:06, 13.06it/s]
Getting predictions:  35%|███▍      | 45/130 [00:15<00:06, 13.22it/s]
Getting predictions:  36%|███▌      | 47/130 [00:15<00:06, 13.29it/s]
Getting predictions:  38%|███▊      | 49/130 [00:15<00:06, 13.49it/s]
Getting predictions:  39%|███▉      | 51/130 [00:15<00:06, 12.78it/s]
Getting predictions:  41%|████      | 53/130 [00:15<00:06, 12.49it/s]
Getting predictions:  42%|████▏     | 55/130 [00:16<00:05, 12.76it/s]
Getting predictions:  44%|████▍     | 57/130 [00:16<00:05, 13.11it/s]
Getting predictions:  45%|████▌     | 59/130 [00:16<00:05, 13.30it/s]
Getting predictions:  47%|████▋     | 61/130 [00:16<00:05, 13.27it/s]
Getting predictions:  48%|████▊     | 63/130 [00:16<00:05, 13.32it/s]
Getting predictions:  50%|█████     | 65/130 [00:16<00:05, 12.47it/s]
Getting predictions:  52%|█████▏    | 67/130 [00:16<00:05, 11.91it/s]
Getting predictions:  53%|█████▎    | 69/130 [00:17<00:05, 12.02it/s]
Getting predictions:  55%|█████▍    | 71/130 [00:17<00:05, 11.46it/s]
Getting predictions:  56%|█████▌    | 73/130 [00:17<00:04, 12.11it/s]
Getting predictions:  58%|█████▊    | 75/130 [00:17<00:05, 10.48it/s]
Getting predictions:  59%|█████▉    | 77/130 [00:17<00:05, 10.26it/s]
Getting predictions:  61%|██████    | 79/130 [00:18<00:04, 11.04it/s]
Getting predictions:  62%|██████▏   | 81/130 [00:18<00:04, 11.49it/s]
Getting predictions:  64%|██████▍   | 83/130 [00:18<00:03, 12.10it/s]
Getting predictions:  65%|██████▌   | 85/130 [00:18<00:03, 11.40it/s]
Getting predictions:  67%|██████▋   | 87/130 [00:18<00:03, 12.01it/s]
Getting predictions:  68%|██████▊   | 89/130 [00:18<00:03, 12.42it/s]
Getting predictions:  70%|███████   | 91/130 [00:19<00:03, 12.76it/s]
Getting predictions:  72%|███████▏  | 93/130 [00:19<00:02, 13.04it/s]
Getting predictions:  73%|███████▎  | 95/130 [00:19<00:02, 13.04it/s]
Getting predictions:  75%|███████▍  | 97/130 [00:19<00:02, 13.15it/s]
Getting predictions:  76%|███████▌  | 99/130 [00:19<00:02, 11.35it/s]
Computing overlaps:  77%|███████▋  | 100/130 [00:19<00:02, 11.35it/s]
Computing overlaps:  78%|███████▊  | 101/130 [00:20<00:04,  7.13it/s]
Computing overlaps:  87%|████████▋ | 113/130 [00:20<00:00, 22.08it/s]
Computing average precisions th = 0.50:  92%|█████████▏| 120/130 [00:20<00:00, 22.08it/s]
Computing average precisions th = 0.55:  93%|█████████▎| 121/130 [00:20<00:00, 22.08it/s]
Computing average precisions th = 0.60:  94%|█████████▍| 122/130 [00:20<00:00, 22.08it/s]
Computing average precisions th = 0.60:  95%|█████████▍| 123/130 [00:20<00:00, 34.35it/s]
Computing average precisions th = 0.65:  95%|█████████▍| 123/130 [00:20<00:00, 34.35it/s]
Computing average precisions th = 0.70:  95%|█████████▌| 124/130 [00:20<00:00, 34.35it/s]
Computing average precisions th = 0.75:  96%|█████████▌| 125/130 [00:20<00:00, 34.35it/s]
Computing average precisions th = 0.80:  97%|█████████▋| 126/130 [00:20<00:00, 34.35it/s]
Computing average precisions th = 0.85:  98%|█████████▊| 127/130 [00:20<00:00, 34.35it/s]
Computing average precisions th = 0.90:  98%|█████████▊| 128/130 [00:20<00:00, 34.35it/s]
Computing average precisions th = 0.95:  99%|█████████▉| 129/130 [00:20<00:00, 34.35it/s]

aeroplane 0.7733
bicycle 0.5278
bird 0.5208
boat 0.3100
bottle 0.3783
bus 0.8013
car 0.8444
cat 0.7760
chair 0.3014
cow 0.4717
diningtable 0.4639
dog 0.4384
horse 0.5596
motorbike 0.5764
person 0.4690
pottedplant 0.0893
sheep 0.4708
sofa 0.5850
train 0.6136
tvmonitor 0.5860
mAP 50: 0.8783
mAP 75: 0.5557
mAP: 0.5279
Keras inference on 100 images took 20.56 s.

6. Conversion to Akida

6.1 Convert to Akida model

The last YOLO_output layer that was added for splitting channels into values for each box must be removed before Akida conversion.

# Rebuild a model without the last layer
compatible_model = Model(model_keras.input, model_keras.layers[-2].output)

When converting to an Akida model, we just need to pass the Keras model to cnn2snn.convert.

from cnn2snn import convert

model_akida = convert(compatible_model)
model_akida.summary()
                 Model Summary
________________________________________________
Input shape    Output shape  Sequences  Layers
================================================
[224, 224, 3]  [7, 7, 125]   1          33
________________________________________________

__________________________________________________________________________
Layer (type)                          Output shape    Kernel shape

==================== SW/conv_0-dequantizer (Software) ====================

conv_0 (InputConv2D)                  [112, 112, 16]  (3, 3, 3, 16)
__________________________________________________________________________
conv_1 (Conv2D)                       [112, 112, 32]  (3, 3, 16, 32)
__________________________________________________________________________
conv_2 (Conv2D)                       [56, 56, 64]    (3, 3, 32, 64)
__________________________________________________________________________
conv_3 (Conv2D)                       [56, 56, 64]    (3, 3, 64, 64)
__________________________________________________________________________
dw_separable_4 (DepthwiseConv2D)      [28, 28, 64]    (3, 3, 64, 1)
__________________________________________________________________________
pw_separable_4 (Conv2D)               [28, 28, 128]   (1, 1, 64, 128)
__________________________________________________________________________
dw_separable_5 (DepthwiseConv2D)      [28, 28, 128]   (3, 3, 128, 1)
__________________________________________________________________________
pw_separable_5 (Conv2D)               [28, 28, 128]   (1, 1, 128, 128)
__________________________________________________________________________
dw_separable_6 (DepthwiseConv2D)      [14, 14, 128]   (3, 3, 128, 1)
__________________________________________________________________________
pw_separable_6 (Conv2D)               [14, 14, 256]   (1, 1, 128, 256)
__________________________________________________________________________
dw_separable_7 (DepthwiseConv2D)      [14, 14, 256]   (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_7 (Conv2D)               [14, 14, 256]   (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_8 (DepthwiseConv2D)      [14, 14, 256]   (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_8 (Conv2D)               [14, 14, 256]   (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_9 (DepthwiseConv2D)      [14, 14, 256]   (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_9 (Conv2D)               [14, 14, 256]   (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_10 (DepthwiseConv2D)     [14, 14, 256]   (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_10 (Conv2D)              [14, 14, 256]   (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_11 (DepthwiseConv2D)     [14, 14, 256]   (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_11 (Conv2D)              [14, 14, 256]   (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_12 (DepthwiseConv2D)     [7, 7, 256]     (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_12 (Conv2D)              [7, 7, 512]     (1, 1, 256, 512)
__________________________________________________________________________
dw_separable_13 (DepthwiseConv2D)     [7, 7, 512]     (3, 3, 512, 1)
__________________________________________________________________________
pw_separable_13 (Conv2D)              [7, 7, 512]     (1, 1, 512, 512)
__________________________________________________________________________
dw_1conv (DepthwiseConv2D)            [7, 7, 512]     (3, 3, 512, 1)
__________________________________________________________________________
pw_1conv (Conv2D)                     [7, 7, 1024]    (1, 1, 512, 1024)
__________________________________________________________________________
dw_2conv (DepthwiseConv2D)            [7, 7, 1024]    (3, 3, 1024, 1)
__________________________________________________________________________
pw_2conv (Conv2D)                     [7, 7, 1024]    (1, 1, 1024, 1024)
__________________________________________________________________________
dw_3conv (DepthwiseConv2D)            [7, 7, 1024]    (3, 3, 1024, 1)
__________________________________________________________________________
pw_3conv (Conv2D)                     [7, 7, 1024]    (1, 1, 1024, 1024)
__________________________________________________________________________
dw_detection_layer (DepthwiseConv2D)  [7, 7, 1024]    (3, 3, 1024, 1)
__________________________________________________________________________
voc_classifier (Conv2D)               [7, 7, 125]     (1, 1, 1024, 125)
__________________________________________________________________________
dequantizer (Dequantizer)             [7, 7, 125]     N/A
__________________________________________________________________________

6.1 Check performance

Akida model accuracy is tested on the first n images of the validation set.

# Create the mAP evaluator object
map_evaluator_ak = MapEvaluation(model_akida,
                                 val_dataset,
                                 len_val_dataset,
                                 labels,
                                 anchors,
                                 is_keras_model=False)

# Compute the scores for all validation images
start = timer()
map_ak_dict, average_precisions_ak = map_evaluator_ak.evaluate_map()
mAP_ak = sum(map_ak_dict.values()) / len(map_ak_dict)
end = timer()

for label, average_precision in average_precisions_ak.items():
    print(labels[label], '{:.4f}'.format(average_precision))
print('mAP 50: {:.4f}'.format(map_ak_dict[0.5]))
print('mAP 75: {:.4f}'.format(map_ak_dict[0.75]))
print('mAP: {:.4f}'.format(mAP_ak))
print(f'Akida inference on {len_val_dataset} images took {end-start:.2f} s.\n')
  0%|          | 0/130 [00:00<?, ?it/s]
Getting predictions:   0%|          | 0/130 [00:00<?, ?it/s]
Getting predictions:   1%|          | 1/130 [00:00<00:17,  7.25it/s]
Getting predictions:   2%|▏         | 2/130 [00:00<00:16,  7.60it/s]
Getting predictions:   2%|▏         | 3/130 [00:00<00:16,  7.73it/s]
Getting predictions:   3%|▎         | 4/130 [00:00<00:16,  7.81it/s]
Getting predictions:   4%|▍         | 5/130 [00:00<00:15,  7.90it/s]
Getting predictions:   5%|▍         | 6/130 [00:00<00:15,  7.91it/s]
Getting predictions:   5%|▌         | 7/130 [00:00<00:16,  7.25it/s]
Getting predictions:   6%|▌         | 8/130 [00:01<00:16,  7.25it/s]
Getting predictions:   7%|▋         | 9/130 [00:01<00:16,  7.48it/s]
Getting predictions:   8%|▊         | 10/130 [00:01<00:15,  7.59it/s]
Getting predictions:   8%|▊         | 11/130 [00:01<00:15,  7.70it/s]
Getting predictions:   9%|▉         | 12/130 [00:01<00:15,  7.74it/s]
Getting predictions:  10%|█         | 13/130 [00:01<00:15,  7.77it/s]
Getting predictions:  11%|█         | 14/130 [00:01<00:14,  7.92it/s]
Getting predictions:  12%|█▏        | 15/130 [00:01<00:14,  7.99it/s]
Getting predictions:  12%|█▏        | 16/130 [00:02<00:14,  7.90it/s]
Getting predictions:  13%|█▎        | 17/130 [00:02<00:14,  7.99it/s]
Getting predictions:  14%|█▍        | 18/130 [00:02<00:13,  8.09it/s]
Getting predictions:  15%|█▍        | 19/130 [00:02<00:13,  8.03it/s]
Getting predictions:  15%|█▌        | 20/130 [00:02<00:13,  8.06it/s]
Getting predictions:  16%|█▌        | 21/130 [00:02<00:13,  8.16it/s]
Getting predictions:  17%|█▋        | 22/130 [00:02<00:13,  7.74it/s]
Getting predictions:  18%|█▊        | 23/130 [00:03<00:15,  7.01it/s]
Getting predictions:  18%|█▊        | 24/130 [00:03<00:14,  7.12it/s]
Getting predictions:  19%|█▉        | 25/130 [00:03<00:15,  6.99it/s]
Getting predictions:  20%|██        | 26/130 [00:03<00:14,  7.26it/s]
Getting predictions:  21%|██        | 27/130 [00:03<00:13,  7.51it/s]
Getting predictions:  22%|██▏       | 28/130 [00:03<00:13,  7.64it/s]
Getting predictions:  22%|██▏       | 29/130 [00:03<00:12,  7.80it/s]
Getting predictions:  23%|██▎       | 30/130 [00:03<00:12,  7.88it/s]
Getting predictions:  24%|██▍       | 31/130 [00:04<00:12,  7.84it/s]
Getting predictions:  25%|██▍       | 32/130 [00:04<00:12,  7.96it/s]
Getting predictions:  25%|██▌       | 33/130 [00:04<00:12,  8.04it/s]
Getting predictions:  26%|██▌       | 34/130 [00:04<00:11,  8.03it/s]
Getting predictions:  27%|██▋       | 35/130 [00:04<00:11,  7.99it/s]
Getting predictions:  28%|██▊       | 36/130 [00:04<00:11,  7.91it/s]
Getting predictions:  28%|██▊       | 37/130 [00:04<00:11,  7.95it/s]
Getting predictions:  29%|██▉       | 38/130 [00:04<00:11,  7.94it/s]
Getting predictions:  30%|███       | 39/130 [00:05<00:11,  7.92it/s]
Getting predictions:  31%|███       | 40/130 [00:05<00:11,  7.94it/s]
Getting predictions:  32%|███▏      | 41/130 [00:05<00:11,  8.06it/s]
Getting predictions:  32%|███▏      | 42/130 [00:05<00:11,  7.83it/s]
Getting predictions:  33%|███▎      | 43/130 [00:05<00:11,  7.79it/s]
Getting predictions:  34%|███▍      | 44/130 [00:05<00:10,  7.86it/s]
Getting predictions:  35%|███▍      | 45/130 [00:05<00:10,  7.87it/s]
Getting predictions:  35%|███▌      | 46/130 [00:05<00:10,  8.04it/s]
Getting predictions:  36%|███▌      | 47/130 [00:06<00:10,  7.93it/s]
Getting predictions:  37%|███▋      | 48/130 [00:06<00:10,  7.99it/s]
Getting predictions:  38%|███▊      | 49/130 [00:06<00:10,  8.01it/s]
Getting predictions:  38%|███▊      | 50/130 [00:06<00:09,  8.00it/s]
Getting predictions:  39%|███▉      | 51/130 [00:06<00:10,  7.55it/s]
Getting predictions:  40%|████      | 52/130 [00:06<00:10,  7.67it/s]
Getting predictions:  41%|████      | 53/130 [00:06<00:10,  7.42it/s]
Getting predictions:  42%|████▏     | 54/130 [00:06<00:09,  7.69it/s]
Getting predictions:  42%|████▏     | 55/130 [00:07<00:09,  7.61it/s]
Getting predictions:  43%|████▎     | 56/130 [00:07<00:09,  7.73it/s]
Getting predictions:  44%|████▍     | 57/130 [00:07<00:09,  7.77it/s]
Getting predictions:  45%|████▍     | 58/130 [00:07<00:09,  7.85it/s]
Getting predictions:  45%|████▌     | 59/130 [00:07<00:09,  7.87it/s]
Getting predictions:  46%|████▌     | 60/130 [00:07<00:08,  7.93it/s]
Getting predictions:  47%|████▋     | 61/130 [00:07<00:08,  7.77it/s]
Getting predictions:  48%|████▊     | 62/130 [00:07<00:08,  7.75it/s]
Getting predictions:  48%|████▊     | 63/130 [00:08<00:08,  7.86it/s]
Getting predictions:  49%|████▉     | 64/130 [00:08<00:08,  7.73it/s]
Getting predictions:  50%|█████     | 65/130 [00:08<00:08,  7.40it/s]
Getting predictions:  51%|█████     | 66/130 [00:08<00:08,  7.17it/s]
Getting predictions:  52%|█████▏    | 67/130 [00:08<00:08,  7.21it/s]
Getting predictions:  52%|█████▏    | 68/130 [00:08<00:08,  7.26it/s]
Getting predictions:  53%|█████▎    | 69/130 [00:08<00:08,  7.36it/s]
Getting predictions:  54%|█████▍    | 70/130 [00:09<00:08,  7.25it/s]
Getting predictions:  55%|█████▍    | 71/130 [00:09<00:08,  7.13it/s]
Getting predictions:  55%|█████▌    | 72/130 [00:09<00:07,  7.36it/s]
Getting predictions:  56%|█████▌    | 73/130 [00:09<00:07,  7.56it/s]
Getting predictions:  57%|█████▋    | 74/130 [00:09<00:07,  7.54it/s]
Getting predictions:  58%|█████▊    | 75/130 [00:09<00:08,  6.34it/s]
Getting predictions:  58%|█████▊    | 76/130 [00:09<00:08,  6.63it/s]
Getting predictions:  59%|█████▉    | 77/130 [00:10<00:08,  6.43it/s]
Getting predictions:  60%|██████    | 78/130 [00:10<00:07,  6.80it/s]
Getting predictions:  61%|██████    | 79/130 [00:10<00:07,  7.12it/s]
Getting predictions:  62%|██████▏   | 80/130 [00:10<00:06,  7.23it/s]
Getting predictions:  62%|██████▏   | 81/130 [00:10<00:06,  7.38it/s]
Getting predictions:  63%|██████▎   | 82/130 [00:10<00:06,  7.64it/s]
Getting predictions:  64%|██████▍   | 83/130 [00:10<00:06,  7.66it/s]
Getting predictions:  65%|██████▍   | 84/130 [00:11<00:06,  7.41it/s]
Getting predictions:  65%|██████▌   | 85/130 [00:11<00:06,  7.17it/s]
Getting predictions:  66%|██████▌   | 86/130 [00:11<00:05,  7.40it/s]
Getting predictions:  67%|██████▋   | 87/130 [00:11<00:05,  7.55it/s]
Getting predictions:  68%|██████▊   | 88/130 [00:11<00:05,  7.71it/s]
Getting predictions:  68%|██████▊   | 89/130 [00:11<00:05,  7.72it/s]
Getting predictions:  69%|██████▉   | 90/130 [00:11<00:05,  7.77it/s]
Getting predictions:  70%|███████   | 91/130 [00:11<00:04,  7.85it/s]
Getting predictions:  71%|███████   | 92/130 [00:12<00:04,  7.85it/s]
Getting predictions:  72%|███████▏  | 93/130 [00:12<00:04,  7.89it/s]
Getting predictions:  72%|███████▏  | 94/130 [00:12<00:04,  8.02it/s]
Getting predictions:  73%|███████▎  | 95/130 [00:12<00:04,  7.77it/s]
Getting predictions:  74%|███████▍  | 96/130 [00:12<00:04,  7.79it/s]
Getting predictions:  75%|███████▍  | 97/130 [00:12<00:04,  7.87it/s]
Getting predictions:  75%|███████▌  | 98/130 [00:12<00:04,  7.70it/s]
Getting predictions:  76%|███████▌  | 99/130 [00:13<00:04,  6.69it/s]
Getting predictions:  77%|███████▋  | 100/130 [00:13<00:04,  7.08it/s]
Computing overlaps:  77%|███████▋  | 100/130 [00:13<00:04,  7.08it/s]
Computing overlaps:  87%|████████▋ | 113/130 [00:13<00:00, 35.28it/s]
Computing average precisions th = 0.50:  92%|█████████▏| 120/130 [00:13<00:00, 35.28it/s]
Computing average precisions th = 0.55:  93%|█████████▎| 121/130 [00:13<00:00, 35.28it/s]
Computing average precisions th = 0.60:  94%|█████████▍| 122/130 [00:13<00:00, 35.28it/s]
Computing average precisions th = 0.65:  95%|█████████▍| 123/130 [00:13<00:00, 35.28it/s]
Computing average precisions th = 0.65:  95%|█████████▌| 124/130 [00:13<00:00, 53.68it/s]
Computing average precisions th = 0.70:  95%|█████████▌| 124/130 [00:13<00:00, 53.68it/s]
Computing average precisions th = 0.75:  96%|█████████▌| 125/130 [00:13<00:00, 53.68it/s]
Computing average precisions th = 0.80:  97%|█████████▋| 126/130 [00:13<00:00, 53.68it/s]
Computing average precisions th = 0.85:  98%|█████████▊| 127/130 [00:13<00:00, 53.68it/s]
Computing average precisions th = 0.90:  98%|█████████▊| 128/130 [00:13<00:00, 53.68it/s]
Computing average precisions th = 0.95:  99%|█████████▉| 129/130 [00:13<00:00, 53.68it/s]

aeroplane 0.7733
bicycle 0.5278
bird 0.5208
boat 0.3100
bottle 0.3783
bus 0.8013
car 0.8444
cat 0.7760
chair 0.3014
cow 0.4717
diningtable 0.4639
dog 0.4384
horse 0.5596
motorbike 0.5764
person 0.4690
pottedplant 0.0893
sheep 0.4708
sofa 0.5850
train 0.6136
tvmonitor 0.5860
mAP 50: 0.8783
mAP 75: 0.5557
mAP: 0.5279
Akida inference on 100 images took 13.40 s.

6.2 Show predictions for a random image

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

from akida_models.detection.processing import preprocess_image, decode_output

# Shuffle the data to take a random test image
val_dataset = val_dataset.shuffle(buffer_size=len_val_dataset)

input_shape = model_akida.layers[0].input_dims

# Load the image
raw_image = next(iter(val_dataset))['image']

# Keep the original image size for later bounding boxes rescaling
raw_height, raw_width, _ = raw_image.shape

# Pre-process the image
image = preprocess_image(raw_image, input_shape)
input_image = image[np.newaxis, :].astype(np.uint8)

# Call evaluate on the image
pots = model_akida.predict(input_image)[0]

# Reshape the potentials to prepare for decoding
h, w, c = pots.shape
pots = pots.reshape((h, w, len(anchors), 4 + 1 + len(labels)))

# Decode potentials into bounding boxes
raw_boxes = decode_output(pots, anchors, len(labels))

# Rescale boxes to the original image size
pred_boxes = np.array([[
    box.x1 * raw_width, box.y1 * raw_height, box.x2 * raw_width,
    box.y2 * raw_height,
    box.get_label(),
    box.get_score()
] for box in raw_boxes])

fig = plt.figure(num='VOC detection by Akida')
ax = fig.subplots(1)
img_plot = ax.imshow(np.zeros(raw_image.shape, dtype=np.uint8))
img_plot.set_data(raw_image)

for box in pred_boxes:
    rect = patches.Rectangle((box[0], box[1]),
                             box[2] - box[0],
                             box[3] - box[1],
                             linewidth=1,
                             edgecolor='r',
                             facecolor='none')
    ax.add_patch(rect)
    class_score = ax.text(box[0],
                          box[1] - 5,
                          f"{labels[int(box[4])]} - {box[5]:.2f}",
                          color='red')

plt.axis('off')
plt.show()
plot 5 voc yolo detection

Total running time of the script: (1 minutes 7.358 seconds)

Gallery generated by Sphinx-Gallery