This hackathon featured 3 Object detection and 3 Audio classification challenges of increasing difficulty.
Over the course of 2 weeks, spending much of my time outside of my internship studying and working on the challenges, my team managed to complete all tasks succesfully, achieving the 2nd place in the hackathon. Here are some of my takeaways from the tasks and the presentation round.

Object Detection

We started by implemented the FasterRCNN model with various backbones, with augmentation strategies such as Mixup and CutMix. The Pytorch Lightning Flash framework made this process straightforward. To utilise some existing image augmentations in the Albumentations library, we just needed to include the following code to define a wrapper around the albumentations transformations, and pass the dictionary of transforms into the Flash Datamodule:

import albumentations as A
import torch
class A_transform(torch.nn.Module):
    def __init__(self, albumentation_transforms=[A.HorizontalFlip(p=0.5)]):
        super().__init__()
        self.transform = A.Compose(
            [*albumentation_transforms],
            bbox_params=A.BboxParams(format='pascal_voc', min_area=1000, min_visibility=0.2, label_fields=['category_ids']),
        )

    def forward(self, samples):
        new_samples=copy.deepcopy(samples)
        transformed=self.transform(image=np.array(new_samples["input"]),
                       bboxes=new_samples["target"]["boxes"],
                       category_ids=new_samples["target"]["labels"])
        target=new_samples["target"]
        target["boxes"]=[list(_) for _ in transformed["bboxes"]]
        target["labels"]=transformed["category_ids"]
        return {"input":Image.fromarray(transformed["image"]),
               "target":target}

train_transform = {
        "pre_tensor_transform": transforms.Compose(
            [A_transform([A.LongestMaxSize()])]
        )
    }

datamodule = ObjectDetectionData.from_coco( ... , train_transform=train_transform )

python

However, after several days of working with the pytorch lightning flash framework, we decided to switch to the ultralytics/yolov5 codebase instead, due to its training and inference efficiency, and simplicity to run, making it easier to collaborate on the training script via github, and simplifying the code in our training notebook to a single line:

!python train.py --epochs 30 --data data.yaml --weights yolov5m6.pt --frozen 10

In fact, it was revealed that the 3 other top teams also used YOLO models for their winning approach, thus demonstrating the ease of use and effectiveness of performing finetuning with the pretrained YOLO models.

Training process

Our approach to training our object detection model was to train models of 4 different sizes, feed the models as much data as we can afford, and ensemble all of our models for testing and inference.
Given the similarity of the classes (i.e. animals) we were required to detect to Google’s Open Images dataset, and considering much of the provided dataset was extracted from the Open Images v2 dataset, it was an obvious choice to obtain more training data to tune our model’s Path Aggregation Network to better learn spatial features relevant to our challenge categories. Thus, we tuned our models on 20GB worth of images among 5 classes with a frozen backbone over a week.
We only had 24 hours to train 2 models for the last 2 challenges, but the small number of trainable parameters when the model backbone and neck was frozen allowed us to quickly train all 4 models on the release datasets, which allowed it to perform well on the test set.
The augmentation strategies included in our model includes the common geometric and color transforms, but some others such as Mixup and Mosaic were seen to have a significant impact on the overall model performance.
The next essential component in our entire implementation is the ensembling of the different models trained by aggregating all bounding box predictions from all models on various augmented versions of the same image, and performing Non-maximum suppresion (NMS) to identify the best bounding boxes for each image.

Audio Classification

Similar to the Object detection tasks, we looked for efficient models, preferrably with pretrained weights, with good performance on existing benchmarks. Eventually, we decided on the Matchboxnet model introduced and implemented by Nvidia. Having worked with the NLP module within the Nvidia Nemo toolkit, I was confident that their framework would be easy to work with and performant. A possible alternative that can be considered would be SpeechBrain.
Like the object detection approach, initialising the model with weights prerained and training with a low learning rate helped speed up the training significantly, and allowed it to achieve a higher test accuracy than the model trained from scratch.
We made use of various augmentation techniques - such as the addition of white noise, gain, reverb, SpecAugment. However, the addition of some augmentation techniques at test time such as speed or reverb might improve the test performance further, and the addition of background / foreground noise or Room Impulse Response (RIR) noise would likely be beneficial to the model training as well.
In addition, ensembling all models using a weighted sum of the class logits helped to improve the accuracy further.
The use of 1D separable convolutions was highly effective in reducing the number of training parameters and increase the training speed greatly.

Ways to improve

Firstly, it is important to ensure the entire team is kept on the same page, and make the most of each team member’s time and abilities to maximise the team’s performance in the hackathon. Having all team members working on the same approach together instead of attempting different approaches would save on the training time and effort needed, which allowed our team to succesfully complete all 6 challenges.
It is also essential for all team members to have a clear understanding of the entire training pipeline during the presentation round, since any lack of clarity about the presentation topic can easily be flagged out by the adjudicators.
The implementation of a genetic algorithm for the searching of optimal hyperparameters when ensembling during inference will speed up the testing process, and allow more time to be spend on the training stage.
In the case where the set of classes to predict varies slightly across challenges, is also good to use the same labels for the same class, both to speed up the learning of the classifier head, and to make the ensembling process easier.

Xing Yu

Today I Learned (TIL) AI Camp 2021 takeaways

Object Detection

Training process

Audio Classification

Ways to improve