I re-labelled the dataset with RoboFlow, as many annotations were missing. Moreover, I split my dataset in two parts (training and validation) thanks to a GroupKFold (k=5
) on the video sequences. This avoids overfitting of our model during training. Indeed, we avoid that images of the same video sequence are found in both the training set and the validation set
For the preprocessing of the images I used a technique consisting in cutting the image into several tiles. I made this choice because the objects (starfish) represent on average less than 1%
of the image. So instead of having 1280x720
images I had 640x320
images. This allowed a faster training and more visible objects.
Here is an interesting research paper on the subject: The Power of Tiling for Small Object Detection
I decided to finetune a yolov5s
model by changing some hyperparameters of augmentations. Furthermore, my training images were image tiles (640x360) and my validation images were classic images (1280x720). The model choice metric was the F2 score
(modified in the yolov5 files)
I proceeded in two steps:
- First I made predictions in a classical way by adjusting the hyperparameters as well as possible (confidence, iou, tta, image resolution)
- Then I took my best model and added tracking with the library (norfair) which improved my results. Finally, I performed a model ensemble (WBF : Weighted-Boxes-Fusion), i.e. I merged the predictions of two models:
- A model trained with the image tiles
- A model trained with the classic images
The results were even better