Exploring YOLOv8: Experimental Discoveries using the xVIEW Aircraft Dataset

Vishnu Nair
19 min readSep 7, 2023
Image by Johny Goerend

What is YOLO-v8?
YOLOv8 is a computer vision model built by the Ultralytics team and was released on January 10, 2023. It is a model that has detection, classification, and segmentation capabilities through both Python and command line interface. It boasts a 5% accuracy improvement from the previous YOLOv5 model and is known to be faster than previous versions as well. The YOLOv8 model was trained on the COCO dataset and can provide accurate prediction of various classes while also learning new classes with ease. Some major difference with this model iteration when compared to its ancestors are: anchor-free detection, C3 convolutional blocks, and mosaic augmentation, which all will be explained in this blog.

Architecture Differences

Anchors

YOLO-v8 uses anchor free detection, unlike YOLO-v5. Anchor-free detection is an approach in object detection that eliminates the use of predefined anchor boxes or anchor points. Traditional object detection methods rely on predefined anchor boxes of various sizes and aspect ratios to localize and classify objects. In contrast, anchor-free detection methods directly predict the bounding boxes without the need for anchor boxes. Sidenote: this speeds up NMS

In an anchor-free head, there are no predefined anchor boxes associated with each grid cell.
Instead of relying on anchor boxes, the anchor-free head directly predicts bounding box coordinates for object detection. Objectness scores are also predicted to indicate the presence or absence of an object within each predicted bounding box. The predictions are often based on keypoint detection, center point estimation, or other methods that directly infer the bounding box location without relying on anchor boxes. The model does not need to learn to adjust anchor boxes during training. Instead, it directly predicts the bounding box coordinates for each object without any anchor box references. The model is thus more flexible and potentially more efficient in detecting objects with different sizes and aspect ratios.

YOLOv5 versus YOLOv8 head. Figure by OpenMMLab

Convolutional Blocks

There is a new C2F (Coarse-to-Fine) module in YOLOv8. The stem leading to the C2F was a 6x6, it is now a 3x3 CBS (Cross-Stage Binary Block). Two 3x3 convolutions (called bottleneck) are concatenated in this new C2F module. In the C2F module, the bottleneck block is typically a sequence of convolutional layers followed by non-linear activation functions, such as ReLU (Rectified Linear Unit). The purpose of the bottleneck block is to reduce the number of channels or dimensions of the input feature maps, thereby compressing the information and extracting more abstract representationsThe bottleneck was 1x1 in YOLOv5 but now is 3x3. All C3 (Convolutional Layer 3 Block) units have been changed to C2F’s. Additionally, there is one prediction head in the YOLO-v8 model, as opposed to three in YOLO-v5.

C3 vs C2f blocks. Figure by OpenMMLab

Within the C2F module, there are split and concat nodes. The “Concat” module within the C2F (Cross to Fine) module of YOLOv8 is responsible for combining feature maps from different layers or branches of the neural network. It plays a crucial role in fusing multi-scale features, leveraging both high-resolution details and low-resolution contextual information. By concatenating the feature maps along the channel dimension, the “Concat” operation enables the C2F module to create a unified and more informative feature map, enhancing the model’s ability to capture fine-grained details and context for more accurate object detection.

The “split” operation in the C2F module is typically used to separate the feature map into different scales or resolutions. This allows the module to extract features and perform subsequent operations at multiple levels simultaneously, capturing both local details and global context. The splitting operation before the bottleneck in the C2F (CornerNet2/FSAF) module serves the purpose of processing feature maps at multiple scales or resolutions independently. This approach has several advantages:

  • Scale-specific feature extraction: Splitting the feature map before the bottleneck allows each branch to focus on specific spatial information relevant to its corresponding scale. Different scales may contain distinct visual patterns, object sizes, or contextual cues. By processing feature maps independently, the C2F module can extract scale-specific features that capture the characteristics of objects at different scales, improving the detection performance
  • Parallel processing: The split operation enables parallel processing of feature maps at multiple scales. Each branch, representing a different scale, can be processed concurrently. This parallelism helps improve the efficiency of the C2F module by utilizing multiple computation resources simultaneously, potentially reducing the overall inference time
  • Diverse receptive fields: The split branches with different resolutions provide feature maps with diverse receptive fields. Lower-resolution branches capture a wider context, while higher-resolution branches focus on finer details. The combination of these diverse receptive fields helps the C2F module capture objects at various scales and maintain a balance between local details and global context

About the xView Dataset

The xView dataset is one created by Lam et. al in 2018. The data consists of satellite imagery collected from the WorldView-3 satellite using a 0.3m ground sampling distance. It is one of the largest and most publicly available object detection datasets and contains more than 60 classes of objects. The dataset covers a wide range of geographical locations, allowing for a broader understanding of object detection challenges across different environments, terrains, and landscapes. The dataset has been widely used as a benchmark for evaluating the performance of object detection algorithms, enabling researchers to compare different approaches and techniques. It has spurred the development of more accurate and robust algorithms for satellite imagery analysis. For our sake, we use the xView aircraft dataset, which is a subset of the xView dataset that contains 5 classes of airplanes. If you wish to learn more about the dataset, check out this paper!

Experiments and Findings

I conducted a series of experiments where different hyperparameters, learning rates, optimizers, and image rotations were tested in aims of improving model performance on the xview aircraft dataset. This is a dataset that contains images of 5 different types of aircraft, as stated before.

Figure 1. Histogram of class counts

Objectness Weight

In this experiment, I aim to observe whether tuning the objectness weight parameter (distributive focal loss gain parameter) in YOLO-v8 will yield improved results. The distributive focal loss (DFL) parameter extends the concept of focal loss introduced in the RetinaNet object detection model. Focal loss is designed to mitigate the impact of easy background examples during training, which can dominate the loss and lead to slow convergence and poor performance. It achieves this by assigning a larger weight to hard examples that are misclassified. The DFL further improves upon the focal loss by introducing a distribution focal term, which adapts the loss weights based on the distribution of the target object sizes. This helps to handle the imbalance between small and large objects in the dataset. The distribution focal term applies a weight to each example based on the target object size, with the goal of upweighting small objects and downweighting large objects during training. By incorporating both the focal term and the distribution focal term, the DFL loss function provides a balanced and effective training signal for object detection models, enabling better handling of class imbalance and improving overall performance.
There are three losses that were employed in the YOLO-v5 model (in the YOLO-v8 model, we only have a box loss and class loss):

  • Box Loss (Localization Loss):
    The box loss measures the discrepancy between the predicted bounding box coordinates (x, y, width, height) and the ground-truth bounding box coordinates for objects present in the image. The model learns to minimize this loss by adjusting the predicted bounding box coordinates to match the ground-truth values more accurately. Common loss functions used for box loss include mean squared error (MSE) or smooth L1 loss.
  • Class Loss:
    The class loss is responsible for optimizing the classification accuracy of the model. It quantifies the difference between the predicted class probabilities and the true class labels associated with the objects in the image. The class loss encourages the model to assign higher probabilities to the correct object classes and lower probabilities to incorrect or background classes. Typically, cross-entropy loss is used as the class loss function.
  • Objectness Loss:
    The objectness loss evaluates the confidence or objectness score predicted by the model for each bounding box proposal. This score represents the likelihood of an object being present within the corresponding box. The objectness loss encourages the model to assign high objectness scores to boxes containing objects and low scores to empty or background regions. Binary cross-entropy loss or logistic loss is commonly used as the objectness loss function.

For initial experimentation, we chose DFL values of 3, 6, and 9. As we can see in figure 2, this parameter does not make a significant impact on the model performance overall. The Adam optimizer with a DFL value of 6 seems to be performing the best. Overall, Adam outperforms SGD from the plot below, but not by a significant amount. An early stopping condition was given; if there is no significant decrease in loss for 5 epochs, the training process stops. This experiment’s results suggests that manipulating this particular parameter alone may not be sufficient to achieve significant improvements in the model’s performance. Furthermore, comparing the performance of the Adam optimizer to that of SGD, it was found that the Adam optimizer with a DFL value of 6 demonstrated the best results. Although Adam outperformed SGD, the margin of improvement was not substantial. These results indicate that the choice of optimizer may have a more noticeable effect on model performance compared to the objectness weight parameter.

Figure 2: DFL Tests

Image Rotations

In this experiment, we observe the effects of image rotation on model performance. We notice across the board that applying rotations to the images do not increase model performance. In fact, we notice mAP values plateauing below 50%, which is significantly less than what we observed in the previous experiments. It would be best if we did not rotate the images during the training process based on this finding. It is important to consider these results within the context of the specific experiment and dataset used. Factors such as the nature of the dataset, the complexity of the objects being detected, and the specific model architecture employed could potentially influence the impact of image rotations on model performance. Some model architectures might not be well-suited to handle rotated images. For example, rotation-sensitive models, such as those using fixed-size convolutional filters, may struggle with detecting objects in rotated images. If the model is not trained for a sufficient number of epochs or with the right learning rate, it may not fully adapt to the augmented data either.

Further analysis and investigation may be warranted to understand why image rotations did not contribute to improved model performance. This could involve exploring other augmentation techniques, adjusting the degree of rotation, or examining the underlying characteristics of the dataset and model architecture.

Figure 3. Rotation Tests

Learning Rate

In this experiment, we test different learning rates. Learning rates of 0.01 and 0.001 were initially tested. We noted that for these, performance either plateaus early or we observe early stopping in training since we set the patience to 5 (early training termination if performance does not improve after 5 epochs). Our intuition was that the learning rate may be too high in later epochs, which makes us bounce around a local/global minimum. In attempts to remedy this, we chose to run a learning rate decay from 0.01 to 0.001. From the plot below, we notice that SGD with learning rate decay as well Adam with a learning rate of 0.001 performs the best. The learning rate decay did not work as well on the Adam optimizer, which is an interesting discovery. Additionally, the RMSProp optimizer was tested and found to be performing significantly worse than the other two optimizers. SGD might converge faster and find a better minimum in the loss landscape for the specific object detection problem. Adam and RMSprop use adaptive learning rates, and sometimes they may not converge as efficiently as SGD, especially for certain datasets such as the xview aircraft dataset being used here. The adaptiveness of Adam and RMSprop might lead to overfitting in some cases, especially if the dataset is not large enough or has high variance. SGD might be more robust to variations including data distributions and class imbalances. For this reason, we should only consider SGD as a viable optimizer candidate for the final model.

Figure 4. Learning Rate Tests

Smaller Training Set

For this experiment, we explored various learning rates using smaller sized training sets and trained these models for 75 epochs for the purpose of testing model robustness. The dataset used contained 7721 image-label pairs in total. We tested using 25% (1291 samples), 50% (2582 samples), and 75% (3873 samples) of the dataset to see how mAP with IoU 50–95 would vary. We wished to observe how much the model would be impacted based on the size of the training data given. From the plots below, we note that the SGD optimizer with a decayed learning rate and 3873 samples (75% of the dataset) performs the best, as expected. The models trained using 25% performed worse as expected as well, but only by a ~5% decrease in mAP50-95. This gives us a sense of how robust the pretrained YOLOv8 model is. Even when using such a small proportion of the dataset, we do not notice a large drop in performance.

Figure 5. Reduced training data tests

Deleted/Skewed Labels

In this experiment, we made adjustments to the YOLOv8 codebase directly. Interactive parameters were added to the code such that the user can specify percentages of the dataset they would like to make modulations to, namely either skew or deletions of labels. The purpose of this experiment was once again to test model robustness, except in terms of labels this time! All modifications were made in the default.yaml file and dataset.py files within the codebase. Doing so, a handful of experiments were ran using different percentage of dataset modulations (both skew and label deletions) in aims of testing model robustness in terms of labels. These modifications, which spanned the spectrum from subtle tweaks to pronounced label deletions, aimed to probe the model’s resilience and adaptability to varying label scenarios. The plot depicted below serves as a testament to the robust nature of our YOLOv8- based approach. Surprisingly, even the wholesale deletion of 50% of the labels yields a performance dip of merely around 3–4%, a result that stands as a testament to the model’s remarkable stability and adaptability.

Figure 6. Missing/skewed label tests

Edge Analysis

Additionally, edge analysis was performed on the best model (SGD with learning rate decay from 0.01 to 0.001). A script was written in order to find the proportion of incorrect predictions that occurred when the object of interest (aircraft) appeared on the edge of the image. The purpose of this was to observe any potential model weakness when the object of interest is cutoff in the image. We defined objects who reside 5% or less away from the edges of the image to be “edge detections”. It was observed that a majority of false positives, negatives, and misclassifications in fact occur on the edge of the image (see table below). This discovery not only underscores the model’s sensitivity to spatial positioning but also beckons a deeper inquiry into the underlying dynamics that contribute to these edge-related predictions. A potential resolution to this would be to tune stride and padding and convolutional filter sizes parameters in the model to account for edge pixels better. Stride, the step size at which the convolutional filter traverses the image, bears the potential to optimize the model’s perception of pixel context along the edges. Padding, on the other hand, influences the handling of border pixels during convolution, potentially bolstering the model’s ability to capture critical edge information. Lastly, optimizing convolutional filter sizes within the YOLOv8 model could yield an improved receptive field, enhancing the model’s capacity to discern object features at varying spatial resolutions, including the pivotal edge regions.

A spatial heatmap of the validation set images was also plotted to see what portion of the image the model frequently makes mistakes on. Plotting the x and y-centers of the incorrect predictions and binning them for all of our 640x640 size images, we see that the model seems to fail the most when objects are located on the left and bottom sides of the image. We initially thought that they would be uniformly distributed amongst all 4 edges, but this was a notable idiosyncracy. Such findings underline the significance of spatial context in determining detection success and suggest the presence of underlying visual and contextual intricacies that warrant further investigation. This insight, gained through the spatial heatmap, offers valuable guidance for refining model training, dataset augmentation, and potential architectural adaptations to address these specific spatial vulnerabilities and elevate overall detection accuracy.

Figure 7. Spatial heatmap of model failures

Below, we see a plot that highlights each error type by class. We notice that the passenger/cargo plane and fixed-wing aircraft classes contain the highest number of incorrect predictions, which is interesting as these classes possess the highest number of instances in our dataset. A compelling facet to consider is the prevalence of false positives across all classes, particularly along the image’s edge. This phenomenon potentially stems from the challenge of discerning between aircraft and the backdrop, thus prompting the exploration of targeted remedies. One potential avenue involves enriching the training dataset with a more diverse array of background imagery. This augmentation could assist the model in distinguishing aircraft from surrounding terrain more adeptly. Additionally, augmenting the training dataset’s volume while managing class imbalance through innovative sampling strategies offers an avenue for enhancing performance and addressing these specific challenges head-on. By delving into these insights and implementing them strategically, we can bolster the model’s ability to overcome these distinct hurdles and further elevate its detection prowess.

Figure 8. Error breakdown by class

Model Robustness Tests — A Deep Dive into the Effects of Missing Labels

Next, we take a deeper dive into testing model robustness by examining the effects of deleting labels. The purpose of this is to observe how model performance varies when we have a inadequately labeled dataset. In the plot below, we see how the number of false negatives increase as we delete more labels within the training dataset. The labels on the x-axis correspond to 5%, 10%, 25%, 50%, 75%, 90%, and 95% of the dataset. Even when 50% of the labels are deleted, the model does not take a significant dip in performance. Out of the total 7,353 labels we have, only about 3,676 of them in order for the model to not take a significant hit in performance. The unexpected resilience in the face of substantial label deletions sparks a new thread of inquiry: does the model’s proficiency rest on certain key instances, encapsulating the essence of each class? Does the model adeptly grasp the overarching patterns, transcending the intricacies of individual examples?

Figure 9. False negative count post-label detection

Here we see another plot that allows us to view the progression of false positives, negatives, and misclassifications as we delete an increasing amount of labels. We see that the number of false positives begins to increase rapidly one we delete 50% of the labels as observed before. The other types of incorrect predictions converge to zero as the model is unable to detect any objects once we delete 90% and 95% (where we have approximately 550 labels).

Figure 10. Error breakdown post-label deletions

From the above plots, we can see visualize the robustness of the YOLOv8 model. We may attribute this to the overall architecture as well as the mosaic augmentation strategy it uses during the training phase, where images are stitched together randomly with varying degrees of overlap. The benefits of mosaic augmentation are multifaceted. Most importantly, it introduces diversity into the training dataset. This diversity helps the model develop a more comprehensive understanding of object appearances, spatial relationships, and context variations. The model becomes adept at recognizing objects from different angles, scales, and relative positions. This, in turn, fosters improved generalization, making the model less prone to overfitting to specific scenarios. The model’s ability to decipher complex spatial relationships and contextual cues, aided by mosaic augmentation, perhaps may attribute greatly to this robustness in scenarios where annotations are sparse. This translates into the model’s remarkable resilience and steadfastness when we lack labels, as it leverages the mosaic- imbued insights to make sense of the images it encounters.

Incorporating Additional Background Imagery

Lastly, I conducted an experiment where additional background imagery (conditioned on the false positives obtained from the model), was incorporated into the training dataset in aims of lowering the number of false positives. Namely, 10,985 background images were incorporated into the dataset. By conditioning this augmentation on the false positives identified by the model, we effectively address one of the challenges it faces — overidentifying objects against the background.

From the table above, we can see the results I obtained after metric collection. I observed a 7% decrease in the number of false positives after incorporating the new imagery. This is a significant finding as it informs us that when we scale the model on larger datasets, integration of background imagery is a potent strategy to bolster the model’s capacity to discern between the aircraft of interest and the backdrop.

We also notice a significant decrease in the amount of misclassifications. However, these improvements come at the expense of having an increased number of false negatives. From this, we learn that it is important to have a balance of background imagery and object-containing imagery in hopes of balancing these metrics. Else, we can customize the ratio at which we want to have both classes of imagery to balance precision/recall as we see fit. Overall, this experiment underscores the customization potential that augmentation techniques offer.

In summary, thorough experimentation was conducted on the YOLOv8 model using the xView aircraft dataset, shedding light on both its strengths and areas for further refinement. Although we have gleaned much insight from these trials, there is always more that can be done to better characterize the model. New techniques, strategies, and refinements will undoubtedly emerge, potentially unveiling additional layers of optimization for the YOLOv8 model. The showcased predictions, generated via an SGD optimizer with a tuned learning rate decay, exemplify the potential of this model to capture and classify complex objects in real-world satellite imagery! These results not only highlight the model’s ability to differentiate between various classes, but also emphasize the tangible impact that the YOLOv8 model can have on diverse applications, from surveillance to environmental monitoring.

Thank you for reading :)

General Notes for YOLOv8 Beginners

What are anchor boxes?

When designing object detection models, anchor boxes are often tailored to the distribution of object sizes and aspect ratios found in a benchmark dataset used for training and evaluation. The anchor boxes are optimized to cover the typical range of objects present in that particular dataset. When applying the same model with predefined anchor boxes to a custom dataset, the distribution of objects in the custom dataset might differ from that of the benchmark dataset. Objects in the custom dataset may have varying sizes, aspect ratios, or spatial distributions that are not adequately represented by the predefined anchor boxes.

In an anchor-based head, predefined anchor boxes of various sizes and aspect ratios are associated with each grid cell of the detection network. The anchor-based head predicts bounding box coordinates (e.g., center coordinates, width, and height) relative to the anchor boxes. Objectness scores are also predicted to indicate the presence or absence of an object within each anchor box.
The predictions are typically based on the anchor box that has the highest intersection-over-union (IoU) with the ground truth bounding box.

What is reg_max?

reg_max refers to a parameter or configuration that determines the maximum value for the regression (bounding box) predictions.
YOLOv8 predicts the bounding box coordinates for each grid cell. These predicted coordinates typically represent the center coordinates, width, and height of the bounding boxes.

However, to ensure numerical stability and prevent unbounded predictions, there is often a limit imposed on the maximum values that the regression predictions can take, which is done by this parameter.

What is box loss, class loss, and CIoU?

The box loss is computed using IoU (intersection over union) in tandem with the DFL loss described above
The class loss is computed using the classic binary cross entropy
The classification branch still uses BCE Loss. The regression branch employs both Distribution Focal Loss and CIoU Loss. The 3 Losses are weighted by a specific weight ratio.
The CIoU loss incorporates two main components: the IoU loss and the distance term. The IoU loss component is similar to traditional IoU loss and quantifies the overlapping area between the predicted and ground truth bounding boxes. It encourages the predicted box to have a high IoU with the ground truth box. The distance term, known as the CIoU distance, takes into account the distance between the centers of the predicted and ground truth bounding boxes. It penalizes larger distances between centers and encourages better localization accuracy. The CIoU distance also considers the aspect ratio difference between the predicted and ground truth boxes, promoting more accurate predictions for objects with different shapes.

What is mosaic augmentation?

Mosaic augmentation is a technique employed in computer vision to enhance the robustness and generalization capabilities of deep learning models, particularly in object detection tasks. This technique draws inspiration from the art of creating a mosaic by combining multiple images into a single canvas. In the context of deep learning, mosaic augmentation involves stitching together several images, along with their corresponding annotations, to form a composite training sample. This merged image is then used for training, providing the model with a holistic view of objects in various spatial arrangements and contexts.

Helpful Resources

--

--

Vishnu Nair

Data Science Masters student @ UPenn. Passionate about using data science and AI for health & environmental avenues