Yolov3 and Yolov4 in Object Detection

Explanation of object detection with various use cases and algorithms. Specifically, how the yolov3 and yolov4 architectures are structured, and how they perform object detection
HAHyacinth Ampadu25.00
May 04, 2021

Object Detection

In computer vision, there are so many applications and uses, one of which is object detection. Object detection is a subset of computer vision that is used to detect the presence, location, and type of objects in images. Object detection is also a combination of three functions; Object recognition, to find objects in an image, Object localization, to find where exactly in the image the objects are located, and Object classification, to detect what particular objects are in that image.

From the image above, the objects have been recognized(two objects in the image), they have been localized(bounding boxes around the two images) and they have been classified(dog and cat with 98.0% and 88.0% respectively). This is the working of object detection. Object detection has several uses, some of which are:

  • For tracking objects, example, tracking a ball during a football match or tennis game, also for traffic checking
  • For automated CCTV surveillance, to help increase security and protect people and property
  • For detecting people, mostly for the police and security services, to use to detect any unscrupulous activities by individuals
  • For vehicle detection, mostly for self-driving cars, to detect vehicles and respond accordingly

Over the past few years, deep learning architectures are achieving extraordinary and state-of-the-art performances for object detection. There are many families of object detection created over the years, but the most popular ones are the YOLO and RCNN object detection families. This article would focus on the YOLO architecture

You only look once(YOLO)

The YOLO algorithm was developed in 2015, and it involves one neuron that is trained and takes an image as input and gives a prediction of a bounding box and the class labels as output.

An input image is split into several grid cells, where each cell has a duty to predict a bounding box if the middle of the bounding box falls within that cell. The predicted bounding box has x,y coordinates and height, and width. Considering the image above, the image is divided into 9 grids, where 5 of them have part of the object(cars) in it, but only 2 of the cells contain the middle of the car, those are the cells that would be chosen(anchor box mechanism).

Each cell would be represented by vectors, with the height, width,x, and y coordinates, if there's a presence of an object and the type of object in the cell, as the vectors. If there's no object, that cell does not proceed with any detection. If an object is detected, then the type of object is indicated and the x and y coordinates, together with the bounding boxes are indicated and the final prediction is made.

There are many variants of the YOLO, which have been developed by researchers. In the following sections, we would look at two of the YOLO architectures, the YOLOv3, and the YOLOv4 object detection architectures.


The Yolo v3 architecture has residual skip connections and an upsampling layer. The key novelty is this algorithm is that it makes its detections at three different scales. The Yolo algorithm is a fully connected layer and the detection is done by using a 1*1 kernel on the feature maps to make the detections at three different locations using three different scales, as can be seen in the diagram above. 

As mentioned, the shape of the kernel for detection is a 1*1*(B*(A+C)), where B indicates the number of bounding boxes, C refers to the number of classes and the A refers to the 4 bounding box attributes(height, width, x and y coordinates). The Yolo v3 algorithm was trained on a dataset known as the coco dataset, which has 80 classes and the bounding box attributes summing to 3, so in effect, the kernel size becomes 1*1*255.

Moving forward, an assumption would be made that there's an image of size 480*480, that we are using the yolov3 to detect the objects in the image. As mentioned earlier, the yolov3 makes detection at three scales, it downsamples the input images by 32,16, and 8. For the first 81 layers, the image is downsampled by the stride of 32 of the 81st layer. Regarding our image of 480*480, the resulting feature map would be 480/32, which is 15*15. The first detection is performed here using the aforementioned 1*1 detection kernel, resulting in a feature map of 15*15*255.  

From the 79th layer, the feature map is upsampled to twice its dimensions(30*30), which is then concatenated with the feature map from the 61st layer. The second detection is made at the 94th layer, generating a feature map of 30*30*255. A similar process like before is followed, where the feature map from the 91st layer is concatenated with the feature map in the 36th layer, which is then subjected to a few 1*1 convolution layers to combine the previous layers. The final detection is made at the 106th layer, generating a feature map of 60*60*255. Each layer has a role in the detection, the 15*15 helps detect the larger objects, the 30*30 helps detect medium-sized objects, and the 60*60 helps detect large-sized objects.

Yolov3 uses independent logistic classifiers in place of the softmax function to determine the class of an input image. It also replaces the mean squared error with the binary cross-entropy loss, in simpler terms, the probability of object in the image and the class predictions are done using logistic regression.

YOLOv4 Architecture

Yolov4 is an improvement on the Yolov3 algorithm by having an improvement in the mean average precision(mAP) by as much as 10% and the number of frames per second by 12%. The Yolov4 architecture has 4 distinct blocks as shown in the image above, The backbone, the neck, the dense prediction, and the sparse prediction. 

The backbone is the feature extraction architecture which is the CSPDarknet53. This CSPDarknet53 stands for Cross-Spatial -Partial connections, which is used to split the current layer into two parts, one to pass through convolution layers and the other that would not pass through convolutions, after which the results are aggregated. Above is an example with DenseNet.

The neck helps to add layers between the backbone and the dense prediction block(head), which is a bit like what the ResNet architecture does. The yolov4 architecture uses a modified Path aggregation network, a modified spatial attention module, and a modified spatial pyramid pooling, which are all used to aggregate the information to improve accuracy. The image above shows spatial pyramid pooling.

The head(Dense prediction)  is used for locating bounding boxes and for classification. The process is the same as the one described for Yolo v3, the bounding box coordinates(x,y, height, and width)  are detected as well as the score. Remember, the main goal of the Yolo algorithm is to divide an input image into several grid cells and predict the probability that a cell contains an object using anchor boxes. The output is then a vector with the bounding box coordinates and the probabilities of the classes.

There are other techniques the authors of the yolov4 algorithm used to improve accuracy during training and afterward. They are the bag of freebies and bag of specials.

The bag of freebies helps during training and without increasing inference time. The bag freebies has 2 techniques, the first being the Bag of freebies for the backbone, which uses the cut mix and mosiac fo data augmentation and drop block for regularization, and the second being the bag of freebies for detection, which adds more to the backbone, such as self-adversarial training, random training shapes and the rest.

The bag of specials changes the architecture and increases the inference time by a little bit. The bag of specials also has 2 techniques, the first is the bag of specials for the backbone which uses the mish activation, cross-stage partial connections, the second is the bag of specials for detection, which uses the SPP-block, the SAM block, and others.

There's a lot of information regarding the bag of freebies and bag of specials, but to make it simple, these are the key takeaways, Both help greatly during training and after training. Bag of freebies uses techniques such as data augmentation and dropout and the bag of specials involves the neck, non-maximum suppression, and the likes. The below image shows the architecture of the main blocks of the Yolov4.


The Yolov3 and Yolov4 algorithms are both excellent at object detection, but as I pointed out earlier, there are several other algorithms for object detection. Below are the results obtained using yolov3 and yolov4 on the coco dataset for object detection, and some other detection algorithms.

The y-axis is the absolute precision and the x-axis is the frame per second. The blue shaded part of the graph is for real-time detection(webcam, street cameras, etc), and the white is for still detection(pictures). It can be seen that the yolov4 algorithm does very well in real-time detection, achieving an average precision between 38 and 44, and frames per second between 60 and 120. The yolov3 achieves an average precision between 31 and 33 and frames per second between 71 and 120.


We have looked at object detection in general, the Yolo algorithm and specifically, the yolov3 and yolov4 algorithms, their architecture, and the results achieved in object detection. Both algorithms are excellent in their own right, and even with the introduction of the yolov4, the yolov3 still performs relatively well in object detection, and should still be considered an option when selecting algorithms for object detection in projects or industrial setting.

3 votes
How helpful was this page?