Diving into Object Detection Basics

A guide for Object Detection basic concepts which cover What is Object Detection and how does it work, Concept of Anchor Boxes, Why is Loss function necessary, some free datasets, and finally, implementation of SSD.
TKTejas Khare24.00
Apr 29, 2021
Article

The prospects of Artificial Intelligence (AI) are not just limited to predicting if a person will get a loan or not by giving his credit history, annual income, annual expenses, criminal records, etc. Computer Vision is a trending topic for AI enthusiasts of any experience level.

Let me give you a brief idea about Computer Vision. It is nothing but when a machine identifies an 'object' in an image/video after learning from the data that was fed to it. For example, when we see an object first time, we are mostly aware of it like what it is called. But after getting to know its name, the next time we see it we know exactly what it is. Exactly like our brain, Computer Vision, to be specific Object Detection works.

Visualization of Computer Vision

Contents of this article -

  1. Introduction of Object Detection
  2. Anchor Boxes
  3. Loss Function
  4. Datasets to start with
  5. Implementing your first Object Detection Model
  6. Results
  7. Conclusion

1. Introduction of Object Detection

Object detection is locating and identifying an object in an image or in a video. Locating an object is nothing but giving the exact position where the object resides in the frame. (Here frame can be a single image or a sequence of frames that is a video). To locate an object, we can either use a bounding box or any other geometrical shape like a circle. The easiest and standard approach is by using the bounding box, where we first obtain the center coordinates (x, y) and the width (w) and height (h) of the box.

To identify an object, the network must be trained on data, for example, images of the person. This step is called the classification of objects and it is very essential for the bounding box to be formed correctly. To ensure the correct training of the network, ensure the data is correct.

What I mean from this is that the labels of an image should be precise. Labels are basically coordinates of an object in an image and are to be passed to the network with the corresponding image for training. There are many open-source datasets available where you can get the images and the labels. You can also create your own custom dataset by using labeling tools like Labeling or VGG Image annotator.

You must be wondering what if I have multiple images in a frame, can I detect every object there? So it turns out, Yes, you can detect multiple objects given the classes of those objects that have been used for training your network.

Detected objects in an image

In the above image, two objects are detected and their corresponding coordinates have been obtained by using the SSD (Single Shot Detection) object detection algorithm.

2. Anchor Boxes

How does the network predict or identify the box?

The network first makes a random guess of the coordinates and assigns them a value w for width and h for height. It assigns (0, 0) for the center of the box (x, y). Of course, this is not the actual prediction. So after every step of training, which is termed as an iteration, the network performs regression to get the correct estimates.

The image/frame is divided into a grid (as shown in the above image) that divides the entire frame into multiple regions. These regions are called Anchor Boxes. So when the network iterates, it measures the error in the predicted coordinates and the ground truth coordinates that are stored in the labels. Hence, after each step, it reduces the difference between the actual and predicted coordinates.

3. Loss Function

For the network to perform with your expectations you will have to define a loss function. Loss function plays a very important role in the network's training. The correct loss function will provide the best predictions of your network. For this article, we will have a look at the loss function for the SSD algorithm.

Loss = 1/N(P + cL)

In SSD, you can set a minimum threshold for matches of the prediction class. If the match probability is greater than the threshold, then the match is positive and it is used for reducing the overall loss. So 'N' is the number of the positive match for that iteration, 'L' is the localization loss, ie. how far or near is the predicted bounding box from the ground truth box, 'P' is the class prediction loss, and finally, 'c' is the weight for the localization loss. The weights are obtained after or during the training which contains values that are used to give the correct output. These values are initialized randomly before training and updated after every iteration to get the minimum loss.

4. Datasets to start with

There are many datasets to start training your first object detection model. These datasets are open source meaning anyone is free to use them. These datasets have a large collection of classes of objects to choose from. So have fun while exploring these datasets -

  1. COCO Dataset
  2. ImageNet
  3. Open Image Dataset V6
  4. Labelme
  5. CelebFaces
  6. 50 other datasets

5. Implementing your first Object Detection Model

By this time you must be excited to implement your first ever object detector. It is highly recommended that you should first implement a pre-trained model. The reason for this is pretty simple. Implementing a pre-trained model will give the gist of what is happening with the code. This will help you in a better understanding of the work and will come in handy when you are training your own custom object detector.

For implementing your object detector I recommend you to get familiar with libraries like OpenCV, Numpy, Tensorflow, Pytorch, Matplotlib, scikit-learn.

You can implement your detector by using Pytorch -

Note: This is the combined code for the SSD detection with simple explaination

import torch
from matplotlib import pyplot as plt
import matplotlib.patches as patches

precision = 'fp32'
#Initializing pretrained ssd model
ssd_model = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd', model_math=precision)
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd_processing_utils')

# Using GPU acceleration for processing faster
ssd_model.to('cuda')
# Initializing the model to forward pass (You need to initialize this as this changes the final layers of your network which are required for testing)
ssd_model.eval()
#Get sample images
uris = [
    'http://images.cocodataset.org/val2017/000000397133.jpg',
    'http://images.cocodataset.org/val2017/000000037777.jpg',
    'http://images.cocodataset.org/val2017/000000252219.jpg'
]
# Preprocessing of images for getting the images in the same format as of the input to the model
inputs = [utils.prepare_input(uri) for uri in uris]
tensor = utils.prepare_tensor(inputs, precision == 'fp16')

# Running the forward pass
with torch.no_grad():
    detections_batch = ssd_model(tensor)

# Get the coordinates of the bounding box as (x, y, w, h)
results_per_input = utils.decode_results(detections_batch)
#Filtering the predictions which have probabilty smaller than 0.4
best_results_per_input = [utils.pick_best(results, 0.40) for results in results_per_input]  

# Getting the labels of the COCO dataset to be matched with the predicted class
classes_to_labels = utils.get_coco_object_dictionary()

# Showing the bounding boxes and their classes in the input image
for image_idx in range(len(best_results_per_input)):
    fig, ax = plt.subplots(1)
    # Show original, denormalized image...
    image = inputs[image_idx] / 2 + 0.5
    ax.imshow(image)
    # ...with detections
    bboxes, classes, confidences = best_results_per_input[image_idx]
    for idx in range(len(bboxes)):
        left, bot, right, top = bboxes[idx]
        x, y, w, h = [val * 300 for val in [left, bot, right - left, top - bot]]
        rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='r', facecolor='none')
        ax.add_patch(rect)
        ax.text(x, y, "{} {:.0f}%".format(classes_to_labels[classes[idx] - 1], confidences[idx]*100), bbox=dict(facecolor='white', alpha=0.5))
plt.show()

6. Results

You will get something like this as your output with the bounding boxes on the objects -

Note: These images are cropped from the original output to fit into the article

                                 

7. Conclusion

There it is, I hope you got your first object detector and are ready to make your own object detector. Hope this article helped you understand the basic concepts of object detection.

Thank you for reading the article

2 votes
object-detectioncomputer-visionssd
How helpful was this page?