Filmora - AI Video Editor
Edit Faster, Smarter and Easier!
Filmora Video Editor
Effortlessly create video with AI.
  • Various AI editing tools to increase your video creation efficiency.
  • Offer popular templates and royalty-free creative resources.
  • Cross-platform functionality for editing everywhere.

Multi Object Tracking: The Ultimate Guide

Liza Brown
Liza Brown Originally published Sep 05, 22, updated Mar 27, 24

Multi Object Tracking (MOT) in a video is a challenging process with many applications in both the public and private sectors. Surveillance cameras in public places can track potential criminals, while retail stores can use object tracking to monitor customer behavior.

Developed in 1988 by Zenon Pylyshyn, MOT is a technique first designed to study the human visual system's ability to track multiple moving objects. However, since then, various methods have been introduced for Object Tracking through computer vision.

In this article, we will explore Multi Object Tracking and provide a detailed guide on object tracking and the requirement to track multiple objects.

In this article
  1. What Is Object Tracking
    1. SOT
    2. MOT
    1. Detection
    2. Prediction
    3. Data Association
    1. OpenCV-Based Object Tracking
    2. MDNet
    3. DeepSort
    4. ROLO

Part 1. What Is Object Tracking

Object Tracking is an application of computer vision that involves tracking the movement of objects in real-time. It is a useful tool for many different purposes, such as video surveillance, human-computer interaction, and automotive safety.

The Object Tracking algorithm is a deep-learning-based program that works by developing a model for each individual object and creating a set of trajectories to represent their movement. This is done through an indication, such as a square that follows the object and tells the users about its location on the screen in real-time.

Its algorithms are designed to work with various types of inputs, including everything from images and videos to real-time footage. The input you expect to use will impact the category, use cases, and object tracking applications.

object tracking

Part 2. Types Of Object Tracking

There are two main types or levels of Object Tracking: SOT and MOT


Single Object Tracking or Visual Object Tracking is a process in which the bounding box of the target object is assigned to the tracker in the first frame. The tracker then detects the same object in all the other frames.

SOT only detects and tracks a single object and comes under the category of detection-free tracking, which implies that it is manually initialized with a fixed number of objects, even though other objects are present in the frames.

Let's understand it with an example: A police department is resolving a murder case that involves a car on the highway. They received surveillance camera footage and wanted to track the vehicle to resolve the mystery. However, it might take time to do it manually. Therefore, they will use the Single Object Tracking process and will assign the tracker a bounding box for the target car to check what happens to it.

single object tracking


Multiple Object Tracking involves tracking multiple objects in a frame. Since its development in 1988 by Zenon Pylyshyn, several experiments have been conducted to see how human and computer vision systems can detect and track multiple objects in a frame.

As an output, multiple tracking creates several bounding boxes and are identified using certain parameters such as coordinates, width, height, etc. MOT program is not pre-trained regarding the appearance or amount of objects to be tracked.

Moreover, the algorithm assigns a detection ID to each box which helps the model in identifying the objects within a class. For instance, if multiple cars are in a frame, the MOT algorithm will identify each car as a separate object and assign them a unique ID.

multiple object tracking

Part 3. What Multi Object Tracking Needs?

Above is the explaination of MOT. In this part, we will focus on its mechanism. Following are some of the most important requirements of Multi Object Tracking:

1.     Detection

The best approach to detect objects of your interest depends on what you're trying to track and if the camera is stationary or moving.

MOT Using Stationary Camera

The vision.ForegroundDetector System object can be used to detect objects in motion against a stationary background by performing background subtraction. This approach is efficient but requires that the camera be stationary.

MOT Using Moving Camera

A sliding-window detection approach is often used with a moving camera to detect objects in motion. However, this approach is slower than the background subtraction method.

Use the following approaches for tracing the given categories of objects.

Type of the Tracking Object Camera Position Approach
Custom object category Stationary/Moving custom sliding window detector using selectStrongestBbox and extractHOGFeatures or trainCascadeObjectDetector function
Pedestrians Stationary/Moving vision.PeopleDetector System object
Moving object Stationary vision.ForegroundDetector System object™
Faces, upper body, mouth, nose, eyes, etc. Stationary/Moving vision.CascadeObjectDetector System object

2.     Prediction

The second requirement for Multi Object Tracking is "Prediction." In this, you have to predict the position of the tracking object in the next frame. To do this, you can design the model to use the Kalman filter (vision.KalmanFilter).

This will help predict the next location of the object in the frames. For this, it will take into account the object's constant velocity, constant acceleration measurement noise, and process noise. Measurement noise is the detection of an error, while process noise is the variation in the object's actual motion from that of the motion model.

3.     Data Association

Data association is a critical step in Multiple Object Tracking and involves linking the data points together that represent the same thing across different frames.

A "track" is the temporal history of an object consisting of multiple detections and can include the entire history of past locations of the object or simply the object's last known location and current velocity.

Part 4. Approaches Of Object Tracking

After understanding what MOT needs, let’s learn about the theory of how Object Tracking works.

The following are the most popular approaches for Object Tracking:

1.     OpenCV-Based Object Tracking

There are many ways to approach object tracking, but one of the most popular is through the use of built-in algorithms in the OpenCV library.

The library has a tracking API containing Object Tracking algorithms and eight trackers: BOOSTING, MEDIANFLOW, MIL, KCF, CSRT, TLD, GOTURN, and MOSSE. Each tracker has its own advantages and disadvantages and has different goals. For instance, the MOSSE tracker is best for the fastest object tracking.

To have a deeper review of OpenCV Object Tracking and what is OpenCV, please read our article about: OpenCV Tracking: A complete Guide in 2022.(同期交付,可以插这个文章主题的内链)

2.     MDNet

MDNet is a breakthrough in the field of tracking because it is the first network to use classification-based models instead of the more traditional approach. This makes MDNet much faster and more accurate than other tracking methods.

Inspired by the R-CNN object detection network, the MDNet algorithm can detect objects in real-time more efficiently and with high speed, making it a state-of-the-art visual tracker.

mdnet for visual tracking

3.     DeepSort

DeepSort is the most popular Object Tracking algorithm choice. The integration of appearance information or deep appearance distance metrics vastly improves DeepSORT performance.

The addition of the "Deep Appearance" distance metric enables DeepSort to avoid identifying switches by 45% and handle complex scenarios. On the MOT17 dataset, DeepSORT has received 77.2 IDF1 and 75.4 MOTA with 239 ID switches but a lower FPS of 13.

4.     ROLO

ROLO - a combination of YOLO and LSTM is a Spatio-temporal convolutional neural network that uses the YOLO module and LTSM network for collecting visual features, location inference priors, and locating the target object's trajectory.

The LSTM network uses an input feature vector of length 4096 for each frame to predict the target object's location. This vector is obtained by combining the high-level visual features with the YOLO detection. By working together, the LSTM and YOLO can predict the target object's location more accurately.

ALT TEXT: rolo for object tracking

rolo for object tracking


In this ultimate guide, we've discussed Multi Object Tracking and its r­­equirements. We also explored different approaches for object tracking to help you determine which one is best for your needs.

Hopefully, you found this guide helpful, and your queries related to Object Tracking and its types have been resolved.

Free Download
Free Download
Liza Brown
Liza Brown Mar 27, 24
Share article: