TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT

PDF version will be published soon

Track black helicopters

Introduction

In our work, our primary focus is on generic multiple object tracking (GMOT). We begin by introducing the Refer-GMOT dataset and then present the TP-GMOT framework. TP-GMOT has two key innovations: (i) CS-OD, an object detection approach capable of identifying novel generic objects based on textual descriptions; and (ii) MAC-SORT, an object association method that adeptly leverages and balances both motion and appearance for tracking generic objects.

Pipeline Overview

Methodology

CS-OD: To address these limitations of pre-trained VL models, our CS-OD comprises two modules as follows:

Module 1: Include-Exclude (IE) Strategy Using a pre-trained Vision-Language Model (VLM), we've refined object detection in our IE Strategy. It starts with parsing the captions defined in our newly introduced "Refer-GMOT" dataset into three prompts: general, include, and exclude. Then, these prompts are processed with a VLM to create and classify bounding boxes based on these criteria.

Pipeline Overview

Examples to illustrate the efficacy of IE-Strategy. Left: Output from pre-trained VLM. Right: Output from IE-Strategy.



Module 2: Long-Short Memory (LSM) Mechanism: To mitigate False Positive (FP) arisingfrom challenges like pose, illumination, and occlusion, we propose a LSM mechanism.

Pipeline Overview

Pipeline Overview


The LSM mechanism acts as a secondary filter to recover TPs that the IE strategy may have incorrectly dismissed by cross-referencing them with the highest confidence detections from the memory bank.


Pipeline Overview

Comparison of with and without LSM mechanism to illustrate the effectivess of LSM mechanism.


Both the IE strategy and LSM mechanism within the CS-OD framework play crucial roles in accurately detecting objects, as illustrated in Figures above. Both serve as a filter to increase true positives (TPs) and reduce false positives (FPs).


MAC-SORT:

The core idea of our MAC-SORT method is the effective balance between appearance cost and motion cost (IoU). This is crucial in GMOT, where objects often look very similar.

Unlike traditional methods like SORT, which rely on motion and struggle with occlusions or object disappearances, MAC-SORT incorporates advanced techniques for better accuracy. These include considering camera movement and detection confidence.

The innovation in MAC-SORT lies in its dynamic adjustment of the importance given to visual appearance versus motion cues. This is especially vital in GMOT scenarios with similar-looking objects. By adeptly balancing between motion cue and appearance cue, MAC-SORT accurately tracks objects in complex environments where appearance alone is insufficient, resulting in a more robust and effective tracking system for highly similar objects.