Qualitative Result

We conduct extensive experiments to empirically prove the performance of our proposed TP-GMOT including both detection with CS-OD and association with MAC-SORT in the GMOT problem. Our strategy can help bridging the gap between human's intention and computer understanding to provide flexibility in tracking objects with distinctive characteristics follow input texts.

"Track red color car while excluding yellow, blue, black, white color car"

Quantitative Results

We present a detailed comparison through six tables below. The first two tables highlight the distinctions between our innovative TP-GMOT approach and the existing one-shot GMOT on the Refer-GMOT dataset. In the subsequent four tables, we demonstrate the effectiveness and generalization of TP-GMOT by comparing it with other state-of-the-art fully-supervised MOT methods on Refer-GMOT40, Refer-Animal, DanceTrack and MOT20 datasets. We also conduct an ablation study on the parameter θ (theta), which measures the similarity between two vectors and plays a crucial role in adeptly balancing between motion and appearance during tracking using MAC-SORT.

GMOT Tracking comparison on Refer-GMOT40 dataset between our CS-OD with existing SOTA OS-OD on with various SOTA trackers.
Trackers Detectors # shot HOTA MOTA IDF1
SORT OS-OD one-shot 30.05 20.83 33.90
CS-OD(Ours) zero-shot 56.43 66.72 67.51
DeepSORT OS-OD one-shot 27.82 17.96 30.37
CS-OD(Ours) zero-shot 50.54 60.21 57.93
ByteTrack OS-OD one-shot 29.88 20.29 34.70
CS-OD(Ours) zero-shot 55.87 64.79 69.79
MOTRv2 OS-OD one-shot 23.76 13.87 25.17
CS-OD(Ours) zero-shot 32.93 18.70 33.48
OC-SORT OS-OD one-shot 29.00 19.96 32.85
CS-OD(Ours) zero-shot 56.06 63.69 68.85
Deep-OCSORT OS-OD one-shot 30.37 21.10 34.74
CS-OD(Ours) zero-shot 55.74 65.53 66.54
Average gains by CS-OD across all trackers +22.78↑ +37.61↑ +28.73↑
GMOT Tracking comparison on Refer-GMOT40 dataset between our MAC-SORT with various SOTA trackers.
Trackers HOTA MOTA IDF1
SORT 56.43 66.72 67.51
DeepSORT 50.54 60.21 57.93
ByteTrack 55.87 64.79 69.79
OC-SORT 56.06 63.69 68.85
Deep-OCSORT 55.74 65.53 66.54
MOTRv2 32.93 18.70 33.48
MAC-SORT (Ours) 58.58 67.77 71.70
GMOT Tracking comparison on Refer-Animal between our MAC-SORT and CS-OD with existing fully-supervised MOT methods. These methods utilize Faster R-CNN (FRCNN) and YOLOX trained on the AnimalTrack-trainset as their object detector. It is important to note that these fully-supervised methods are limited in their ability to handle category-agnostic tracking.
Trackers Detectors Category Agnostic HOTA MOTA IDF1
SORT FRCNN × 42.80 55.60 49.20
DeepSORT FRCNN × 32.80 41.40 35.20
ByteTrack YOLOX × 40.10 38.50 51.20
TransTrack YOLOX × 45.40 48.30 53.40
QDTrack YOLOX × 47.00 55.70 56.30
OC-SORT YOLOX × 56.93 65.02 67.48
Deep-OCSORT YOLOX × 57.24 68.05 62.01
MORTv2 YOLOX × 52.07 57.08 62.07
MAC-SORT (Ours) YOLOX × 57.86 68.32 63.01
MAC-SORT (Ours) CS-OD (Ours) 57.29 66.46 68.37
Ablation study of generalizability of TP-GMOT framework on DanceTrack-valset with MOT task. We compare our MAC-SORT and CS-OD with other fully-supervised MOT methods, which use YOLOX trained on DanceTrack-trainset as their object detector. Deep-OCOSRT denotes the reported results in the paper whereas Deep-OCOSRT† presents the reproduced results with the best settings suggested by the authors on our machine with the same object detector. It is important to note that these existing fully-supervised methods are limited in their ability to handle category-agnostic tracking.
Trackers Detectors Category HOTA MOTA IDF1
Agnostic
SORT YOLOX × 47.80 88.20 48.30
DeepSORT YOLOX × 45.80 87.10 46.80
MOTDT YOLOX × 39.20 84.30 39.60
ByteTrack YOLOX × 47.10 88.20 51.90
OC-SORT YOLOX × 52.10 87.30 51.60
Deep-OCSORT YOLOX × 58.53 -- 59.06
Deep-OCSORT† YOLOX × 49.36 84.82 48.89
MAC-SORT(Ours) YOLOX × 53.78 86.85 54.06
MAC-SORT(Ours) CS-OD (Ours) 48.75 81.74 48.02


Ablation study on the effectiveness of MAC-SORT on MOT20-testset with MOT task. We compare our MAC-SORT with other SORT-based MOT methods. As ByteTrack, OC-SORT uses different thresholds for testset sequences and offline interpolation procedure, we also report scores by disabling these as in ByteTrack†, OC-SORT†. As Deep OC-SORT used separated weights for YOLOX object detector, we also report scores by retraining YOLOX on MOT20-trainset as in Deep OC-SORT†.
Trackers HOTA MOTA IDF1
MeMOT 54.1 63.7 66.1
FairMOT 54.6 61.8 67.3
GSDT 53.6 67.1 67.5
CSTrack 54.0 66.6 68.6
ByteTrack 61.3 77.8 75.2
OC-SORT 62.4 75.7 76.3
Deep-OCSORT 63.9 75.6 79.2
ByteTrack† 60.4 74.2 74.5
OC-SORT† 60.5 73.1 74.4
Deep OC-SORT† 59.6 75.3 75.2
MAC-SORT (Ours) 62.6 75.2 76.9
Ablation study of θ in computing Wamw on Refer-GMOT40 dataset.
θ Standard MOT metrics ID metrics
HOTA↑ MOTA↑ IDF1↑ IDR↑
22.5° 57.54 66.82 69.29 64.16
45 (default)° 59.26 68.03 70.86 68.39
67.5° 58.06 67.37 70.46 65.24
80° 58.21 65.81 70.13 68.30