TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT

Qualitative Result

We conduct extensive experiments to empirically prove the performance of our proposed TP-GMOT including both detection with CS-OD and association with MAC-SORT in the GMOT problem. Our strategy can help bridging the gap between human's intention and computer understanding to provide flexibility in tracking objects with distinctive characteristics follow input texts.

"Track red color car while excluding yellow, blue, black, white color car"

Quantitative Results

We present a detailed comparison through six tables below. The first two tables highlight the distinctions between our innovative TP-GMOT approach and the existing one-shot GMOT on the Refer-GMOT dataset. In the subsequent four tables, we demonstrate the effectiveness and generalization of TP-GMOT by comparing it with other state-of-the-art fully-supervised MOT methods on Refer-GMOT40, Refer-Animal, DanceTrack and MOT20 datasets. We also conduct an ablation study on the parameter θ (theta), which measures the similarity between two vectors and plays a crucial role in adeptly balancing between motion and appearance during tracking using MAC-SORT.

GMOT Tracking comparison on `Refer-GMOT40` dataset between our `CS-OD` with existing SOTA OS-OD on with various SOTA trackers.
Trackers	Detectors	# shot	HOTA↑	MOTA↑	IDF1↑
SORT	OS-OD	one-shot	30.05	20.83	33.90
SORT	`CS-OD`(Ours)	zero-shot	56.43	66.72	67.51
DeepSORT	OS-OD	one-shot	27.82	17.96	30.37
DeepSORT	`CS-OD`(Ours)	zero-shot	50.54	60.21	57.93
ByteTrack	OS-OD	one-shot	29.88	20.29	34.70
ByteTrack	`CS-OD`(Ours)	zero-shot	55.87	64.79	69.79
MOTRv2	OS-OD	one-shot	23.76	13.87	25.17
MOTRv2	`CS-OD`(Ours)	zero-shot	32.93	18.70	33.48
OC-SORT	OS-OD	one-shot	29.00	19.96	32.85
OC-SORT	`CS-OD`(Ours)	zero-shot	56.06	63.69	68.85
Deep-OCSORT	OS-OD	one-shot	30.37	21.10	34.74
Deep-OCSORT	`CS-OD`(Ours)	zero-shot	55.74	65.53	66.54
Average gains by `CS-OD` across all trackers			+22.78↑	+37.61↑	+28.73↑

GMOT Tracking comparison on `Refer-GMOT40` dataset between our **`MAC-SORT`** with various SOTA trackers.
Trackers	HOTA↑	MOTA↑	IDF1↑
SORT	56.43	66.72	67.51
DeepSORT	50.54	60.21	57.93
ByteTrack	55.87	64.79	69.79
OC-SORT	56.06	63.69	68.85
Deep-OCSORT	55.74	65.53	66.54
MOTRv2	32.93	18.70	33.48
`MAC-SORT` (Ours)	58.58	67.77	71.70

GMOT Tracking comparison on `Refer-Animal` between our **`MAC-SORT`** and **`CS-OD`** with existing *fully-supervised MOT* methods. These methods utilize Faster R-CNN (FRCNN) and YOLOX trained on the AnimalTrack-trainset as their object detector. It is important to note that these fully-supervised methods are limited in their ability to handle category-agnostic tracking.
Trackers	Detectors	Category Agnostic	HOTA↑	MOTA↑	IDF1↑
SORT	FRCNN	×	42.80	55.60	49.20
DeepSORT	FRCNN	×	32.80	41.40	35.20
ByteTrack	YOLOX	×	40.10	38.50	51.20
TransTrack	YOLOX	×	45.40	48.30	53.40
QDTrack	YOLOX	×	47.00	55.70	56.30
OC-SORT	YOLOX	×	56.93	65.02	67.48
Deep-OCSORT	YOLOX	×	57.24	68.05	62.01
MORTv2	YOLOX	×	52.07	57.08	62.07
`MAC-SORT` (Ours)	YOLOX	×	57.86	68.32	63.01
`MAC-SORT` (Ours)	`CS-OD` (Ours)	✔	57.29	66.46	68.37

Ablation study of generalizability of **`TP-GMOT`** framework on *DanceTrack-valset* with MOT task. We compare our **`MAC-SORT`** and **`CS-OD`** with other *fully-supervised MOT* methods, which use YOLOX trained on DanceTrack-trainset as their object detector. Deep-OCOSRT denotes the reported results in the paper whereas Deep-OCOSRT† presents the reproduced results with the best settings suggested by the authors on our machine with the same object detector. It is important to note that these existing fully-supervised methods are limited in their ability to handle category-agnostic tracking.
Trackers	Detectors	Category	HOTA↑	MOTA↑	IDF1↑
Trackers	Detectors	Agnostic	HOTA↑	MOTA↑	IDF1↑
SORT	YOLOX	×	47.80	88.20	48.30
DeepSORT	YOLOX	×	45.80	87.10	46.80
MOTDT	YOLOX	×	39.20	84.30	39.60
ByteTrack	YOLOX	×	47.10	88.20	51.90
OC-SORT	YOLOX	×	52.10	87.30	51.60
Deep-OCSORT	YOLOX	×	58.53	--	59.06
Deep-OCSORT†	YOLOX	×	49.36	84.82	48.89
`MAC-SORT`(Ours)	YOLOX	×	53.78	86.85	54.06
`MAC-SORT`(Ours)	`CS-OD` (Ours)	✔	48.75	81.74	48.02

Ablation study on the effectiveness of MAC-SORT on MOT20-testset with MOT task. We compare our MAC-SORT with other SORT-based MOT methods. As ByteTrack, OC-SORT uses different thresholds for testset sequences and offline interpolation procedure, we also report scores by disabling these as in ByteTrack†, OC-SORT†. As Deep OC-SORT used separated weights for YOLOX object detector, we also report scores by retraining YOLOX on MOT20-trainset as in Deep OC-SORT†.

Trackers	HOTA↑	MOTA↑	IDF1↑
MeMOT	54.1	63.7	66.1
FairMOT	54.6	61.8	67.3
GSDT	53.6	67.1	67.5
CSTrack	54.0	66.6	68.6
ByteTrack	61.3	77.8	75.2
OC-SORT	62.4	75.7	76.3
Deep-OCSORT	63.9	75.6	79.2
ByteTrack†	60.4	74.2	74.5
OC-SORT†	60.5	73.1	74.4
Deep OC-SORT†	59.6	75.3	75.2
`MAC-SORT` (Ours)	62.6	75.2	76.9

Ablation study of θ in computing `W_amw` on `Refer-GMOT40` dataset.
θ	Standard MOT metrics		ID metrics
θ	HOTA↑	MOTA↑	IDF1↑	IDR↑
22.5°	57.54	66.82	69.29	64.16
45 (default)°	59.26	68.03	70.86	68.39
67.5°	58.06	67.37	70.46	65.24
80°	58.21	65.81	70.13	68.30

TP-GMOT demo website

Qualitative Result

Quantitative Results