TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT

Dataset

Comparison of **existing datasets** of SOT, MOT, GSOT, GMOT. "#" represents the quantity of the respective items. Cat., Vid., NLP denote Categories, Videos, and Textual Language Descriptions.
	Datasets	NLP	#Cat.	#Vid.	#Frames	#Tracks	#Boxs
SOT	OTB2013~[1]	✖	10	51	29K	51	29K
	VOT2017~[2]	✖	24	60	21K	60	21K
	TrackingNet~[3]	✖	21	31K	14M	31K	14M
	LaSOT~[4]	✔	70	1.4K	3.52M	1.4K	3.52M
	TNL2K~[5]	✔	-	2K	1.24M	2K	1.24M
GSOT	GOT-10~[6]	✖	563	10K	1.5M	10K	1.5M
GSOT	Fish~[7]	✖	1	1.6K	527.2K	8.25K	516K
MOT	MOT17~[8]	✖	1	14	11.2K	1.3K	0.3M
	MOT20~[9]	✖	1	8	13.41K	3.45K	1.65M
	Omni-MOT~[10]	✖	1	-	14M+	250K	110M
	DanceTrack~[11]	✖	1	100	105K	990	-
	TAO~[12]	✖	833	2.9K	2.6M	17.2K	333K
	SportMOT~[13]	✖	1	240	150K	3.4K	1.62M
	Refer-KITTI~[14]	✔	2	18	6.65K	637	28.72K
GMOT	AnimalTrack~[15]	✖	10	58	24.7K	1.92K	429K
	GMOT-40~[16]	✖	10	40	9K	2.02K	256K
	Refer-GMOT40 (Ours)	✔	10	40	9K	2.02K	256K
	Refer-Animal (Ours)	✔	10	58	24.7K	1.92K	429K

In our research, we improved two existing datasets for tracking multiple objects (GMOT-40 and AnimalTrack) by adding text descriptions. These enhanced datasets are named 'Refer-GMOT40' and 'Refer-Animal'.

'Refer-GMOT40' includes 40 videos covering 10 different types of real-world objects, with each type having 4 video sequences. 'Refer-Animal' contains 26 videos focusing on 10 common types of animals.

Each video in these datasets has been carefully annotated with several details:

For text label:

class_name: The general category of objects being tracked.
class_synonyms: Other names or terms for the class.
definition: A description of the objects being tracked.
include_attributes: Characteristics of the tracked objects based on what can be seen.
exclude_attributes: Characteristics that identify objects within the same category that are not being tracked.
caption: Descriptions of the objects being tracked. For tracking all objects in a class, the caption is in the format: "Track [visible attributes] [class name]". When tracking a specific subset, the format is: "Track [visible attributes] [class name] while excluding [untracked attributes] [class name]".
track_path: The exact tracking path is stored separately, following the standard format for multiple object tracking challenges.

For track label:

each line will contain 9 elements, seperated by commas

frame: index of frame in video sequence
id: id of object accord to tracker
bb_left: x coordinate for top left
bb_top: y coordinate for top left
bb_width: width of the box that contains object
bb_height: height of the box that contains object
conf: confidence score but get 1 as default
x: get 1 as default
y: get 1 as default

The annotations are formatted in JSON, and we provide examples to illustrate how they are structured. This data, prepared by 4 annotators, will be shared publicly.

Text label for referring with specific attributes
{
    video: "",
    label:{
        class_name: "",
        class_synonyms:[],
        definition: "",
        include_attributes: []
        exclude_attributes: []
        caption: "",
        track_path: "",
    }
}

Track label for associating objects' IDs through time
1, 1, xl, yt, w, h, 1, 1, 1
1, 2, xl, yt, w, h, 1, 1, 1
1, 3, xl, yt, w, h, 1, 1, 1
2, 1, xl, yt, w, h, 1, 1, 1
2, 2, xl, yt, w, h, 1, 1, 1
2, 3, xl, yt, w, h, 1, 1, 1
3, 1, xl, yt, w, h, 1, 1, 1
3, 2, xl, yt, w, h, 1, 1, 1
3, 3, xl, yt, w, h, 1, 1, 1

                video: "airplane-1",
                label:{
                        class_name: "helicopter",
                        class_synonyms:["airplane", "aircraft", "jet", "plane"],
                        definition: "a vehicle designed for flight in the air",
                        include_attributes: ["black", "flying"],
                        exclude_attributes: [],
                        caption: "Track all black flying helicopters",
                        track_path: "airplane_01.txt"
                }

              video: "car-1"
              label:{
                      class_name: "car",
                      class_synonyms: ["vehicle", "automobile", "auto", "transport", "transportation"],
                      definition: "mechanical device designed for transportation, powered by an engine or motor, equipped by four wheels",
                      include_attributes:  ["white headlight", "oncoming traffic"],
                      exclude_attributes:  ["red taillight", "opposite traffic"],
                      caption:  "Track white headlight cars while excluding red taillight cars",
                      track_path: "car_01.txt",
              }

TP-GMOT demo website

Dataset