Z-GMOT with MA-SORT: Zero-shot Generic Multiple Object Tracking (GMOT) with Motion Appearance SORT (MA-SORT)

Dataset

Download dataset

Collect videos from GMOT40 and AnimalTrack datasets

Download our annotation for Refer-GMOT and Refer-Animal:

https://drive.google.com/file/d/1TVjDD6jxL5TsjLXVEUacIYXa-0IBY0mc/view?usp=sharing

Datasets comparison

Table 1: Comparison of **existing datasets** of SOT, MOT, GSOT, GMOT. ``#'' represents the quantity of the respective items. , Vid. denote Categories and Videos. NLP indicates textual natural language descriptions.
	Datasets	NLP	#Cat.	#Vid.	#Frames	#Tracks	#Boxs
SOT	OTB2013	✖	10	51	29K	51	29K
	VOT2017	✖	24	60	21K	60	21K
	TrackingNet	✖	21	31K	14M	31K	14M
	LaSOT	✔	70	1.4K	3.52M	1.4K	3.52M
	TNL2K	✔	-	2K	1.24M	2K	1.24M
MOT	MOT17	✖	1	14	11.2K	1.3K	0.3M
	MOT20 (Dendorfer et al., 2020)	✖	1	8	13.41K	3.45K	1.65M
	Omni-MOT (Sun et al., 2020b)	✖	1	-	14M+	250K	110M
	DanceTrack (Sun et al., 2022)	✖	1	10	105K	990	-
	TAO (Dave et al., 2020)	✖	833	2.9K	2.6M	17.2K	333K
	SportMOT (Cui et al., 2023)	✖	1	240	150K	3.4K	1.62M
	Refer-KITTI (Wu et al., 2023)	✔	2	18	6.65K	637	28.72
GSOT	GOT-10 (Huang et al., 2019)	✖	563	10K	1.5M	10K	1.5M
GSOT	Fish (Kay et al., 2022)	✖	1	1.6K	527.2K	8.25K	516K
GMOT	AnimalTrack (Zhang et al., 2022b)	✖	10	58	24.7K	1.92K	429K
	GMOT-40 (Bai et al., 2021)	✖	10	40	9K	2.02K	256K
	Refer-Animal(Ours)	✔	10	58	24.7K	1.92K	429K
	Refer-GMOT(Ours)	✔	10	40	9K	2.02K	256K

In this work, we introduce textual descriptions to two existing GMOT datasets, named "Refer-GMOT40" and "Refer-Animal". The "Refer-GMOT40" dataset includes 40 videos, each showcasing 10 different real-world object categories, with four sequences per category. The "Refer-Animal" dataset features 26 video sequences, each depicting 10 common animal categories. Each video is annotated with the object's name, a description of its attributes, and its tracks. The attribute descriptions focus on noticeable characteristics of the object, while "other_attributes" provide more detailed information, although these might not be visible throughout the entire video.

To align with the standard format of MOT challenges, each video is accompanied by tracking ground truth data in a separate text file. This ensures consistency with the conventions of MOT problems. The annotation is done in JSON format, and an example of this structure is provided in the study. Four annotators conducted this annotation, and the data is available to the public.

Our dataset with textual description annotations are formatted in COCO format. Each generic object is labeled as follows:

Text label for referring with specific attributes
{
    video: "",
    label: [
        {
            object: "",
            object_synonym: [""],
            attribute: "",
            other_attributes: [""],
            tracks = {}
        },
    ]
}

Track label for associating objects' IDs through time
1, 1, xl, yt, w, h, 1, 1, 1
1, 2, xl, yt, w, h, 1, 1, 1
1, 3, xl, yt, w, h, 1, 1, 1
2, 1, xl, yt, w, h, 1, 1, 1
2, 2, xl, yt, w, h, 1, 1, 1
2, 3, xl, yt, w, h, 1, 1, 1
3, 1, xl, yt, w, h, 1, 1, 1
3, 2, xl, yt, w, h, 1, 1, 1
3, 3, xl, yt, w, h, 1, 1, 1

For text label:

“Video”: video name. For example "airplane-0"
”object": particular generic objects type. For example, aircraft
“object_synonym”: synonym of object. For example, sky vehicle, air transportation
“attribute”: specific characteristic of the object. For example, helicopter
“other_attributes”: object's extra attributes which not necessary visible but representative. For example, chopper, copter

For track label:

each line will contain 9 elements, seperated by commas

<frame_id>, <track_id>, <bb_left>, <bb_top>, <bb_width>, <bb_height>, 1, 1, 1

"frame_id": index of frame in video sequence
"track_id": id of object accord to tracker
"bb_left": x coordinate for top left
"bb_top": y coordinate for top left
"bb_width": width of the box that contains object
"bb_height": height of the box that contains object

              “video”: stock-3
              “label”: [{
                      ”object": "stock"
                      “object_synonym”: [“wild dog”]
                      “attribute”: ["gray fur"]
                      “other_attributes”: [“four legs”, “sharp teeth”, 
                  "small ears", "strong jaw"]
    “tracks”: "stock-3.txt"
              }]

              “video”: ball-0
              “label”: [{
                      ”object": "stock"
                      “object_synonym”: ["sphere, "billard ball",
                   "billard sphere"]
    “attribute”: ["circle", "round", red"]
                      “other_attributes”: ["small", "smooth", "numbering",
                  "glossy"]
    “tracks”: "ball-0.txt"
              }]

              “video”: car-1
              “label”: [{
                      ”object": "car"
                      “object_synonym”: ["transport, "vehicle"]
                      “attribute”: ["white light"]
                      “other_attributes”: ["frontal side", "night covered"]
                      “tracks”: "car-1.txt"
              }]

Z-GMOT demo website

Dataset

Download dataset

Datasets comparison