Dataset

Download dataset

Collect videos from GMOT40 and AnimalTrack datasets

Download our annotation for Refer-GMOT and Refer-Animal:

https://drive.google.com/file/d/1TVjDD6jxL5TsjLXVEUacIYXa-0IBY0mc/view?usp=sharing

Datasets comparison

Table 1: Comparison of existing datasets of SOT, MOT, GSOT, GMOT. ``#'' represents the quantity of the respective items. , Vid. denote Categories and Videos. NLP indicates textual natural language descriptions.
Datasets NLP #Cat. #Vid. #Frames #Tracks #Boxs
SOT OTB2013 10 51 29K 51 29K
VOT2017 24 60 21K 60 21K
TrackingNet 21 31K 14M 31K 14M
LaSOT 70 1.4K 3.52M 1.4K 3.52M
TNL2K - 2K 1.24M 2K 1.24M
MOT MOT17 1 14 11.2K 1.3K 0.3M
MOT20 (Dendorfer et al., 2020) 1 8 13.41K 3.45K 1.65M
Omni-MOT (Sun et al., 2020b) 1 - 14M+ 250K 110M
DanceTrack (Sun et al., 2022) 1 10 105K 990 -
TAO (Dave et al., 2020) 833 2.9K 2.6M 17.2K 333K
SportMOT (Cui et al., 2023) 1 240 150K 3.4K 1.62M
Refer-KITTI (Wu et al., 2023) 2 18 6.65K 637 28.72
GSOT GOT-10 (Huang et al., 2019) 563 10K 1.5M 10K 1.5M
Fish (Kay et al., 2022) 1 1.6K 527.2K 8.25K 516K
GMOT AnimalTrack (Zhang et al., 2022b) 10 58 24.7K 1.92K 429K
GMOT-40 (Bai et al., 2021) 10 40 9K 2.02K 256K
Refer-Animal(Ours) 10 58 24.7K 1.92K 429K
Refer-GMOT(Ours) 10 40 9K 2.02K 256K

In this work, we introduce textual descriptions to two existing GMOT datasets, named "Refer-GMOT40" and "Refer-Animal". The "Refer-GMOT40" dataset includes 40 videos, each showcasing 10 different real-world object categories, with four sequences per category. The "Refer-Animal" dataset features 26 video sequences, each depicting 10 common animal categories. Each video is annotated with the object's name, a description of its attributes, and its tracks. The attribute descriptions focus on noticeable characteristics of the object, while "other_attributes" provide more detailed information, although these might not be visible throughout the entire video.

To align with the standard format of MOT challenges, each video is accompanied by tracking ground truth data in a separate text file. This ensures consistency with the conventions of MOT problems. The annotation is done in JSON format, and an example of this structure is provided in the study. Four annotators conducted this annotation, and the data is available to the public.

Our dataset with textual description annotations are formatted in COCO format. Each generic object is labeled as follows:

Text label for referring with specific attributes
{
    video: "",
    label: [
        {
            object: "",
            object_synonym: [""],
            attribute: "",
            other_attributes: [""],
            tracks = {}
        },
    ]
}
                
Track label for associating objects' IDs through time
1, 1, xl, yt, w, h, 1, 1, 1
1, 2, xl, yt, w, h, 1, 1, 1
1, 3, xl, yt, w, h, 1, 1, 1
2, 1, xl, yt, w, h, 1, 1, 1
2, 2, xl, yt, w, h, 1, 1, 1
2, 3, xl, yt, w, h, 1, 1, 1
3, 1, xl, yt, w, h, 1, 1, 1
3, 2, xl, yt, w, h, 1, 1, 1
3, 3, xl, yt, w, h, 1, 1, 1
                    
Image 1
              “video”: stock-3
              “label”: [{
                      ”object": "stock"
                      “object_synonym”: [“wild dog”]
                      “attribute”: ["gray fur"]
                      “other_attributes”: [“four legs”, “sharp teeth”, 
                  
"small ears", "strong jaw"]
    “tracks”: "stock-3.txt" }]
Image 1
              “video”: ball-0
              “label”: [{
                      ”object": "stock"
                      “object_synonym”: ["sphere, "billard ball",
                  
"billard sphere"]
    “attribute”: ["circle", "round", red"]     “other_attributes”: ["small", "smooth", "numbering",
"glossy"]
    “tracks”: "ball-0.txt" }]
Image 1
              “video”: car-1
              “label”: [{
                      ”object": "car"
                      “object_synonym”: ["transport, "vehicle"]
                      “attribute”: ["white light"]
                      “other_attributes”: ["frontal side", "night covered"]
                      “tracks”: "car-1.txt"
              }]