Dataset

Comparison of existing datasets of SOT, MOT, GSOT, GMOT. "#" represents the quantity of the respective items. Cat., Vid., NLP denote Categories, Videos, and Textual Language Descriptions.
Datasets NLP #Cat. #Vid. #Frames #Tracks #Boxs
SOT OTB2013~[1] 10 51 29K 51 29K
VOT2017~[2] 24 60 21K 60 21K
TrackingNet~[3] 21 31K 14M 31K 14M
LaSOT~[4] 70 1.4K 3.52M 1.4K 3.52M
TNL2K~[5] - 2K 1.24M 2K 1.24M
GSOT GOT-10~[6] 563 10K 1.5M 10K 1.5M
Fish~[7] 1 1.6K 527.2K 8.25K 516K
MOT MOT17~[8] 1 14 11.2K 1.3K 0.3M
MOT20~[9] 1 8 13.41K 3.45K 1.65M
Omni-MOT~[10] 1 - 14M+ 250K 110M
DanceTrack~[11] 1 100 105K 990 -
TAO~[12] 833 2.9K 2.6M 17.2K 333K
SportMOT~[13] 1 240 150K 3.4K 1.62M
Refer-KITTI~[14] 2 18 6.65K 637 28.72K
GMOT AnimalTrack~[15] 10 58 24.7K 1.92K 429K
GMOT-40~[16] 10 40 9K 2.02K 256K
Refer-GMOT40 (Ours) 10 40 9K 2.02K 256K
Refer-Animal (Ours) 10 58 24.7K 1.92K 429K

In our research, we improved two existing datasets for tracking multiple objects (GMOT-40 and AnimalTrack) by adding text descriptions. These enhanced datasets are named 'Refer-GMOT40' and 'Refer-Animal'.

'Refer-GMOT40' includes 40 videos covering 10 different types of real-world objects, with each type having 4 video sequences. 'Refer-Animal' contains 26 videos focusing on 10 common types of animals.

Each video in these datasets has been carefully annotated with several details:

The annotations are formatted in JSON, and we provide examples to illustrate how they are structured. This data, prepared by 4 annotators, will be shared publicly.

Text label for referring with specific attributes
{
    video: "",
    label:{
        class_name: "",
        class_synonyms:[],
        definition: "",
        include_attributes: []
        exclude_attributes: []
        caption: "",
        track_path: "",
    }
}
                    
Track label for associating objects' IDs through time
1, 1, xl, yt, w, h, 1, 1, 1
1, 2, xl, yt, w, h, 1, 1, 1
1, 3, xl, yt, w, h, 1, 1, 1
2, 1, xl, yt, w, h, 1, 1, 1
2, 2, xl, yt, w, h, 1, 1, 1
2, 3, xl, yt, w, h, 1, 1, 1
3, 1, xl, yt, w, h, 1, 1, 1
3, 2, xl, yt, w, h, 1, 1, 1
3, 3, xl, yt, w, h, 1, 1, 1
                        
Image 1
                video: "airplane-1",
                label:{
                        class_name: "helicopter",
                        class_synonyms:["airplane", "aircraft", "jet", "plane"],
                        definition: "a vehicle designed for flight in the air",
                        include_attributes: ["black", "flying"],
                        exclude_attributes: [],
                        caption: "Track all black flying helicopters",
                        track_path: "airplane_01.txt"
                }
            
Image 1
              video: "car-1"
              label:{
                      class_name: "car",
                      class_synonyms: ["vehicle", "automobile", "auto", "transport", "transportation"],
                      definition: "mechanical device designed for transportation, powered by an engine or motor, equipped by four wheels",
                      include_attributes:  ["white headlight", "oncoming traffic"],
                      exclude_attributes:  ["red taillight", "opposite traffic"],
                      caption:  "Track white headlight cars while excluding red taillight cars",
                      track_path: "car_01.txt",
              }