Collect videos from GMOT40 and AnimalTrack datasets
Download our annotation for Refer-GMOT and Refer-Animal:
Datasets | NLP | #Cat. | #Vid. | #Frames | #Tracks | #Boxs | |
---|---|---|---|---|---|---|---|
SOT | OTB2013 | ✖ | 10 | 51 | 29K | 51 | 29K |
VOT2017 | ✖ | 24 | 60 | 21K | 60 | 21K | |
TrackingNet | ✖ | 21 | 31K | 14M | 31K | 14M | |
LaSOT | ✔ | 70 | 1.4K | 3.52M | 1.4K | 3.52M | |
TNL2K | ✔ | - | 2K | 1.24M | 2K | 1.24M | |
MOT | MOT17 | ✖ | 1 | 14 | 11.2K | 1.3K | 0.3M |
MOT20 (Dendorfer et al., 2020) | ✖ | 1 | 8 | 13.41K | 3.45K | 1.65M | |
Omni-MOT (Sun et al., 2020b) | ✖ | 1 | - | 14M+ | 250K | 110M | |
DanceTrack (Sun et al., 2022) | ✖ | 1 | 10 | 105K | 990 | - | |
TAO (Dave et al., 2020) | ✖ | 833 | 2.9K | 2.6M | 17.2K | 333K | |
SportMOT (Cui et al., 2023) | ✖ | 1 | 240 | 150K | 3.4K | 1.62M | |
Refer-KITTI (Wu et al., 2023) | ✔ | 2 | 18 | 6.65K | 637 | 28.72 | |
GSOT | GOT-10 (Huang et al., 2019) | ✖ | 563 | 10K | 1.5M | 10K | 1.5M |
Fish (Kay et al., 2022) | ✖ | 1 | 1.6K | 527.2K | 8.25K | 516K | |
GMOT | AnimalTrack (Zhang et al., 2022b) | ✖ | 10 | 58 | 24.7K | 1.92K | 429K |
GMOT-40 (Bai et al., 2021) | ✖ | 10 | 40 | 9K | 2.02K | 256K | |
Refer-Animal(Ours) | ✔ | 10 | 58 | 24.7K | 1.92K | 429K | |
Refer-GMOT(Ours) | ✔ | 10 | 40 | 9K | 2.02K | 256K |
In this work, we introduce textual descriptions to two existing GMOT datasets, named "Refer-GMOT40" and "Refer-Animal". The "Refer-GMOT40" dataset includes 40 videos, each showcasing 10 different real-world object categories, with four sequences per category. The "Refer-Animal" dataset features 26 video sequences, each depicting 10 common animal categories. Each video is annotated with the object's name, a description of its attributes, and its tracks. The attribute descriptions focus on noticeable characteristics of the object, while "other_attributes" provide more detailed information, although these might not be visible throughout the entire video.
To align with the standard format of MOT challenges, each video is accompanied by tracking ground truth data in a separate text file. This ensures consistency with the conventions of MOT problems. The annotation is done in JSON format, and an example of this structure is provided in the study. Four annotators conducted this annotation, and the data is available to the public.
Our dataset with textual description annotations are formatted in COCO format. Each generic object is labeled as follows:
Text label for referring with specific attributes { video: "", label: [ { object: "", object_synonym: [""], attribute: "", other_attributes: [""], tracks = {} }, ] }
Track label for associating objects' IDs through time 1, 1, xl, yt, w, h, 1, 1, 1 1, 2, xl, yt, w, h, 1, 1, 1 1, 3, xl, yt, w, h, 1, 1, 1 2, 1, xl, yt, w, h, 1, 1, 1 2, 2, xl, yt, w, h, 1, 1, 1 2, 3, xl, yt, w, h, 1, 1, 1 3, 1, xl, yt, w, h, 1, 1, 1 3, 2, xl, yt, w, h, 1, 1, 1 3, 3, xl, yt, w, h, 1, 1, 1
For text label:
For track label:
each line will contain 9 elements, seperated by commas
<frame_id>, <track_id>, <bb_left>, <bb_top>, <bb_width>, <bb_height>, 1, 1, 1
“video”: stock-3 “label”: [{ ”object": "stock" “object_synonym”: [“wild dog”] “attribute”: ["gray fur"] “other_attributes”: [“four legs”, “sharp teeth”,"small ears", "strong jaw"]“tracks”: "stock-3.txt" }]
“video”: ball-0 “label”: [{ ”object": "stock" “object_synonym”: ["sphere, "billard ball","billard sphere"]“attribute”: ["circle", "round", red"] “other_attributes”: ["small", "smooth", "numbering","glossy"]“tracks”: "ball-0.txt" }]
“video”: car-1 “label”: [{ ”object": "car" “object_synonym”: ["transport, "vehicle"] “attribute”: ["white light"] “other_attributes”: ["frontal side", "night covered"] “tracks”: "car-1.txt" }]