Software Demo for spotting action in sports video
The concept of action spotting was first presented in the SoccerNet dataset. It involves identifying a specific momentary event, known as an action, using a single timestamp (only starting point). This differs from temporal action localization, which is characterized by defining the starting and ending points of actions. The performance of action-spotting is evaluated by loose average mAP metric. In this evaluation, a detection is considered correct if it falls within a certain time window around the true event, with a loose error tolerance ranging from 5 to 60 seconds. To evaluate more accurately, the tight average mAP metric is used with tighter error tolerance of 1 to 5 seconds.
Motivation of fusing local feature to global feature: Current sports video action-spotting methods predominantly rely on global features, employing a backbone network across the entire spatial frame. This approach struggles with action classes involving small objects like balls, and yellow/red cards, which occupy only a fraction of space. To address this, we introduce a novel approach, which employs an object detector for local features and integrates them with global features.
Motivation of using vision-language object detector: To eliminate the need for training the object detector on different datasets as in YOLO, Faster-RCNN as well as to detect a wide range of objects, and do not have to relabel data when new datasets become available, we propose to leverage VL model GLIP.
Unifying Global and Local (UGL) Module UGL module concurrently extracts both the global environment feature fEnv and the local relevant entities feature fEnt. Subsequently, it combines these features to produce a unified global-local entities-environment feature fEnt−Env, acting as the unifying link between the local and global representations.
Long-term Temporal Reasoning (LTR) Module LTR is responsible for estimating an action score for each frame in the δ-frame snippet. It consists of two components: semantic modeling, and proposal estimation. The first component focuses on modeling the semantic relationships between frames within a snippet. This is achieved by applying a 1-layer bidirectional Gated Recurrent Unit (GRU).