Timezone: »

PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points
Jing Tan · Xiaotong Zhao · Xintian Shi · Bin Kang · Limin Wang

Wed Dec 07 09:00 AM -- 11:00 AM (PST) @

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e.g., ActivityNet, THUMOS). However, this setting might be unrealistic as different classes of actions often co-occur in practice. In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video. Multi-label TAD is more challenging as it requires for fine-grained class discrimination within a single video and precise localization of the co-occurring instances. To mitigate this issue, we extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD. Specifically, our PointTAD introduces a small set of learnable query points to represent the important frames of each action instance. This point-based representation provides a flexible mechanism to localize the discriminative frames at boundaries and as well the important frames inside the action. Moreover, we perform the action decoding process with the Multi-level Interactive Module to capture both point-level and instance-level action semantics. Finally, our PointTAD employs an end-to-end trainable framework simply based on RGB input for easy deployment. We evaluate our proposed method on two popular benchmarks and introduce the new metric of detection-mAP for multi-label TAD. Our model outperforms all previous methods by a large margin under the detection-mAP metric, and also achieves promising results under the segmentation-mAP metric.

Author Information

Jing Tan (Nanjing University)
Jing Tan

Video understanding, Temporal Action Detection.

Xiaotong Zhao (Beijing University of Posts and Telecommunications)
Xintian Shi (Platform & Content Group)
Bin Kang (QQ.com)
Limin Wang (Nanjing University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors

  • 2022 Spotlight: VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training »
    Zhan Tong · Yibing Song · Jue Wang · Limin Wang
  • 2022 Spotlight: Lightning Talks 5B-1 »
    Devansh Arpit · Xiaojun Xu · Zifan Shi · Ivan Skorokhodov · Shayan Shekarforoush · Zhan Tong · Yiqun Wang · Shichong Peng · Linyi Li · Ivan Skorokhodov · Huan Wang · Yibing Song · David Lindell · Yinghao Xu · Seyed Alireza Moazenipourasil · Sergey Tulyakov · Peter Wonka · Yiqun Wang · Ke Li · David Fleet · Yujun Shen · Yingbo Zhou · Bo Li · Jue Wang · Peter Wonka · Marcus Brubaker · Caiming Xiong · Limin Wang · Deli Zhao · Qifeng Chen · Dit-Yan Yeung
  • 2022 Spotlight: Lightning Talks 3A-3 »
    Xu Yan · Zheng Dong · Qiancheng Fu · Jing Tan · Hezhen Hu · Fukun Yin · Weilun Wang · Ke Xu · Heshen Zhan · Wen Liu · Qingshan Xu · Xiaotong Zhao · Chaoda Zheng · Ziheng Duan · Zilong Huang · Xintian Shi · Wengang Zhou · Yew Soon Ong · Pei Cheng · Hujun Bao · Houqiang Li · Wenbing Tao · Jiantao Gao · Bin Kang · Weiwei Xu · Limin Wang · Ruimao Zhang · Tao Chen · Gang Yu · Rynson Lau · Shuguang Cui · Zhen Li
  • 2022 Poster: VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training »
    Zhan Tong · Yibing Song · Jue Wang · Limin Wang