The automatic recognition of human actions in videos is a central component of modern AI systems. While large language models (LLMs) capture general action categories well, they reach their limits when recognizing fine-grained actions, which are characterized by subtle differences and short durations. Examples include detailed movement sequences in sports, such as "backwards tucked salto with a twist," or complex interactions in AR/VR applications.
Annotating training data for fine-grained actions is time-consuming and expensive. Therefore, the use of Semi-Supervised Learning (SSL) is a promising approach to achieve good results with limited labeling effort. SeFAR (Semi-supervised Fine-grained Action Recognition), a novel framework, addresses precisely this challenge.
SeFAR is characterized by several innovative design decisions. To capture the fine details in the videos, SeFAR uses so-called "Dual-Level Temporal Elements." These elements represent the temporal sequences at two different levels and enable a more precise representation of the actions.
Another core aspect of SeFAR is the use of a robust augmentation strategy within a teacher-student learning paradigm. Through moderate temporal perturbations of the videos, the student model learns to extract invariant features, thus improving generalization ability. The temporal perturbations simulate variations in execution speed and duration that frequently occur in real-world videos.
The predictions of the teacher model in fine-grained action recognition are often associated with high uncertainty. To compensate for this, SeFAR uses adaptive regulation, which stabilizes the learning process and increases the accuracy of the predictions. This regulation dynamically adapts to the uncertainty of the predictions, thus preventing the model from being misled by erroneous information.
Experimental results on the FineGym and FineDiving datasets demonstrate the performance of SeFAR. The framework achieves state-of-the-art results in various scenarios with different amounts of labeled data. Furthermore, SeFAR also shows convincing performance in recognizing coarse-grained actions on the UCF101 and HMDB51 datasets, outperforming other semi-supervised methods.
The features extracted by SeFAR can also enhance the capabilities of multimodal foundation models to understand fine-grained and domain-specific semantics. This opens up new possibilities for integrating SeFAR into more comprehensive AI systems.
The combination of Dual-Level Temporal Elements, temporal augmentation, and adaptive regulation makes SeFAR a promising approach for fine-grained action recognition. The results achieved underscore the potential of SSL for video analysis and open up new perspectives for the development of robust and efficient AI systems.
Huang, Y., Chen, H., Xu, Z., Jia, Z., Sun, H., & Shao, D. (2025). SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization. AAAI Conference on Artificial Intelligence (AAAI). Singh, A., Chakraborty, O., Varshney, A., Panda, R., Feris, R., Saenko, K., & Das, A. (2021). Semi-Supervised Action Recognition with Temporal Contrastive Learning. CVPR 2021. Dave, I. R., Rizve, M. N., & Shah, M. (2024). FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition. ECCV 2024. Xiao, J., Jing, L., Zhang, L., He, J., She, Q., Zhou, Z., Yuille, A., & Li, Y. (2022). Learning from Temporal Gradient for Semi-supervised Action Recognition. CVPR 2022. Xiao, J., Jing, L., Zhang, L., He, J., She, Q., Zhou, Z., ... & Li, Y. (2024). Learning From Temporal Gradient for Semi-Supervised Action Recognition. In European Conference on Computer Vision (pp. 131-148). Springer, Cham.