Tracking objects with drones in real-time is a challenging task, especially in environments with obstacles like buildings or trees. Occlusions of the tracked object pose a particular challenge for existing tracking algorithms. New research presents a promising approach based on Vision Transformers (ViT) specifically designed to improve robustness against occlusions.
Conventional tracking methods based on single-stream architectures with ViT backbones show great potential for real-time drone tracking but reach their limits with occlusions. Frequent interruptions of the line of sight by obstacles lead to tracking errors and impair the reliability of the systems. Therefore, there is an urgent need for new strategies that increase the resilience of these models to occlusions.
The new method, called ORTrack (Occlusion-Robust Tracking), focuses on learning occlusion-robust representations. The core idea is to enforce the invariance of the feature representation of a target object to random masking operations. These masks, modeled by a spatial Cox process, simulate occlusions of the target object. By training with these simulated occlusions, the ViT model learns to extract robust features that enable reliable tracking even when the target object is partially obscured.
To enable deployment in real-time applications, an adaptive feature-based knowledge distillation (AFKD) method was also developed. This method trains a more compact student model (ORTrack-D) that mimics the behavior of the more complex teacher model ORTrack. The adaptation is dynamic, based on the difficulty of the respective tracking task. ORTrack-D thus achieves significantly higher efficiency with only a slight loss of performance compared to the teacher model.
The effectiveness of ORTrack and ORTrack-D was evaluated in extensive experiments on various benchmarks. The results show that the new method achieves state-of-the-art performance compared to existing methods and significantly improves robustness against occlusions. This underscores the potential of the approach for use in real-world drone applications.
The presented research opens exciting perspectives for the further development of robust tracking algorithms. Future work could, for example, focus on optimizing the masking strategies or integrating further sensor data to further improve robustness in even more complex scenarios. The combination of ViTs with occlusion-robust learning methods thus promises an important contribution to reliable and efficient real-time drone tracking.