DiffusionDrive: A Novel Approach to Autonomous Driving Using Diffusion Models

DiffusionDrive: A New Approach to Autonomous Driving with Diffusion Models

Autonomous driving is a complex field of research that has made significant progress in recent years. A promising approach is so-called end-to-end learning, where a model learns to control a vehicle directly from sensor data. Diffusion models, which are increasingly important due to their ability to model multimodal action distributions, play a key role in this. A recent research paper introduces DiffusionDrive, a new model that leverages the advantages of diffusion models for autonomous driving.

Challenges and Motivation

Conventional end-to-end models for autonomous driving are often based on regression methods that predict only a single trajectory. This, however, does not account for the uncertainty and multimodality of driving behavior in real-world traffic situations. Diffusion models offer an alternative here, as they can generate various plausible trajectories. However, previous diffusion models require many denoising steps, resulting in high computational costs and thus posing a challenge for real-time applications.

DiffusionDrive addresses precisely these challenges. The model aims to leverage the benefits of diffusion models while reducing the computational cost to ensure real-time capability.

The DiffusionDrive Model

DiffusionDrive is based on a novel, truncated diffusion approach. Instead of generating actions from random Gaussian noise, DiffusionDrive uses predefined anchor points that represent typical driving patterns. Gaussian noise with low variance is added around these anchor points, resulting in a so-called anchored Gaussian distribution. Trajectories are then generated from this distribution.

By using anchor points and shortening the diffusion process, the number of required denoising steps can be significantly reduced. Compared to conventional diffusion models, DiffusionDrive requires only two steps, enabling a substantial acceleration.

Additionally, DiffusionDrive uses an efficient, transformer-based decoder that improves interaction with contextual information from the environment. Through a cascading mechanism, the trajectory reconstruction is iteratively refined in each denoising step.

Experimental Results

DiffusionDrive was evaluated on the NAVSIM dataset, a dataset for planning tasks in autonomous driving. The results show that DiffusionDrive with a ResNet-34 backbone achieves a PDMS score of 88.1, surpassing the previous state-of-the-art. At the same time, the model achieves a real-time speed of 45 FPS on an NVIDIA 4090.

Further experiments on the nuScenes dataset confirm the performance of DiffusionDrive. Compared to VAD, another end-to-end model, DiffusionDrive achieves a 20.8% lower L2 error rate and a 63.6% lower collision rate while being 1.8 times faster.

Conclusion