Diffusion Framework with Cross-Modality Fusion Perception for Autonomous Driving in Urban Traffic

2026-99-0749

5/15/2026

Authors
Abstract
Content
End-to-end autonomous driving in urban environments faces three core challenges. First, camera and LiDAR sensor heterogeneity causes cross-modal perception inconsistencies and sensor fusion instability. Second, diffusion models suffer from training instability due to scale variance and distribution changes, which limits generalization. Third, traditional trajectory decoders lack structured interaction with semantic elements, thereby undermining planning rationality. To address these issues, CMFPNet introduces an integrated framework with three key modules. The HGCF-Backbone integrates LiDAR and camera features using channel focus, deformable cross-focus, and state space modeling to enhance semantic alignment. The NST module maps physical trajectories to normalized space, employing truncated diffusion sampling for stable generation in just 2–4 steps. The NDA models trajectory generation as a semantic narrative, utilizing a six-stage semantic attention flow incorporating BEV context, interactive dynamics, and self-states. Experiments on the NAVSIM dataset demonstrate CMFP Net’s superiority over existing baselines, showing outstanding generalization and trajectory stability in challenging scenarios. Notably, the truncated sampling strategy achieves an 8–10× acceleration during inference while maintaining decision accuracy and reducing computational costs. CMFPNet provides a scalable, semantically consistent solution for diffusion-based autonomous driving with significant potential in both research and practical deployment.
Meta TagsDetails
DOI
https://doi.org/10.4271/2026-99-0749
Citation
Qu, Y. and Mo, H., "Diffusion Framework with Cross-Modality Fusion Perception for Autonomous Driving in Urban Traffic," Interntional Conference on the New Energy and Intelligent Vehicles, Hefei, China, November 2, 2025, https://doi.org/10.4271/2026-99-0749.
Additional Details
Publisher
Published
14 hours ago
Product Code
2026-99-0749
Content Type
Technical Paper
Language
English