Diffusion Framework with Cross-Modality Fusion Perception for Autonomous Driving in Urban Traffic

Yanwei Qu; Hangjie Mo

doi:10.4271/2026-99-0749

Features

Event: Interntional Conference on the New Energy and Intelligent Vehicles

Authors

Yanwei Qu

School of Management, Hefei University of Technology

Hangjie Mo

School of Management, Hefei University of Technology

Abstract

Content: End-to-end autonomous driving in urban environments faces three core challenges. First, camera and LiDAR sensor heterogeneity causes cross-modal perception inconsistencies and sensor fusion instability. Second, diffusion models suffer from training instability due to scale variance and distribution changes, which limits generalization. Third, traditional trajectory decoders lack structured interaction with semantic elements, thereby undermining planning rationality. To address these issues, CMFPNet introduces an integrated framework with three key modules. The HGCF-Backbone integrates LiDAR and camera features using channel focus, deformable cross-focus, and state space modeling to enhance semantic alignment. The NST module maps physical trajectories to normalized space, employing truncated diffusion sampling for stable generation in just 2–4 steps. The NDA models trajectory generation as a semantic narrative, utilizing a six-stage semantic attention flow incorporating BEV context, interactive dynamics, and self-states. Experiments on the NAVSIM dataset demonstrate CMFP Net’s superiority over existing baselines, showing outstanding generalization and trajectory stability in challenging scenarios. Notably, the truncated sampling strategy achieves an 8–10× acceleration during inference while maintaining decision accuracy and reducing computational costs. CMFPNet provides a scalable, semantically consistent solution for diffusion-based autonomous driving with significant potential in both research and practical deployment.

Meta Tags

Topics: Autonomous vehicles
Electric vehicles
Lidar
Sensors and actuators
Cameras
Logistics
Simulation and modeling
Vehicle acceleration
Research and development

Affiliated or Co-Author: School of Management, Hefei University of Technology

Details

DOI: https://doi.org/10.4271/2026-99-0749

Citation: Qu, Y. and Mo, H., "Diffusion Framework with Cross-Modality Fusion Perception for Autonomous Driving in Urban Traffic," Interntional Conference on the New Energy and Intelligent Vehicles, Hefei, China, November 2, 2025, https://doi.org/10.4271/2026-99-0749.

Additional Details

Publisher: SAE International

Published: 14 hours ago

Product Code: 2026-99-0749

Content Type: Technical Paper

Language: English