End-to-end autonomous driving in urban environments faces three core challenges. First, camera and LiDAR sensor heterogeneity causes cross-modal perception inconsistencies and sensor fusion instability. Second, diffusion models suffer from training instability due to scale variance and distribution changes, which limits generalization. Third, traditional trajectory decoders lack structured interaction with semantic elements, thereby undermining planning rationality. To address these issues, CMFPNet introduces an integrated framework with three key modules. The HGCF-Backbone integrates LiDAR and camera features using channel focus, deformable cross-focus, and state space modeling to enhance semantic alignment. The NST module maps physical trajectories to normalized space, employing truncated diffusion sampling for stable generation in just 2–4 steps. The NDA models trajectory generation as a semantic narrative, utilizing a six-stage semantic attention flow incorporating BEV context, interactive dynamics, and self-states. Experiments on the NAVSIM dataset demonstrate CMFP Net’s superiority over existing baselines, showing outstanding generalization and trajectory stability in challenging scenarios. Notably, the truncated sampling strategy achieves an 8–10× acceleration during inference while maintaining decision accuracy and reducing computational costs. CMFPNet provides a scalable, semantically consistent solution for diffusion-based autonomous driving with significant potential in both research and practical deployment.