Bird's-Eye-View (BEV) perception has become pivotal in autonomous driving for 3D object detection, enabling the transformation of multi-view image features into a unified BEV representation. However, purely visual methods still struggle with depth estimation and temporal consistency, particularly in modeling long-term dependencies and avoiding trajectory confusion in dynamic scenes. Knowledge distillation (KD) offers a promising solution by transferring rich knowledge from a powerful teacher model to a efficient student model, enhancing performance without incurring significant computational overhead.
This paper aims to enhance the temporal reasoning and geometric understanding of the StreamPETR model—an object-centric, streaming-based 3D detector—via a multi-level distillation strategy.
1.Temporal Relation Distillation: The self-attention mechanisms of the teacher model are distilled to impart temporal relational knowledge, enabling the student to associate object states across time, leverage historical context for stable detection, and avoid trajectory confusion (deduplication). This structural knowledge distillation proves more effective than merely imitating spatial attention distributions, as it captures how the teacher models relationships between objects over time
2.Advanced Semantic Distillation: To circumvent misalignment challenges in direct cross-attention distillation (caused by disparities in feature extraction), higher-level semantic information is extracted from the teacher. This includes statistical measures (e.g., mean, variance) of features corresponding to regions highlighted by the teacher’s cross-attention. A Region Decomposition Mask distinguishes foreground, background, and false-positive areas, while Spatial Attention Maps (generated via channel-wise norm aggregation) guide the student to focus on critically important regions
3.3D-Aware Feature Rendering: The teacher model constructs a 3D scene representation using 3D Gaussians, where each Gaussian point encapsulates color, position, and a low-dimensional feature vector. This representation enables the rendering of high-resolution, 3D-consistent feature maps from arbitrary viewpoints. These features serve as distillation targets via L1 loss, enhancing the student’s 2D backbone in geometric and depth understanding, thereby improving performance in tasks such as depth estimation and segmentation
Experiments on the nuScenes dataset demonstrate that the approach enhances robustness in temporal reasoning and geometric perception, validating the efficacy of distilling temporal and structural knowledge.
This work presents a novel distillation paradigm for streaming-based visual BEV models, emphasizing temporal relational knowledge and advanced semantic distillation.