Sparse BEV 3D Object Detection with Perspective Supervision

2025-01-8019

To be published on 04/01/2025

Event
WCX SAE World Congress Experience
Authors Abstract
Content
Object detection is a fundamental task in autonomous driving perception systems. This paper discusses this task and proposes a novel sparse BEV object detector with perspective supervision. Existing BEV object detectors typically use 2D modern image backbones. We further explore pre-trained backbones to adapt general 2D image backbones for BEV detectors. By using modern image backbone networks in BEV models, inference accuracy can rival certain depth-pretrained backbones, thereby unleashing the full potential of modern image backbones. Many advanced dense BEV detectors have limitations in model flexibility. For instance, temporal feature fusion requires stacking historical features, consuming substantial computational resources, and modeling moving objects necessitates a large receptive field. Some sparse BEV detectors suffer from slower convergence due to sparsity. We aim for BEV detectors to model more flexibly in both temporal and spatial domains and achieve faster convergence. Existing BEV object detectors often have poor generalization to camera parameters. Applying large-scale data augmentation to camera parameters during training can significantly impact convergence speed. We hope that BEV object detectors can better utilize camera extrinsics to enhance detection accuracy and generalization. Therefore, we propose a sparse BEV object detector with perspective supervision. Firstly, we designed perspective view supervision specifically for 2D modern image backbones to enhance their adaptation to 3D object detection tasks, significantly improving detection performance. Next, we adopted an anchor-based sparse BEV detection scheme. The prior information of anchors makes the sparse detection scheme more flexible in temporal fusion. By effectively decoupling the high-dimensional features of the target from the anchors, we only need to update the state of anchors in temporal propagation, while the high-dimensional features of the target remain unchanged. This approach achieves recursive multi-frame feature temporal fusion through sparse feature transmission frame by frame, significantly improving inference speed and memory efficiency. We then used perspective detection’s 3D bounding boxes to provide accurate target position and pose information, using them as initial estimates for BEV detection, which further accelerated model convergence. We designed a camera extrinsic Fourier encoding module to map camera extrinsics into high-dimensional feature vectors, which are then combined with the target's high-dimensional features to form new features. Through image and output coordinate augmentation during training, this approach significantly enhances the generalization ability of camera extrinsics and achieves higher perception metrics. The network architecture achieved high accuracy and efficiency in the nuScenes benchmark, and the proposed approach significantly improved perception performance metrics.
Meta TagsDetails
Citation
Yan, Y., "Sparse BEV 3D Object Detection with Perspective Supervision," SAE Technical Paper 2025-01-8019, 2025, .
Additional Details
Publisher
Published
To be published on Apr 1, 2025
Product Code
2025-01-8019
Content Type
Technical Paper
Language
English