Weakly Supervised Monocular 3D Detection with a Single-View Image

1S-Lab, Nanyang Technological University, Singapore
2Sensetime Research, China
3College of Computer Science and Technology, Zhejiang University of Technology, China
*Corresponding author.

CVPR 2024

Different paradigms in weakly supervised monocular 3D detection. Our approach in (c) leverages Pseudo Depth Labels from a single-view image to achieve weakly supervised monocular 3D detection, requiring no extra training data like LiDAR point clouds or multi-view images as in (a) and (b). It improves usability and applicability greatly. The Pseudo Depth Labels are obtained with an off-the-shelf depth estimator without extra training and ground-truth depth labels.

Abstract

Monocular 3D detection (M3D) aims for precise 3D object localization from a single-view image which usually involves labor-intensive annotation of 3D detection boxes. Weakly supervised M3D has recently been studied to obviate the 3D annotation process by leveraging many existing 2D annotations, but it often requires extra training data such as LiDAR point clouds or multi-view images which greatly degrades its applicability and usability in various applications. We propose SKD-WM3D, a weakly supervised monocular 3D detection framework that exploits depth information to achieve M3D with a single-view image exclusively without any 3D annotations or other training data. One key design in SKD-WM3D is a self-knowledge distillation framework, which transforms image features into 3D-like representations by fusing depth information and effectively mitigates the inherent depth ambiguity in monocular scenarios with little computational overhead in inference. In addition, we design an uncertainty-aware distillation loss and a gradient-targeted transfer modulation strategy which facilitate knowledge acquisition and knowledge transfer, respectively. Extensive experiments show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods.

Method Overview

The framework of the proposed self-knowledge distillation network. The framework consists of a depth-guided self-teaching network and a monocular 3D detection network. The depth-guided self-teaching network acquires comprehensive 3D localization knowledge by leveraging depth information and transfers its learned expertise to the monocular 3D detection network via soft label distillation to enhance its performance. We design an uncertainty-aware distillation loss and a gradient-targeted transfer modulation strategy to facilitate the knowledge transfer between the two networks effectively. During inference, the monocular 3D detection network extracts intrinsic depth information from single-view images independently with little computational overhead.

Detection Visualization

Qualitative illustration on KITTI val set. Red boxes denote ground-truth annotations and Green boxes denote our predictions. The ground truth of LiDAR point clouds is utilized for visualization purposes only. Best viewed with zoom-in.

BibTeX


          @article{jiang2024weakly,
            title={Weakly Supervised Monocular 3D Detection with a Single-View Image},
            author={Jiang, Xueying and Jin, Sheng and Lu, Lewei and Zhang, Xiaoqin and Lu, Shijian},
            journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
            year={2024}
          }