Camera pose estimation is a long-standing computer vision problem that to date often relies on classical methods, such as handcrafted keypoint matching, RANSAC and bundle adjustment. In this paper, we propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework, modelling the conditional distribution of camera poses given input images. This novel view of an old problem has several advantages. (i) The nature of the diffusion framework mirrors the iterative procedure of bundle adjustment. (ii) The formulation allows a seamless integration of geometric constraints from epipolar geometry. (iii) It excels in typically difficult scenarios such as sparse views with wide baselines. (iv) The method can predict intrinsics and extrinsics for an arbitrary amount of images. We demonstrate that our method significantly improves over the classic SfM pipelines and the learned approaches on two real-world datasets. Finally, it is observed that our method can generalize across datasets without further training.
We propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework. Training is supervised given a multi-view datasets of images and camera poses to learn a diffusion model \(D_\theta\) to model \(p(x |I)\). During inference, the reverse diffusion process is guided by optimizing the geometric consistency between poses via Sampson Epipolar Error.
We provide the qualitative samples of pose estimation on the CO3Dv2 dataset. Given input images I (first row), our PoseDiffusion (2nd row) is compared to RelPose (3rd row), COLMAP+SPSG (4th row), and the ground truth. Missing cameras indicate failure.
We also provide a video featuring the observation with a sweeping, fly-around viewpoint.
To illustrate the quality of our camera pose estimation, we train NeRF models with the extrinsics and intrinsics predicted by our method, with the results shown below.
One inherent advantage of utilizing the diffusion model for camera pose estimation is its probabilistic nature. It is well-known that few-view camera pose estimation is a non-deterministic problem, where multiple pose combinations may be all reasonable for a set of images. We provide a visualization below to verify that our method can provide several reasonable pose sets \(x\) for the same input frames \(I\). The cameras predicted for the same frame are indicated with identical colors.
@InProceedings{wang2023pd,
author = {Jianyuan Wang and Christian Rupprecht and David Novotny},
title = {{PoseDiffusion}: Solving Pose Estimation via Diffusion-aided Bundle Adjustment},
journal = {ICCV},
year = {2023}
}
We appreciate the great help from Jason Y. Zhang for generously answering questions and sharing the code for RelPose and benchmark evaluation. We would like to thank Nikita Karaev, Luke Melas-Kyriazi, and Shangzhe Wu for insightful discussions.
Jianyuan Wang is supported by Facebook Research. Christian Rupprecht is supported by ERC-CoG UNION 101001212 and VisualAI EP/T028572/1.