PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment

1Visual Geometry Group, University of Oxford, 2Meta AI
ICCV 2023

Given a set of input frames, the model samples \(p(x |I)\) step by step. We start to use geometry-guided sampling at timestep 10, which corresponds to a notable improvement in prediction quality observed in the video at \(t=10\).

Abstract

Camera pose estimation is a long-standing computer vision problem that to date often relies on classical methods, such as handcrafted keypoint matching, RANSAC and bundle adjustment. In this paper, we propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework, modelling the conditional distribution of camera poses given input images. This novel view of an old problem has several advantages. (i) The nature of the diffusion framework mirrors the iterative procedure of bundle adjustment. (ii) The formulation allows a seamless integration of geometric constraints from epipolar geometry. (iii) It excels in typically difficult scenarios such as sparse views with wide baselines. (iv) The method can predict intrinsics and extrinsics for an arbitrary amount of images. We demonstrate that our method significantly improves over the classic SfM pipelines and the learned approaches on two real-world datasets. Finally, it is observed that our method can generalize across datasets without further training.

Framework

We propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework. Training is supervised given a multi-view datasets of images and camera poses to learn a diffusion model \(D_\theta\) to model \(p(x |I)\). During inference, the reverse diffusion process is guided by optimizing the geometric consistency between poses via Sampson Epipolar Error.

Teaser image

Qualitative Comparison

We provide the qualitative samples of pose estimation on the CO3Dv2 dataset. Given input images I (first row), our PoseDiffusion (2nd row) is compared to RelPose (3rd row), COLMAP+SPSG (4th row), and the ground truth. Missing cameras indicate failure.

Qualitative Samples on Co3D

We also provide a video featuring the observation with a sweeping, fly-around viewpoint.


More Visualization of Sampling Iterations


Novel View Synthesis

To illustrate the quality of our camera pose estimation, we train NeRF models with the extrinsics and intrinsics predicted by our method, with the results shown below.


Camera Pose Uncertainty

One inherent advantage of utilizing the diffusion model for camera pose estimation is its probabilistic nature. It is well-known that few-view camera pose estimation is a non-deterministic problem, where multiple pose combinations may be all reasonable for a set of images. We provide a visualization below to verify that our method can provide several reasonable pose sets \(x\) for the same input frames \(I\). The cameras predicted for the same frame are indicated with identical colors.

Uncertainty

BibTeX

@InProceedings{wang2023pd,
  author    = {Jianyuan Wang and Christian Rupprecht and David Novotny},
  title     = {{PoseDiffusion}: Solving Pose Estimation via Diffusion-aided Bundle Adjustment},
  journal   = {ICCV},
  year      = {2023}
}

Acknowledgements

We appreciate the great help from Jason Y. Zhang for generously answering questions and sharing the code for RelPose and benchmark evaluation. We would like to thank Nikita Karaev, Luke Melas-Kyriazi, and Shangzhe Wu for insightful discussions.

Jianyuan Wang is supported by Facebook Research. Christian Rupprecht is supported by ERC-CoG UNION 101001212 and VisualAI EP/T028572/1.