Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

1HKUST, 2UC San Diego
CVPR 2022

Our method can synthesize a consistent long-term 3D scene video from a single image, especially for looking outside the room.


Novel view synthesis from a single image has recently attracted a lot of attention, and it has been primarily advanced by 3D deep learning and rendering techniques. However, most work is still limited by synthesizing new views within relatively small camera motions. In this paper, we propose a novel approach to synthesize a consistent long-term video given a single scene image and a trajectory of large camera motions. Our approach utilizes an autoregressive Transformer to perform sequential modeling of multiple frames, which reasons the relations between multiple frames and the corresponding cameras to predict the next frame. To facilitate learning and ensure consistency among generated frames, we introduce a locality constraint based on the input cameras to guide self-attention among a large number of patches across space and time. Our method outperforms state-of-the-art view synthesis approaches by a large margin, especially when synthesizing long-term future in indoor 3D scenes.



During training, images and camera transformations are first encoded to modality-specific tokens. Tokens are then fed into an autoregressive Transformer that predicts images. During inference, given a single image and a camera trajectory, novel views can be generated autoregressively by using the Transformer.

Comparison to method with arbitrary camera motion

We evaluate and compare to previous approaches with arbitrary camera motion on two benchmark datasets: Matterport and RealEstate10K (see paper for details).

Comparison to method with discreate camera motion

Besides, we compare to the method with discreate camera motion.

Attention Maps

Given the output image in the future and the input image, we visualize bias on the input image corresponding to a patch on the output image. This visualization indicates when the patch is synthesized, what information contributes the most in the input image. We can see most contributions come from the patches in nearby locations along the trajectory.

Ablation Study

We report some ablations of our method in terms of longterm view synthesis on the Matterport3D dataset. Camera-Aware Bias and decoupled positional embedding improve the image quality and the consistency between frames. Finetuning the model by stimulating error accumulation benefits the longterm view synthesis.


  title={Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image},
  author={Ren, Xuanchi and Wang, Xiaolong},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},