Reconstructing the World from a Few Views

The DIDYMOS-XR project, funded by the European Union’s Horizon Europe programme, aims to advance the technologies required for creating digital twins and developing 3D capturing systems for both objects and complex scenes. As part of this initiative, the GDNeRF model has been developed to address one of the key challenges in this domain: enabling high-quality scene reconstruction from a few, sparsely positioned cameras.

In the rapidly evolving world of 3D reconstruction and neural rendering, one of the most exciting frontiers is rebuilding complete objects and environments from just a few images. Traditional Neural Radiance Fields (NeRFs) have already revolutionized how we synthesize new views of a scene. At its core, a NeRF trains a new network from scratch for every scene, learning its complete 3D shape and appearance just by observing how that scene looks from all the different viewpoints — but they typically require hundreds of closely spaced images and long training times in order to reconstruct the scene faithfully. That’s where GDNeRF (Generalizable Depth-based NeRF) comes in.

Developed by researchers at the i2CAT Foundation in Barcelona within the DIDYMOS-XR project, GDNeRF introduces a new way to generalize NeRF models for sparse view synthesis — meaning it can reconstruct realistic scenes even when you only have a handful of widely spaced camera views. GDNeRF is designed to handle both dynamic humans and general scenes. However, within the DIDYMOS-XR project, the focus has been on object- and scene-level reconstruction — such as rooms, urban spaces, or cultural heritage sites — where the ability to work with minimal hardware setups is especially valuable.

The Challenge: From Dense Cameras to Sparse Reality

Most NeRF-based systems assume dense multi-camera rigs with overlapping fields of view. This works well in lab environments, but not in the real world — where you might only have a few cameras, or even just a smartphone moving around an object.

Sparse setups introduce major challenges:

    • Large gaps between viewpoints make it hard to infer unseen geometry.

    • Occluded regions (like the backside of an object) often remain unknown.

    • Training time and data requirements make many new view synthesis methods impractical for real-time use.

GDNeRF directly tackles these limitations, allowing digital twin creation even from minimal, low-cost camera rigs — a key enabler for DIDYMOS-XR’s goal of democratizing immersive 3D reconstruction. Figure 1 shows an example of the inputs that GDNeRF employs and its superiority among other methods.

Figure 1: Given few sparse input views, GDNeRF is capable of synthesizing highly realistic target views. The example shows 3 sparse views of the DTU dataset and how our method is capable of synthesizing a new view. GT refers to the Groundtruth image, while ENeRF is the state-of-the-art on generalizable NeRFs and FreeNeRF on sparse view synthesis. Our model leverages input depth information using a multilevel probabilistic feature volume and incorporates a generative component to produce feasible scene features rendered through volumetric rendering. It can generate new views in real time without needing to train the neural representation of the scene.

The Core Idea: Depth-Guided Probabilistic Feature Volumes

At its heart, GDNeRF integrates feature maps from sparse views into a new multilevel probabilistic representation called a probabilistic feature volume. This structure fuses features from a few input views into a consistent 3D volume, weighting them by how likely each feature belongs to a given depth position. It implicitly models uncertainty in depth and uses that to make smarter reconstructions.

This design allows GDNeRF to handle large view gaps more gracefully. When reconstructing a building facade, a car, or a piece of furniture, the model can infer coherent geometry even when much of the surface isn’t directly visible. Figure 2 depicts a visual description of GDNeRF.

Figure 2: From a few sparse views, our method extracts a set of feature maps which are projected into our probabilistic feature volume based on the source views depth maps. Our generator module processes the probabilistic feature volume to generate a feasible multilevel feature volume. These volumes contain features that the volumetric renderer uses to synthesize the target view in real time, without the need for per-scene training.

Filling the Gaps: A Generative Approach to Missing Information

Sparse views always leave holes — unseen parts of the scene. To fill them, GDNeRF borrows concepts from generative adversarial networks (GANs), particularly StyleGAN. A 3D Convolutional Neural Network (CNN) generator predicts plausible features for occluded or ambiguous regions, ensuring that even unseen parts of an object are consistent with visible context.

This combination of probabilistic geometry and generative reasoning is particularly valuable for DIDYMOS-XR’s broader mission: enabling digital twins that can represent entire scenes, even when parts of those scenes are hidden from sensors or cameras.

Real-Time Rendering, Real-World Applications

Beyond reconstruction accuracy, GDNeRF achieves real-time rendering at 16 FPS (frames per second), enabling dynamic applications such as:

    • Interactive 3D object visualization from a few smartphone photos.

    • Digital twins of spaces using sparse camera setups.

    • Augmented/ virtual reality (AR/VR) scene reconstruction for telepresence or virtual staging.

    • Cultural heritage preservation, where limited captures are common.

Unlike many NeRFs, GDNeRF doesn’t require retraining for every new scene — it generalizes across environments, meaning a pre-trained model can adapt to new inputs on the fly.

Results: Sparse Input, Dense Reality

In benchmarks, GDNeRF dramatically outperformed prior generalizable NeRFs such as ENeRF, especially in sparse setups:

    • Up to 60% improvement in view synthesis quality under large (60°) camera spacing.

    • Consistent results across both object-level (the DTU dataset) and scene-level datasets (such as the CWI dataset).

Even without dense coverage, GDNeRF delivers detailed textures and solid geometry, bridging the gap between few-shot input and dense reconstruction. Table 1 provides quantitative results while Figure 3 shows visual comparisons.

Table 1: GDNeRF ablation study. Results on DTU, CWI and ActorsHQ for the different components of GDNeRF and ENeRF. Our model significantly outperforms ENeRF. Even the most basic version of GDNeRF, utilizing only the probabilistic feature volume, already surpasses its performance.

 

Figure 3: Generalizable sparse new view synthesis in DTU. Comparison between ENeRF and GDNeRF in ActorsHQ. ENeRF generates many artifacts to our GDNeRF.

These advances make GDNeRF well-suited for on-site 3D reconstruction, digital twin creation, and free-viewpoint rendering — all central goals for DIDYMOS-XR’s applied research in XR technologies.

Why It Matters

The broader implication of GDNeRF extends beyond free-viewpoint video. Its framework opens doors to generalizable 3D scene reconstruction from minimal input — a foundational capability for applications in:

    • Robotics and autonomous systems navigating with few sensors.

    • Digital heritage and conservation with limited imagery.

    • Rapid spatial scanning in XR or simulation environments.

By combining depth-based uncertainty modeling and GAN-powered feature synthesis, GDNeRF represents a meaningful step toward NeRFs that understand and reconstruct entire scenes.

Looking Ahead

The next steps in the DIDYMOS-XR roadmap are exploring extending GDNeRF’s framework with 3D Gaussian Splatting for faster rendering and synthetic depth estimation to work with image RGB-only inputs. These developments will push the boundaries of real-time digital twin generation, enabling scalable 3D capture across domains. Preliminary results already show the potential of extending this framework with Gaussian Splatting as shown in Figure 4.

Figure 4: GDNeRF framework with 3D Gaussian Splatting with in-the-wild car camera sequences.

 

In short:

GDNeRF embodies the core vision of DIDYMOS-XR — transforming limited sensory input into detailed, interactive digital twins of our world. It’s not just about reconstructing humans, but about reconstructing everything: our objects, spaces, and scenes, one sparse view at a time.

By Ivan Huerta and Sergio Montoya of i2CAT, inspired by the DIDYMOS-XR paper ‘GDNeRF: Generalizable Depth-based NeRF for Sparse View Synthesis,’ ICME 2025