The DIDYMOS-XR project, funded by the European Union’s Horizon Europe programme, aims to advance the technologies required for creating digital twins and developing 3D capturing systems for both objects and complex scenes. As part of this initiative, the GDNeRF model has been developed to address one of the key challenges in this domain: enabling high-quality scene reconstruction from a few, sparsely positioned cameras.
In the rapidly evolving world of 3D reconstruction and neural rendering, one of the most exciting frontiers is rebuilding complete objects and environments from just a few images. Traditional Neural Radiance Fields (NeRFs) have already revolutionized how we synthesize new views of a scene. At its core, a NeRF trains a new network from scratch for every scene, learning its complete 3D shape and appearance just by observing how that scene looks from all the different viewpoints — but they typically require hundreds of closely spaced images and long training times in order to reconstruct the scene faithfully. That’s where GDNeRF (Generalizable Depth-based NeRF) comes in.
Developed by researchers at the i2CAT Foundation in Barcelona within the DIDYMOS-XR project, GDNeRF introduces a new way to generalize NeRF models for sparse view synthesis — meaning it can reconstruct realistic scenes even when you only have a handful of widely spaced camera views. GDNeRF is designed to handle both dynamic humans and general scenes. However, within the DIDYMOS-XR project, the focus has been on object- and scene-level reconstruction — such as rooms, urban spaces, or cultural heritage sites — where the ability to work with minimal hardware setups is especially valuable.
The Challenge: From Dense Cameras to Sparse Reality
Most NeRF-based systems assume dense multi-camera rigs with overlapping fields of view. This works well in lab environments, but not in the real world — where you might only have a few cameras, or even just a smartphone moving around an object.
Sparse setups introduce major challenges:
-
- Large gaps between viewpoints make it hard to infer unseen geometry.
-
- Occluded regions (like the backside of an object) often remain unknown.
-
- Training time and data requirements make many new view synthesis methods impractical for real-time use.
GDNeRF directly tackles these limitations, allowing digital twin creation even from minimal, low-cost camera rigs — a key enabler for DIDYMOS-XR’s goal of democratizing immersive 3D reconstruction. Figure 1 shows an example of the inputs that GDNeRF employs and its superiority among other methods.
The Core Idea: Depth-Guided Probabilistic Feature Volumes
At its heart, GDNeRF integrates feature maps from sparse views into a new multilevel probabilistic representation called a probabilistic feature volume. This structure fuses features from a few input views into a consistent 3D volume, weighting them by how likely each feature belongs to a given depth position. It implicitly models uncertainty in depth and uses that to make smarter reconstructions.
This design allows GDNeRF to handle large view gaps more gracefully. When reconstructing a building facade, a car, or a piece of furniture, the model can infer coherent geometry even when much of the surface isn’t directly visible. Figure 2 depicts a visual description of GDNeRF.
Filling the Gaps: A Generative Approach to Missing Information
Sparse views always leave holes — unseen parts of the scene. To fill them, GDNeRF borrows concepts from generative adversarial networks (GANs), particularly StyleGAN. A 3D Convolutional Neural Network (CNN) generator predicts plausible features for occluded or ambiguous regions, ensuring that even unseen parts of an object are consistent with visible context.
This combination of probabilistic geometry and generative reasoning is particularly valuable for DIDYMOS-XR’s broader mission: enabling digital twins that can represent entire scenes, even when parts of those scenes are hidden from sensors or cameras.
Real-Time Rendering, Real-World Applications
Beyond reconstruction accuracy, GDNeRF achieves real-time rendering at 16 FPS (frames per second), enabling dynamic applications such as:
-
- Interactive 3D object visualization from a few smartphone photos.
-
- Digital twins of spaces using sparse camera setups.
-
- Augmented/ virtual reality (AR/VR) scene reconstruction for telepresence or virtual staging.
-
- Cultural heritage preservation, where limited captures are common.
Unlike many NeRFs, GDNeRF doesn’t require retraining for every new scene — it generalizes across environments, meaning a pre-trained model can adapt to new inputs on the fly.
Results: Sparse Input, Dense Reality
In benchmarks, GDNeRF dramatically outperformed prior generalizable NeRFs such as ENeRF, especially in sparse setups:
-
- Up to 60% improvement in view synthesis quality under large (60°) camera spacing.
-
- Consistent results across both object-level (the DTU dataset) and scene-level datasets (such as the CWI dataset).
Even without dense coverage, GDNeRF delivers detailed textures and solid geometry, bridging the gap between few-shot input and dense reconstruction. Table 1 provides quantitative results while Figure 3 shows visual comparisons.
These advances make GDNeRF well-suited for on-site 3D reconstruction, digital twin creation, and free-viewpoint rendering — all central goals for DIDYMOS-XR’s applied research in XR technologies.
Why It Matters
The broader implication of GDNeRF extends beyond free-viewpoint video. Its framework opens doors to generalizable 3D scene reconstruction from minimal input — a foundational capability for applications in:
-
- Robotics and autonomous systems navigating with few sensors.
-
- Digital heritage and conservation with limited imagery.
-
- Rapid spatial scanning in XR or simulation environments.
By combining depth-based uncertainty modeling and GAN-powered feature synthesis, GDNeRF represents a meaningful step toward NeRFs that understand and reconstruct entire scenes.
Looking Ahead
The next steps in the DIDYMOS-XR roadmap are exploring extending GDNeRF’s framework with 3D Gaussian Splatting for faster rendering and synthetic depth estimation to work with image RGB-only inputs. These developments will push the boundaries of real-time digital twin generation, enabling scalable 3D capture across domains. Preliminary results already show the potential of extending this framework with Gaussian Splatting as shown in Figure 4.
In short:
GDNeRF embodies the core vision of DIDYMOS-XR — transforming limited sensory input into detailed, interactive digital twins of our world. It’s not just about reconstructing humans, but about reconstructing everything: our objects, spaces, and scenes, one sparse view at a time.
By Ivan Huerta and Sergio Montoya of i2CAT, inspired by the DIDYMOS-XR paper ‘GDNeRF: Generalizable Depth-based NeRF for Sparse View Synthesis,’ ICME 2025