Abstract
Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis.
To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3DGS reconstruction space.
Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.
Method
Figure 1. Overview of the WorldAgents pipeline. The Director (VLM) scans the verified scene history to generate prompts for the next view. The Generator performs sequential inpainting with re-rendering of intermediate 3D reconstructions. The Verifier (VLM) applies two-step verification: 2D consistency against scene history and 3D verification against reconstructed views, before accepting frames. Verified frames are passed to AnySplat for 3DGS reconstruction.

Director: Prompt Formulation
A VLM-based director analyzes the current world state and formulates targeted prompts to guide the image generation model. It reasons about the scene context to produce descriptive, spatially-consistent generation prompts.

Generator: Novel View Synthesis
A state-of-the-art 2D foundation image generation model synthesizes new image views conditioned on the director's prompts. The generator leverages the implicit 3D understanding encoded in large-scale image model training to produce geometrically plausible views.

Verifier: Two-Step Frame Curation
A VLM-backed two-step verifier evaluates generated frames from both 2D image quality and 3D reconstruction consistency perspectives. Frames that pass both checks are selectively incorporated into the growing world representation, filtering out artifacts and inconsistencies.
3D World Reconstruction & Exploration
Curated frames are fed into AnySplat, a feed-forward 3D Gaussian Splatting (3DGS) reconstruction method, to produce a coherent, explorable scene. The resulting 3DGS representation supports real-time novel view rendering, enabling users to freely navigate the synthesized 3D environment.
Results
Interior of a futuristic sci-fi laboratory room. Sleek metallic wall panels with integrated glowing blue and cyan neon strips. Holographic displays floating in mid-air showing complex data projections. An advanced robotic arm structure in the center of a clean workspace. ...
Apartment of an urban industrial loft with exposed brick and ductwork.
Interior of a cluttered cyberpunk apartment room at night. A cluttered desk with multiple glowing monitors, cables, and tech parts. Contrasting pink and teal neon light spilling into the room from outside. ...
A fantasy interior inside a natural cave filled with giant glowing purple and blue crystals. Stalactites hanging from the ceiling. A calm underground pool of water reflecting the bioluminescent light. ...
Interior of an ancient medieval stone crypt or dungeon. Rough hewn stone walls and vaulted stone ceiling arches. Iron sconces with flickering torches casting long harsh shadows. A stone sarcophagus in the center covered in dust and cobwebs. ...
Apartment of an airy Hamptons beach house with white shiplap and coastal blue accents.
Comparisons
Sci-Fi Scene vs. Baselines
Foundation Models
WorldAgents is built on top of state-of-the-art foundation models spanning image generation and visual language understanding.
Image Generation
FLUX.2 [pro] is BFL's flagship diffusion model, delivering the highest-fidelity image outputs with exceptional prompt adherence and photorealism. We use it to establish an upper bound on generation quality within the WorldAgents pipeline.
FLUX.2 [klein] is BFL's efficient 9-billion parameter model, offering a strong quality-to-cost tradeoff. Its compact footprint makes it well-suited for the iterative, multi-step generation loop that WorldAgents performs during world expansion.
NanoBanana 1, powered by Gemini 2.5 Flash, is Google's fast multimodal model with native image generation capabilities. Its low latency and high-fidelity outputs make it an effective backbone for the generation stage of WorldAgents.
Vision-Language Models
GPT-4.1 serves as the primary VLM backbone for WorldAgents' Director and Verifier agents. Its strong visual understanding and instruction-following capabilities enable reliable prompt formulation and multi-step frame evaluation across diverse scene types.
Qwen3 8B is Alibaba's compact multimodal model with strong vision-language reasoning at a fraction of the compute cost. We use it as a lightweight alternative for the Director and Verifier roles, enabling fully local and deployable WorldAgents pipelines.
BibTeX
@article{erkoc2025worldagents,
title = {WorldAgents: Can Foundation Image Models be
Agents for 3D World Models?},
author = {Erkoc, Ziya and Dai, Angela and Nie{\ss}ner, Matthias},
journal = {arXiv preprint},
year = {2025},
url = {https://arxiv.org/abs/XXXX.XXXXX}
}