WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

Abstract

Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis.

To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3DGS reconstruction space.

Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.

Method

Figure 1. Overview of the WorldAgents pipeline. The Director (VLM) scans the verified scene history to generate prompts for the next view. The Generator performs sequential inpainting with re-rendering of intermediate 3D reconstructions. The Verifier (VLM) applies two-step verification: 2D consistency against scene history and 3D verification against reconstructed views, before accepting frames. Verified frames are passed to AnySplat for 3DGS reconstruction.

Director: Prompt Formulation

A VLM-based director analyzes the current world state and formulates targeted prompts to guide the image generation model. It reasons about the scene context to produce descriptive, spatially-consistent generation prompts.

Generator: Novel View Synthesis

A state-of-the-art 2D foundation image generation model synthesizes new image views conditioned on the director's prompts. The generator leverages the implicit 3D understanding encoded in large-scale image model training to produce geometrically plausible views.

Verifier: Two-Step Frame Curation

A VLM-backed two-step verifier evaluates generated frames from both 2D image quality and 3D reconstruction consistency perspectives. Frames that pass both checks are selectively incorporated into the growing world representation, filtering out artifacts and inconsistencies.

🏗️

3D World Reconstruction & Exploration

Curated frames are fed into AnySplat, a feed-forward 3D Gaussian Splatting (3DGS) reconstruction method, to produce a coherent, explorable scene. The resulting 3DGS representation supports real-time novel view rendering, enabling users to freely navigate the synthesized 3D environment.

Results

Interior of a futuristic sci-fi laboratory room. Sleek metallic wall panels with integrated glowing blue and cyan neon strips. Holographic displays floating in mid-air showing complex data projections. An advanced robotic arm structure in the center of a clean workspace. ...

Apartment of an urban industrial loft with exposed brick and ductwork.

Interior of a cluttered cyberpunk apartment room at night. A cluttered desk with multiple glowing monitors, cables, and tech parts. Contrasting pink and teal neon light spilling into the room from outside. ...

A fantasy interior inside a natural cave filled with giant glowing purple and blue crystals. Stalactites hanging from the ceiling. A calm underground pool of water reflecting the bioluminescent light. ...

Interior of an ancient medieval stone crypt or dungeon. Rough hewn stone walls and vaulted stone ceiling arches. Iron sconces with flickering torches casting long harsh shadows. A stone sarcophagus in the center covered in dust and cobwebs. ...

Apartment of an airy Hamptons beach house with white shiplap and coastal blue accents.

Comparisons

Sci-Fi Scene vs. Baselines

Text2Room

WorldExplorer

Ours

Foundation Models

WorldAgents is built on top of state-of-the-art foundation models spanning image generation and visual language understanding.

Image Generation

FLUX.2 [pro] Black Forest Labs

FLUX.2 [pro] is BFL's flagship diffusion model, delivering the highest-fidelity image outputs with exceptional prompt adherence and photorealism. We use it to establish an upper bound on generation quality within the WorldAgents pipeline.

API access Generator

FLUX.2 [klein] Black Forest Labs · 9B params

FLUX.2 [klein] is BFL's efficient 9-billion parameter model, offering a strong quality-to-cost tradeoff. Its compact footprint makes it well-suited for the iterative, multi-step generation loop that WorldAgents performs during world expansion.

9B params Local deploy Generator

NanoBanana 1 Google · Gemini 2.5 Flash

NanoBanana 1, powered by Gemini 2.5 Flash, is Google's fast multimodal model with native image generation capabilities. Its low latency and high-fidelity outputs make it an effective backbone for the generation stage of WorldAgents.

Multimodal API access Generator

Vision-Language Models

GPT-4.1 OpenAI

GPT-4.1 serves as the primary VLM backbone for WorldAgents' Director and Verifier agents. Its strong visual understanding and instruction-following capabilities enable reliable prompt formulation and multi-step frame evaluation across diverse scene types.

API access Director Verifier

Qwen3 8B Alibaba · 8B params

Qwen3 8B is Alibaba's compact multimodal model with strong vision-language reasoning at a fraction of the compute cost. We use it as a lightweight alternative for the Director and Verifier roles, enabling fully local and deployable WorldAgents pipelines.

8B params Local deploy Director Verifier

BibTeX

@misc{erkoc2026worldagents,
      title={WorldAgents: Can Foundation Image Models be Agents for 3D World Models?}, 
      author={Ziya Erkoç and Angela Dai and Matthias Nießner},
      year={2026},
      eprint={2603.19708},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.19708}, 
}

Video