Large Language Models (LLMs) are increasingly being explored for tasks beyond text generation, including spatial reasoning and visual planning. In the context of image planning—such as deciding where objects should be placed in an image or scene—LLMs offer an exciting new paradigm. This task, often called “image layout generation” or “scene composition,” traditionally relies on rule-based systems or vision models trained on layout datasets. However, LLMs can reason abstractly about object relationships, functions, and user intent, making them a valuable tool in the early stages of image design.
For example, given a prompt like “design a cozy living room with a desk near the window and a sofa facing the fireplace”, an LLM can generate a structured description or even pseudo-code indicating object positions, dimensions, and spatial relationships. When combined with a visual renderer or layout model, this structured output can be transformed into an image or a bounding-box map. Moreover, LLMs can incorporate user constraints, design principles (e.g., feng shui or accessibility), and functional goals (e.g., maximizing space or light) without requiring domain-specific training.