π Video Scene Graphs
Scene graphs provide a structured representation of the environment, capturing objects and relationships as they evolve over time.
Politecnico di Torino, Italy
Understanding activities requires more than recognizing actions or objects, it requires understanding how humans transform the world around them. Objects appear, move, interact, and change their relationships over time. Capturing this evolving structure explicitly enables richer and more grounded reasoning than raw video alone. We believe that spatio-temporal scene graphs provide the right representation to captures exactly these evolving interactions. This representation is compositional, interpretable, and editable, making it possible to reason about interactions rather than simply recognize them. Every human activity changes the environment, scene graphs make those changes explicit. Once interactions are represented as evolving graphs, they become a common language for a wide range of downstream capabilities.
π‘ We introduce SG-Ego, the first large-scale extension of Ego4D with temporally consistent scene graph annotations generated through a fully training-free pipeline. Building on this representation, we propose GLEN, a graph-based model that reasons directly over scene graph sequences. GLEN learns to align graphs with textual actions while modeling their temporal evolution, enabling explicit reasoning about how scenes evolve thourgh human acitvities.
We introduce Activity-Driven Graph Edit Forecasting (A-GEF), a new benchmark that formulates future prediction as forecasting a sequence of graph edits conditioned on ongoing actions.
We show that the same scene graph representation can support a diverse set of challenges, including retrieval, compositional reasoning, and long-horizon forecasting, where GLEN consistently achieves SOTA zero-shot performance with structured and controllable predictions.
Together, SG-Ego and GLEN demonstrate that spatio-temporal scene graphs are a powerful foundation for interpretable, compositional, and reusable reasoning over video of human activities.
SG-Ego implements a training-free VidSGG strategy divided into three stages. Our pipeline maps a sequence of T frames into a single temporally consistent spatio-temporal scene graph, capturing all the spatial and functional relations in the video. We follow a bottom-up approach to extract a rich set of frame-level relations, ground them with a open vocabulary detection model and build a temporally consistent scene graph of the temporal window.
Qwen3.5 extracts spatial and functional triplets from each frame, turning raw egocentric video frames into a structured relations list.
GroundingDINO localizes the objects and relations from the triplets and builds a spatially grounded frame-level scene graphs.
SAM2+DINOv2 tracks object instances through time and merges frame-level scene graphs into temporally consistent scene graphs.
The Align dataset contains approximately 3.8M spatio-temporal graphs from 7297 unique Ego4D videos, each paired with its corresponding narration and centered on EgoClip action windows.
The Edit dataset contains 360k training samples from 6537 videos and 7.2k validation samples from 181 videos, providing both the frame-level start graphs and the clip-level consolidated graphs.
Exported layout snippet for sharing
We formulate the dynamic evolution of a scene as a novel setting called Activity-Conditioned Graph Edit Forecasting (A-GEF), in which we ask a model to forecast how a spatial scene graph will change based on the activity described in text prompt. To do this, we consider a conditional formulation in which future graph edits are predicted given both the current graph and the upcoming activity.
Our approach GLEN consists of three main components. The Graph Encoder and the Text Encoder map scene graphs and textual prompts into an aligned embedding space, thus learning a versatile representation that summarizes the state of the scene and enables spatially grounded reasoning about objects and relations. Finally, the Graph Edit Model defines a text-conditioned transformation over scene graphs, allowing action-conditioned modifications of the graph.
We validate our scene graph annotation pipeline and GLEN on several benchmark. On EgoSchema, we demonstrate that SG-Ego scene graphs provide a structured and complete representation of the scene that can support long range reasoning about human activities. On EgoMCQ and EgoCVR, we show that the graph embeddings produced by GLEN are well aligned with the textual description of the corresponding ongoing human activities. Finally, on A-GEF and EXPLORE-Bench, we assess the capability of GLEN to predict the dynamic evolution of a scene given an activity.
Text-to-video retrieval on egocentric clips.
| Model | R@1 (Global) | R@1 (Local) |
|---|---|---|
| CLIP | 7.4 | 26.1 |
| TFR-CVR | 14.1 | 44.2 |
| Ours | 15.3 | 47.7 |
Action-conditioned scene graph evolution.
| Model | R@20 |
|---|---|
| Qwen3.5-9B | 9.14 |
| Static baseline | 23.17 |
| Ours | 35.06 |
Long-horizon reasoning over objects and relations.
| Model | Sobj | Srel |
|---|---|---|
| Gemini-3-Pro | 60.94 | 2.75 |
| Qwen3-VL-8B-Thinking | 62.70 | 2.80 |
| Ours | 65.59 | 2.69 |
@article{pistilli2026glen,
title={Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs},
author={Pistilli, Francesca and Peirone, Simone Alberto and Averta, Giuseppe},
journal={arXiv preprint arXiv:2607.xxxxx},
year={2026}
}
This study was carried out within the FAIR - Future Artificial Intelligence Research and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) β MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 β D.D. 1555 11/10/2022, PE00000013). This manuscript reflects only the authorsβ views and opinions, neither the European Union nor the European Commission can be considered responsible for them. We acknowledge the CINECA award under the ISCRA initiative, for the availability of high performance computing resources and support.