Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs

Francesca Pistilli , Simone Alberto Peirone , Giuseppe Averta

Politecnico di Torino, Italy

Video Scene Graphs

πŸ‘“ Video Scene Graphs

Scene graphs provide a structured representation of the environment, capturing objects and relationships as they evolve over time.

SG-Ego

🌍 Large-Scale Annotations

We introduce a large-scale training-free annotations pipeline (SG-Ego) for building egocentric scene graphs and release annotations for Ego4D.

A-GEF  GLEN

🧠 Action-conditioned scene evolution

We introduce the A-GEF task that casts scene dynamics as a sequence of action-conditioned graph edits and introduce the GLEN model to reason over these evolving graphs.

πŸ‘“ Why scene graphs?

Understanding activities requires more than recognizing actions or objects, it requires understanding how humans transform the world around them. Objects appear, move, interact, and change their relationships over time. Capturing this evolving structure explicitly enables richer and more grounded reasoning than raw video alone. We believe that spatio-temporal scene graphs provide the right representation to captures exactly these evolving interactions. This representation is compositional, interpretable, and editable, making it possible to reason about interactions rather than simply recognize them. Every human activity changes the environment, scene graphs make those changes explicit. Once interactions are represented as evolving graphs, they become a common language for a wide range of downstream capabilities.

πŸ’‘ We introduce SG-Ego, the first large-scale extension of Ego4D with temporally consistent scene graph annotations generated through a fully training-free pipeline. Building on this representation, we propose GLEN, a graph-based model that reasons directly over scene graph sequences. GLEN learns to align graphs with textual actions while modeling their temporal evolution, enabling explicit reasoning about how scenes evolve thourgh human acitvities.

πŸ‘“ How many things can you do with scene graphs?

  • We introduce Activity-Driven Graph Edit Forecasting (A-GEF), a new benchmark that formulates future prediction as forecasting a sequence of graph edits conditioned on ongoing actions.

  • We show that the same scene graph representation can support a diverse set of challenges, including retrieval, compositional reasoning, and long-horizon forecasting, where GLEN consistently achieves SOTA zero-shot performance with structured and controllable predictions.

  • Together, SG-Ego and GLEN demonstrate that spatio-temporal scene graphs are a powerful foundation for interpretable, compositional, and reusable reasoning over video of human activities.

🌍 SG-Ego: A large scale video scene graph annotation pipeline

SG-Ego implements a training-free VidSGG strategy divided into three stages. Our pipeline maps a sequence of T frames into a single temporally consistent spatio-temporal scene graph, capturing all the spatial and functional relations in the video. We follow a bottom-up approach to extract a rich set of frame-level relations, ground them with a open vocabulary detection model and build a temporally consistent scene graph of the temporal window.


Stage 1: Captioning

πŸ“ Frame-level relations extraction

Qwen3.5 extracts spatial and functional triplets from each frame, turning raw egocentric video frames into a structured relations list.

Stage 2: Grounding

πŸ“Œ Frame-level relations grounding

GroundingDINO localizes the objects and relations from the triplets and builds a spatially grounded frame-level scene graphs.

Stage 3: Consolidation

⏱️ Clip-level graph consolidation over time

SAM2+DINOv2 tracks object instances through time and merges frame-level scene graphs into temporally consistent scene graphs.


With SG-Ego, we extend Ego4D with two set of annotations:

SG-Ego-Align

🌟 Alignment split for graph-text supervision

The Align dataset contains approximately 3.8M spatio-temporal graphs from 7297 unique Ego4D videos, each paired with its corresponding narration and centered on EgoClip action windows.

3.8M samples
~11.0 nodes/graph
~16.1 edges/graph
SG-Ego-Edit

πŸ”§ Edit split for action-conditioned graph evolution

The Edit dataset contains 360k training samples from 6537 videos and 7.2k validation samples from 181 videos, providing both the frame-level start graphs and the clip-level consolidated graphs.

360k samples
~4.1 edges/graph (start)
~17.3 edges/graph (end)

🌍 Examples from SG-Ego-Edit

Exported layout snippet for sharing

Scene Graph before the action (start frame)

Scene Graph after the action (start-to-end frame)

🧠 Learning to evolve scenes: the GLEN model

We formulate the dynamic evolution of a scene as a novel setting called Activity-Conditioned Graph Edit Forecasting (A-GEF), in which we ask a model to forecast how a spatial scene graph will change based on the activity described in text prompt. To do this, we consider a conditional formulation in which future graph edits are predicted given both the current graph and the upcoming activity.


Our approach GLEN consists of three main components. The Graph Encoder and the Text Encoder map scene graphs and textual prompts into an aligned embedding space, thus learning a versatile representation that summarizes the state of the scene and enables spatially grounded reasoning about objects and relations. Finally, the Graph Edit Model defines a text-conditioned transformation over scene graphs, allowing action-conditioned modifications of the graph.

Method diagram depicting the proposed scene graph generation pipeline
Method image placeholder
Add file: website/assets/method.png
The model takes a spatial graph and a conditioning action and predicts the necessary edit to evolve the scene graph into the consolidated final scene graph. For Graph-Text Alignment (GTA), we extract a graph embedding from the consolidated scene graph and align it with the text embedding of the action.

🎯 From Scene Graphs to Real-World Tasks

We validate our scene graph annotation pipeline and GLEN on several benchmark. On EgoSchema, we demonstrate that SG-Ego scene graphs provide a structured and complete representation of the scene that can support long range reasoning about human activities. On EgoMCQ and EgoCVR, we show that the graph embeddings produced by GLEN are well aligned with the textual description of the corresponding ongoing human activities. Finally, on A-GEF and EXPLORE-Bench, we assess the capability of GLEN to predict the dynamic evolution of a scene given an activity.


EgoCVR

Text-to-video retrieval on egocentric clips.

Model R@1 (Global) R@1 (Local)
CLIP7.426.1
TFR-CVR14.144.2
Ours15.347.7
Remark: GLEN enables fine-grained reasoning about the spatial and functional relations in the input clips.

A-GEF

Action-conditioned scene graph evolution.

Model R@20
Qwen3.5-9B9.14
Static baseline23.17
Ours35.06
Remark: Graph edits enables more accurate and interpretable predictions of future scene structure.

EXPLORE-Bench

Long-horizon reasoning over objects and relations.

Model Sobj Srel
Gemini-3-Pro60.942.75
Qwen3-VL-8B-Thinking62.702.80
Ours65.592.69
Remark: Scene graph enables structured reasoning over long action sequences.

✏️ Cite Us

BibTeX

@article{pistilli2026glen,
  title={Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs},
  author={Pistilli, Francesca and Peirone, Simone Alberto and Averta, Giuseppe},
  journal={arXiv preprint arXiv:2607.xxxxx},
  year={2026}
}

Acknowledgements

This study was carried out within the FAIR - Future Artificial Intelligence Research and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D. 1555 11/10/2022, PE00000013). This manuscript reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them. We acknowledge the CINECA award under the ISCRA initiative, for the availability of high performance computing resources and support.