Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

ICCV 2025

1University of Edinbugh, 2Huawei Noah’s Ark Lab
Interpolate start reference image.

We present a complete affordance learning system that encompasses data collection (from egocentric videos),
effective model training, and robot deployment for manipulation tasks.


Given a task and a cluttered scene, the robot can select the object that possesses the related affordance, grasp the correct part,
and apply the functional part to the target object to perform desired actions.

Abstract

Affordance, defined as the potential actions that an object offers, is crucial for embodied AI agents. For example, such knowledge directs an agent to grasp a knife by the handle for cutting or by the blade for safe handover. While existing approaches have made notable progress, affordance research still faces three key challenges: data scarcity, poor generalization, and real-world deployment. Specifically, there is a lack of large-scale affordance datasets with precise segmentation maps, existing models struggle to generalize across different domains or novel object and affordance classes, and little work demonstrates deployability in real-world scenarios.

In this work, we address these issues by proposing a complete affordance learning system that (1) takes in egocentric videos and outputs precise affordance annotations without human labeling, (2) leverages geometric information and vision foundation models to improve generalization, and (3) introduces a framework that facilitates affordance-oriented robotic manipulation such as tool grasping and robot-to-human tool handover.

Experimental results show that our model surpasses the state-of-the-art by 13.8% in mIoU, and the framework achieves 77.1% successful grasping among 179 trials, including evaluations on seen, unseen classes, and cluttered scenes.

Geometry-guided Affordance Transformer (GAT)

Interpolate start reference image.

The architecture of GAT. It consists of a DINOv2 image encoder, a depth feature injector, an embedder, and LoRA layers. The model performs segmentation by computing cosine similarity between upsampled features and learnable / CLIP text embeddings.



Aff-Grasp

Interpolate start reference image.

The framework of Aff-Grasp. It first employs an open-vocabulary detector to locate all objects within the scene, which are then sent to GAT to determine if they possess the corresponding affordance required for the task. Afterwards, a 6 DoF grasp generation model, Contact-GraspNet, leverages the object's graspable affordance and the depth map to generate dense grasp proposals. Finally, the robot executes affordance-specific sequential motion primitives to apply the functional part to the target.



Video for Robot Experiments

BibTeX

@article{li2024affgrasp,
  title     = {Learning Precise Affordances from Egocentric Videos for Robotic Manipulation}, 
  author    = {Li, Gen and Tsagkas, Nikolaos and Song, Jifei and Mon-Williams, Ruaridh and Vijayakumar, Sethu and Shao, Kun and Sevilla-Lara, Laura},
  journal   = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year      = {2025},
}