publications | Gen Li

An up-to-date list is available on Google Scholar.

2025

ICCV’25

×

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, and 2 more authors

In IEEE/CVF International Conference on Computer Vision, 2025

Abs arXiv Website

Affordance, defined as the potential actions that an object offers, is crucial for robotic manipulation tasks. A deep understanding of affordance can lead to more intelligent AI systems. For example, such knowledge directs an agent to grasp a knife by the handle for cutting and by the blade when passing it to someone. In this paper, we present a streamlined affordance learning system that encompasses data collection, effective model training, and robot deployment. First, we collect training data from egocentric videos in an automatic manner. Different from previous methods that focus only on the object graspable affordance and represent it as coarse heatmaps, we cover both graspable (e.g., object handles) and functional affordances (e.g., knife blades, hammer heads) and extract data with precise segmentation masks. We then propose an effective model, termed Geometry-guided Affordance Transformer (GKT), to train on the collected data. GKT integrates an innovative Depth Feature Injector (DFI) to incorporate 3D shape and geometric priors, enhancing the model’s understanding of affordances. To enable affordance-oriented manipulation, we further introduce Aff-Grasp, a framework that combines GKT with a grasp generation model. For comprehensive evaluation, we create an affordance evaluation dataset with pixel-wise annotations, and design real-world tasks for robot experiments. The results show that GKT surpasses the state-of-the-art by 15.9% in mIoU, and Aff-Grasp achieves high success rates of 95.5% in affordance prediction and 77.1% in successful grasping among 179 trials, including evaluations with seen, unseen objects, and cluttered scenes
ICCV’25

×

Principles of Visual Tokens for Efficient Video Understanding

Xinyue Hao, Gen Li, Shreyank N Gowda, Robert B Fisher, Jonathan Huang, and 2 more authors

In IEEE/CVF International Conference on Computer Vision, 2025

arXiv
IROS’25

×

Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts

Yizhou Huang, Fan Yang, Guoliang Zhu, Gen Li, Hao Shi, and 4 more authors

In International Conference on Intelligent Robots and Systems, 2025

arXiv Code
NMI

×

Embodied Large Language Models Enable Robots to Complete Complex Tasks in Unpredictable Environments

Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, and Chris Lucas

Nature Machine Intelligence, 2025

Abs URL Video

Completing complex tasks in unpredictable settings challenges robotic systems, requiring a step change in machine intelligence. Sensorimotor abilities are considered integral to human intelligence. Thus, biologically inspired machine intelligence might usefully combine artificial intelligence with robotic sensorimotor capabilities. Here, we report an Embodied Large Language Model Enabled Robot (ELLMER) framework, utilising GPT-4 and a Retrieval Augmented Generation (RAG) infrastructure, to enable robots to complete long-horizon tasks in unpredictable settings. The method extracts contextually relevant examples from a knowledge base, producing action plans incorporating Force and Visual feedback, enabling adaptation to changing conditions. We tested ELLMER on a robot tasked with coffee making and plate decoration; these tasks consist of a sequence of sub-tasks from drawer-opening to pouring, each benefiting from distinct feedback types and methods. We show that the ELLMER framework allows the robot to complete the tasks. This demonstration marks progress towards scalable, efficient, ‘intelligent robots’ able to complete complex tasks in uncertain environments.

2024

ECCVW’24

×

Watt for what: Rethinking deep learning’s energy-performance relationship

Shreyank N Gowda, Xinyue Hao, Gen Li, Shashank Narayana Gowda, Xiaobo Jin, and 1 more author

In European Conference on Computer Vision Workshop, 2024

URL
CVPR’24

×

One-Shot Open Affordance Learning with Foundation Models

Gen Li, Deqing Sun, Laura Sevilla-Lara, and Varun Jampani

In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Abs arXiv Code Website

We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category, but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes, they often struggle to understand finer levels of granularity such as affordances. To handle this issue, we conduct a comprehensive analysis of existing foundation models, to explore their inherent understanding of affordances and assess the potential for data-limited affordance learning. We then propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data, and exhibits reasonable generalization capability on unseen objects and affordances.

2023

IJCNN’23

×

Referenceless User Controllable Semantic Image Synthesis

Jonghyun Kim, Gen Li, and Joongkyu Kim

In International Joint Conference on Neural Networks, 2023

Abs arXiv Code

Despite recent progress in semantic image synthesis, complete control over image style remains a challenging problem. Existing methods require reference images to feed style information into semantic layouts, which indicates that the style is constrained by the given image. In this paper, we propose a model named RUCGAN for user controllable semantic image synthesis, which utilizes a singular color to represent the style of a specific semantic region. The proposed network achieves reference-free semantic image synthesis by injecting color as userdesired styles into each semantic layout, and is able to synthesize semantic images with unusual colors. Extensive experimental results on various challenging datasets show that the proposed method outperforms existing methods, and we further provide an interactive UI to demonstrate the advantage of our approach for style controllability. The codes and UI are available at: https://github.com/BenjaminJonghyun/RUCGAN
CVPR’23

×

LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding

Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla-Lara

In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Abs arXiv Code Website

Humans excel at acquiring knowledge through observation. For example, we can learn to use new tools by watching demonstrations. This skill is fundamental for intelligent systems to interact with the world. A key step to acquire this skill is to identify what part of the object affords each action, which is called affordance grounding. In this paper, we address this problem and propose a framework called LOCATE that can identify matching object parts across images, to transfer knowledge from images where an object is being used (exocentric images used for learning), to images where the object is inactive (egocentric ones used to test). To this end, we first find interaction areas and extract their feature embeddings. Then we learn to aggregate the embeddings into compact prototypes (human, object part, and background), and select the one representing the object part. Finally, we use the selected prototype to guide affordance grounding. We do this in a weakly supervised manner, learning only from image-level affordance and object labels. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a large margin on both seen and unseen objects.

2021

CVPR’21

×

Adaptive Prototype Learning and Allocation for Few-Shot Segmentation

Gen Li, Varun Jampani, Laura Sevilla-Lara, Deqing Sun, Jonghyun Kim, and 1 more author

In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

Abs arXiv Code Website

Prototype learning is extensively used for few-shot segmentation. Typically, a single prototype is obtained from the support feature by averaging the global object information. However, using one prototype to represent all the information may lead to ambiguities. In this paper, we propose two novel modules, named superpixel-guided clustering (SGC) and guided prototype allocation (GPA), for multiple prototype extraction and allocation. Specifically, SGC is a parameter-free and training-free approach, which extracts more representative prototypes by aggregating similar feature vectors, while GPA is able to select matched prototypes to provide more accurate guidance. By integrating the SGC and GPA together, we propose the Adaptive Superpixelguided Network (ASGNet), which is a lightweight model and adapts to object scale and shape variation. In addition, our network can easily generalize to k-shot segmentation with substantial improvement and no additional computational cost. In particular, our evaluations on COCO demonstrate that ASGNet surpasses the state-of-the-art method by 5% in 5-shot segmentation.
BMVC’21

×

SuperStyleNet: Deep Image Synthesis with Superpixel Based Style Encoder

Jonghyun Kim, Gen Li, Cheolkon Jung, and Joongkyu Kim

In British Machine Vision Conference, 2021

Abs arXiv Code

Existing methods for image synthesis utilized a style encoder based on stacks of convolutions and pooling layers to generate style codes from input images. However, the encoded vectors do not necessarily contain local information of the corresponding images since small-scale objects are tended to "wash away" through such downscaling procedures. In this paper, we propose deep image synthesis with superpixel based style encoder, named as SuperStyleNet. First, we directly extract the style codes from the original image based on superpixels to consider local objects. Second, we recover spatial relationships in vectorized style codes based on graphical analysis. Thus, the proposed network achieves high-quality image synthesis by mapping the style codes into semantic labels. Experimental results show that the proposed method outperforms state-of-the-art ones in terms of visual quality and quantitative measurements. Furthermore, we achieve elaborate spatial style editing by adjusting style codes.
PR

×

Weakly-supervised temporal attention 3D network for human action recognition

Jonghyun Kim, Gen Li, Inyong Yun, Cheolkon Jung, and Joongkyu Kim

Pattern Recognition, 2021

Abs URL

From a series of observations, we have inferred that human actions in videos are defined by a set of significant frames. In this paper, we propose a weakly-supervised temporal attention 3D network for human action recognition, called as TA3DNet, to accelerate 3D convolutional neural networks (3D CNNs) by temporally assigning different importance to each frame. First, we obtain short-term frames with long-term connection by regularly or randomly skipping frames to avoid temporal redundancy, and apply 3D convolutional layers to extract features for action recognition. Then, we apply a temporal attention module to assign different weights to each frame. We train the temporal attention module in a weakly-supervised manner that updates weights based on only class labels without event information and extra labels. Thus, TA3DNet reduces the number of input frames and constructs a lightweight network for action recognition. Experimental results demonstrate that TA3DNet achieves high performance on two challenging datasets (UCF101 and HMDB51) and outperforms state-of-the-art methods for action recognition.
NC

×

Edge and identity preserving network for face super-resolution

Jonghyun Kim, Gen Li, Inyong Yun, Cheolkon Jung, and Joongkyu Kim

Neurocomputing, 2021

Abs URL arXiv Code

Face super-resolution (SR) has become an indispensable function in security solutions such as video surveillance and identification system, but the distortion in facial components is a great challenge in it. Most state-of-the-art methods have utilized facial priors with deep neural networks. These methods require extra labels, longer training time, and larger computation memory. In this paper, we propose a novel Edge and Identity Preserving Network for Face SR Network, named as EIPNet, to minimize the distortion by utilizing a lightweight edge block and identity information. We present an edge block to extract perceptual edge information, and concatenate it to the original feature maps in multiple scales. This structure progressively provides edge information in reconstruction to aggregate local and global structural information. Moreover, we define an identity loss function to preserve identification of SR images. The identity loss function compares feature distributions between SR images and their ground truth to recover identities in SR images. In addition, we provide a luminance-chrominance error (LCE) to separately infer brightness and color information in SR images. The LCE method not only reduces the dependency of color information by dividing brightness and color components but also enables our network to reflect differences between SR images and their ground truth in two color spaces of RGB and YUV. The proposed method facilitates the proposed SR network to elaborately restore facial components and generate high quality 8× scaled SR images with a lightweight network structure. Furthermore, our network is able to reconstruct an 128×128 SR image with 215 fps on a GTX 1080Ti GPU. Extensive experiments demonstrate that our network qualitatively and quantitatively outperforms state-of-the-art methods on two challenging datasets: CelebA and VGGFace2.

2020

Access

×

Depth-Wise Asymmetric Bottleneck With Point-Wise Aggregation Decoder for Real-Time Semantic Segmentation in Urban Scenes

Gen Li, Shenlu Jiang, Inyong Yun, Jonghyun Kim, and Joongkyu Kim

IEEE Access, 2020

Abs URL

Semantic segmentation is a process of linking each pixel in an image to a class label, and is widely used in the field of autonomous vehicles and robotics. Although deep learning methods have already made great progress for semantic segmentation, they either achieve great results with numerous parameters or design lightweight models but heavily sacrifice the segmentation accuracy. Because of the strict requirements of real-world applications, it is critical to design an effective real-time model with both competitive segmentation accuracy and small model capacity. In this paper, we propose a lightweight network named DABNet, which employs Depth-wise Asymmetric Bottleneck (DAB) and Point-wise Aggregation Decoder (PAD) module to tackle the challenging real-time semantic segmentation in urban scenes. Specifically, the DAB module creates a sufficient receptive field and densely utilizes the contextual information, and the PAD module aggregates the feature maps of different scales to optimize performance through the attention mechanism. Compared with existing methods, our network substantially reduces the number of parameters but still achieves high accuracy with real-time inference ability. Extensive ablation experiments on two challenging urban scene datasets (Cityscapes and CamVid) have proved the effectiveness of the proposed approach in real-time semantic segmentation.

2019

BMVC’19

×

DABNet: Depth-wise asymmetric bottleneck for real-time semantic segmentation

Gen Li, and Joongkyu Kim

In British Machine Vision Conference, 2019

Abs arXiv Code

As a pixel-level prediction task, semantic segmentation needs large computational cost with enormous parameters to obtain high performance. Recently, due to the increasing demand for autonomous systems and robots, it is significant to make a tradeoff between accuracy and inference speed. In this paper, we propose a novel Depth-wise Asymmetric Bottleneck (DAB) module to address this dilemma, which efficiently adopts depth-wise asymmetric convolution and dilated convolution to build a bottleneck structure. Based on the DAB module, we design a Depth-wise Asymmetric Bottleneck Network (DABNet) especially for real-time semantic segmentation, which creates sufficient receptive field and densely utilizes the contextual information. Experiments on Cityscapes and CamVid datasets demonstrate that the proposed DABNet achieves a balance between speed and precision. Specifically, without any pretrained model and post-processing, it achieves 70.1% Mean IoU on the Cityscapes test dataset with only 0.76 million parameters and a speed of 104 FPS on a single GTX 1080Ti card.