Xihao Chen ( 陈熙昊 )

I’m a third-year Ph.D. student in computer science affiliated with the Integrative Sciences and Engineering Programme (ISEP) and the School of Computing, National University of Singapore. I am fortunate to be advised by Professor Roger Zimmermann and Professor Yangyang Guo.

My research interest lies in the intersection of multi-modal algorithms and efficiency of large foundation models. Currently, I am working on reducing the resources needed for the training and inference of large vision-language models. I am also interested in agentic systems, specifcally, on the storage and retrieval of multi-modal memories.

Prior to my PhD studies, I obtained my Bachelor of Computing (Computer Science) with Honours (Highest Distinction) at NUS School of Computing. My thesis was on Multi-Modal Entity Resolution, supervised by Professor Kian-Lee Tan.

I enjoy sports: gym, bouldering and fencing. Additionally, I play competitive games such as Valorant, although I can’t claim to be good at them.

News

May 2026

One paper accepted to Transactions on Machine Learning Research.

Jun 2025

Passed Qualifying Examination and became a PhD candidate.

Jun 2025

Awarded the Teaching Fellowship Scheme award for being one of the top graduate tutors in SoC.

Aug 2023

Started my PhD at ISEP and SoC at NUS.

Jun 2023

I obtained my Bachelor of Computing (Computer Science) with Honours of Highest Distinction!

Selected Publications

Make Your LVLM KV Cache More Lightweight

Xihao Chen, Yangyang Guo, and Roger Zimmermann

Transactions on Machine Learning Research, May 2026

PDF Code

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.

Seeing the Unseen: Unified Visible-Invisible Motion for Physically Consistent Video Generation

Mingyang Bao, Fangda Ye, Xihao Chen, Bobo Li, Xiangtai Li, Shengqiong Wu

Under review

Recent advances in video generation have achieved impressive visual quality, yet they often fail to produce physically consistent dynamics due to the lack of explicit modeling of underlying physical processes. Existing approaches either rely on simulators with strong assumptions or use coarse semantic guidance, and further struggle to effectively inject physical priors into generation models. In this work, we propose Seeing the Unseen, a unified framework for physics-grounded video generation. Our approach decomposes the problem into two stages: learning physical dynamics and injecting them into video synthesis. First, we introduce a lightweight Particle-based Graph Dynamics Simulator (PGDS) that learns generalizable physical interactions from data and predicts plausible 3D motion trajectories. Second, we propose a Visible-Invisible Motion Field (VIMF) that captures both motion observable in the input frame and motion that emerges over time due to object dynamics. This representation enables more complete and structured motion guidance compared to conventional physical signals. By integrating these trajectories into a diffusion-based generator, our method produces videos that are both visually coherent and physically consistent. Extensive experiments demonstrate improved motion accuracy, interaction consistency, and reduced physical artifacts compared to strong baselines.

Can We Make Coreset Selection for LVLM Fine-Tuning More Efficient?

Xihao Chen, Yangyang Guo, and Roger Zimmermann

Under review

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a range of cross-modal understanding tasks. However, their supervised fine-tuning (SFT) stage often requires extensive data, leading to substantial challenges given limited resource budgets. In this work, we focus specifically on visual instruction SFT, where models are trained on multimodal instruction–response pairs rather than task-specific adaptation datasets. To address this bottleneck, recent efforts in data efficiency have exclusively relied on coreset selection to produce a reduced dataset of informative samples. All these methods, as found in this work, incur significant resource burdens in both time and additional storage required for coreset selection. Therefore, we propose a novel, resource-light coreset selection method for alleviating this bottleneck. Our method adopts a two-stage design: First, an LLM estimates the linguistic difficulty of each sample without visual input to identify high-language-prior samples. Second, we introduce a biased sampling distribution that favors challenging samples while maintaining data diversity. We evaluate our method on three representative models: LLaVA-1.5-7B, Qwen2-VL-7B, and InternVL2-8B, trained on two general-purpose datasets for visual instruction SFT. Our method consistently outperforms existing state-of-the-art baselines at the same coreset size budgets. More importantly, our approach delivers significant benefits in coreset selection efficiency than these baselines. These results together demonstrate the effectiveness and lightweight nature of our approach for efficient LVLM SFT, especially in resource-limited settings.

Awards

Jan 2025

School of Computing Teaching Fellowship Scheme Award (11 per year).

Aug 2023

NUS President's Graduate Fellowship (top 5% of PhD admissions).

Jun 2022

Dean's List Honours Roll (AY2021/22 Sem 2).