Abstract

Large Vision-Language Models (LVLMs) represent a significant advancement in AI, enabling systems to understand and generate content across both visual and textual modalities. While large-scale pretraining has driven substantial progress, fine-tuning these models for aligning with human values or engaging in specific tasks remains a critical challenge. Reinforcement Learning (RL) methods offer promising frameworks for this fine-tuning process. RL enables models to optimize actions based on reward signals instead of relying solely on supervised preference data. This talk presents an overview of paradigms for fine-tuning LVLMs, highlighting how RL techniques can be used to align models with human values, improve task performance, and enable adaptive multimodal interaction. I categorize key approaches, examine sources of preference data, reward signals, and discuss open challenges. The goal is to provide a clear understanding of how RL contributes to the evolution of fine-tuned, robust, and human-aligned LVLMs.

Guest speaker A/Prof Thanh Thi Nguyen (fourth from left) with AAII researchers and students after the AAII seminar on 10 June 2025.

Speaker

Associate Professor Thanh Thi Nguyen has been ranked among the world’s top 2% of AI scientists by Elsevier and Stanford University. He was a visiting scholar with the Computer Science Department at Stanford University in 2015 and the John A. Paulson School of Engineering and Applied Sciences at Harvard University in 2019. Dr. Nguyen received a European-Pacific Partnership for ICT Expert Exchange Program Award from the European Commission in 2018 and an Australia–India Strategic Research Fund Early- and Mid-Career Fellowship from the Australian Academy of Science in 2020. He is currently an associate professor (research) at the Faculty of Information Technology, Monash University, Melbourne, Australia.