Minttu Alakuijala

I am a Postdoctoral Researcher at Aalto University working on LLMs, reinforcement learning (RL), and structures for systematic thinking in language models. I am working with Samuel Kaski and Pekka Marttinen, and I coordinate the Foundation models for language & reinforcement learning team at FCAI, the Finnish Center for Artificial Intelligence.

Previously, I obtained my PhD from Google Research (through the CIFRE scheme), Inria and Ecole Normale Superieure in Paris. My PhD research focused on RL and learning from demonstration for robotic manipulation. In particular, I still actively work on learning reward models, multi-modality and embodied AI, topics I explored during my PhD. I defended my thesis on Autonomous and Weakly-Supervised Learning for Robotic Manipulation in December 2022 and I was advised by Cordelia Schmid, Jean Ponce and Julien Mairal. At Inria, I was a member of the Willow and Thoth teams.

Email  |  CV  |  Google Scholar  |   |   | 

profile photo
Research

There are considerable mutual benefits to be gained from integrating ideas from reinforcement learning (RL) and language models. LLMs possess a real-world knowledge base that is vast but mostly static, and not specific enough for many environments. Moreover, decoding is typically greedy, without planning or lookahead. RL, on the other hand, formalizes learning from interaction, trading off exploration and short-term reward maximization, and planning with the future in mind.

My objective is twofold: I aim to equip RL agents with world knowledge and transferable skills that models like LLMs and VLMs have learned from internet-scale text and vision corpora. Furthermore, I am looking to teach LLMs to think and plan more explicitly – using utilities such as tree search, task decomposition and code – and to learn from embodied interaction like RL agents.

Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization
Minttu Alakuijala*, Ya Gao*, Georgy Ananov*, Samuel Kaski, Pekka Marttinen, Alexander Ilin, Harri Valpola
In review, 2025
arXiv

We train LLM agents to internalize knowledge and skills for multiple tasks without relying on ever-expanding prompts or prior demonstrations, through context distillation and efficient use of corrective feedback from humans.

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, Kai Yuan
TMLR, 2025
arXiv

We train language-conditioned robotic reward functions from actor-agnostic data, and outperform 5 existing methods with a novel combination of contrastive and sequential ranking objectives, together ensuring both smooth and accurate rewards.

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search
Nicola Dainese, Matteo Merler, Minttu Alakuijala, Pekka Marttinen
NeurIPS 2024
arXiv

We propose to model RL environments with code written by an LLM, propose a method to improve code generation for this task and show how to plan with code world models.

Recursive Decomposition with Dependencies for Generic Divide-and-Conquer Reasoning
Sergio Hernandez-Gutierrez, Minttu Alakuijala, Alexander Nikitin, Pekka Marttinen
NeurIPS 2024 Workshop on System 2 Reasoning at Scale
Paper

We present a novel divide-and-conquer method to solve reasoning problems with LLMs, employing recursive decomposition with dependency modeling in generic settings.

Learning Reward Functions for Robotic Manipulation by Observing Humans
Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid
ICRA, 2023
project page | arXiv

We learn dense reward functions for robotic manipulation by learning about distances in state space from videos of humans only, using self-supervised contrastive and regression objectives.

Residual Reinforcement Learning from Demonstrations
Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid
RSS 2020 Workshop on Advances & Challenges in Imitation Learning for Robotics
project page | arXiv

Starting from a small number of task demonstrations on a robot arm, we learn an initial base policy and a task-relevant, low-dimensional representation space through behavioral cloning, which is then autonomously improved through residual reinforcement learning using images, proprioceptive inputs and sparse rewards only.

Discovering Actions by Jointly Clustering Video and Narration Streams Across Tasks
Minttu Alakuijala, Julien Mairal, Jean Ponce, Cordelia Schmid
CVPR 2020 Workshop on Learning from Instructional Videos
video | poster

Using only weak supervision from the timing of narration and visual information, we segment narrated tutorial videos into k action classes or background. We use a discriminative clustering objective together with an inconsistency penalty which encourages the timing and order of actions in the visual stream to match that of the narration stream in each video.


Design and source code from Jon Barron's website.