Interview with Xue Bin (Jason) Peng: Exploring Full-Body Control for Humanoid Robots — Toward General-Purpose Controllers
Transition from Graphics to Robotics
You’ve worked with Michiel(one of the most influential professor in charactor animation) 、Pieter Abbeel and Seregey Levine (the poiner in RL for robotics)— how did each of them influence your research and thinking?
Starting with Michiel. He was my advisor during my undergrad and master’s studies at the University of British Columbia (UBC). Honestly, if it weren’t for him, I probably wouldn’t be doing the work I’m doing now.
Back then, I was interested in computer graphics and wanted to try research in that area. Michiel was a professor at UBC specializing in character animation, so I reached out to him to get some research experience. That’s how it all started.
At the time, I wasn’t really thinking about robotics — I just wanted to do something related to graphics. Animation seemed exciting, but there wasn’t much machine learning involved yet. Most techniques were still based on traditional kinematics or manually designed controllers.
Then, around that time, DeepMind published their paper on Deep Q-learning for motor control. When we saw that work, we began diving into reinforcement learning and deep RL as a way to train controllers for simulated characters. That became a turning point — it drew me more and more into machine learning and its application in animation.
Later, during my PhD at Berkeley, I started working with Sergey Levine and Pieter Abbeel. That’s when I began exploring how these same techniques could be applied to control real robots, not just simulated characters.
So both phases — with Michiel and then with Sergey and Pieter — had a huge influence on me. I feel incredibly lucky to have worked with all of them. Without them, I’d probably be doing something very different today.
Your research has gone through a shift from virtual character skill learning in graphics to real-world robot training — a kind of sim-to-real transition in perspective. What do you see as the biggest difference between the two?
I think one of the things that really surprised me when I started working with real robots was how hard even the simplest things can be — things I had completely taken for granted when working in simulation.
For example, state estimation turned out to be much harder in robotics than I expected. In simulation, you have full observability — you can access everything: positions, velocities, the full internal state of the agent. Everything is perfectly known.
But once I started working with real robots, even getting something as basic as the linear velocity of a robot became a real challenge. You need a lot of additional hardware and estimation techniques, and even then the results are noisy and unreliable.
So I’d say that partial observability was one of the biggest things that caught me off guard when moving from simulation to the real world — and honestly, it’s still a major challenge today.
When working with real robots, what left the deepest impression on you?
State estimation was definitely one of the big challenges that left an impression.
Another major factor is just how non-stationary the dynamics are in the real world.
When I was working in simulation, the system dynamics were essentially static — they never changed. That made it very easy to reproduce experiments. If a controller worked once, I could run it again and again, and the behavior would be pretty consistent every time.
But with real robots, the dynamics are constantly changing. A controller might work perfectly one day, and then behave very differently later. For example, even something like the robot motors heating up after extended use can change the dynamics. So a controller that worked 10 minutes ago might no longer work the same way if you run it again later.
How do you approach the challenges of sim-to-real transfer?
So I think some of the main challenges in sim-to-real transfer include partial observability, uncertainty, and the non-stationarity of real-world robot dynamics.
The general approach we’ve taken to tackle these challenges is through domain randomization. In most of our work, we train controllers entirely in simulation, and then deploy them directly onto real robots.
To bridge the reality gap, we heavily randomize the dynamics in simulation — things like mass, friction, latency, and noise — so that the controller learns robust and adaptive strategies that can generalize to the variability found in the real world.
So at this point, domain randomization is probably our main tool for sim-to-real transfer.
Research Area
Could you give us a systematic overview of your key research topics and projects? What are the core ideas behind these works? and What key problem do they address?
DeepMimic
Maybe I can start by talking about one of our earlier works: DeepMimic.
The core idea behind DeepMimic is actually quite simple: it combines deep reinforcement learning with motion tracking. The project was initially motivated by a goal to create simulated characters that could replicate a wide range of human motor skills for animation purposes.
Before DeepMimic, a lot of work in character animation mirrored trends in robotics. People would build controllers using methods from robotics — such as trajectory optimization, optimal control, or finite state machines.
The problem with these traditional methods was that they required a lot of manual engineering and didn’t generalize well. You’d need to design a completely different controller for every new skill. So, for example, walking would need one carefully tuned controller, and jumping would require a totally different one. Even more complex skills like flips or acrobatics demanded significant effort and handcrafted logic.
This made the whole approach not scalable. As a result, many past papers focused only on very narrow, specific behaviors.
What we wanted to do with DeepMimic was to create a general framework that could reproduce any human motion. And to do that, we used recent advances in deep reinforcement learning to train neural network controllers.
Neural networks are powerful function approximators — they can learn very flexible, expressive policies. This allowed us to move away from hand-crafted structures like finite state machines and use a unified architecture for all types of skills.
The other key component of DeepMimic is motion tracking. Once we had a neural network controller, the question became: how do we train it to perform different skills?
Traditionally, you’d have to spend a lot of time designing custom reward functions for each skill — a reward for walking, another for jumping, etc. That wasn’t scalable either.
Instead, our idea was to use a single, general reward: track a reference motion. You give the agent a target motion (from motion capture data), and train the controller to follow it frame by frame.
And we found that this simple tracking-based reward, when combined with policy gradient reinforcement learning, enabled us to learn a wide range of motor skills without manually designing rewards or controllers for each one.
So rather than crafting a controller for every skill, you just feed in different reference motions — and optimize the same tracking objective — to get a controller that learns to reproduce those skills.
That’s the basic idea of DeepMimic.
AMP
After DeepMimic, one of the next key projects was AMP, which stands for Adversarial Motion Priors.
DeepMimic gave us a general framework for training controllers that can imitate different skills through motion tracking — frame-by-frame tracking of reference motions. This approach is very effective when you want a controller to replicate a specific motion as closely as possible. In fact, many of the impressive humanoid demos you see today are built on top of motion tracking techniques like this.
But there's a key limitation: controllers trained with motion tracking are not very flexible. They tend to be restricted to replaying the same motion again and again. The agent can’t really deviate from the reference, or adapt to new tasks that weren’t part of the original motion.
So with AMP, we wanted to create a more flexible imitation learning objective — one that didn’t force the controller to follow a fixed reference motion frame by frame.
The key insight was to replace tracking with distribution matching. Instead of forcing the agent to copy a specific trajectory, you train the controller to produce behaviors that match the overall motion distribution found in your dataset.
We implemented this using adversarial imitation learning — similar to GANs. A discriminator is trained to distinguish between motions from the real dataset and motions generated by the simulated character. Then, the controller is trained to fool the discriminator — in other words, to generate motion that "looks like" the dataset.
This approach allows for much more flexibility:
The character doesn’t have to follow any single motion exactly.
It can mix and combine different transitions.
It can generalize to new behaviors not explicitly shown in the original data.
And it can adapt those behaviors to solve
new tasks — even ones not seen in the dataset.
So overall, AMP gives us a more powerful and adaptable framework for imitation learning in motor control.
ASE
In terms of AMP,for someone try to reproduce this work, and they often mention encountering a common issue: the policy learns only a single-frame transition and ends up trembling in place.
Have you experienced similar behavior during your experiments? Do you think this is a limitation of the method itself? And are there any possible ways to address it?
that’s definitely an important challenge. One method we explored in our follow-up work, called ASE (Adversarial Skill Embedding), tries to address this.
In many models, especially those using VAE-style architectures, a common issue is that latent spaces often have “holes” — regions where sampled points can decode into behaviors that don’t resemble the training data at all.
While there isn’t a perfect solution yet, ASE tries to mitigate this by adopting a GAN-like approach. In ASE, we train a decoder (i.e., the controller) to map latent codes to behaviors that look like those in the original motion data. A discriminator is used to evaluate whether the generated behavior matches the style of the real data — just like how GANs are trained.
One key advantage of this adversarial training setup is that it encourages the model to cover the latent space more uniformly — i.e., the controller learns to map every point in the latent space to something that looks like real motion. This helps prevent regions in the space that would otherwise generate completely unrealistic behaviors.
That said, it’s not a complete guarantee — it’s still possible that some latent values will produce strange or unstable actions. But we’ve found that adversarially trained models like ASE tend to be more robust than simpler baseline methods.
Reinforcement Learning vs. Imitation Learning
In your view, what are the pros and cons of imitation learning and RL in terms of motor skill learning, and which one do you think has more long-term potential?
I'm not sure if imitation learning and reinforcement learning are necessarily directly comparable, because they’re kind of orthogonal paradigms.
Imitation learning can be implemented using either supervised learning or reinforcement learning, depending on the data you have access to. Reinforcement learning can actually be one way to implement imitation learning. But on the other hand, reinforcement learning doesn’t need to involve imitation — it can work without any demonstrations.
So, I don't think they're directly comparable. But if we do want to make a comparison, let's say, between imitation learning with demonstrations and pure reinforcement learning without any demonstration data:
The advantage of pure reinforcement learning is that you don't need to rely on any demonstrations — you just train a controller to optimize some objective function, which may be hand-designed. The benefit is that you might discover novel or even more optimal behaviors that humans haven’t considered. For example, systems like AlphaGo discovered strategies that no human had thought of before. That’s part of the strength of reinforcement learning.
But the downside of pure reinforcement learning is that it often requires a lot of reward engineering or “reward hacking” to produce the behavior you actually want. You might need to design a pretty complex reward function to guide the agent properly — and that can be hard.
Now with imitation learning, it kind of flips the pros and cons. One benefit is that demonstrations make it easier to shape the agent’s behavior, since the controller is learning to mimic what’s in the dataset. That way, the agent is more likely to produce something human-like or intuitive.
But the drawbacks are: (1) Data collection is not always easy — depending on the task, it may be hard to find humans who can perform it well. (2) Imitation limits the agent’s exploration — it’s more constrained to human strategies, and might not be able to discover newer, better behaviors that humans haven’t demonstrated before.
What is your viewpoint on the potential of RL in Manipulation?
I think one reason why reinforcement learning (RL) hasn't been as widely used in robotic manipulation yet — compared to areas like locomotion — is that manipulation still has relatively easier alternatives, like behavior cloning and supervised learning.
For manipulation, it’s relatively straightforward to build a teleoperation system that allows you to collect large amounts of human demonstration data. Once you have that, you can train a controller using supervised learning, which works quite well in many cases.
That’s not really feasible in locomotion, especially with legged robots, where collecting high-quality demonstrations is much harder. So for locomotion, we’ve sort of been forced to use reinforcement learning from the start, while manipulation has benefited from abundant demonstration data.
As a result, most manipulation controllers today are still trained with supervised learning and behavior cloning. But I do think that, over time, manipulation models — like other domains — will follow a similar trajectory.
You can already see a similar trend in large language models (LLMs).
Initially, we train them with large-scale supervised learning on human data. But when we want better performance on specific tasks, we often fine-tune them with reinforcement learning — like reinforcement learning from human feedback (RLHF) or task-specific RL.
I imagine the same will happen in manipulation:
First, models are trained on a
large dataset collected via teleoperation with supervised learning.
Then, when you want to
maximize performance on a particular manipulation task, that’s when reinforcement learning fine-tuning will come into play.
VLA as robot foundation models
An increasing number of researchers are focusing on applying large models (like VLA) to robotics. How do you see the role of large-scale architectures in robot skill learning?
I think these kinds of large multimodal models are very promising. In fact, we've already observed some very interesting emergent properties from these models — that is, the ability to exhibit spontaneously generated complex behaviors.
However, I believe the main bottleneck at the moment lies in the inference speed of these models. Their relatively slow inference makes it difficult to apply them for real-time low-level control. It's somewhat similar to the difference between "slow thinking" and "fast thinking." I think that many VLA models will, in the future, primarily serve as high-level planners that operate at a lower frequency, focusing on strategy and planning, while actual execution will be handled by smaller, faster, low-level controllers.
Thus, while these multimodal models are highly promising, there is still a lack of efficient low-level controllers capable of translating high-level commands into precise, real-time motor actions.
I think VLA models could be particularly well-suited for slower, less dynamic tasks such as simple manipulation — for example, basic pick-and-place operations. In such cases, slower inference is acceptable. But for highly dynamic, full-body agile motor control, we still need control systems that can operate at much higher frequencies. Currently, our computational capabilities are not sufficient to fully realize such ambitions with large multimodal models.
In the field of locomotion, do you think industry has outpaced academia in some aspects?
Yes, in certain respects, I think the industry has indeed become better at enabling robots to exhibit highly agile, human-like behaviors.
As robot manufacturers, they possess a deep understanding of the system's underlying structure and how to optimize performance for specific robots. This level of insight is often missing in academia, where researchers typically work with off-the-shelf robots rather than building them from scratch.
Thus, when it comes to delivering impressive agile demonstrations on humanoid robots, industry does have a significant advantage over academia.
However, academia has its own strengths — particularly in developing more general-purpose control models. These models aim not just to replicate individual behaviors, but to serve as foundational control systems capable of reproducing a wide range of human skills.
From this perspective, academia might be slightly ahead in exploring general controllers. That said, the motion quality of academic models still generally lags behind the highly polished demonstrations produced by industry.
What future research areas are you interested in or find particularly promising?
Currently, I'm particularly interested in developing more generalizable control models.
Rather than building controllers for individual, specific skills, our goal is to construct general controllers capable of reproducing the full spectrum of human motor skills.
There have already been some promising developments in this direction — for example, the work from the PHC team at Carnegie Mellon University (CMU).
However, a key challenge remains:
How can we guarantee that these general controllers produce natural, human-like behaviors?
Once we have a general controller, how do we efficiently adapt or reuse it for new tasks?
At present, most reusable controllers are still limited to relatively simple tasks, like basic navigation. Therefore, a major focus for future research will be how to build more powerful general models and extend their applicability to increasingly complex tasks.
What do you envision for future universal controllers? What can we learn from generative models when scaling up control systems?
I believe future universal controllers will themselves be generative models.
In the future, controllers won't just be traditional modules — they will generate motor commands, much like how generative models now produce images or text.
We will likely see generative architectures becoming the core of control systems.
The key question then becomes: How can we scalably train these generative controllers to handle large datasets?
If action labels are available, supervised learning still seems to be the most scalable approach across many domains.
If action labels are not available, reinforcement learning must be employed.
From our experience, motion-tracking-based reinforcement learning has proven to be the most scalable method for training controllers on large datasets.
As you mentioned, methods like AMP and some adversarial imitation techniques often suffer from mode collapse as the dataset scales up — the model collapses to a small subset of behaviors, losing diversity.
However, we've found that using a motion tracking objective greatly mitigates this issue, similar to the differences between supervised learning and adversarial learning (e.g., GANs).
As a result, in our recent work such as Mask Mimic, we have reverted to using tracking-based training to train our generative control models.
Do you think we need a latent space, such as using Gaussian distributions or Fourier transforms, when designing controllers?
I don't think a latent space is absolutely necessary. It's not a must-have component.
A latent space is simply one mechanism for a generative model to capture complex distributions, but it is also possible to build excellent generative models without an explicit latent space.
For example, large language models (LLMs) are extremely powerful generative models that do not rely on an explicit latent space.
I believe something similar could be done for motor control tasks as well.
That said, many current top-performing generative models do employ latent spaces.
For instance, diffusion models are prime examples, and they are currently among the most effective generative architectures.
Thus, while latent spaces can be helpful, they are not strictly necessary.
What we truly need is a highly flexible and powerful generative architecture that can capture and express complex multimodal behaviors.
Will future universal controllers take the form of a Mixture of Experts (MoE)?
In general, I'm not a big fan of the traditional Mixture of Experts approach where each expert is trained separately.
The problem is that when you train separate controllers for different skills and try to combine them later,
there's no guarantee they will transition robustly between skills,
and the system can become fragile and unstable, which makes it hard to achieve optimal performance.
Of course, large models like LLMs do use MoE layers, but it's important to note:
These are not separate models stitched together after training.
Instead, MoE is integrated into the end-to-end architecture and trained jointly.
I personally believe that integrated MoE architectures — where expert specialization happens within a unified, end-to-end framework — are much more promising than post hoc stitching of separately trained experts.
What advice would you give for current humanoid robot hardware design?
Since most of my research relies heavily on sim-to-real transfer, my biggest request for hardware designers would be:
Make robots easier to simulate.
Specifically:
Ensure that motors behave as closely as possible to ideal PD controllers (Proportional-Derivative controllers);
Design the kinematic structures of robots to be easily modeled within physics simulation environments.
This approach would help minimize the sim-to-real gap.
In practice, motor dynamics are probably the biggest source of the sim-to-real gap we encounter when moving from simulation to real-world deployment.