Reinforcement Learning + Real-World Robot Manipulation Can Really Work
Interview with Jianlan Luo
When it comes to robotics and reinforcement learning, most people — myself included — likely think of training robot control in simulators, especially for locomotion tasks. In the field of robotic manipulation, however, the combination of simulation and reinforcement learning has yet to overcome significant real-world performance barriers due to the Sim2Real gap.
Before today’s article, I believed reinforcement learning could only be applied to robotics within simulators before attempting deployment in real-world environments. Most reinforcement learning and manipulation papers follow this path, with only a few demonstrating real-world deployment.
At this point, many of us may assume that reinforcement learning, often thought of as a brute-force approach, can only succeed in GPU-powered simulations. As an experienced embodied intelligence blogger, I’m excited to say that today’s article offers an excellent introduction to how reinforcement learning can train directly on real robots without relying on simulations.
Today’s interview features Dr. Jianlan Luo, an accomplished researcher who has dedicated seven to eight years to real-world reinforcement learning. He is the first author of two influential recent works in this area: SERL and Hil-SERL. Currently a postdoctoral researcher at the Berkeley AI Research (BAIR) lab, Dr. Luo works with Professor Sergey Levine. Before returning to academia in 2022, he spent two years as a researcher at Google X, collaborating with Professor Stefan Schaal. He earned his Master’s and Ph.D. from UC Berkeley in 2020 and has also worked at DeepMind and Everyday Robots.
I admire Dr. Luo’s steadfast dedication to his academic path. Some people pursue less-traveled roads, advancing with resilience until others finally recognize and follow them. I hope this article offers insight into this evolving path of reinforcement learning directly applied to real-world robotics.
1. Self-introduction
I began my PhD in Mechanical Engineering at Berkeley in 2015, initially focusing on robot control, at a time when the field of robot learning was just starting to gain momentum, especially during 2015–2016. From that point on, I gradually shifted my research direction towards robot learning, collaborating mainly with scholars like Pieter Abbeel and Sergey Levine. During this period, my research focused on reinforcement learning, particularly its application to real robots.
In 2019, I joined multiple robotics departments at Google, including Google X and DeepMind, where I worked with Professor Stefan Schaal. Stefan Schaal, formerly a professor at USC and the founding director of the Max Planck Institute, was a key figure in the field of robot learning before 2015. At that time, robot learning was largely focused on learning dynamic systems for robots, rather than incorporating vision. The European school of thought, including researchers like Jan Peters and Aude Billard, developed within his academic framework.
In 2022, I returned to Sergey Levine’s team as a postdoctoral researcher, focusing on high sample-efficiency, high-performance real-robot reinforcement learning and manipulation, as well as applications related to large models.
2. What is Real-World + Reinforcement Learning?
In the realm of simulated reinforcement learning, successful research directions mainly focus on reinforcement learning combined with motion control (RL + locomotion). These studies typically rely on state-based models or learn a representation (latent space), followed by a Sim2Real or Real2Sim2Real alignment between the real world and the simulation environment.
However, my focus is on more complex and crucial scenarios in the real world, such as object manipulation and deformable objects. In these situations, visual inputs (such as RGB images and camera perception) are essential, and strategies developed in simulation often fail to perform well in the real world due to the significant gap between the two.
Reinforcement learning for object manipulation is still a relatively niche area, but between 2017–2018, many researchers explored it. At that time, reinforcement learning faced challenges like low sample efficiency, making training difficult and causing many to abandon this direction. Nevertheless, I firmly believe in the importance of this problem, which is why I’ve persisted in this research area.
In general, my research involves using reinforcement learning algorithms for real-world training with visual input, particularly for tasks where imitation learning cannot achieve 100% success. In scenarios involving physical interactions, such as object manipulation, we can train models that achieve 100% success within one to two hours for specific tasks. Although this does not generalize to all tasks, the robustness and resilience to interference in these models are much stronger than what is achievable in simulation environments.
3. The Development of Reinforcement Learning + Robotics
Reinforcement learning (RL) has its origins in the 1980s and 1990s, with early algorithms being relatively simple. Researchers like Stefan Schaal tried to apply RL to motion control, combining methods like Dynamic Movement Primitives (DMP) and policy gradients, but these early studies did not incorporate visual input.
By 2017 and 2018, Google began applying RL to real robots, focusing primarily on grasping tasks. However, due to better solutions available (such as using labeled data), this work did not progress further.
Between 2019 and 2020, significant contributions came from the ETH team, which demonstrated how RL models could be trained in simulated environments using domain randomization. These studies showed that even if the simulation is not perfectly realistic, reasonable strategies could still be learned for the real world. Their focus was primarily on motion control, and this approach of training in simulation followed by teacher-student distillation has become a mainstream solution.
In motion control tasks, model-based control methods like MPC (Model Predictive Control) have been very successful. While MPC requires extensive knowledge of dynamics and control theory, and its parameter tuning is difficult, it provides a clear understanding of what each parameter represents. On the other hand, RL in motion control often achieves good performance but struggles to improve beyond a certain point, especially for edge cases. The ETH team has recently combined MPC with RL to address this limitation.
However, RL has not made significant breakthroughs in robotic manipulation, as the main challenge lies not in the model’s uncertainty but in external uncertainties. In motion control, challenges often stem from model imperfections, but these are relatively easier to tackle. In manipulation tasks, the uncertainty lies in interactions with the environment — like objects’ positions and orientations — which requires processing continuous visual input and handling complex contact forces and rigid body interactions. The challenge of real-world manipulation is far more complex than motion control, which is why training in real-world environments is essential. Simulations can only go so far; the model learned in a simulator cannot surpass the simulator’s capabilities.
While training in a real environment is difficult, it offers the most accurate feedback. In RL, there’s a term called RLHF (Reinforcement Learning with Human Feedback), and in the physical world, we can call this RLPF (Reinforcement Learning with Physical Feedback). The feedback provided by the physical world is rich and valuable, and it’s crucial not to settle for approximations that limit the ability to find a global optimal solution.
Training in real environments is challenging but necessary. It requires high sample efficiency and system coordination, yet it is the only viable path to achieving 100% success rates in tasks. Before our recent work (e.g., Hil-SEARL), many had abandoned RL for manipulation in real environments, believing it was impractical due to long training times and complexity.
However, in January of this year, we released SEARL, which can learn 100% successful strategies in real-world environments in just 10 to 20 minutes. This framework has attracted attention from companies like Boston Dynamics and institutions like Stanford and MIT, who are revisiting RL training for real robots. Some European universities have also started experimenting with this method and shared their training videos with us.
4. How Does RL + Real Robots Solve the Problem of Brutal Search?
In simulation, algorithms like PPO are typically used. With PPO, the data gathered after each training session must be discarded, and training must start over. This isn’t a major issue in simulation, as sampling efficiency isn’t a concern there.
Our approach focuses on improving the reuse efficiency of data. The key to achieving sample efficiency is human demonstrations. Humans can typically provide 10 to 20 reasonable demonstrations as initial solutions, and RL will then search within this range until 100% success is achieved.
We use demonstration correction combined with online human feedback. This approach doesn’t require complex reward functions. For example, AlphaGo’s strategy, regardless of the number of steps taken, ultimately focuses on one result: win or lose. Its evaluation is binary (0 or 1), reducing bias in the reasoning process and improving effectiveness.
Designing reward functions is indeed very complex, and it’s difficult to create one that works for all situations. Often, the reward functions designed in simulation do not align well with the real-world physical environment. Our reward function is sparse: if the task is completed, the reward is 1, and if it is not, the reward is 0, regardless of whether the task is long or short.
Our system is highly scalable and performs well across multiple tasks. Through careful design choices, we integrate multiple systems and algorithms to create an efficient system. In our research, RL is not just a brute-force search, but a search within an initial range. This concept is similar to control theory, where a reference trajectory is used to design controllers, and feedback controllers are designed around that trajectory to achieve robustness.
After adding visual input, our method significantly improves sample efficiency. In some complex tasks, like precision PCB assembly or car manufacturing, our system can achieve 100% success in real-world environments within 1 to 2 hours, demonstrating its efficiency and reliability.
5. Does RL + Real Robots Mean No Use of Simulation?
Currently, we are not using simulation; we directly train complex tasks in real environments. These tasks typically involve precision operations and multiple contact points, where imitation learning often faces difficulties. However, our system can achieve 100% success within one to two hours. For example, in our research, we completed 12 tasks, including dynamic manipulation and interacting with flexible objects, such as installing dashboards in cars.
In dynamic manipulation, the key is real-time adjustments based on feedback, rather than simply executing a fixed sequence of actions. We must continuously adjust our behavior based on the environmental information captured by cameras, rather than just memorizing a particular action and executing it in an open-loop fashion. If there is any deviation from the expected outcome, we need to continuously correct it. This real-time interaction with the environment is crucial to shaping robot behavior.
While we are not currently using simulation, I believe pre-training can help reduce training time. The early stages of each task typically involve exploring basic manipulation capabilities. If we can use easily accessible datasets for pre-training, this could significantly shorten the training time. For instance, through training, we have reduced the time for completing complex tasks, like installing car flywheels and belts, to just 20 minutes, although robot performance in these areas is still limited.
6. What Does the Training Process of RL + Real Robots Look Like?
This is an automotive assembly task where two robotic arms need to lift a component and install it onto a dashboard. The assembly task is very precise, involving multiple interfaces and highly accurate operations. Current robots are not capable of directly completing this task, so the training process starts from scratch. Initially, the robot engages in random exploration, performing arbitrary actions in the environment. We provide demonstrations, and during this process, humans continuously correct and guide the robot.
Over time, the robot gradually masters the basic operations. The video shows the process from start to finish without editing, so you can see how the robot goes from random exploration to successfully completing the task. After a certain amount of time (e.g., two hours), its success rate approaches 100%, with almost no errors.
These behaviors are learned through interaction with the real environment. For example, the robot learns how to grasp and place a USB or how to pull a belt, all through autonomous exploration. Generally, only through continuous training and feedback can the robot learn these operations. The reinforcement learning algorithm, through dynamic programming, helps the robot achieve higher performance.
7. Generalization Issue
This paper does not address the generalization issue from one task to another. SERL and HIL-SERL primarily focus on solving the problem of reinforcement learning (RL) for robotic operations not working well in the real world. Regarding the multi-task problem, there have been some previous works that demonstrated how a single-task RL model can quickly generate data, and this data can be distilled into a larger model, which can then support multiple tasks. In this way, the model is able to adapt better to different tasks.
Our goal is to provide a reliable and usable foundational tool that helps researchers perform more complex work using this tool. Based on our tool, many subsequent research works can continue to be built upon, and researchers will have a clearer idea of how to implement them.
8. Why Persist with RL + Real Robots
One of the reasons many people give up on this type of work is that it requires a deep understanding of both robotic systems and reinforcement learning (RL) algorithms. You need to achieve a high level in both areas, but in reality, this is not an easy task. Frankly speaking, there doesn’t seem to be any shortcut; you must take it step by step, learning from mistakes, and accumulating experience. Many people might feel that it’s difficult to achieve quick results and might give up. I personally think this process is full of risks, and the short-term gains might not be substantial. However, in the long run, I believe that RL is certainly a significant part of future solutions, and we can already see this trend. I predict that, after these startup companies that heavily use imitation learning identify the pain points through trial and error, they will start using RL to optimize success rates, cycle times, and achieve greater robustness beginning next year.
As for why I still insist on RL + real robots, I remember starting to experiment with this back in 2019 while I was doing research at Google. We used RL with real robots for a plug-and-play operation task and achieved a 100% success rate. We conducted 13,000 tests, all of which were successful. Moreover, our success rate even exceeded that of the best robot system integrator we had paid to hire at the time. We truly saw the potential of RL.
A lot of the work on RL for robotic operations is very difficult to reproduce in practice. Many people struggle to make RL truly work, especially the latest research findings, which are difficult for others to replicate. In June of last year, Sergey and I discussed whether we could provide the community with a mature, reliable tool that would allow people to reproduce this RL algorithm for real-world robotic operations, and that was the original intention behind the SERL paper.
9. Imitation Learning (IL) or Reinforcement Learning (RL)
Imitation learning (IL) only requires collecting some data through the robot and conducting experiments, which can yield quick results. This is why people are willing to pursue it. However, the issue arises when you want to improve the precision of imitation learning to a higher level. Simply relying on data collection may not be sufficient. Ultimately, imitation learning and reinforcement learning need to be combined to form a complementary relationship. For example, if you want the robot to achieve 99% accuracy, reinforcement learning is certainly an indispensable part. Rich Sutton’s “bitter lesson” tells us that history shows us that learning and search are two methods that can be scaled infinitely. Imitation learning helps us extract features from data, but without search or optimization (RL), it cannot overcome the limitations of the data, thus unable to address new problems in new ways.
In North America, Sergey has consistently advocated for using reinforcement learning in real environments. The works on SERL and HIL-SERL have convincingly demonstrated why reinforcement learning is necessary. Through these works, people should be able to see the potential of reinforcement learning and understand why it is critical in the process. Over the past eight or nine years in North America, I have remained committed to this direction. My goal is simple: to ensure that my technology can be widely applied and receive genuine praise. I am not aiming to write exaggerated claims in papers but hope my work can genuinely solve problems and provide real, tangible help to people.
10. How Is RL + Real Robots in 2018 Different from Today?
The core algorithms themselves have not changed significantly; the main improvements come from the integration and optimization of the entire system. HIL-SERL’s contribution lies in providing an end-to-end system and solution, rather than just a single algorithm. In the past, much of the RL research focused on results in simulation environments, usually tested in simulators, but with a significant disconnection from the real world. HIL-SERL is different. It is a comprehensive system, ranging from low-level controllers to high-level algorithms. The low-level control includes impedance control and force feedback control, while the higher-level algorithms design and integrate into a unified system. It is not just optimizing a single algorithm but rather a complex integrated system where each component works together.
The design philosophy of the system is to integrate multiple factors to form a better whole, rather than simply optimizing a single component. For example, the design of the reward function, through a more sophisticated reward mechanism, helps the model learn faster. This is not the result of just one factor but of the combined effects of all the factors, enhancing the overall system performance. This is a point I have always emphasized.
Additionally, I believe that many people who are just starting with robotics technology often begin by simply collecting data and training experiments with two robots or several devices. However, experienced robotics experts from previous generations would agree with my point: robots are complex systems and cannot simply rely on optimizing a single component to improve overall performance. Just like you can’t make a car go faster just by replacing the engine or the computer. A robotic system needs to be properly optimized across hardware, software, and control layers to fully realize its potential. If you only focus on one part and neglect the overall integration and optimization, the final outcome will be significantly limited.
11. What Are the Limitations of HIL-SERL?
The sampling efficiency of reinforcement learning (RL) depends on several factors, primarily including the state space, action space, and the task’s horizon (time span). These factors directly affect the RL sampling efficiency — the more complex the task, the more samples are required. For example, short tasks (e.g., lasting a few seconds) are relatively easier to solve with RL, but for long-term tasks (e.g., lasting several minutes), other tools or methods might be necessary to assist.
For tasks that involve complex contacts, flexible objects, and precise designs, RL does have advantages, but it also has limitations. For long-term tasks, such as combining multiple sub-tasks into a single learning task (e.g., “open the door,” “close the curtain,” “pour water”), RL may struggle when trying to learn all seven or eight tasks together. To address this, the task can be split into smaller sub-tasks, with each sub-task treated individually, making RL more efficient in learning and execution.
12. Regarding RL + Simulation
Many human tasks, such as playing tennis, cannot be learned simply by watching YouTube videos. A person must practice and adjust their movements based on feedback from the environment, ultimately forming muscle memory. This memory is not related to the brain’s understanding of the laws of the physical world; it is purely physical. This process is difficult to replicate through simulation. Instead of spending billions to construct a virtual world model, I believe collecting real-world data is faster, more effective, and avoids the Sim2Real gap problem.
Even autonomous driving companies, despite having the world’s largest and most scalable simulators (such as Waymo or Tesla’s simulation systems), still primarily rely on real-world data when training their final models, rather than simulation data. For Tesla today, the issue isn’t collecting more data — it’s how to use the data they’ve already accumulated. Last year, I heard that they were facing issues with their data centers being unable to store all the data. Now, there’s talk that whoever has the most storage will come out on top.
Similar lessons were learned in the development of computer vision in the 1990s. At that time, Adobe tried to solve image recognition using synthetic images, investing heavily in generating synthetic data, but ultimately these synthetic images did not succeed in supporting the practical application of visual models. It was only when real image datasets, such as ImageNet, became available that visual models made breakthrough progress. Simulation has its uses, but when it comes to building a robot data pyramid, real-world data will always take the lead.
The primary argument for using simulation now is that real robot data is difficult to obtain. Simulation can instantly generate billions of data points, but no one would argue that if we have real data, it is the most valuable. However, this isn’t the crux of the issue. In ten years, when we have one hundred million robots deployed in the real world, constantly sharing their physical experiences, we will look back at our current dilemma, and many of these problems will no longer exist. Many viewpoints will become irrelevant. The current data we have and the number of deployed robots are insufficient to draw definitive scientific conclusions, which is why there is still a wide range of opinions. For instance, the first company to deploy 1,000 humanoid robots in factories will have enough data sent back 24/7 to generate new paradigms and scientific conclusions. By starting with issues in these semi-enclosed spaces, once we have a deeper understanding of the problems, our methodologies will naturally extend to unconstrained spaces.
13. Is Hil-SERL Open Source?
Yes, we have fully open-sourced the entire project and specifically chose the MIT license to make it accessible to as many people as possible. From the robot to reinforcement learning (RL), everything is open-sourced end-to-end. Regarding the question of usability, in fact, some undergraduates have been able to set it up in just a few days, and they can even extend it from there. The original intention of this paper was not to publish an academic paper — publishing papers holds no practical significance for me.
We spent a lot of time redesigning and verifying the entire codebase to ensure it runs smoothly in various environments. Through repeated testing, we have made sure that once people get the tool, they can easily integrate and use it without any issues.
References:
SERL:
https://serl-robot.github.io/
Hil-SERL:
https://hil-serl.github.io/
Chinese Link:对话罗剑岚:强化学习+真机操作可以很Work