An In-Depth Guide to Robotic Reinforcement Learning

Interview with Weinan Zhang and Yunlong Song

Feb 04, 2025

Part 1: Interview with Dr. Zhang Weinan

Dr. Zhang Weinan is currently a Professor, Ph.D. Supervisor, and Deputy Chair of the Department of Computer Science at Shanghai Jiao Tong University. His research interests include reinforcement learning (RL) and data science. With over 20,000 citations on Google Scholar, Dr. Zhang obtained his bachelor’s degree in the ACM class from the Department of Computer Science at Shanghai Jiao Tong University in 2011 and his Ph.D. from the Department of Computer Science at University College London in 2016.

He has published textbooks titled Hands-on Reinforcement Learning and Hands-on Machine Learning. During the interview, I found that Dr. Zhang has a deep understanding of reinforcement learning and is able to explain complex concepts in a vivid and accessible way. Below is the interview record where Dr. Zhang and I discussed the classification of RL:

Classification of Reinforcement Learning

There are many ways to classify reinforcement learning, and generally, we can categorize it into the following types:

Value-based RL vs Policy-based RL
Online RL vs Offline RL
On-Policy vs Off-Policy
Model-based RL vs Model-free RL

1. Value-based RL vs Policy-based RL

Value-based RL refers to methods where an agent interacts with the environment multiple times, accumulates rewards, and calculates the expected value (value). In this approach, value functions are used to evaluate the effectiveness of states and actions, and the agent selects actions based on these evaluations. For instance, algorithms like Q-Learning, SARSA, and DQN are examples of value-based RL algorithms.

Value-based Reinforcement Learning is a method in RL that primarily uses a value function to guide the agent’s decision-making. The core idea is to evaluate each state or state-action pair to select the optimal action strategy. In this framework, the agent doesn’t directly optimize the policy but learns a value function to evaluate the “value” of each state or the “value” of performing a particular action in a given state. Based on these evaluations, the agent selects actions.

A value function estimates the expected cumulative reward the agent can achieve from a certain state. There are two main forms:

State-value function: Represents the expected return that the agent can obtain from a given state.
Action-value function: Represents the expected return from performing a specific action in a given state.

On the other hand, Policy-based RL doesn’t rely on value functions but directly optimizes the policy. Pure policy methods are less common, with common algorithms like Policy Gradient methods, such as REINFORCE.

Policy-based RL directly optimizes the agent’s policy by adjusting the policy parameters to maximize the cumulative reward. Unlike value-based RL, policy-based methods don’t depend on value function estimates but directly learn a parameterized policy (usually a neural network) to select optimal actions. The policy is the rule that defines the agent’s action selection in a given state. It can be deterministic or stochastic.

Additionally, there are algorithms that combine value functions and policies. These methods use both value functions and policies for learning and optimization. Common combined algorithms include A3C (Asynchronous Advantage Actor-Critic), A2C, PPO (Proximal Policy Optimization), TRPO (Trust Region Policy Optimization), and DDPG (Deep Deterministic Policy Gradient). These algorithms belong to the Actor-Critic framework. These methods perform exceptionally well when dealing with high-dimensional state and action spaces and are suitable for reinforcement learning tasks that require high flexibility and continuous action spaces.

2. Online RL vs Offline RL vs On-policy vs Off-policy

The main difference between Online RL and Offline RL lies in whether the agent interacts with the environment in real time during training. Online RL relies on real-time interaction with the environment, whereas Offline RL relies on pre-collected data.

On-policy and Off-policy are both subsets of Online RL. The main difference between them is whether the data used for learning comes from the same policy currently being followed. On-policy learning uses only data generated by the current policy to update the policy, while Off-policy learning can use data generated by other policies.

Online RL, On-policy, and Off-policy all fall under the category of online learning, where On-policy is a form of pure online learning that depends on interaction with the environment. Offline RL, on the other hand, is a completely different concept, emerging around 2018-2020, where the agent doesn’t need to interact with the environment. Instead, it learns from pre-collected data, thus eliminating the need for real-time interaction, which makes it relatively faster. Both On-policy and Off-policy still require interaction with the environment during learning. The key difference is that On-policy does not use a data buffer, while Off-policy uses a data buffer.

To make it easier to understand, I often use this analogy: On-policy is like washing your hands with tap water. The water flows from the tap, washes your hands, and then goes down the drain without being used again. After passing through your hands, the water gets dirty and is discarded. If the water flows too fast, a lot of data could be wasted.

In contrast, Off-policy is more like using a basin to collect water. You don’t directly use tap water; instead, you catch the water in the basin, and the water can be reused. When the basin fills up, the oldest water is discarded. In Off-policy methods, data is collected into a buffer pool and can be reused at different times.

Offline RL goes even further. The water doesn’t come from the tap at all. It’s like a basin that has already been filled with water, and you continuously wash your hands in the basin. There’s no reliance on real-time interaction with the environment; instead, you collect the data in advance and repeatedly use this data from the buffer for training.

This setup depends on whether the environment or task allows for the use of On-policy or Off-policy or whether only Offline RL can be used. For example, when playing a game, the game engine can continuously generate data, so you can use On-policy because there’s no limitation on data collection. But if the data flow is fast, leading to the waste of a lot of data, you may need a basin to collect it, which means you might choose an Off-policy online learning approach. On the other hand, if data collection is very costly, like in a factory where each data point is precious, you may choose Offline RL to learn from the data.

To summarize:

If data is scarce and difficult to obtain, use Offline RL.
If data is relatively easy to obtain but still valuable, use Off-policy with a data buffer.
If data can be generated endlessly, use On-policy learning.

3. Model-based RL vs Model-free RL

Initially, many researchers chose to use model-based methods because once the environment model is established, the agent can operate freely within this model. After building the environment model, dynamic programming can be applied, or infinite data sampling can be done in the simulated environment, which does not interfere with the real world. Model-based methods are suitable for situations where the environment can be completely modeled and simulated. Many classic reinforcement learning methods, such as maze navigation and the game of Go, are model-based. In reinforcement learning, the term "model" usually refers to the environment model, i.e., the dynamic model of the environment. For example, Model Predictive Control (MPC) uses the dynamic model of the environment for planning. When we talk about models in RL, we typically refer to the environment model, not a policy model or value function model.

In reinforcement learning, the environment model usually includes two parts: 1) the state transition function and 2) the reward function. More commonly, modeling is done based on the dynamic environment state transition function. Once the model is built, it’s like the environment has been provided to you, and you can use the environment model to perform dynamic planning. The core idea of this method is to make decisions and plans by learning the model.

However, the problem with this approach is that the environment models are often inaccurate, leading to less-than-ideal results. This is why Rich Sutton, David Silver, and others proposed model-free methods, which rely more on data and usually do not require pre-building complex environment models but instead learn directly from the data. A typical example is Q-learning, whose core idea is to learn the value function (Q-function), which tells the agent the optimal action to take in each state. Q-learning doesn’t need to know the environment’s dynamic model; as long as it can collect data in the form of state-action-reward triples, it can learn.

This process can be implemented through specific learning algorithms, such as temporal difference learning methods. There are two main types of temporal difference learning methods: one is on-policy, and the other is off-policy.

4. Q-Learning

Q-learning is an "off-policy" method, meaning its training doesn’t rely on data generated by the current policy (unlike on-policy methods). For example, in Q-learning, the agent can learn from data generated by other policies or previously trained policies. This gives Q-learning greater flexibility, allowing it to use data from different policies to accelerate learning, without being restricted to data generated by the current policy. In contrast, on-policy methods require the agent to only use data generated by the current policy for learning.

The advantage of Q-learning is that it eliminates the “importance sampling correction” issue present in traditional reinforcement learning. Traditional off-policy methods require adjustments to data generated by other policies, while Q-learning uses a simple update rule that allows it to use data from other policies without needing to correct it.

In practical applications, Q-learning has a significant advantage in that it is data-driven, and does not require building a complex environment model. The agent only needs to collect state, action, and reward information to directly learn the Q-function and derive the optimal policy. This means Q-learning does not need to have a clear understanding of the environment's state transition probabilities or reward distribution.

Q-learning was first proposed by Chris Watkins in 1989. Its main advantage lies in that it does not require an environment model and learns the optimal policy through interaction with the environment. In the 1990s to early 2000s, scholars like Richard Sutton helped popularize the Q-learning algorithm. In 2013, David Silver and researchers at DeepMind introduced Deep Q-Network (DQN), which became one of the core algorithms in deep reinforcement learning (DRL). DQN combined Q-learning with deep neural networks, allowing Q-learning to break free from the limitations of simple state spaces and tackle high-dimensional data processing, such as image inputs.

At that time, David Silver was working at UCL (University College London), and he constantly promoted his model-free reinforcement learning method to companies in London. However, it initially received little attention. Many people didn’t understand his method, thinking that learning directly from data without modeling the environment was infeasible. People were used to modeling the environment, especially in the era of limited data, and were more inclined to describe the environment through models and use dynamic programming to solve problems.

However, with the advent of the big data era, David Silver’s method began to seem more feasible, as the increased data volume made data-driven learning more reliable. Although he didn’t gain widespread recognition immediately after returning to UCL, he continued to insist on his approach and pushed forward the development of model-free methods.

Later, with the rise of deep learning, these data-driven deep reinforcement learning algorithms made significant breakthroughs. The powerful expressive capability of deep neural networks allowed Q-learning and similar algorithms to handle more complex tasks. When the Q-function is represented by deep neural networks (like CNN or AlexNet), it can naturally learn from large amounts of data, enabling finer optimization.

This method is fundamentally different from traditional algorithms: earlier algorithms relied on manual modeling and limited data, while with deep learning, Q-learning can automatically learn and improve from vast amounts of data, leading to successful applications in complex tasks like Atari games. This marked a significant milestone for deep reinforcement learning, breaking past limitations.

In summary, model-based methods and model-free methods each have their advantages and disadvantages. Model-based methods can foresee future states through simulation and planning, while model-free methods rely more on data accumulation and learning. With breakthroughs in deep learning, model-free methods like Q-learning have gradually become mainstream, particularly when dealing with complex, high-dimensional environments where they show significant advantages.

5. Sim2Real RL

In the past, robots didn't face the Sim2Real problem because it wasn't necessary for them to learn; it’s only now, as robots are learning, that the Sim2Real issue has emerged. For reinforcement learning, using it to train robots for movement or manipulation wasn’t effective in the past, so almost no one attempted it. The main reasons are twofold:

Reinforcement learning requires a large amount of data. Due to its trial-and-error nature, reinforcement learning heavily depends on the agent’s interactions with the environment to gather meaningful data. This is different from supervised or unsupervised learning: supervised learning can directly utilize pre-labeled data, and unsupervised learning can extract patterns from large amounts of unlabeled data. However, reinforcement learning must acquire high-value data through interaction to learn.
The deployment of robot policies must consider the nature of the real environment. Even if a policy appears effective in a simulated environment, if the amount of interaction data is insufficient, the policy cannot successfully transfer to the real robot, especially in locomotion scenarios like walking or jumping. Without sufficient training, the robot may frequently damage equipment, and it is unacceptable to have a robot fall and break 10 or even 100 robots in real-world scenarios. Therefore, these tasks can generally only be trained in a simulated environment.

However, simulators also have limitations. They typically run based on preset equations and are difficult to perfectly simulate the real environment, especially when dealing with contact (e.g., robot legs making contact with the ground) or collisions, where the accuracy of such simulations is often limited. The gap between this simulation and reality is called the reality gap.

To make reinforcement learning policies trained in simulated environments efficiently transfer across the reality gap and work in the real world, the Sim2Real technology was developed, with two main solutions. The first is Domain Randomization, which involves randomly adjusting environmental parameters in multiple dimensions during training to cover as large a variation range as possible. If a policy remains effective within these fluctuations, it will also perform more robustly in the real environment. Currently, most robots (such as quadruped robots and industrial robots) use this approach. The second solution is Domain Adaptation (or Policy Adaptation), which combines real-world data to help the policy adapt to the real environment. Specifically, most of the training process (about 95%) still happens in the simulated environment, with about 5% of the training data coming from the real environment to help the policy adapt to the real-world conditions. However, collecting data in the real world still faces many challenges, such as high noise levels and data quality issues, which could cause the policy to fail completely.

To address these challenges, some domain adaptation methods adjust the simulator's parameters after collecting real-world data, making the simulated environment closer to the real one. While this doesn’t completely solve all the problems, it is relatively reliable and widely used.

6. Teach-Student Learning

When performing domain randomization, a policy needs to work in many different environments. For example, imagine a blind dog robot: it doesn’t know the exact terrain in front of it and can only "feel" its way forward. The current training method for this blind dog is to "feel its way": for the robot, the terrain is unstable because it doesn’t know the specific terrain at any given time, nor does it know what happens after it takes a step. The local observations it gets mostly come from sensors on its four legs, and this information alone is not sufficient.

But if the information you have is "cheating"—that is, in the simulated environment, you can see the terrain in all the surrounding grid cells—then for you, the environment becomes "a white box," and you can quickly train a policy. We call this "cheating information" privilege information. Why is it called that? Because in the simulator, you can see all the information, essentially having a full map, so it’s easy to learn how to move.

In this case, you first train a teacher policy that has access to global information (the full map) and knows how to move. Then, you need to distill this teacher policy into a student policy that can only use local observations. In other words, in real life, the student doesn't know why it should move its leg at a certain time, but because the teacher has taught it when and how to move, it can perform the correct action. The teacher knows when to move its leg because it can see that there’s a rock ahead and needs to take a step back—this is the basic idea.

Generally, the teacher learns first, and after the teacher has learned, it can use its policy (which has global information) to guide the student. The teacher’s policy tells the student, "Move this way at this time," "Move like this at that time," using a regression approach to teach the student.

7. RL + Manipulation

If your robot task is simply a path planning task—for example, first doing path planning and then executing the path—then it actually doesn't need RL at all. Colleagues in computer vision (CV) often first understand the environment through vision, then perform path planning, followed by execution, and that’s the end of it. In this case, there’s no need for any additional complex manipulation.

A key point here is: when a robot interacts with the environment, does it need high-frequency, real-time dynamic control? If not, then path planning and execution are enough.

For example, consider the task of "folding clothes". If you only follow a path planning approach, during the execution, as the two robotic arms move, the clothes will naturally drape over them, relying on gravity to fold the clothes. Your speed of movement will have little to no strong correlation with the folding process—whether you move fast or slow, the clothes will still hang underneath. This is "static execution".

But if you need to toss the clothes up or perform a dynamic action (dynamic manipulation), then RL is necessary. For example, in quadruped robots walking, the process is dynamic; you can’t just do path planning for each leg and then execute it, because the legs move very quickly. This kind of dynamicity requires a different approach entirely, which is why locomotion (walking, running, etc.) was researched and developed early on—it requires dynamic control.

Let’s look at the concept of manipulation. If it’s just a simple “grip,” there’s not much to focus on, and the change in speed won’t be very fast. In such a case, you wouldn’t care about the process itself; you just need to set an instruction like 0 (open) and 1 (grip). This doesn’t really require RL. Many current manipulation tasks are simple 0 or 1 actions—like a clamp that closes with a "snap"—this represents early-stage scenarios and doesn't need complex strategies.

However, in the future, with the increasing use of dexterous hands, those agile movements will require RL to finely control the movements. This is because performing truly delicate and flexible manipulations requires high-frequency dynamic control, far beyond the simple 0/1 actions.

Part 2: Interview with Dr. Song Yunlong

Dr. Song Yunlong is a Ph.D. candidate in the Robotics and Perception Group at the Department of Computer Science at the University of Zurich, and a member of the Neural Informatics Group jointly established by the University of Zurich and ETH Zurich, under the supervision of Professor Davide Scaramuzza. During his Ph.D. studies, he collaborated with Professor Sangbae Kim in the Biohybrid Robotics Lab at MIT. Before pursuing his Ph.D., he earned his Master's degree at TU Darmstadt (Darmstadt University of Technology) under the supervision of Professor Jan Peters.

(Dr. Song Yunlong's academic mentors are highly regarded in the field. Davide Scaramuzza is a leading figure in RL and drones, Sangbae Kim is a renowned professor at MIT known for the MIT Cheetah, and Jan Peters, along with his advisor Stefan Schaal, represents one of the most important branches of robotic learning in Europe. Jan Peters is also the corresponding author of the famous paper Reinforcement Learning in Robotics: A Survey.)

Before joining the University of Zurich, Dr. Song did significant theoretical research and later explored applications of reinforcement learning in drone control. During his Ph.D., he started exploring traditional control methods, especially optimization-based control, and found this direction very interesting, leading him to study it systematically. His research during the Ph.D. primarily focused on optimization control and reinforcement learning, and he made further progress in both exploration and application. He developed the Flightmare simulator and designed the first reinforcement learning strategy that enabled ultra-agile drones to reach peak performance in the physical world.

Before diving into Dr. Song Yunlong’s discussion of reinforcement learning in robotics, let’s watch a clip of his work in using reinforcement learning for drone racing.

8. Classification of RL

From a theoretical perspective, there isn’t much truly new content in reinforcement learning; it’s more like “old wine in a new bottle.” Before the advent of deep learning, traditional reinforcement learning had already explored many of the commonly used classification methods. The specific classification depends on the perspective you take.

The first classification method is from the highest level: deep reinforcement learning vs non-deep reinforcement learning. The difference between these two lies in whether deep learning techniques are used.

The second classification can be based on value-based vs policy-based methods. For example, AlphaGo represents a typical value-based deep reinforcement learning approach. In the field of robotics, policy-based deep reinforcement learning is more commonly used. Of course, there are also hybrid methods that combine both value and policy.

The third classification is related to on-policy vs off-policy. I’m not sure if what you mean by online vs offline refers to on-policy vs off-policy, but this is indeed another common classification method. In reinforcement learning, the typical “online” concept refers to real-time interaction with the environment and updating parameters. However, in practical applications, this fully online approach is rare. More commonly, the distinction is made between on-policy and off-policy.

On-policy means the data used to update the policy comes from the current policy that is being learned.
Off-policy means the data used to update the policy comes from other policies or sources, not the current policy.

Both have their pros and cons. On-policy methods typically require more samples, as they can only use data collected by the current policy. Off-policy methods can reuse historical data or data collected from other policies, making them more data-efficient and requiring fewer samples.

In the value-based vs policy-based classification mentioned earlier, most value-based methods (like DQN) tend to be off-policy, while most policy-based methods (like PPO) are usually on-policy. This is the third classification method: categorizing methods based on the source of sampling data—on-policy vs off-policy.

The fourth classification distinguishes between model-based and model-free methods. Model-free methods do not explicitly learn an environment model; they directly optimize the policy through interaction with the environment. Model-based methods first learn (or construct) the environment model, and then use this model to collect data or directly optimize the policy.

Additionally, another classification can be made between Reinforcement Learning (RL) and Inverse Reinforcement Learning (IRL). IRL mainly focuses on learning a reasonable reward function from data to solve the problem of designing a reward function, which is often difficult in RL.

In the field of robotics, we usually use PPO, an on-policy algorithm, because robot control signals are mostly in continuous space, whereas discrete action spaces (such as in board games) often use value-based methods. Policy-based methods are more suitable for continuous control scenarios, and therefore, they have gradually become mainstream.

Around 2019, with the improvement in hardware performance (such as GPU) and the emergence of open-source tools (like Stable Baselines provided by OpenAI), the application of PPO became more convenient. As long as the data is sufficient, good results can be achieved.

As a result, more and more robot research uses PPO and similar policy-based methods. Before this, various experimental groups often decided which algorithm to use based on their own conditions or existing code. Now, PPO has almost become the “standard” algorithm for robot reinforcement learning. However, value-based methods, such as SAC (Soft Actor-Critic), which are off-policy, require less data in theory but are harder to use effectively in practice. Although GPUs can provide large datasets, they are not always capable of processing so much data, leading to slower training for off-policy methods. In other words, although off-policy algorithms require fewer data, their training periods are usually longer, and as a result, they are less widely used and have fewer adopters in practice.

9. Value-based RL vs Policy-based RL

My understanding is that the distinction mainly depends on how we optimize the objective function. In value-based methods, we typically derive the Bellman function from the bottom up, and ultimately aim to minimize the Bellman Error or TD Error. By minimizing this quantity, we perform value-based reinforcement learning.

In contrast, policy-based methods primarily optimize a reward function through policy gradient, with the goal of maximizing this reward. In other words, value-based methods focus on minimizing some error (e.g., TD Error), while policy-based methods directly maximize cumulative rewards. From this perspective, the two methods are essentially optimization processes for different objective functions.

It’s not entirely accurate to say that value-based methods are suited for discrete control and policy-based methods for continuous control. Take SAC (Soft Actor-Critic) as an example, which also works well for continuous control, and policy gradient can also be applied in discrete action spaces.

Originally, policy-based methods were more often used in continuous action domains, while value-based methods were more common in discrete action domains. However, with the advent of neural networks, we can use approximation functions to solve different types of problems, so the boundary between these two methods has become increasingly blurred.

10. On-policy vs off-policy: Are they branches of policy-based methods?

No, in reinforcement learning, on-policy and off-policy are broader categorization concepts and are not strictly tied to value-based or policy-based methods. Many value-based methods are on-policy.

On-policy refers to using data from the current policy's sampling process to update the policy. In other words, under on-policy, once the collected data is used, it is usually discarded and not reused, so these methods generally require more data.

Off-policy, on the other hand, allows the use of data collected through any means to update the current policy. This data could come from human demonstrations, previous versions of the Q-function, or other policy samplings. A defining feature of off-policy algorithms is the use of a Replay Buffer, from which data can be repeatedly sampled to update the policy, making data utilization more efficient and training faster.

Additionally, off-policy algorithms are indeed common in value-based methods, but it cannot be simply said that “Value-based means off-policy” or “Policy-based means on-policy.” These concepts are different dimensions within reinforcement learning.

11. Model-free RL vs Model-based RL

The distinction between model-free and model-based is quite intuitive: whether or not the algorithm explicitly uses an environment model during optimization. If a model is used, it's model-based; if not, it’s model-free.

However, the question of ‘using a model’ can be more complex. For instance, many model-free methods might use simulators to gather data, but as long as the simulator isn’t directly used for policy gradient calculation or policy optimization, it still remains model-free.

For example, the MPC (Model Predictive Control) algorithm uses an explicit model to calculate gradients or make predictions, which are directly used for optimization, thus it’s a clear example of a model-based method.

On the other hand, there are subtle differences between model-based methods and MPC, mainly in the source and use of the model: in MPC, the model is typically assumed to be known or derived through first-principles. In some other model-based RL methods, the model can be learned from data (e.g., using neural networks to approximate environment dynamics). Different model-based approaches might also employ different optimization strategies, such as online rolling optimization like MPC, or gradient-based and non-gradient search methods. All these differences boil down to how the environment model is used in the policy optimization process.

Some model-based methods don’t just learn an environment dynamics model, they may also learn a reward model and a value function. For example, in papers like Dreamer or PlaNet, they learn the dynamics, reward model, and value function first, and then perform online optimization based on these models.

However, learning the model itself doesn’t directly tell us what action to take, so a method is still needed to determine the optimal action. The usual approach is to use something like online MPC, where a series of candidate actions is sampled or predicted on the learned model, their long-term rewards are evaluated, and the optimal action is selected. This represents planning and decision-making on top of the model, which is common today.

If we only learn a value function without learning a dynamics model, it’s more like a regression task, and it’s not directly related to whether the method is model-based or model-free. The real reinforcement learning component comes into play when performing online optimization and finding the optimal action, as RL methods are needed to update the policy parameters and determine the specific action.

In many current papers, authors might not strictly differentiate these concepts (model-based, model-free, RL optimization, etc.). For example, some works might claim to be ‘model-based’ but are actually model-free, or they might not be directly related to reinforcement learning at all. Misuse of these terms is common, so it’s important to pay attention to the details when reading papers to avoid being misled.

In the current field of locomotion (robot motion), at least 80% of successful results come from model-free methods. Model-based methods tend to perform less well because learning the environment model (dynamics model) itself is very difficult. Even once the model is obtained, further optimization based on it is not easy either.

Some model-based algorithms also learn a reward function and value function, each of which introduces some error, and the errors can add up, making the performance less stable compared to directly using model-free methods. While model-based methods theoretically require less data to learn, the optimization process is more challenging, especially when it comes to how to use the learned model to compute the optimal action.

Currently, the ‘ideal vision’ for model-based RL is that if we can learn an environment model (model) from a small amount of data, we can, first, reduce the sim-to-real gap (the difference between the real world and simulation) because we train the model with real data, not manually written simulators; second, reduce sample requirements: after learning the model, only a small amount of new data might be needed to optimize the policy; third, reuse: the same learned model can be applied not only to the current task but also reused when the reward function or target task changes, greatly improving efficiency.

However, in practice, these advantages are often not fully realized. Many common methods still use manually written simulators to generate data, and these simulators inevitably differ from the real environment. Additionally, deploying strategies learned in simulated environments to the real world also faces many challenges. Effectively bridging the gap between simulation and reality remains an unsolved problem.

12. Application differences of RL in drones, locomotion, and manipulation

Personally, I have not worked on actual projects in the robot manipulation domain, so I can only provide some subjective opinions. Compared to drones or robot locomotion, manipulation tasks have fundamental differences:

First, the objects of focus are different. In the drone or locomotion domain, the robot mainly focuses on its own state, such as posture and joint angles, with environmental factors (like terrain) being important but relatively less emphasized. In manipulation, the environment plays a crucial role; the robot must not only understand its own position and posture but also have a deep understanding of the various objects in the environment.

Second, the diversity of the environment. In manipulation, adding or changing an object (e.g., cups or tools with different shapes, weights, and materials) can significantly affect the grasping strategy. In other words, to achieve general-purpose grasping, the robot needs to have a "general understanding" of the environment in terms of object geometry, weight, physical properties, etc. Some believe that solving general-purpose manipulation could address a core challenge of AGI, as it requires a strong perceptual and cognitive ability to handle various unknown objects in the environment.

Next, the core challenge lies in perception. From making the robot "see" and "understand" the shape, weight, and material of objects to planning grasping strategies based on this information, the difficulty in perception and cognition far exceeds simple control problems. For example, if we already know the target object's weight and shape, the control step of "grasping the object and placing it in a designated location" is not very complicated and may not even require reinforcement learning or complex neural networks.

Finally, the expectations for large models. Currently, many expectations are that large models can provide a higher level of environmental understanding, including reasoning about object characteristics and scene cognition. Once this environmental understanding is effectively solved, the execution of grasping actions itself becomes relatively simple.

In summary, for robot manipulation, the real challenge is not entirely in control but in accurately perceiving and modeling the environment, especially unknown objects. Because of this, training in simulation environments alone cannot fully address the challenges in the real world, where there are many unpredictable environmental changes and object differences. This is also a major reason why general-purpose grasping tasks have not been fully solved.

In robot manipulation, apart from perceiving and understanding the environment (Perception & Understanding), another challenge is long-horizon task planning. For example, when making food in a kitchen or tying shoes, it requires the decomposition of multiple steps and the arrangement of their order.

Such complex tasks are hard to express directly with simple mathematical formulas and design reward functions, like flying a drone from point A to point B. Turning an abstract process like "how to cook" or "how to tie shoes" into clear, quantifiable goals and substeps is itself very difficult.

Therefore, compared to traditional short-term, quantifiable tasks (like point navigation), the long-term planning and task decomposition requirements in manipulation tasks make problem modeling and reward design quite tricky. This also explains why, in real-world manipulation tasks, more breakthroughs in perception, understanding, planning, and other areas are still needed.

13. Perception in Drones and Manipulation Tasks

The perception requirements and complexities in tasks like drone racing, robotic manipulation, and autonomous driving are not of the same magnitude. For example, in drone racing, the main perception tasks involve sensing the current pose and environmental obstacles. In contrast, in manipulation tasks, the robot needs to not only sense the position and shape of objects in space but also understand their physical properties, such as material and weight, because interaction with the object is essential during the grasping process.

In autonomous driving, perception primarily involves recognizing roads, pedestrians, other vehicles, and traffic signs, most of which belong to "known" categories. Even when encountering a small number of unknown objects, they can often be classified as obstacles. However, in robotic manipulation, every new object requires a deeper understanding and adaptation, leading to greater perception challenges. Hence, while all these tasks involve perception, manipulation tasks pose significantly higher challenges because it involves not just identifying "what" an object is, but also understanding "its physical properties and how it interacts." This is why manipulation perception is generally considered more difficult than navigation, drone racing, or autonomous driving.

14. RL vs. MPC for Drone Control

The paper suggests that reinforcement learning (RL) outperforms traditional Model Predictive Control (MPC) in drone racing (or similar tasks). However, in reality, MPC is still sufficient for most tasks that can be easily planned with explicit trajectories. It is only when tasks cannot be easily broken down into explicit trajectory tracking (e.g., racing tasks with no set trajectory to follow) that the advantages of RL become more apparent.

For example, in locomotion tasks, before RL, the common approach was to pre-plan the trajectory and have the robot follow it. However, RL allows the robot to make decisions in real time, adjusting its actions without strictly following a predefined trajectory. For instance, if a robot dog steps on an irregular rock, it can autonomously adjust its body posture or leg movements instead of being constrained to a fixed movement path. This makes RL more adaptable and autonomous in unknown or complex environments.

On the other hand, for complex tasks like cooking, where simple mathematical formulas or differentiable cost functions cannot directly describe the task, RL can optimize strategies based on flexible reward signals. Traditional MPC, however, relies on a differentiable and analyzable objective function, making it ineffective in such cases.

In summary, RL excels in long-horizon, hard-to-plan tasks, while MPC remains efficient and stable for short-horizon, easily planned tasks.

In drone racing and locomotion, finding the right application for RL is often easier since these tasks can be modeled relatively simply, like "from point A to point B." However, drone racing presents a greater challenge because, besides moving from start to finish, it requires maximizing speed, avoiding obstacles (like passing through specific gates), and adhering to a gate-passing order. Nevertheless, these objectives can still be translated into a well-defined reward function, making them relatively easy to model within an RL framework.

15. Current Trends in Drone Development

The future goals of drones are similar to those of autonomous driving: after receiving a high-level task (e.g., delivery, inspection, firefighting), drones will be able to execute the task autonomously. Technologically, we have not yet reached full autonomy (like Level 5 of autonomous driving), but companies like Skydio in the US are already achieving high levels of autonomous flight with considerable independence.

While the drone industry has a clear advantage in productization and market share, the core algorithms behind autonomous flight, particularly those based on vision (camera), still face significant technical challenges and development potential. In contrast, lidar-based perception is relatively easier in terms of accuracy and algorithm design, but breakthroughs in vision-based solutions could open up more possibilities for drone applications.

16.Differentiable Simulation vs. Locomotion

The choice to work on locomotion tasks stemmed from my time at MIT, where I was collaborating with teams researching robots like the MIT Mini Cheetah. Conventional RL for locomotion was already relatively mature, and I wanted to explore the problem from multiple angles with different algorithms.

During this process, we proposed a novel idea: if we had a differentiable model, we could directly obtain gradients from this model, potentially increasing the efficiency of policy training by ten to even a hundred times. However, since the cost of data collection for locomotion tasks is not particularly high (reducing the time from one to two minutes to ten seconds may not make a significant difference), people were not particularly focused on this "extra efficiency."

However, I still believe this direction is worth pursuing, especially when the input is high-dimensional visual information, as RL alone often struggles to handle it. This is why some current approaches, like "Learning by Cheating" (where a teacher policy with more complete or privileged information is first given to be imitated), are employed — using pure end-to-end RL with visual input remains challenging, and simulation capabilities are still limited.

Jamie Publication

Discussion about this post