Super helpful video that's intended for people who aren't too familiar with StarCraft (or machine learning):
Next video in the series (full playlist here):
AlphaStar is my favourite example of an achievement in AI. It shows that you can get to significantly better than average human performance on a difficult, complex task with a long time horizon and multi-step sequences of actions. Moreover, it shows that you can do this simply with a large dataset of human behaviour to imitate (i.e. with imitation learning). This side-steps problems in reinforcement learning like:
Here's a visualization showing how good AlphaStar got using various training techniques, including pure imitation learning:
AlphaStar makes me feel optimistic that the planning or behaviour generation component of autonomous driving can be solved with a big enough dataset of human driving behaviour. Imitation learning should be able to get us most of the way there. My hunch is that, if anything works, it will be a combination of imitation learning and explicit code. (Cool technical presentation on combining the two here.)
Fine-tuning with reinforcement learning may be helpful or perhaps even necessary. One way to get reward signal would be human interventions. This approach is susceptible to reward hacking if you start from scratch, but maybe it wouldn't if you did imitation learning first. Perhaps training via reinforcement learning in simulation would be possible using replays from real world situations, but this approach would face the problem of accurately simulating how humans would react to the autonomous cars' actions.
An alternative (or maybe supplementary?) approach would be reward learning. Specifically, reward learning from demonstrations. This is part of an approach called inverse reinforcement learning. First, you learn a reward from the behaviour of an “expert” or demonstrator, such as a human being. You assume the demonstrations represent optimal or reward-maximizing behaviour. Second, you do reinforcement learning with the learned reward.
A potential problem with inverse reinforcement learning is that, if you assume demonstrations are optimal, your agent may not be able to do better than your demonstrations. A pair of awesome papers attempt to solve this problem — at least for some tasks — with an approach called T-REX and a follow-up called D-REX. The researchers figured out how to get an agent to extrapolate beyond the best demonstrations it can observe by ranking the demonstrations based on quality. Maybe D-REX, or something like it, could be applied to the autonomous vehicle problem. That's a complex technical question and I don't think I'm equipped to answer it. But I find it to be a fascinating idea.
The success of AlphaStar (and similar successes like OpenAI Five) has made me feel fairly relaxed about the planning/behaviour generation part of the problem. Personally, I feel a lot more worried about computer vision. As Elon put it:
But since there is no proof of concept comparable to AlphaStar for computer vision, I worry that superhuman computer vision for autonomous vehicles may not be tractable with only incremental advances on current technology. I'm not arguing this is actually the case; I just don't have strong evidence with which to rule out this possibility.
On the computer vision front, I'm hopeful (but not necessarily super confident) that scaling up techniques like automatic curation, active learning, weakly supervised learning, sensor-supervised learning, and self-supervised learning will push the frontier all the way to superhuman vision. We all want self-driving cars and, to me, this looks like the best bet for solving computer vision right now.
Next video in the series (full playlist here):
AlphaStar is my favourite example of an achievement in AI. It shows that you can get to significantly better than average human performance on a difficult, complex task with a long time horizon and multi-step sequences of actions. Moreover, it shows that you can do this simply with a large dataset of human behaviour to imitate (i.e. with imitation learning). This side-steps problems in reinforcement learning like:
- sparse rewards (e.g. victory or defeat in StarCraft occurs only every ~10-20 minutes)
- credit assignment (i.e. which actions had what effect on the reward?)
- reward hacking (i.e. the AI agent gives you exactly what you asked for, which turns out to be not what you wanted)
- the number of possible combinations of actions is too large to find the right sequences by trial and error (e.g. DeepMind estimates 10^26 per time step in StarCraft)
Here's a visualization showing how good AlphaStar got using various training techniques, including pure imitation learning:
AlphaStar makes me feel optimistic that the planning or behaviour generation component of autonomous driving can be solved with a big enough dataset of human driving behaviour. Imitation learning should be able to get us most of the way there. My hunch is that, if anything works, it will be a combination of imitation learning and explicit code. (Cool technical presentation on combining the two here.)
Fine-tuning with reinforcement learning may be helpful or perhaps even necessary. One way to get reward signal would be human interventions. This approach is susceptible to reward hacking if you start from scratch, but maybe it wouldn't if you did imitation learning first. Perhaps training via reinforcement learning in simulation would be possible using replays from real world situations, but this approach would face the problem of accurately simulating how humans would react to the autonomous cars' actions.
An alternative (or maybe supplementary?) approach would be reward learning. Specifically, reward learning from demonstrations. This is part of an approach called inverse reinforcement learning. First, you learn a reward from the behaviour of an “expert” or demonstrator, such as a human being. You assume the demonstrations represent optimal or reward-maximizing behaviour. Second, you do reinforcement learning with the learned reward.
A potential problem with inverse reinforcement learning is that, if you assume demonstrations are optimal, your agent may not be able to do better than your demonstrations. A pair of awesome papers attempt to solve this problem — at least for some tasks — with an approach called T-REX and a follow-up called D-REX. The researchers figured out how to get an agent to extrapolate beyond the best demonstrations it can observe by ranking the demonstrations based on quality. Maybe D-REX, or something like it, could be applied to the autonomous vehicle problem. That's a complex technical question and I don't think I'm equipped to answer it. But I find it to be a fascinating idea.
The success of AlphaStar (and similar successes like OpenAI Five) has made me feel fairly relaxed about the planning/behaviour generation part of the problem. Personally, I feel a lot more worried about computer vision. As Elon put it:
“The hardest thing is having accurate representation of the physical objects in vector space. So, taking the visual input, primarily visual input, some sonar and radar and then creating an accurate vector space representation of the objects around you. Once you have an accurate vector space representation, the planning and control is relatively easier. That is relatively easy.
Basically, once you have accurate vector space representation, then you're kind of like a video game, like cars in Grand Theft Auto or something.”
Once driving is a video game — i.e. once computer vision is “solved” and the world state is known with a high degree of confidence — it feels to me like it's a lot more tractable. The same techniques that made AlphaStar can be used: namely, imitation learning and (possibly) fine-tuning with reinforcement learning. If reinforcement learning is used, ideas like D-REX or like humans providing the reward signal via interventions could substitute for the built-in victory and defeat conditions of StarCraft.
But since there is no proof of concept comparable to AlphaStar for computer vision, I worry that superhuman computer vision for autonomous vehicles may not be tractable with only incremental advances on current technology. I'm not arguing this is actually the case; I just don't have strong evidence with which to rule out this possibility.
On the computer vision front, I'm hopeful (but not necessarily super confident) that scaling up techniques like automatic curation, active learning, weakly supervised learning, sensor-supervised learning, and self-supervised learning will push the frontier all the way to superhuman vision. We all want self-driving cars and, to me, this looks like the best bet for solving computer vision right now.
Last edited: