Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

The analogy between DeepMind’s AlphaStar and Tesla’s Full Self-Driving

This site may earn commission on affiliate links.
Here’s my hunch:

The rate of progress on Full Self-Driving depends on whether — particularly after HW3 launches — Tesla can use its training fleet of hundreds of thousands of HW3 cars to do for autonomous driving what DeepMind’s AlphaStar did for StarCraft II. That is, use imitation learning on the state-action pairs from real world human driving. Then augment in simulation with reinforcement learning.​

Imitation learning and reinforcement learning

A state-action pair in this context is everything the perception neural networks perceives (the state), and the actions taken by the driver, like steering, accelerating, braking, and signalling (the action). Similar to the way a human annotator labels an image with the correct category (e.g. stop sign, pedestrian, vehicle), the human driver “labels” a set of neural network perceptions with the correct action (e.g. brake, turn, accelerate, slow down).

This form of imitation learning is just the deep supervised learning we all know and love. Another name for it is behavioural cloning. A report by Amir Efrati in The Information cited an unnamed source or multiple unnamed sources claiming that Tesla is doing this:

“Tesla’s cars collect so much camera and other sensor data as they drive around, even when Autopilot isn’t turned on, that the Autopilot team can examine what traditional human driving looks like in various driving scenarios and mimic it, said the person familiar with the system. It uses this information as an additional factor to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object. Such an approach has its limits, of course: behavior cloning, as the method is sometimes called…

But Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team. They envision a future in which humans won’t need to write code to tell the car what to do when it encounters a particular scenario; it will know what to do on its own.”​

There are other forms of imitation learning as well, such as inverse reinforcement learning. Pieter Abbeel, an expert in imitation learning and reinforcement learning, has expressed support for the idea using inverse reinforcement learning for autonomous cars. Drago Anguelov, head of research at Waymo, says that Waymo uses inverse reinforcement learning for trajectory optimization. But from what I understand Waymo uses supervised learning rather than inverse reinforcement learning in cases where they have more data on human driving behaviour.

Anguelov’s perspective is super interesting. In his talk for Lex Fridman’s MIT course, he used this diagram to represent machine learning replacing more and more hand coding in Waymo’s software:

y5vpl7o.jpg


This diagram is strikingly similar to the one Andrej Karpathy used to visualize Tesla’s transition from Software 1.0 code (traditional hand coding) to Software 2.0 code (neural networks):

iHLfvMY.jpg


Anguelov’s talk is the most detailed explanation I’ve seen of what Waymo is doing:


As many people already know, reinforcement learning is essentially trial and error for AI. In theory, a company working on autonomous driving could do reinforcement learning in simulation from scratch. Mobileye is doing this. One of the problems is that a self-driving car has to learn how to respond appropriately to human driving behaviour. Vehicles in a simulation don’t necessarily reflect real human driving behaviour. They might be following a simple algorithm, like the cars in a video game like Grand Theft Auto V.

If Mobileye's approach works, why wouldn't Waymo collaborate with DeepMind and solve autonomous driving with reinforcement learning? On this very topic, Oriol Vinyals, one of the creators of AlphaStar, said:

"Driving a car is harder. The lack of (perfect) simulators doesn't allow training for as much time as would be needed for Deep RL to really shine."​

Reinforcement learning from scratch worked for OpenAI Five on Dota 2. Surprisingly, OpenAI Five converged on many tactics and strategies used by human players in Dota 2, simply by playing against versions of itself. So, who knows, maybe Mobileye will be vindicated.

Perhaps one key difference between Dota 2 and driving is that there are driving laws and cultural norms. In Dota 2, everything that is possible to do in the game is allowed, and players are constantly looking for whatever play styles will lead to more victories. Driving, unlike Dota 2, is a coordination problem. Some of the rules are arbitrary and not discoverable in reinforcement learning. To use a toy example, with no prior knowledge a virtual agent might learn to drive on the right side of the road, or the left side of the road. It would have no way of guessing the arbitrary rule in the country it’s going to be deployed.

This is solveable because you can, for example, penalize the agent for driving on the wrong side of the road. Human engineers essentially hand code the knowledge into the agent via the reward function (i.e. the points system). But what if there are more subtle norms and rules that human drivers follow? Can an agent learn all of them with no knowledge of human behaviour? Maybe, maybe not.

Imitation learning can be used to create so-called “smart agents” that learn to drive based on human behaviour. These agents can be used in a simulation, and reinforcement learning can occur in that simulation. In theory, this simulation would be a much better model of real world driving than an agent starting from scratch and driving with versions of itself. If imitation learning is successful in copying human behaviours, then in theory what is learned in reinforcement learning in simulation could actually transfer to the real world.

AlphaStar and Full Self-Driving

With imitation learning alone, AlphaStar achieved a high level of performance. DeepMind estimates it was equivalent to a human player in the Gold or Platinum league in StarCraft II, which is roughly around the middle of the ranked ladder. So AlphaStar may have achieved roughly median human performance just with imitation learning. When AlphaStar was augmented with population-based, multi-agent reinforcement learning — a tournament style of self-play called the AlphaStar league — it reached the level of professional StarCraft II players.

AlphaStar took about 3 years of development, with little to no publicly revealed progress. The version of AlphaStar that beat MaNa — one of the world’s top professional StarCraft II players — was trained with imitation learning for 3 days, and reinforcement learning for 14 days (on a compute budget estimated around $4 million). So that’s a total of 17 days of training.

In June, Andrej Karpathy will have been at Tesla for 2 years. He joined as Director of AI in June 2017. Since at least around that time (perhaps earlier, I don’t know), Tesla has been looking for Autopilot AI interns with expertise in (among other things) reinforcement learning. Karpathy himself spent a summer as an intern at DeepMind working on reinforcement learning. He also worked on reinforcement learning at OpenAI.

The internship job postings also mention working with “enormous quantities of lightly labelled data”. I can think of at least two interpretations:
  1. State-actions pairs for supervised learning (i.e. imitation learning) of path planning and driving policy (i.e. how to drive).
  2. Sensor data weakly labelled by driver input (e.g. image of traffic light labelled as red by driver braking) for weakly supervised learning of computer vision tasks. (An example of weakly supervised learning is Facebook training a neural network on Instagram images using hashtags as labels.)
Tesla’s Full Self-Driving is different from AlphaStar in that Tesla has a plan to roll out features to customers incrementally, so progress is a lot more publicly visible. We didn’t get to see the agents that DeepMind trained, say, 6 months ago. So, we don’t really know how fast the agents went from completely incompetent to superhuman. What’s cool and interesting, though, is that Demis Hassabis (the CEO of DeepMind) seemed totally surprised after AlphaStar beat MaNa:


I don’t think I would be super surprised if, 3 years from now, Tesla is way behind schedule and progress has been plodding and incremental. I would be amazed, but necessarily taken totally off guard, if 3 years from now Tesla’s FSD is at an AlphaStar-like level of performance on fully autonomous (unsupervised by a human) driving.

We can’t predict how untried machine learning projects will turn out. That’s why researchers publish surprising results—we wouldn’t be surprised if we could predict what would happen in advance. The best I can do in my lil’ brain is draw analogies to completed projects like AlphaStar to what Tesla is doing (or might be doing). Then try to identify what relevant differences might change the outcome in Tesla’s case.

Some differences that come to mind:
  • perfect perception in a virtual environment vs. imperfect perception in a real world environment
  • optimizing for one long-term goal (winning the game), which allows individual mistakes vs. a task where a single error could lead to a crash
  • no need to communicate with humans vs. some need to communicate with humans
  • self-play with well-defined conditions for victory and defeat vs. this is not an inherent part of driving, although maybe you could design a driving game to do self-play
What are other important differences? What are other reasons this approach might not work?
 
Last edited:
Food for thought:
  • Imposed restrictions to create a fair game against human opponent vs. taking advantage of non-human states and actions

With imitation learning alone, AlphaStar achieved a high level of performance. DeepMind estimates it was equivalent to a human player in the Gold or Platinum league in StarCraft II, which is roughly around the middle of the ranked ladder. So AlphaStar may have achieved roughly median human performance just with imitation learning. When AlphaStar was augmented with population-based, multi-agent reinforcement learning — a tournament style of self-play called the AlphaStar league — it reached the level of professional StarCraft II players.
There are many great drivers out there, demonstrated by the relatively low amount of deaths caused by human fault alone (i.e. sober driver paying complete attention to the road). Yet a major goal of Tesla is to surpass that mark. How will the addition of non-human states and actions affect the reinforcement learning? How will the external environment adapt or change to these new states and actions?
 
Food for thought:
  • Imposed restrictions to create a fair game against human opponent vs. taking advantage of non-human states and actions

Do you mean that, unlike AlphaStar which was artificially limited in some ways, autonomous cars don't have to be limited? Yeah! For those who don't know... In the first 5 matches against MaNa, AlphaStar's APMs (actions per minute; for humans this would be keystrokes and clicks) were artificially limited, although a lot of gamers complained it could still do crazy superhuman bursts. We definitely don't need to limit autonomous cars' reaction time. :D

In the 6th match, in which MaNa beat AlphaStar (yay humans!), AlphaStar was restricted from looking in more than one place at once. In the previous games, it could see the whole map at once, all the time. We can let autonomous cars look in 360 degrees at all times.

There are many great drivers out there, demonstrated by the relatively low amount of deaths caused by human fault alone (i.e. sober driver paying complete attention to the road). Yet a major goal of Tesla is to surpass that mark. How will the addition of non-human states and actions affect the reinforcement learning? How will the external environment adapt or change to these new states and actions?

Hmm... One thing that surprised is me is that it isn't necessarily better to have only data from the best drivers. I think Drago Anguelov from Waymo said in his talk that you want examples of cars getting into bad situations so you can train the neural network how to recover from them. Or maybe that was someone talking about AlphaStar. I can't remember.

DeepMind published this chart showing AlphaStar's progress. You can see how far it got just with imitation learning/supervised learning. Then when it did reinforcement learning in the AlphaStar league, it improved immensely. Its MMR (Matchmaking Ranking) roughly doubled. It went from the level of a roughly median human player to the pro level.

SCII-BlogPost-Fig04.width-1500.png


Check out DeepMind's blog post. It's really well-written and goes through all the details:

AlphaStar: Mastering the Real-Time Strategy Game StarCraft II | DeepMind

If this pattern were to hold for autonomous cars, then reinforcement learning would greatly enhance the end product of imitation learning.

Hope I understood what you meant. If not, please feel free to elaborate.
 
Just in case what I said in my first post about "state-action pairs" wasn't understandable, I'll clarify what that term means in the context of Tesla's Full Self-Driving. If you already understand, don't read this. If you don't understand, I hope this is helpful.

The state would be all the judgments made by the perception neural network. The technical term for this is the mid-level representation. In verygreen's awesome videos, the mid-level representation is visualized by lines, boxes, the "green carpet" representing driveable road, and the text labels like "Car (53.5m)". For example:


The action would be whatever the human in the driver's seat does that Tesla can measure and record. Such as the angle of the steering wheel, use of the turn signal, and use of the accelerator and brake pedals. You can even see steering wheel angle represented in the bottom-right corner of verygreen's video.

Waymo published a paper on imitation learning where they described a neural network called ChauffeurNet. Waymo also published a blog post on ChauffeurNet. This is how Waymo describes the state, or the mid-level representation:

"In order to drive by imitating an expert, we created a deep recurrent neural network (RNN) named ChauffeurNet that is trained to emit a driving trajectory by observing a mid-level representation of the scene as an input. A mid-level representation does not directly use raw sensor data, thereby factoring out the perception task, and allows us to combine real and simulated data for easier transfer learning. As shown in the figure below, this input representation consists of a top-down (birds-eye) view of the environment containing information such as the map, surrounding objects, the state of traffic lights, the past motion of the car, and so on. The network is also given a Google-Maps-style route that guides it toward its destination."
The action was the drivers' behaviour:

"We trained the model with examples from the equivalent of about 60 days of expert driving data..."
The state-action pair is the mid-level representation at a given moment, paired with the driver's behaviour in that same moment.

It's important to clarify that, unlike images or video from a vehicle's cameras, state-action pairs don't need to labelled by humans. When a vehicle's cameras capture a picture of a stop sign, a human annotator later has to look at that picture, draw a bounding box around the stop sign, and label it "stop sign". Or even select every pixel in the image that corresponds to the stop sign (this is called semantic segmentation).

In the case of state-action pairs, the mid-level representation (i.e. the state) is the equivalent of the image of the stop sign. The mid-level representation is the data that needs to be labelled. The human driver is the equivalent of the human annotator, drawing the bounding box or selecting the pixels. The driver is "labelling" the mid-level representation with their hands on the steering wheel and their foot on the pedals, rather than by clicking and typing on a computer. The "label" — the equivalent of the category "stop sign", the bounding box around the sign, or the pixel-by-pixel segmentation — is the steering and pedal action taken by the driver (i.e. the action). For example, the mid-level representation might include a yellow traffic light up ahead, and if the human driver slows down, then the mid-level representation is labelled with slowing down.

A driving neural network trained on many such pairings of 1) mid-level representations that include yellow traffic lights and 2) human drivers slowing down... That driving neural network could learn to slow down whenever the perception neural network detects a yellow light. I hope what I'm saying is clear.

If you collect billions of miles of video from vehicles, human annotators have to sit down and painstakingly label it all. If you collect billions of miles of state-action pairs from vehicles driven by humans, the data is already labelled — the driving is the labelling. So, a nice part of this approach if you're a company like Tesla is you can upload a lot of data that already comes labelled. You don't have to pay human annotators to label it.

This is counterintuitive if you are used to thinking of a label only as a word or phrase that is attached to an object in an object. In deep supervised learning (what people are referring to most of the time when they say "deep learning"), the input data can be all kinds of things, not just images, and so can the output data — it doesn't have to just be words.
 
  • Informative
Reactions: OPRCE
Reinforcement learning also learns about state-action pairs, but it learns about them from trial and error, not from humans. For example, it might learn from experience that in state X, action Y is the highest-value action. (State X could be "red light" and action Y could be "brake".) That is, the action it should take to maximize its reward (i.e. its score in a points system designed by humans).

For an autonomous car, this trial and error would occur in simulation. The problem with simulation is that simulation is different from reality, so what an AI agent learns about state-action pairs in simulation might not translate to reality.

In principle, reinforcement learning can happen in the real world, but for cars that would be dangerous, expensive, and slow. The cars would probably crash a lot, and agents like AlphaStar and OpenAI Five trained with reinforcement learning are trained on the equivalent of thousands of years of experience in simulation.

Maybe real world reinforcement learning could work to some small degree in Teslas. Maybe Autopilot and Full Self-Driving could attempt to maximize a reward based on miles before the human takes over. Before Andrej Karpathy joined Tesla, he wrote this in a blog post on reinforcement learning:

"Another related approach is to scale up robotics, as we’re starting to see with Google’s robot arm farm, or perhaps even Tesla’s Model S + Autopilot."​

The main thing is to even get to this point, you have to already have a system that works well enough and safely enough to deploy to customers. You can't do reinforcement learning from scratch in the real world, at least not this way.
 
Last edited:
  • Informative
Reactions: OPRCE
As many people already know, reinforcement learning is essentially trial and error for AI. In theory, a company working on autonomous driving could do reinforcement learning in simulation from scratch. Mobileye is doing this. One of the problems is that a self-driving car has to learn how to respond appropriately to human driving behaviour. Vehicles in a simulation don’t necessarily reflect real human driving behaviour. They might be following a simple algorithm, like the cars in a video game like Grand Theft Auto V.

Reinforcement learning from scratch worked for OpenAI Five on Dota 2. Surprisingly, OpenAI Five converged on many tactics and strategies used by human players in Dota 2, simply by playing against versions of itself. So, who knows, maybe Mobileye will be vindicated.

Mobileye isn't doing RL from scratch.

VJPHe8M.png
 
Hmm... One thing that surprised is me is that it isn't necessarily better to have only data from the best drivers. I think Drago Anguelov from Waymo said in his talk that you want examples of cars getting into bad situations so you can train the neural network how to recover from them. Or maybe that was someone talking about AlphaStar. I can't remember.
...
Hope I understood what you meant. If not, please feel free to elaborate.

To take it to an extreme: if we get to a point where Self Driving Cars are pervasive in the world, can the few Human Driving cars do whatever they want? Will the Neural Network learn that Human Drivers are so unpredictable that's it's better to avoid them at all costs?

In that extreme situation, I'm envisioning a world where obtaining a Driver's License is prohibitively difficult. Human Driving cars may be reserved for Law Enforcement or otherwise similarly highly-trained individuals.

Sorry, I know that philosophical debate of a future 5+ years away strays from your original post. It's just where my mind strayed.

Back on Topic:
Do you mean that, unlike AlphaStar which was artificially limited in some ways, autonomous cars don't have to be limited? Yeah! For those who don't know... In the first 5 matches against MaNa, AlphaStar's APMs (actions per minute; for humans this would be keystrokes and clicks) were artificially limited, although a lot of gamers complained it could still do crazy superhuman bursts. We definitely don't need to limit autonomous cars' reaction time. :D
YES! Exactly.


Check out DeepMind's blog post. It's really well-written and goes through all the details:

AlphaStar: Mastering the Real-Time Strategy Game StarCraft II | DeepMind

If this pattern were to hold for autonomous cars, then reinforcement learning would greatly enhance the end product of imitation learning.
Yes, I loved reading that Blog post. I wish they would go into detail why Protoss had to be the chosen opponent (I have my assumptions). I wonder if they can expand to other Starcraft Races with the same results? And I wonder how AlphaStar would react against multiple opponents and/or playing with a teammate?
 
Mobileye isn't doing RL from scratch.

VJPHe8M.png

Interesting, what’s the source? On Twitter, Amnon Shashua said:

“Imitation learning is great when you have someone to imitate (like in pattern recognition & NLP). We instead created two layers – one based on “self-play” RL that learns to handle adversarial driving (including non-human) and another layer called RSS which is rule-based.”
Amnon Shashua on Twitter

So I interpreted this as no imitation learning. But maybe there is more to the story.
 
  • Like
Reactions: OPRCE

Right, so, they are doing some amount of imitation learning to kick off reinforcement learning. I wish they elaborated on that more in the paper; there is not much to go on. I don’t see anywhere where they discuss the goal of training a human-like smart agent for simulation, or an approach to reaching that goal.

Without further details, I think the question remains: will policies learned via “self-play” lead to unhuman-like behaviours in complex, interactive scenarios? Will the agent be really good at interacting with virtual agents, but not so good at interacting will real humans?

How will the agent learn to interact with real humans without any practice? And without imitating how real humans interact? It might turn out that the same interactive behaviours emerge naturally from “self-play”, but this is an untested conjecture. The other possible outcome is that human driving involves a lot of behaviours that are idiosyncratically human and that can’t be learned without knowledge of human behaviour.
 
  • Like
Reactions: OPRCE
Right now it feels to me like supervised learning is hard to beat for perception, but long-term it would be ideal to get rid of supervised learning because human data labelling bottlenecks progress. If learning of both perception and action can occur without human labelling or any supervisory signal from humans, then learning will be bottlenecked by compute rather than by, essentially, human labour.

DeepMind is working on unsupervised learning of perception for StarCraft, which is sooo interesting. A future version of AlphaStar that could beat a pro using unsupervised learning on raw pixels rather than the game’s API would be as mind-blowing a step forward as the original. The problem with reality is it doesn’t have an API! So if we want robots to learn tasks, we need a more robust solution to perception. If agents can learn in simulation from raw visual input and then transfer that learning to the real world, then essentially we can solve any robotics problem using just compute.

Which is how we get to:

EuR5.gif


1450284900219


1450283592727


giphy.gif


(minus being able to talk and think and all that)

For autonomous cars, supervised learning of perception might be okay because it’s a domain where companies are willing to pour billions into labelling if necessary. If we want all kinds of different robots that operate in all kinds of different domains, without each kind of robot needing to generate huge profits to pay off the labour investment in labelling, then non-supervised learning has to happen. (Alternatively, maybe supervised learning could just be scaled up. Maybe we could have giant, commoditized, general-purpose training datasets for robot perception. Similar to Google’s Cloud AutoML, maybe in the future any roboticist can train a perception neural network on a huge dataset in the cloud. That’s another conceivable path to cheap and generalized robotics.)

It might turn out that the “one domain” of autonomous driving is too general for supervised learning to efficiently handle, in which case I guess we’ll have to switch over to a compute-limited form of learning for perception. This would likely set progress back a long time, unless there are some big breakthroughs in unsupervised learning or data efficiency in supervised learning.

So, the way of the future might be unsupervised learning of perception + reinforcement learning of action, or some version of end-to-end reinforcement learning like the model-based RL approach that the Google blog post describes.

Supervised learning of action (imitation learning) might be a quirky outlier in the evolution of AI. Driving might be the only example where there is free, massive-scale supervisory signal from humans operating robots.

But, on the other hand, if you make supervised learning a lot more data efficient then you open up the possibilities of what can be done with imitation learning. Or if you learn from data that is already abundant like video of humans doing tasks.
 
Last edited:
  • Like
Reactions: OPRCE
Another possible approach is to gather tons of labeled data, use the data to train a simulator to behave close to the world, then train a reinforcement learning to master the simulation. As new failure edge cases are found, add more data to train the simulation and add new rules and retrain the reinforcement learning agent. I have a hard time seeing what problems can’t be addressed this way. If you have ideas why this wouldn’t work, please let me know.
 
  • Like
Reactions: OPRCE