Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

how does the new end to end FSD work, need a block diagram from of data flow from the fleet to DoJO to an individual's car

This site may earn commission on affiliate links.
Does DoJO actually have all the millions of sections of road, street, hi-way, intersections, parking lots etc with millions of Tesla's historical actions at those (call them nodes) stored in DoJOs memory? (Seems impossible) Then does DoJo transmit to the individual's Tesla the best action to take at those nodes? And when a driver intervenes and responds to the question, what happened? Is there a person that looks at that node and for example decide that the Tesla's action was not safe and then he revises the model for that node? Anyway, that is how I imagine it impossible as it may seem.
 
Does DoJO actually have all the millions of sections of road, street, hi-way, intersections, parking lots etc with millions of Tesla's historical actions at those (call them nodes) stored in DoJOs memory? (Seems impossible) Then does DoJo transmit to the individual's Tesla the best action to take at those nodes? And when a driver intervenes and responds to the question, what happened? Is there a person that looks at that node and for example decide that the Tesla's action was not safe and then he revises the model for that node? Anyway, that is how I imagine it impossible as it may seem.

No, that is not how it works at all. DOJO does not have a big map and FSD actions. DOJO is just a training computer. It takes in data and trains the NN that goes into our cars. DOJO does not transmit anything to our cars. DOJO does not drive our cars. All the driving that FSD does is handled inside our car's computer.

Tesla collects millions of video clips from our cars. Tesla then puts those clips into their training computer like DOJO which creates a neural network. Tesla then puts that NN into our cars which does the driving. When we disengage or intervene with FSD, Tesla can collect a video clip of that intervention to add to their training set. Tesla refines their training set, does more training and then sends a new NN to our cars in the next software version which hopefully works better. Tesla collects disengagement data to see if the new FSD version is performing better or worse and what areas need to be worked on. Tesla can search for video clips for the specific types of issues they are looking for to put in their training set. I hope that helps. Also, my understanding is that DOJO is basically dead. Tesla basically cancelled it. Tesla uses another big training computer instead of DOJO.
 
Definitely not how it works.

The database (training set is a more appropriate term) that the training system uses consists of a representative set of objects (cars, signs, pedestrians, ...) and constructs (intersections, curves, ...) that it is expected to recognize or handle. If there are any real-word scenarios that it does not handle well (possibly detected through disengagements or randomly generated artificial scenarios), those video clips are requested from the fleet (or fed directly from the randomly generated training data) to add to the training set.

Additionally, the problem set is broken down into smaller chunks than I think you are thinking about it. It's not simply "here's an intersection and here's the best way to handle it". Rather, one thing that is done is that the surrounding objects and constructs are detected and located around the "ego" (a representation of the vehicle). The visualization effectively shows a representation of this. So for example, the neural network may detect a passenger vehicle 30 feet directly ahead of me and a pickup truck 10 feet off my left side, and a bicyclist 25 feet ahead and to the right. It also detects lane markings, traffic lights, speed limit signs, curbs, etc. Now it has a picture of obstacles to avoid and valid paths to follow.

But there is more to it than that: it also needs to predict the movement of those obstacles (whether they will be crossing our selected path or not), and also choose a path that avoids obstacles, stays within safe and legal travel lanes, and do so with a "normal" and comfortable behavior. The neural network outputs steering and accelerator/brake controls to achieve this. Again, if there are situations where it does not handle this well, those situations are added to the training set and the neural network is fine tuned and downloaded to the fleet as an FSD update.

The last point that is important to understand is it's not like programming a robot where you give it a precise set of movements to repeatedly execute exactly the same for each specific action. Since the training data is only a sample set of all possibly real world scenarios, the way you can think of it is that it will find the best match in the training set to what it actually sees, and then perform the same actions that the training system reinforced as the optimal solution. But it's even more complicated than that, because the neural network is capable of "melding" several actions into the chosen actual output if it doesn't find a perfect match in the training set, but maybe it's seen (a) from one scenario and (b) from another, so if it seems something that looks like a combination of (a) and (b) it will output in a way that's kind of a combination of the best (a) and (b) outcomes.

Even this is not truly an accurate representation of how a neural network works, but perhaps at a basic level if is sufficient to think about it like that. If you really want to learn more, I would highly suggest you spend time learning about NNs and how they are trained and you will have a vastly better appreciation for what is actually happening both in the vehicle and on the training side.
 
Does DoJO actually have all the millions of sections of road, street, hi-way, intersections, parking lots etc with millions of Tesla's historical actions at those (call them nodes) stored in DoJOs memory? (Seems impossible) Then does DoJo transmit to the individual's Tesla the best action to take at those nodes? And when a driver intervenes and responds to the question, what happened? Is there a person that looks at that node and for example decide that the Tesla's action was not safe and then he revises the model for that node? Anyway, that is how I imagine it impossible as it may seem.

What you are describing is similar to how the non-Tesla solutions do it. They use high-resolution mapping to guide the vehicles. For the automated Taxi companies, they spend months driving cars around a city to get all the information and put in the cars. And then most of them run the routes daily to see if anything has changed.

Elon realized years ago that this really wasn't a sustainable model. It wasn't flexible, nor scalable. Tesla worked at making the car drive like a human does. That started with using vision instead of other solutions. But this showed issues with the code to handle the exceptions being too big. They then switched to AI and that's where we are today. The car makes decision really close to the way you do.

You can think of FSD updates as snapshots of human brains. Each new version is a snapshot of a brain that's a little older and wiser.
 
  • Like
  • Informative
Reactions: QUBO and APotatoGod
Flow Chart.jpg
 
What you are describing is similar to how the non-Tesla solutions do it. They use high-resolution mapping to guide the vehicles. For the automated Taxi companies, they spend months driving cars around a city to get all the information and put in the cars. And then most of them run the routes daily to see if anything has changed.

The idea that Tesla is building a "brain" to drive like a human whereas the competition is just guiding their cars on HD maps is a myth. The non Tesla solutions are also building a "brain" to drive the like a human, like Tesla. They rely heavily on NN. They use HD maps as a prior only, not to guide the car. They rely on the sensors and NN to guide the car in real-time, like Tesla. They are just using different sensors and building the "brain" differently. Both Tesla and the non-Tesla approaches use maps, sensors to get real-time perception and NN to make driving decisions. The differences are in the details. Tesla uses basic maps, cameras only and an end-to-end NN. The non Tesla solutions use more detailed maps, cameras+lidar+radar and modular NN. And what the OP is describing is more like a central computer in the cloud that controls the cars. That is not how the non Tesla solutions work at all. Everything is in the car. The HD maps and the NN to drive are in the cars, not in the cloud.
 
Last edited:
The idea that Tesla is building a "brain" to drive like a human whereas the competition is just guiding their cars on HD maps is a myth. The non Tesla solutions are also building a "brain" to drive the like a human, like Tesla. They rely heavily on NN. They use HD maps as a prior only, not to guide the car. They rely on the sensors and NN to guide the car in real-time, like Tesla. They are just using different sensors and building the "brain" differently. Both Tesla and the non-Tesla approaches use maps, sensors to get real-time perception and NN to make driving decisions. The differences are in the details. Tesla uses basic maps, cameras only and an end-to-end NN. The non Tesla solutions use more detailed maps, cameras+lidar+radar and modular NN. And what the OP is describing is more like a central computer in the cloud that controls the cars. That is not how the non Tesla solutions work at all. Everything is in the car. The HD maps and the NN to drive are in the cars, not in the cloud.

Not really sure of what your "use HD maps as a prior only" means.

But if it takes months for Waymo or Cruise to learn a city, then there appears to be a LOT of HD information used somewhere. Or are you saying that their NNs are built for specific operating areas. not the general case?
 
Not really sure of what your "use HD maps as a prior only" means.

Waymo has explained this. It means the HD map gives the car pre-knowledge of the road before it starts driving but Waymo can change the map or ignore the map as needed. The HD map gives the car useful information but it is not an absolute. The Waymo also relies on the sensors to drive in real-time. The Waymo does not just blindly follow the HD map.

But if it takes months for Waymo or Cruise to learn a city, then there appears to be a LOT of HD information used somewhere.

HD maps do contain a lot of information but HD mapping does not take months. Mapping only requires that they drive every route in the geofence a few times. And the Waymo Driver is generalized so it does not need to learn a new city. It can drive anywhere. But of course, you can't launch a robotaxi service until you are sure it is safe and reliable enough. The reason it takes months to launch a new service is testing and validation to make sure the Waymo Driver is safe and also setting up logistics for a robotaxi network, working with the community to build trust etc...

The process might look something like this:
- 1-2 weeks to map
- 3 months of testing with safety drivers
- 2 months of driverless validation for employees-only.
- Gradual roll out of driverless to public on "early access" list.
- General public launch.

Or are you saying that their NNs are built for specific operating areas. not the general case?

No, the NN are built to be generalized and work everywhere. Waymo has even talked about how it is the same Waymo Driver that drives in SF, Phoenix, LA, Austin etc... But like I said above, it still takes time to test and make any improvements, validate for safety, as well as set up the logistics of a ride-hailing network.

One last point: Tesla fans love to say that Waymo is not scalable because it takes them months to add new cities while Tesla has deployed FSD everywhere. But Tesla does not have any driverless robotaxis yet. I highly doubt that Tesla will launch driverless robotaxis everywhere overnight. Even when FSD reaches the point where Telsa thinks the software is ready for driverless, Tesla will likely still need to set up geofences for the ride-hailing service, do further testing of the robotaxis, set up pick ups and drop off points, set up remote assistance in case a robotaxi needs help etc... The notion that Tesla will just solve L5 and send us a software update and all our cars instantly become driverless robotaxis overnight is not realistic IMO. There is a lot of testing and preparation that is required for a reliable driverless robotaxi network. So it will take Tesla time to launch robotaxis, even after the software is deemed "ready".

And by the way, this is not a Tesla versus Waymo. Both are working hard to solve difficult autonomous driving problems. Waymo has driverless robotaxis but only in limited areas compared to the size of the US. Tesla has supervised self-driving everywhere but does not have driverless yet. Neither has fully solved autonomous driving yet. Elon is making big promises that FSD will soon be able to drive over a year between safety critical interventions. We shall see how quickly the intervention rate improves and when it is "ready", how quickly Tesla can actually launch a driverless robotaxi network. I wish both approaches the best.
 
Last edited:
coming from a lifetime of "if then else" computer programming and Boolean hardware logic it is difficult to understand how this works. So here is what I can come up with in understanding. BTW I really love seeing it function especially at unprotected left turns. It is unbelievable! So the computer in the car sees a situation from its cameras and relates that to a situation that match a situation that has been downloaded from DoJo. Hold on, there has to be some "logic" or Neural nets in the car that can determine, for example the speed of the cars approaching and spacing of cars on the left and right to determine if it is safe to proceed. Also, BTW, I have observed the car doing this much faster than I can. Those are stressful situations. IMHO the average person has no clue the value of this and others until they experience it half a dozen or more times.
 
Last edited:
  • Like
Reactions: Lance From Jax
One other thing about FSD that I dont understand. I've been driving this rural highway on Auto pilot for three years and the car keeps me perfectly centered between the lane markings. Now, with FSD it drifts over beyond the lane marking and onto the shoulder. Does this have to do with there are two stacks they have been talking about, one for city streets and one for highways.
 

Definitely not how it works.

The database (training set is a more appropriate term) that the training system uses consists of a representative set of objects (cars, signs, pedestrians, ...) and constructs (intersections, curves, ...) that it is expected to recognize or handle. If there are any real-word scenarios that it does not handle well (possibly detected through disengagements or randomly generated artificial scenarios), those video clips are requested from the fleet (or fed directly from the randomly generated training data) to add to the training set.

Additionally, the problem set is broken down into smaller chunks than I think you are thinking about it. It's not simply "here's an intersection and here's the best way to handle it". Rather, one thing that is done is that the surrounding objects and constructs are detected and located around the "ego" (a representation of the vehicle). The visualization effectively shows a representation of this. So for example, the neural network may detect a passenger vehicle 30 feet directly ahead of me and a pickup truck 10 feet off my left side, and a bicyclist 25 feet ahead and to the right. It also detects lane markings, traffic lights, speed limit signs, curbs, etc. Now it has a picture of obstacles to avoid and valid paths to follow.

But there is more to it than that: it also needs to predict the movement of those obstacles (whether they will be crossing our selected path or not), and also choose a path that avoids obstacles, stays within safe and legal travel lanes, and do so with a "normal" and comfortable behavior. The neural network outputs steering and accelerator/brake controls to achieve this. Again, if there are situations where it does not handle this well, those situations are added to the training set and the neural network is fine tuned and downloaded to the fleet as an FSD update.

The last point that is important to understand is it's not like programming a robot where you give it a precise set of movements to repeatedly execute exactly the same for each specific action. Since the training data is only a sample set of all possibly real world scenarios, the way you can think of it is that it will find the best match in the training set to what it actually sees, and then perform the same actions that the training system reinforced as the optimal solution. But it's even more complicated than that, because the neural network is capable of "melding" several actions into the chosen actual output if it doesn't find a perfect match in the training set, but maybe it's seen (a) from one scenario and (b) from another, so if it seems something that looks like a combination of (a) and (b) it will output in a way that's kind of a combination of the best (a) and (b) outcomes.

Even this is not truly an accurate representation of how a neural network works, but perhaps at a basic level if is sufficient to think about it like that. If you really want to learn more, I would highly suggest you spend time learning about NNs and how they are trained and you will have a vastly better appreciation for what is actually happening both in the vehicle and on the training side.
Thank you. your description is what makes sense. I will have to study into neural nets. I know they have been around since at least the early nineties and they were thought of at that time to replace PID loop tuning in process control. dont know if that actually happened, I retired.
 
Definitely not how it works.

The database (training set is a more appropriate term) that the training system uses consists of a representative set of objects (cars, signs, pedestrians, ...) and constructs (intersections, curves, ...) that it is expected to recognize or handle. If there are any real-word scenarios that it does not handle well (possibly detected through disengagements or randomly generated artificial scenarios), those video clips are requested from the fleet (or fed directly from the randomly generated training data) to add to the training set.

Additionally, the problem set is broken down into smaller chunks than I think you are thinking about it. It's not simply "here's an intersection and here's the best way to handle it". Rather, one thing that is done is that the surrounding objects and constructs are detected and located around the "ego" (a representation of the vehicle). The visualization effectively shows a representation of this. So for example, the neural network may detect a passenger vehicle 30 feet directly ahead of me and a pickup truck 10 feet off my left side, and a bicyclist 25 feet ahead and to the right. It also detects lane markings, traffic lights, speed limit signs, curbs, etc. Now it has a picture of obstacles to avoid and valid paths to follow.

But there is more to it than that: it also needs to predict the movement of those obstacles (whether they will be crossing our selected path or not), and also choose a path that avoids obstacles, stays within safe and legal travel lanes, and do so with a "normal" and comfortable behavior. The neural network outputs steering and accelerator/brake controls to achieve this. Again, if there are situations where it does not handle this well, those situations are added to the training set and the neural network is fine tuned and downloaded to the fleet as an FSD update.

The last point that is important to understand is it's not like programming a robot where you give it a precise set of movements to repeatedly execute exactly the same for each specific action. Since the training data is only a sample set of all possibly real world scenarios, the way you can think of it is that it will find the best match in the training set to what it actually sees, and then perform the same actions that the training system reinforced as the optimal solution. But it's even more complicated than that, because the neural network is capable of "melding" several actions into the chosen actual output if it doesn't find a perfect match in the training set, but maybe it's seen (a) from one scenario and (b) from another, so if it seems something that looks like a combination of (a) and (b) it will output in a way that's kind of a combination of the best (a) and (b) outcomes.

Even this is not truly an accurate representation of how a neural network works, but perhaps at a basic level if is sufficient to think about it like that. If you really want to learn more, I would highly suggest you spend time learning about NNs and how they are trained and you will have a vastly better appreciation for what is actually happening both in the vehicle and on the training side.
Lovely walkthrough. Thanks for sharing this.

I call it as solving an equation every 30ms where the given inputs are the scenes & objects captured by the cameras around the car, and the output is the steering & speed controlled by acceleration and braking.

What the training does is to build that equation (algorithm)
 
Last edited:
coming from a lifetime of "if then else" computer programming and Boolean hardware logic it is difficult to understand how this works.
Yeah, you need to throw out everything you know about algorithmic style programming.

While neural networks are deterministic, they are not programmed with an algorithm. Rather, they are "trained" to respond to given set of inputs.

As a simple primer, take an example of a network that recognizes handwritten letters A through Z. The network consists of an input later (for example, representing a 20x20 pixel array), and output later (for example, representing the letters A through Z), zero, one, or more hidden layers, and connections between each node in each layer and each node in the next layer. It might look something like this:

The-Architecture-of-a-Neural-Network.png

(courtesy https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-a-neural-network/)

Each node (the circles in the diagram) will have a numerical representation of its "value", and each connection between the node (the lines) will have a weight assigned to it (for purposes of this discussion, let's assume the weights are a real number between 0 and 1, but that choice is somewhat arbitrary, you can use whatever number system makes sense in each particular implementation).

When one of the pixels on the input layer is "lit", that node has a value of "1", and if not "lit", it's "0".

For each subsequent layer, the input value of each node is calculated as the sum of the value of the node that feeds it times the weight of the connecting line. Some of the weights might be close to zero (minimize the impact of the preceding nodes) and some might be close to one (favoring those preceding nodes). If the value of the node is calculated to be above a certain threshold, it is considered a "1", otherwise it's a "0" for purposes of calculating the next layer.

Once you reach the output layer, the goal is to have one (and only one) of the nodes above the threshold, and all the others below the threshold, so you wind up with just one answer, although in the output layer, you could retain the calculated values of the nodes to express a mathematical confidence level in the result (i.e. if the calculated value of the "correct" node could be 0.935 and the others 0.021, 0.019, 0.024, ....)

The idea is that the input data need not be perfect (if it was, you could fall back on your boolean logic to directly connect the input pixels to each output letter with the right combination and AND gates and INVERTERS).

So the "training" process is one of assigning the weights to each "edge" (the technical term for the connections between nodes). This is done by feeding the network a (large) set of training data, examining the results, and then adjusting the weights using an algorithm that increases the weights that feed the correct expected answer, and decreasing the weights that feed the nodes that are "incorrect". While not necessarily complex, the training process does take a huge number of training sets to finally arrive at a set of weights that always produce the right answer for every example in the training set. And this example is a relatively simple one. It can take MANY iterations of this training before the training process is complete. You can imagine the complexity of processing 6 or more feeds of HD video, which is why the problem is broken down into different tasks, but even so real-world scenarios are quite extensive in number as well!

This is also why Tesla chose to implement their NN in hardware, versus running the NN in a GPU (which is well suited for the task, but still is a simulation of a network running in a processor). This has significant power and speed implications (and cost, since they don't have to buy processors for a third-party).

Obviously once you have your trained network, it's not really possible (easy) to pick out why the network arrived at a specific output given the inputs. It's all basically encoded in the weights as opposed to an algorithm. However, there has been done work on being able to have NNs "explain" how they arrived at a particular answer.
 
Waymo has explained this. It means the HD map gives the car pre-knowledge of the road before it starts driving but Waymo can change the map or ignore the map as needed. The HD map gives the car useful information but it is not an absolute. The Waymo also relies on the sensors to drive in real-time. The Waymo does not just blindly follow the HD map.

If it can drive without maps, then why does it need HD mapping?

You seem to infer that one doesn't need the other, but yet one won't work without the other.

If the NN works everywhere, then why do they need the safety drive test every time?

That's because the NN absolutely requires HD mapping along significant amount of "hints"

And updating those hints are probably the reason that they run the routes daily.
 
If it can drive without maps, then why does it need HD mapping?....
Analogy:

No HD Maps - You can go somewhere you have never been and drive. You may drive like a "tourist" but you can do it.

HD Maps - Driving to work you know all the potholes, where all the curbs are, when the traffic is likely heaver, all the obstacles that may obscure a view, all the nearly hidden Stop signs, about how long and what the sequence of the Red lights are and all the other details that allow you to drive WAY more efficiently.
 
  • Like
Reactions: Ben W
If it can drive without maps, then why does it need HD mapping?

You seem to infer that one doesn't need the other, but yet one won't work without the other.

If the NN works everywhere, then why do they need the safety drive test every time?

That's because the NN absolutely requires HD mapping along significant amount of "hints"

And updating those hints are probably the reason that they run the routes daily.
Waymo is a glorified railway system. Just like inspectors take trolleys to inspect the tracks and ensure usabilty, they also need to the same given they rely on the electronic track known as HD maps

MEC_4_LU_Approved.jpg
 
If it can drive without maps, then why does it need HD mapping?

It could drive without maps but Waymo says that HD maps make it drive safer. Thus, Waymo always uses HD maps in order to drive safer.

If the NN works everywhere, then why do they need the safety drive test every time?

For validation purposes. "Working everywhere" and "working with 99.9999% reliability" are two very different things. Waymo has said that they dropped the Waymo Driver in LA the first time and it drove very well right out of the gate. But driving very well is not good enough. To launch driverless robotaxis you need better than "drive very well", you need "drive with 99.9999% reliability".

That's because the NN absolutely requires HD mapping along significant amount of "hints"

And updating those hints are probably the reason that they run the routes daily.

No. The Perception stack uses info from the HD map but does not require HD maps. Using something and requiring it are not the same thing. And saying that the NN requires significant amount of hints is speculation on your part. Again, the reason Waymo uses HD maps is for safety not because the NN can't work without them.
 
  • Like
Reactions: Ben W