Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

FSD tweets

This site may earn commission on affiliate links.
Amnon's argument has a significant foundational assumption in it, which I believe to be false. He assumes that the "long tail" of failure cases is independently distributed; that solving one of them will have no bearing on solving the others. I believe the opposite; that there will be a lot of similarity and overlap between many of the rare failure cases, and that properly learning to solve a few of them can implicitly solve many more. In this sense, I think the monolithic E2E approach is perfectly fine.

Some edge cases probably overlap and some may not. And some edge cases are perception issues while others may be behavior prediction or planning issues. For example, a person in a mascot costume might be a perception edge case if your perception system has never seen something like that. And maybe adding that edge case to your training would automatically carry over to similar perception cases like a person in a Halloween costume. Other edge cases might be prediction based. For example, Waymo encountered an edge case with a tow truck that was pulling a pick up truck at an unusual angle and so the Waymo prediction stack misinterpreted how the combo of the tow truck and the pick-up would move. You could also have a planning edge case where you encounter say a construction zone that requires the car to move in a way it was not trained to do.

Of courses, this only applies for the modular approach with separate perception, prediction and planning stacks. With end-to-end, everything is trained together. But the point still remains that I don't think edge cases would always carry over to other cases. For example, training your end-to-end to handle the person in the Halloween costume would not carry over to the edge case of the oddly towed pick-up truck. So I think there would be some independence between edge cases but not 100%.

Ultimately, when discussing the long tail, I don't think anyone really knows the best way to solve it. In fact, I don't think we really know how long the tail is, or more precisely, how much of the long tail needs to be solved before AVs can be scaled safely everywhere. Amnon even says as much when he says that we could be underestimating or overestimating the long tail.
 
Last edited:
  • Like
Reactions: DanCar and Ben W
Some edge cases probably overlap and some may not. And some edge cases are perception issues while others may be behavior prediction or planning issues. For example, a person in a mascot costume might be a perception edge case if your perception system has never seen something like that. And maybe adding that edge case to your training would automatically carry over to similar perception cases like a person in a Halloween costume. Other edge cases might be prediction based. For example, Waymo encountered an edge case with a tow truck that was pulling a pick up truck at an unusual angle and so the Waymo prediction stack misinterpreted how the combo of the tow truck and the pick-up would move. You could also have a planning edge case where you encounter say a construction zone that requires the car to move in a way it was not trained to do.
One hope is that the by training on enough unusual long-tail examples, the network will develop the abstract concept of "weird stuff", along with a strategy for identifying and dealing with it in a generic and reasonable way, much as humans do. Agreed that there are many categories of edge cases, and it will take many training examples to cover them all. But hopefully there will still be some useful generalization happening, so the millions of long-tail cases won't have to be exhaustively enumerated in the training set.
Of courses, this only applies for the modular approach with separate perception, prediction and planning stacks. With end-to-end, everything is trained together. But the point still remains that I don't think edge cases would always carry over to other cases. For example, training your end-to-end to handle the person in the Halloween costume would not carry over to the edge case of the oddly towed pick-up truck. So I think there would be some independence between edge cases but not 100%.

Ultimately, when discussing the long tail, I don't think anyone really knows the best way to solve it. In fact, I don't think we really know how long the tail is, or more precisely, how much of the long tail needs to be solved before AVs can be scaled safely everywhere. Amnon even says as much when he says that we could be underestimating or overestimating the long tail.
Yes. This is why I think Elon is nuts to be putting all Tesla's eggs in the Robotaxi/AI basket in the short term, at the expense of e.g. Model 2 development and the Supercharger network. It does make sense to keep developing FSD with plenty of resources, but there is not yet line-of-sight to geofenced narrow-ODD L4 or even L3 FSD, let alone broad-ODD L4 or L5. As pointed out, FSD is currently at ~10 hours between safety-critical interventions, but will need to be at 10^7 hours for L4/Robotaxi reliability. That's six orders of magnitude. We still have a very long ways to go.
 
  • Like
Reactions: DanCar
Amnon's argument has a significant foundational assumption in it, which I believe to be false. He assumes that the "long tail" of failure cases is independently distributed; that solving one of them will have no bearing on solving the others. I believe the opposite; that there will be a lot of similarity and overlap between many of the rare failure cases, and that properly learning to solve a few of them can implicitly solve many more. In this sense, I think the monolithic E2E approach is perfectly fine.
You are making a big assumption too - with nothing to back it up. It is equally likely that solving an edge case destabilizes the normal case.

Infact I think a lot of edge cases are exceptions. You drive within the lane with the edge case being an obstruction or bike on the side. An exception to this exception is when a vehicle is coming from the other side in which case you have to wait. Etc. etc.
 
You are making a big assumption too - with nothing to back it up. It is equally likely that solving an edge case destabilizes the normal case.
What backs it up is that this is how humans learn, and neural networks are getting more and more humanlike, especially as they become E2E rather than modular / feature-engineered. Tesla has highly robust regression testing to ensure that the normal cases don't get destabilized by the retrained networks; I don't expect to see any obvious regressions going forward, or at least none that Tesla isn't aware of when it releases the software.

Of course, it may turn out that the networks that fit on HW3 are fundamentally not large or capable enough to contain all the needed "knowledge" for all the normal cases + edge cases. Or it may turn out that there are some edge cases that require too sophisticated high-level reasoning for the closed-form network to be able to handle. I think it's more likely that sensor suite limitations will ultimately be more of a limiting factor than compute limitations for HW3, but that both will need to be leapfrogged (by HW5 or HW6) to achieve full-ODD L4.
Infact I think a lot of edge cases are exceptions. You drive within the lane with the edge case being an obstruction or bike on the side. An exception to this exception is when a vehicle is coming from the other side in which case you have to wait. Etc. etc.
For sure. And it's literally impossible to explicitly enumerate all the exceptions in a meaningful way, which is why the 300k+ lines of C++ code approach was doomed to failure. The reason ML works so well is that it doesn't have any super-narrow bottlenecks where it throws away nearly all of the information, whereas in C++ code, each time you have an "if-then" statement you effectively reduce the entire state of knowledge to a single bit, leaving no room for subtleties or exceptions. ML logic is much fuzzier, which is essential for solving fuzzy real-world problems and situations.
 
ML logic is much fuzzier, which is essential for solving fuzzy real-world problems and situations.
What backs it up is that this is how humans learn, and neural networks are getting more and more humanlike, especially as they become E2E rather than modular / feature-engineered. T

First - I'd argue NN is nothing like biological neural network. You don't have to tell someone a million times before they learn something. Even a puppy learns something after a dozen examples. Try that with NN. There are a lot of other differences too - I've many comments on this a couple of years back. The reason for that is organic NN has the benefit of a billion years of evolution.

As to whether NN or heuristics is better - is anyone's guess. I've no strong arguments either way. You can show examples where NN does better and examples where heuristics does better.

FSD seems to get so confused about actual driving lane vs shoulder. While taking an entrance to the freeway, FSD decided to use the broad shoulder instead of the lane ...
 
First - I'd argue NN is nothing like biological neural network. You don't have to tell someone a million times before they learn something. Even a puppy learns something after a dozen examples. Try that with NN. There are a lot of other differences too - I've many comments on this a couple of years back. The reason for that is organic NN has the benefit of a billion years of evolution.
There's a distinction here between subconscious learning and conscious embedding. A puppy has had months of learning all the basic things about its environment through countless thousands of examples, so that when you try to train it on a new task, it already thoroughly understands the task's underlying components, which it has learned in a very NN-like way, and the new task is learned on top of that. Humans are able to consciously learn new high-level concepts and tasks in a similar way, because we've already learned their unconscious low-level components through endless examples and practice. However, learning new low-level unconscious tasks, such as mastering a complicated piano piece to the point where it can be played effortlessly/subconsciously, still takes us thousands of repetitions, very much like a computerized NN.

Another example: I can teach you how to juggle in ten minutes. But it will take you months of practice to learn how to juggle _well_, and many years after that (should you pursue it) to learn how to juggle expertly well (like circus-performer well). Driving is more akin to juggling in this respect; it requires subconscious reflexes and intuition as much as logical thinking. That's why it takes months of driver training to even acquire a license, and years after that to become an expert driver. This is very NN-like.
As to whether NN or heuristics is better - is anyone's guess. I've no strong arguments either way. You can show examples where NN does better and examples where heuristics does better.
NN's scale to unbounded problem domains (such as driving, or language translation, or even audio-to-text transcription) much better than hand-designed heuristics ever could. This is also becoming true of most bounded domains, such as playing chess or Go, or protein-folding. See: The Bitter Lesson. Closed-form tasks such as solving a FFT may have optimal algorithmic solutions that can be hand-coded, but even in cases where such algorithms may intuitively seem useful, neural nets often and surprisingly do just as well without them for solving real-world problems.
FSD seems to get so confused about actual driving lane vs shoulder. While taking an entrance to the freeway, FSD decided to use the broad shoulder instead of the lane ...
Agreed that it still has a ways to go. But the heuristic approach still made many of the same (or worse) mistakes. FSD v11 still regularly swerves into turnout lanes at highway speeds, which is a very similar mistake. I'm confident that v12.4 and v12.5 will make huge progress on these common cases. We should (hopefully) know quite soon!
 
Didn’t Musk say that 12.4 had been retrained from scratch? That may be the cause of the delay. There may be unexpected challenges that crop up each time a totally new version is created. Each version might have a slightly different “personality” with biases that must be taken into account, like with humans.
 
  • Like
Reactions: JB47394
However, learning new low-level unconscious tasks, such as mastering a complicated piano piece to the point where it can be played effortlessly/subconsciously, still takes us thousands of repetitions, very much like a computerized NN.
Yes - but for NN to do something even badly takes thousands of examples. Organic NN is very different from CNN. You can read papers on that.

Getting back to the original point, solving one edge case may have nothing to do with other edge cases, infact it may regress the already learnt normal behavior. That is why learning edge cases progresses lot more slowly than the normal cases. It is a typical S curve.
 
  • Like
Reactions: LSDTester#1
Yes - but for NN to do something even badly takes thousands of examples. Organic NN is very different from CNN. You can read papers on that.

Getting back to the original point, solving one edge case may have nothing to do with other edge cases, infact it may regress the already learnt normal behavior. That is why learning edge cases progresses lot more slowly than the normal cases. It is a typical S curve.
Once a neural network is trained there are examples of zero shot, one shot, and few shot learning. Not much different than a baby. No matter how many examples you give a baby of some high level tasks, they still won't learn.
 
  • Like
Reactions: LSDTester#1
Once a neural network is trained there are examples of zero shot, one shot, and few shot learning. Not much different than a baby. No matter how many examples you give a baby of some high level tasks, they still won't learn.
I've not seen those ... I'll look it up.

BTW, if that were the case, doesn't it go against Tesla's idea that you need lot of data. Anyone can reproduce any edge case and train in a day !
 
I've not seen those ... I'll look it up.

BTW, if that were the case, doesn't it go against Tesla's idea that you need lot of data. Anyone can reproduce any edge case and train in a day !
Quote: A prompt may include a few examples for a model to learn from, such as asking the model to complete "maison → house, chat → cat, chien →" (the expected response being dog),[9] an approach called few-shot learning.[10]

> doesn't it go against Tesla's idea that you need lot of data.
Need a lot of data to get the model up to a certain level of intelligence. Once it is intelligent can learn from a few examples.

> Anyone can reproduce any edge case and train in a day !
Training for edge cases is interesting. You don't really need a lot of examples for a single case. For training on few examples, current top of the line hardware can finish training in seconds, with the risk of regressing elsewhere. There are techniques to minimize regressions, such as reserving model space for new training. Another technique is only updating near zero weights.
 
Quote: A prompt may include a few examples for a model to learn from, such as asking the model to complete "maison → house, chat → cat, chien →" (the expected response being dog),[9] an approach called few-shot learning.[10]

> doesn't it go against Tesla's idea that you need lot of data.
Need a lot of data to get the model up to a certain level of intelligence. Once it is intelligent can learn from a few examples.
Ofcourse we have to see how it works in the video / FSD world.

BTW, what is zero shot - just learns from existing data ?
 
  • Like
Reactions: LSDTester#1
Yes, also known as generalization.
Anyway, I think the important thing that takes time is an engineer who debugs the edge case and figures out how to address it and retrains the network and makes sure there are no regressions. Whether they use new training data or not is not a big concern (since they can be working on other edge cases when they gather training data).

We'll know in the next few releases how fast they can address issues like this one ... where from 150th St getting into I-90, FSD took the shoulder, rather than the lane.

1717437480067.png


ps : That looks confusing because of satellite imagery ... but street view isn't confusing.

1717437732138.png
 
Elon waiting on nvidia b200s before expanding training capacity. I suspect this line of thinking will be common across the industry.

Does X training centers share compute with Tesla?
 
Elon waiting on nvidia b200s before expanding training capacity. I suspect this line of thinking will be common across the industry.

Does X training centers share compute with Tesla?
....and they even announce a G to follow the B in 2026. Looks like NIVDIA is on fire and going to keep the lead in AI GPUs for at least the next few years. Dojo is stating to look like a Model T now. I bet production on this has completely stopped and only using what was built already.
 
  • Like
Reactions: DanCar