Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
It's unclear how an all nets approach will understand implicit human decision making. How will it understand that I made a lane change because I'm avoiding an arbitrary obstruction or situation vs navigating to my destination vs fixing a mistake I did prior.
If you can see something and react to it, it hopefully should be more obvious to end-to-end networks that the differing control behavior is related to the obstruction, e.g., road closed sign. Presumably Tesla is also using human labelers to approve / rate / rank driving behaviors, so if a random driving decision, e.g., oh I actually need to stop at the grocery store, doesn't appear to be good for training, it might not make it to the networks based on a second human opinion. More generally, neural networks with lots of training data somewhat "averages" away infrequent behaviors, so this can be good or bad in filtering out random mistakes or not realizing special behavior is necessary.

Depending on how much dynamic driving context is kept, e.g., the last second vs last minute or upcoming information, some immediate driving prediction could be wrong or not make sense. For example, the turns to get to a destination over an hour away probably aren't as relevant as the immediate next turn, which could also be several minutes away, so does the network get the whole navigation and it needs to learn when to pay attention to that information or does an engineer decide only the next 3 intersections should be enough to pass in? Similarly, passing a "lane closed ahead" sign at highway speeds nearly a mile back should probably avoid switching into that lane, but then again, I've seen humans correctly(?) ignore the signs until forced to merge to get ahead and save several minutes off the trip.
 
I'd characterize this as an exception. Even Google router tells you what lanes to use (esp when turning).
Google lane selection is sometimes wrong also.
SmartSelect_20231203_125244_Maps.jpg
SmartSelect_20231203_125330_Maps.jpg
 
  • Like
Reactions: EVNow
If you can see something and react to it, it hopefully should be more obvious to end-to-end networks that the differing control behavior is related to the obstruction, e.g., road closed sign. Presumably Tesla is also using human labelers to approve / rate / rank driving behaviors, so if a random driving decision, e.g., oh I actually need to stop at the grocery store, doesn't appear to be good for training, it might not make it to the networks based on a second human opinion. More generally, neural networks with lots of training data somewhat "averages" away infrequent behaviors, so this can be good or bad in filtering out random mistakes or not realizing special behavior is necessary.

Depending on how much dynamic driving context is kept, e.g., the last second vs last minute or upcoming information, some immediate driving prediction could be wrong or not make sense. For example, the turns to get to a destination over an hour away probably aren't as relevant as the immediate next turn, which could also be several minutes away, so does the network get the whole navigation and it needs to learn when to pay attention to that information or does an engineer decide only the next 3 intersections should be enough to pass in? Similarly, passing a "lane closed ahead" sign at highway speeds nearly a mile back should probably avoid switching into that lane, but then again, I've seen humans correctly(?) ignore the signs until forced to merge to get ahead and save several minutes off the trip.

Elon has said there are millions of edge cases. It's really unclear and unintuitive to me how a NN can generalize implicit human decision making and mistake correction. How do you even make a training set that is consistent among hundreds of human video analysts, considering how many edge cases there are.

Take for example your construction or lane closed sign ahead. Does Tesla pick videos of good drivers changing lanes 100 yds in advance or 200 or 10? And what if the car doesn't see the sign until 5 yrds before because of obstructions, it's totally unintuitive that a NN can understand the concept of avoiding a closed lane, how far, what maneuvers to take at certain distances / among other road users in all weather conditions, etc.
 
At least from the livestream, it got into a left-turn lane and kept right when forking to a double left turn. This particular turn ends up merging immediately after, so the lane selection didn't matter as much. Obviously this has been an ongoing issue with 11.x, so maybe that not only means 12.x's initial release quality bar will do no better but also hopefully means we can try it out sooner. There's the potential for general end-to-end data engine to "naturally" resolve this type of disengagement assuming a generic feedback loop, but Tesla could also dedicate resources for explicitly finding examples and collecting dedicated training data to address this family of issues.
Funny thing is - FSD currently behaves like a driver new to the area. If you are familiar with the roads you know which lanes to take (depending on your route). But drivers new to the area frequently take the wrong lane - just like FSD.

Usually force lane changes when using FSD - yesterday I decided to just let it select the lanes to see what it did. It made a number of bad choices which forced it to make lane changes later that were difficult because of heavy traffic.

In one instance - at a place the road forks for a right turn. It a regular route I use and FSD has almost never failed to take the right fork. But yesterday it "forgot" and went straight a little bit and then tried to change lanes. The traffic was heavy and FSD managed to change lanes only because a driver at the back decided to let the car merge. I don't think this would have been possible in NJ/NY ;)
 
  • Like
Reactions: mongo
Take for example your construction or lane closed sign ahead. Does Tesla pick videos of good drivers changing lanes 100 yds in advance or 200 or 10? And what if the car doesn't see the sign until 5 yrds before because of obstructions, it's totally unintuitive that a NN can understand the concept of avoiding a closed lane, how far, what maneuvers to take at certain distances / among other road users in all weather conditions, etc.
Its definitely "unintuitive" - but think about your own driving. Did someone specifically teach you what is a good distance at which to change or did you just learn from experience and from seeing what others do. I don't remember anyone specifically teaching me ... so its a learn by example thing ... which is exactly what NN is supposed to do.
 
Funny thing is - FSD currently behaves like a driver new to the area. If you are familiar with the roads you know which lanes to take (depending on your route). But drivers new to the area frequently take the wrong lane - just like FSD.

Usually force lane changes when using FSD - yesterday I decided to just let it select the lanes to see what it did. It made a number of bad choices which forced it to make lane changes later that were difficult because of heavy traffic.

In one instance - at a place the road forks for a right turn. It a regular route I use and FSD has almost never failed to take the right fork. But yesterday it "forgot" and went straight a little bit and then tried to change lanes. The traffic was heavy and FSD managed to change lanes only because a driver at the back decided to let the car merge. I don't think this would have been possible in NJ/NY ;)

You’re lucky. My Tesla has missed highway exits multiple times because of poor lane choices. In most cases, it got puzzled and proceeded to slow down right on the highway amidst heavy traffic. I had to slam the accelerator to save my life.
 
Its definitely "unintuitive" - but think about your own driving. Did someone specifically teach you what is a good distance at which to change or did you just learn from experience and from seeing what others do. I don't remember anyone specifically teaching me ... so its a learn by example thing ... which is exactly what NN is supposed to do.

It's unintuitive how v12 will be able to understand implicit human decisions from a NN technology and architecture pov.

When you look at LLMs, it makes sense that they compress their dataset, which includes all the "instructions" on how to decipher text (words are confined by their definitions and order, all of which are fed into the dataset).

But V12, based on how Elon and Ashok described it, doesn't make sense. It's unclear how a NN can understand human concepts just from video.

My driving behavior is based on multimodal learnings, based on my language understanding of road concepts, based on people honking me in the past, based on imperfect knowledge of the law and when it can be broken / bent, etc. If you show a NN a video of seemingly the same situation conceptually but different decision making, will it understand why the decision was made? It's odd.
 
  • Like
Reactions: pilotSteve
Then, how does V12 be better than V11 when it comes to lane selection, the #1 cause of disengagements?

Well, we don't know that V12 will be better than V11 in terms of lane selection. I could see e2e being better at executing lane changes. For example, if you feed lots of video of human drivers making good lane changes, it might correctly train the e2e to make smoother lane changes or make more assertive lane changes when needed or make lane changes earlier etc... Furthermore, I could see training the e2e to understand the difference between a turn only lane and a normal lane or to understand that you need to be in a turn lane to make a turn, or that you should not make a turn from a non turn lane. But I think you do need to give the e2e map info otherwise it won't know what route to take. It's fine to know that you need to be in a turn lane to make a turn but if you don't know that you need to make a turn there to stay on route, it won't matter. And even if you somehow train the e2e to read navigation signs, it would still need to correlate that to a map or a route to know that it needs to change lanes or make a turn to stay on route.

Moreover how would FSD handle rerouting, lane changes based on speed or blockages ...

IMO, it needs a map so that it can figure out rerouting.

So my point is that I can see e2e being good at training many driving tasks like stopping at a red light, going around a double parked car etc... but I think it still needs a separate map in order to know what route to take.

I wonder how you make a single e2e network which knows what it can’t see?

Seems hard to get that structure to arise organically.

Yet it is essential - you can’t drive without that info.

Not sure what you mean. The e2e can't know what it does not see. That is why all AV development begins with perception. I could see the e2e learning to be more cautious in situations where it can't see. So for example, it might learn to slow down on a twisty road or driving at night in a narrow city street. And that is something that could be taught from imitation learning from videos of human driving.

This doesn’t seem right since the router would need to know what lanes were present - which is not knowledge that can be contained reliably in a map.

In a standard modular approach, they use precise maps with lane info and they have a router that uses the map to plan the route and then tells the planner what route to take. The planner takes info from both the perception and the router to plan the short term and long term actions of the vehicle.

33kyUy8.png


Now, in a single model e2e approach, that green modules in between sensors and controls (perception, behavior prediction and planner) are replaced by a single NN. But I am suggesting that the map and router would still need to be there to inform the e2e stack what route the car needs to take.
 
How do you even make a training set that is consistent among hundreds of human video analysts, considering how many edge cases there are.
They don't have to be consistent and variety leads to learning if given enough context and training. For example just controlling one pawn piece in chess has many levels of complexity from basic move forward 1 (or conditional 2 squares or capture diagonally) to context awareness of opponent pieces preventing the move (blocked or illegal / would put your king in check) to immediate tactical (unblock your bishop so it can attack a queen) to complex tactical (multiple turns of captures) to mid-term strategic (allow an escape path or provide safety for your king) to complex strategic (sacrifice or seemingly bad move for long term advantage/win). Each of these can be considered edge cases with more edges to get into even rare chess corner cases, and yet the neural network can learn to do the appropriate thing for each.

One downside of AlphaZero's approach to learning chess is that it requires a lot of training examples, but it was also intentionally designed to minimize human involvement while achieving superhuman capabilities. Tesla can be much more directed with curated training data, but it's still an open question of how much data will need to be collected, labeled and trained to achieve those various levels of capability/understanding (e.g., chess basics vs tactics vs strategy). Chess is also "easy" in that perception is relatively trivial not needing to worry about not seeing an enemy piece because vision is occluded nor inconsistent map data that the 8x8 squares are connected differently, but even 11.x has trouble with lane selection on empty roads that hopefully is much improved with 12.x.

Getting a little bit more into how the neural network learns how to move the pawn, you can somewhat think of the likelihood of moving the piece as adding up a bunch of individual components -- some positive if the square ahead is empty or negative if blocked. Another component could be looking at squares the pawn can capture potentially with a very strong signal if it's for an enemy queen or lower for capturing a knight while another component is looking for whether moving the pawn would put your king in "check" thus resulting in a very negative signal even if you would really like to capture the queen. This "check"-checking applies to not just pawns, so the network internally could learn to share that across multiple piece moves. So for driving, each decision is adding up various components that can be individually trained but generally applied to different situations, e.g., traffic congestion, weather conditions, road layouts, navigation routes.
 
The e2e can't know what it does not see. That is why all AV development begins with perception.
This is absolutely critical. All humans do it (a huge amount too / it’s not rare!). If you can’t, the results will be catastrophic.

No amount of caution or imitation can make up for it.

You can’t imitate what someone else did when your situation is entirely different for reasons you can’t see.
 
This is absolutely critical. All humans do it (a huge amount too / it’s not rare!). If you can’t, the results will be catastrophic.

No amount of caution or imitation can make up for it.

You can’t imitate what someone else did when your situation is entirely different for reasons you can’t see.
Perhaps you meant to convey is that FSD needs to detect when it has visibility limitations. Obviously, FSD cannot know about a specific object that is not detectable.

And yes, the system needs to deal with cases where it's vision is obscured either due to occlusions (Cars, fences, trees, etc., or due to weather. Sometimes it is unclear whether FSD 11.x can tell the difference between not seeing another car and seeing that there is not another car.
 
Its definitely "unintuitive" - but think about your own driving. Did someone specifically teach you what is a good distance at which to change or did you just learn from experience and from seeing what others do. I don't remember anyone specifically teaching me ... so its a learn by example thing ... which is exactly what NN is supposed to do.
Of course the NN can't be taught everything. For example, a sign is expected to be positioned a certain distance from an intersection. What if it's located far off from expectations/standards/training? There is no human-like reasoning to recover. It either carries on oblivious or the unlikely event it's smart enough to know it's failing and hand off to the driver in time.
 
It's unintuitive how v12 will be able to understand implicit human decisions from a NN technology and architecture pov.

When you look at LLMs, it makes sense that they compress their dataset, which includes all the "instructions" on how to decipher text (words are confined by their definitions and order, all of which are fed into the dataset).

But V12, based on how Elon and Ashok described it, doesn't make sense. It's unclear how a NN can understand human concepts just from video.

My driving behavior is based on multimodal learnings, based on my language understanding of road concepts, based on people honking me in the past, based on imperfect knowledge of the law and when it can be broken / bent, etc. If you show a NN a video of seemingly the same situation conceptually but different decision making, will it understand why the decision was made? It's odd.
It's not really understanding human concepts. It extracts the relevant information it needs from the dataset and represents it in an efficient way as billions of parameters. But it's damned good at extracting information from the signal.

If there is some simple concept such as "pixels that look like this combination of letters if you rotate and scale them" -> drive in y way" it will extract that signal from the data. Even more complex concepts such as "that guy is erratic, move away from him" it will just figure out from the data.
 
  • Like
Reactions: ZeApelido and rlsd
It's not really understanding human concepts. It extracts the relevant information it needs from the dataset and represents it in an efficient way as billions of parameters. But it's damned good at extracting information from the signal.

If there is some simple concept such as "pixels that look like this combination of letters if you rotate and scale them" -> drive in y way" it will extract that signal from the data. Even more complex concepts such as "that guy is erratic, move away from him" it will just figure out from the data.

I do understand the general concept of how NNs work and generalize. Based on all the advances so far, I have a hard time seeing how V12 will make sense of human driving concepts based on videos of specific situations.

Chess, LLMs, vision, StarCraft, DoTA, protein folding, etc. etc. all of these are intuitive to me from a NN capability pov. What v12 is purportedly designed to accomplish isn't intuitive to me. It seems heuristic controls with ever increasing perceptual ability and coarse aids is a more tractable approach (basically software 2.0 where NNs are slowly consuming heuristics). V12 isn't software 2.0 imo, it's getting rid of the human software paradigm and going with the black box route.
 
  • Like
Reactions: DanCar
Based on all the advances so far, I have a hard time seeing how V12 will make sense of human driving concepts based on videos of specific situations.

By making connections between the video input and what the human driver did next in the video. So for example, if you see a bunch of videos where a car is approaching a stop sign and the next action the car takes is to stop at the stop sign, you can make a connection between the two things that "if stop sign, then car should stop". If you see a bunch of videos where a car is stopped at a red light and then when the light turns green, the next action is to go, you can infer a connection "if light turns green, then go". Maybe there are a bunch of videos where the light turns green but a pedestrian is still crossing or another car passes in front. In the video clips, the next action of the car is to wait a bit and then go. So now you can make a better connection "if light turns green and path is clear then go". Or there are video clips of a double parked car and the next action of the car is to go around it. So you can make a connection "if path is blocked, go around it". e2e maps the visual input directly to a certain control output. So when the visual input matches "stop sign", the control output is to brake and stop, when the visual input is a green light, the control output is to accelerate, etc... Now imagine scaling that up to billions of video of clips. In theory, the NN should make enough connections to be able to "make sense" of all human driving concepts. A that point, the AV should be able to drive in all situations. At least that is the idea. We are talking billions of connections that need to be optimized just right. That is why it requires massive computing power like dojo. It remains to be seen how well the approach will work in real life.
 
Last edited:
  • Helpful
Reactions: pilotSteve
By making connections between the video input and what the human driver did next in the video. So for example, if you see a bunch of videos where a car is approaching a stop sign and the next action the car takes is to stop at the stop sign, you can make a connection between the two things that "if stop sign, then car should stop". If you see a bunch of videos where a car is stopped at a red light and then when the light turns green, the next action is to go, you can infer a connection "if light turns green, then go". Maybe there are a bunch of videos where the light turns green but a pedestrian is still crossing or another car passes in front. In the video clips, the next action of the car is to wait a bit and then go. So now you can make a better connection "if light turns green and path is clear then go". Or there are video clips of a double parked car and the next action of the car is to go around it. So you can make a connection "if path is blocked, go around it". Now imagine scaling that up to billions of video of clips. In theory, the NN should make enough connections to be able to "make sense" of all human driving concepts. A that point, the AV should be able to drive in all situations. At least that is the idea. We are talking billions of connections that need to be optimized just right. That is why it requires massive computing power like dojo. It remains to be seen how well the approach will work in real life.

There are just so many unanswered questions for me wrt V12. Everything how many examples to curate of each situation (because even a rare situation is just as important as a routine situation), do you need the same # of examples in all lighting and weather conditions? How can Tesla curate all these situations when they can't even detect wiper needs properly lol
 
Elon has said there are millions of edge cases. It's really unclear and unintuitive to me how a NN can generalize implicit human decision making and mistake correction. How do you even make a training set that is consistent among hundreds of human video analysts, considering how many edge cases there are.

Take for example your construction or lane closed sign ahead. Does Tesla pick videos of good drivers changing lanes 100 yds in advance or 200 or 10? And what if the car doesn't see the sign until 5 yrds before because of obstructions, it's totally unintuitive that a NN can understand the concept of avoiding a closed lane, how far, what maneuvers to take at certain distances / among other road users in all weather conditions, etc.
I stopped reading after, “Elon has said” ... ☺️
 
Last edited: