Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
There are just so many unanswered questions for me wrt V12. Everything how many examples to curate of each situation (because even a rare situation is just as important as a routine situation), do you need the same # of examples in all lighting and weather conditions? How can Tesla curate all these situations when they can't even detect wiper needs properly lol
These are the same unanswered questions such as "how will the neural network tell the difference between a dog and a cat?" It just does it, the same way human brains does it. By taking the input, doing some function on it and out comes the answer. Then we can try to rationalize how we came to the conclusion, but that's not how the brain does it...

Wiper blade has the issue that the camera is not placed in a place that gives the same focus for rain drops as the human eyes in the driver seat has. If the car had a camera where the human was it would be a lot easier problem.
 
Wiper blade has the issue that the camera is not placed in a place that gives the same focus for rain drops as the human eyes in the driver seat has. If the car had a camera where the human was it would be a lot easier problem.

You would think so, but if we're talking NN's ability to predict based on a pixel pattern, it should be able to see a pattern of visual distortions based on a sequence of water drops onto the camera, despite the focal distance.
 
  • Like
Reactions: pilotSteve
Funny thing is - FSD currently behaves like a driver new to the area. If you are familiar with the roads you know which lanes to take (depending on your route). But drivers new to the area frequently take the wrong lane - just like FSD.
Tesla is still making navigation map updates as a local driver or even someone who has driven just once through an intersection before probably has some memory of the layout and potential things to watch out for different from an average intersection. The neural networks can have a general understanding of many types of intersections, but they aren't databases designed to recall any intersection. Even without maps, the networks can probably drive safely and might even have enough time to comfortably make the lane change, but to increase the consistency of safe and comfortable drives, maps can provide that memory especially when vision cannot get a good view.

I've had plenty of unnecessary lane changes with 11.x, but there's been a few times where even on local roads but infrequent routes, FSD Beta was actually in the correct lane because of map data, so then I apologize in the voice notes that I was wrong. :p Hopefully end-to-end can reduce the need to have correct maps in the first place as well as make better use of map data when it does have it even when the map is wrong.
 
  • Funny
Reactions: EVNow
V12 isn't software 2.0 imo, it's getting rid of the human software paradigm and going with the black box route
From what Tesla has presented about their foundational world model, it seems like it's generic enough to predict video from video and other sensors. This doesn't require knowing about the particular task of controlling the vehicle and thus can be trained from a larger set of data independent of whether it's good driving or any labels actually.

For such a model to accurately predict what other cameras will see when continuing straight or when making a turn, it needs to have some internal understanding of how objects behave when time moves on as well as how an object moves from one camera to another view or even more basically what even is an object. Even more complex concepts would be learning that some objects like traffic lights behave a certain way such as multiple signals changing from red to green approximately the same time, and that also results in other objects like vehicles changing their behavior.

If Tesla has a model that is able to internalize these to predict future video frame from past video, that same understanding could be used to train an end-to-end control network to stop at red lights and go on green, etc.
 
  • Informative
Reactions: pilotSteve
During the livestream, Elon and Ashok emphasized the importance of having only good drivers in the dataset though
Yes, for training good control you need the appropriate/curated training data. You can train the world model to understand how things behave then finetune control outputs. Similar to large language models generally trained on any text to understand how words go together and their underlying meaning, etc., ChatGPT additionally uses reinforcement learning with human feedback of what people expect from a conversation as opposed to generic text completion.
 
Yes, for training good control you need the appropriate/curated training data. You can train the world model to understand how things behave then finetune control outputs. Similar to large language models generally trained on any text to understand how words go together and their underlying meaning, etc., ChatGPT additionally uses reinforcement learning with human feedback of what people expect from a conversation as opposed to generic text completion.

Yea that's analogous to how LLMs are created, but I think Tesla is solely using good driver videos both to generate the world model and also output the controls. I haven't heard Ashok or Elon say something contrary to this.
 
Good crowd sourced metadata with map is necessary for good local driving in heavy traffic
Part of the practical aspect is what is good enough for initial release of end-to-end? If Tesla can improve maps in parallel to significantly improve the driving behavior of already deployed models, Tesla might even build out something to react closer to real-time such as detecting debris in the road and having other vehicles avoid. (Even more future would be to change driving behavior/routes based on automated reports of partial/uncertain issues to get a better view to verify.)

While annoying to miss a turn, it's not necessarily a safety issue as there's the option to reroute unless the human decides to force a last second lane change. Overall it seems like the main benefit of 11.x to 12.x is increasing comfort while also maintaining or improving safety, so maybe it will be comfortable for average traffic initially and improve later for heavy traffic.
 
This is absolutely critical. All humans do it (a huge amount too / it’s not rare!). If you can’t, the results will be catastrophic.

No amount of caution or imitation can make up for it.

You can’t imitate what someone else did when your situation is entirely different for reasons you can’t see.
A twist to this is humans predict what they can't see. For example, when you turn a blind corner, whether from experience in that exact corner or from having travelled similar corners, most people mentally predict the road structure around the corner before they even see it. FSD Beta had been doing similar things for a while now.

There was a presentation where it was talked about how theoretically you can build a NN that can know every single road in an area even without explicitly mapping it, just by having the information stored in weights in the NN. It's like a map, but not really a map. It'll be similar to a very experienced driver driving around in their local area vs using a detailed map. In context, lane choices may potentially be handled in a similar way (though obviously you still need a rough map, at minimum a navigation map, just like humans do).
 
  • Informative
Reactions: APotatoGod
A twist to this is humans predict what they can't see. For example, when you turn a blind corner, whether from experience in that exact corner or from having travelled similar corners, most people mentally predict the road structure around the corner before they even see it. FSD Beta had been doing similar things for a while now.

There was a presentation where it was talked about how theoretically you can build a NN that can know every single road in an area even without explicitly mapping it, just by having the information stored in weights in the NN. It's like a map, but not really a map. It'll be similar to a very experienced driver driving around in their local area vs using a detailed map. In context, lane choices may potentially be handled in a similar way (though obviously you still need a rough map, at minimum a navigation map, just like humans do).
Yeah, there's a lot of talk about this. Prediction is a whole other issue which I wasn't addressing (which is also important, and requires good memory). Anyway, in general there's a lot of talk about NN's and how good they are, accompanied by tons of handwaving and saying "we'll throw a few trillion parameters at it." But haven't seen evidence of a system that works, yet. Seem likely they are a very good "99% solution," or perhaps even higher percentage.

But I was just talking more narrowly about the concept of things that are blocked from view, and how important it is to realize they are blocked from view. It's definitely something you have to be aware of.

It's always been unclear from me how much the visual renderings on the screen of things that are out of view are from the LD maps, and how much are from the NN's predictions. Maybe someone's done a detailed analysis, but it would take some careful observation to figure it out (can't do while driving).

Anyway whatever. It seems really hard to me to produce a system that goes truly end to end and reliably captures the concepts of lanes, obstructions, etc., and I hope they can figure out how to do it. Would be a truly remarkable accomplishment!
 
I was just talking more narrowly about the concept of things that are blocked from view, and how important it is to realize they are blocked from view
10.69 introduced the explicit perception output of occlusion with occupancy network, and this was used by control to decide whether it should creep up to the limit to get a better view. If the world model is an evolution of the occupancy network, it would seem natural for it to understand occlusions too. It might be even more fundamental for it to understand for predicting how the world behaves such as object permanence when a crossing vehicle temporarily blocks the view. Potentially this more nuanced understanding of occlusions will allow end-to-end to behave differently when at a solid wall vs chain link fencing that can be easily seen through vs short hedges/walls.

Similarly the 11.x object detection and visualization seems to require the thing to be in sight whereas a visualization output trained on top of the more fundamental understanding could continue to visualize things even if they temporarily disappear from view. However, this could also result in wrong predictions/visualizations if say a person unexpectedly turns around when walking behind a large truck, but this is probably no different than the existing predictions of road layout based on common patterns, e.g., a roundabout is generally circular even if you can't actually see the far side.
 
I mean, that is kind of the question that people want to know the answer to, and it’s not clear which we will get - and how much “glue” the “nothing but nets” approach has. There’s a big difference between just photons and map info (and other input modes) directly to output, and something much more observable, with nets that perform specific tasks and then interface to other nets. One is “actual end-to-end” (using a reasonable definition) and the other is something else (also can be called “end-to-end, but in pieces”).

It's semantics, the training of either approach is the same. You still have backprop through the modules. It's just each piece has been pre-trained. That's not a bad thing, it assuredly allows your total model to converge to decent results more rapidly. And it can still allow the perception component to evolve over time. it doesn't have to be static.

The whole video in controls out paradigm doesn't make sense to me from an implementation pov. Like evnow pointed out, there are situations where it's not clear how this approach will improve or solve the problem.

It's unclear how an all nets approach will understand implicit human decision making. How will it understand that I made a lane change because I'm avoiding an arbitrary obstruction or situation vs navigating to my destination vs fixing a mistake I did prior.

Feedback from pedal & wheel control backward toward trying to figure out which component of the network needs updating is indeed a weak signal. But this is counterbalanced by the sheer volume of data available. If the labeled data shows 100k cases where the driver changes lane when it sees an object (and 0 where it runs over the object), then the NN will learn to always move over for the object). Even if there are other reasons you may change lanes where a few training samples wouldn't be sufficient to clarify, the volume wins out.

I don't think people are really understanding what is happening in machine learning with transformers and compute. Look at this post going around these days in ML world, as well this paper on scaling laws.

Intuition and domain knowledge eventually lose out to compute and data. These models just keep improving in a consistent, predictiable fashion as long as you allow enough compute, data, and parameters in the model. The actual architecture / depth / width matters less!

Self driving cars at scale is one of the hardest problems, which is why it will need to be solved by ML models. There is no reason to think it will behave differently than other complex models. What wins? Compute and data. Clean data. The Attention mechanism / transformer models have shown strong ability to generalize from a data set and store nuanced understandings given enough data.

Worried about navigation? Just literally feed a picture of the navigation screen into the training, like a human.

The main issues going forward I see are:

1) Cleaning the data. You are ingesting so much data, and you want to remove contradictory data & data from say distracted drivers.

2) Inference compute. All of what I said above does not imply a great model can work on the limited compute resources.
 
I don't think people are really understanding what is happening in machine learning with transformers and compute. Look at this post going around these days in ML world
Do you think moving to self-supervised pre-training takes even more advantage of Tesla's potential for fleet data and training compute? Instead of "just" building from existing human knowledge of 11.x's network predictions for objects, occupancy, lanes, traffic controls, etc., the neural network architecture and training data could be reshaped to be more general, so it could potentially learn things that human understanding might have overlooked.

I wonder if self-supervised training including videos of crashes would even help the neural network better understand which situations are more likely to result in accidents. If there's enough examples for that level of understanding, then a finetuned control head could potentially even generally learn to avoid these dangerous situations.
 
10.69 introduced the explicit perception output of occlusion with occupancy network, and this was used by control to decide whether it should creep up to the limit to get a better view. If the world model is an evolution of the occupancy network, it would seem natural for it to understand occlusions too. It might be even more fundamental for it to understand for predicting how the world behaves such as object permanence when a crossing vehicle temporarily blocks the view. Potentially this more nuanced understanding of occlusions will allow end-to-end to behave differently when at a solid wall vs chain link fencing that can be easily seen through vs short hedges/walls.

Similarly the 11.x object detection and visualization seems to require the thing to be in sight whereas a visualization output trained on top of the more fundamental understanding could continue to visualize things even if they temporarily disappear from view. However, this could also result in wrong predictions/visualizations if say a person unexpectedly turns around when walking behind a large truck, but this is probably no different than the existing predictions of road layout based on common patterns, e.g., a roundabout is generally circular even if you can't actually see the far side.
We’ll see. There’s been a lot of talk about AI for a long time, but only very limited, specific, well-bounded applications have been successful to date. It seems like AI/ML will be able to provide excellent “assistant” type capabilities. These are extremely valuable tools of course, and have the potential to radically change the world and hopefully make life even more comfortable for humans. But these tools are not what is being discussed here.

Self driving cars at scale is one of the hardest problems, which is why it will need to be solved by ML models. There is no reason to think it will behave differently than other complex models. What wins? Compute and data. Clean data. The Attention mechanism / transformer models have shown strong ability to generalize from a data set and store nuanced understandings given enough data.

We’ll see. Seems like a very tough problem.

Regarding the semantics: I feel like something that has a representation of a stop sign included within is quite different than a model which has no such thing that is distinct in the model. (This is the issue - it’s been stated that no such accessible representation exists, which is confusing, since they were still visualized.) To a complete moron in this field, it seems like a distinction that may actually have some consequences for training, visibility, pace of development, ability to retrain particular parts of the model, etc.
 
It's semantics, the training of either approach is the same. You still have backprop through the modules. It's just each piece has been pre-trained. That's not a bad thing, it assuredly allows your total model to converge to decent results more rapidly. And it can still allow the perception component to evolve over time. it doesn't have to be static.



Feedback from pedal & wheel control backward toward trying to figure out which component of the network needs updating is indeed a weak signal. But this is counterbalanced by the sheer volume of data available. If the labeled data shows 100k cases where the driver changes lane when it sees an object (and 0 where it runs over the object), then the NN will learn to always move over for the object). Even if there are other reasons you may change lanes where a few training samples wouldn't be sufficient to clarify, the volume wins out.

I don't think people are really understanding what is happening in machine learning with transformers and compute. Look at this post going around these days in ML world, as well this paper on scaling laws.

Intuition and domain knowledge eventually lose out to compute and data. These models just keep improving in a consistent, predictiable fashion as long as you allow enough compute, data, and parameters in the model. The actual architecture / depth / width matters less!
Yes, but only when the amount of supervised labels (which is token ahead prediction) grows directly proportionally to that.

The problem with L4 is the significant out of bounds and controllability requirements. GPT-4 works really well because almost any statement that it needs to think about is not outside its training domain as it has ingested almost all of computer readable text.

We can't get there like that with just using thin human control as the only supervision signal. There is plenty of data for the auxiliary task of predicting images 50 ms ahead, but that doesn't give a driving system. Predicting words one token ahead is very close to a useful chatbot, but the auxiliary task in driving is much further away.

This self supervised general purpose will be good at making a very natural L2+++ ADAS system, and then probably get stuck, and then it's a grey goo we don't understand and don't have a path to true controllability. Humans need to guarantee controllability to ideal laws in order to pass regulation and acceptance, not statistical behavior on near training sets.

Potentially if massive compute is able to generate full fidelity simulated data indistinguishable from real in the lab (simulating control of ego and other objects response and the photon response of the sensors including weather and glare), then the massive compute-and-search based directives of Sutton would work, as that would give the massive quantity of simulated correct and incorrect labels that adhere to the regulatory requirements.

Self driving cars at scale is one of the hardest problems, which is why it will need to be solved by ML models. There is no reason to think it will behave differently than other complex models. What wins? Compute and data.
What level of compute is needed? If it is 500x what is on board today it might not work, as Moore's law has stopped.
 
Part of the practical aspect is what is good enough for initial release of end-to-end? If Tesla can improve maps in parallel to significantly improve the driving behavior of already deployed models, Tesla might even build out something to react closer to real-time such as detecting debris in the road and having other vehicles avoid. (Even more future would be to change driving behavior/routes based on automated reports of partial/uncertain issues to get a better view to verify.)

While annoying to miss a turn, it's not necessarily a safety issue as there's the option to reroute unless the human decides to force a last second lane change. Overall it seems like the main benefit of 11.x to 12.x is increasing comfort while also maintaining or improving safety, so maybe it will be comfortable for average traffic initially and improve later for heavy traffic.
I don't think they need crowdsourced metadata for initial release, ofcourse. But to make the drive feel anything other than a fumbling new driver's, they need to get some of the local knowledge we use to drive.

Missing turns can be more than an annoyance - sometimes FSD can put you in dangerous situations by stopping in a lane trying to change lanes and it can get you late for important appointments ... or make you miss flights !
 
  • Like
Reactions: Mardak