FSD v12.x (end to end AI)

mongo · Sep 14, 2023

powertoold said:
Anyone have any ideas on how V12 works?

1) If V12 is not taught / labeled what a traffic light / lane / pedestrian / sign is, how will it be able to extract what information is relevant in a video stream?

Assuming it isn't started from a pretrained set of weights, it picks it up by inference, the same way NN recognize boats vs planes vs a thousand other categories in the ImageNet data set.
After millions of iterations of millions of samples it gains red_on_these_regions->stop and green_in_these_regions->go connections

Tronguy · Sep 14, 2023

enemji said:
It surely will be able to perfect your commute in no time.

Um. Being of the DSP persuasion, although long before Markov Chains became a thing, I've been following the foibles of Neural Networks for quite some time. Not that I've programmed same, but, still.

There's no question that NN's can be rigged up in control loops with feedback that, in turn, changes the weights of the NN and, thus, its behaviors. That's the math part of it. From another perspective, we know that our brains work primarily with NNs all the time. As users of this wetware, we have a fairly decent idea of How It All Works, including how to train children up to become adults, including learning how to walk, talk, and probably rub one's belly at the same time.

So, in a way, it's easy to think that with a bit of self training, along the lines of what we do when we train a newbie driver how to drive, one might think that a NN'd car is trainable like that.

And it's right about here where it all falls apart. The NNs that our wetware brains work with are not just blind, unprogrammed NNs; they've got a couple-three billion years (or however long it's been since nerves first evolved) of evolution behind them. Evolution is Nature that's bloody in tooth and claw; if a built-in NN isn't as good as the NN in the same species (natural variation) or as good as the NN in the predator species, those that don't measure up end up dead.

And this impacts everything in our wetware. Sense of balance? Ability to suss out sounds? Image recognition? Instinctive (that's a good word..) fear of situations that raise the hairs on the back of our necks?

It's that good old argument of nature vs. nuture for particular traits, which people can get endlessly involved in about which is more important. But, no question, we've got loads of the first (nature) stuff, without which we wouldn't survive, period.

The fact that we even have the ability to drive a car is pretty amazing. First, my spouse, a Human Factors engineer, will absolutely tell anyone who'll listen that it's perfectly possible (and has been done, not on purpose) to design a vehicle that can't be driven by mere mortals, be they test pilots or no. Second, it doesn't take much thought to see that, as descendants of a long line of omnivores (and who knows what even before there were primates), being able to dodge predators and work in concert with others was probably survival-enhancing thing. At which point Nature takes over and puts in instincts.

So, as a stupid example, running into trees at full tilt is something that humans have a decided aversion to. But think about what that means: Ability to recognize a tree, how the energy in speed translates to splat, and What Happens if One Hits, the instinct towards self preservation.. If one takes one of those robots Tesla is working upon, those things don't have a built-in aversion to trees at high speeds, they just got NN/algorithmic rules written, more or less, on white sheets of paper. Which is not the same thing as a human's thought (or lack thereof) processes when approaching a tree at speed.

It's arguments like the above that make probably fuel the naysayers around these parts who are of the unsubtle opinion that FSD, from whoever, not just Tesla, is a research project that might take decades to solve. They may have a point.

But I'm not one of those people. As weird and varied as roads are, it's Not The General Universe. And, well, in a couple hundred years we may very well be building truly intelligent beings that do have all the bells and whistles; but, at the moment, we're coming into the shallows with the technology that we do have; and, that technology, as limited as it is, is ridiculously more powerful than, say, what was available fifty years ago. So I think it quite likely that we'll have the equivalent of the smarts of a rat's brain to traverse the landscape in a car.

But, a generally self-programmable NN/Algorithmic computer that copies a driver? No way, not now, possibly never, and, if it ever does appear, will have some core self-preservation system (perhaps modeled on today's work) that prevents it from doing stupid stuff. Ten, fifty years from now? If one was be lucky.

diplomat33 · Sep 14, 2023

powertoold said:
Anyone have any ideas on how V12 works?

1) If V12 is not taught / labeled what a traffic light / lane / pedestrian / sign is, how will it be able to extract what information is relevant in a video stream?

I think Elon was not very precise when he said that V12 is not taught what a traffic light/lane/pedestrian/sign is. What he meant is that humans don't manually label traffic lights/lanes/pedestrians/signs. And there is no hard coded heuristics that say "this is a traffic light". But V12 is taught what traffic lights, lanes, signs are. The learning is just done more implicitly. Basically, The NN is given video training and it extracts the meaning of traffic lights, signs etc on its own.

powertoold · Sep 14, 2023

diplomat33 said:
But V12 is taught what traffic lights, lanes, signs are.

Do you mean that Tesla feeds these labels to V12 during the training?

diplomat33 · Sep 14, 2023

powertoold said:
Do you mean that Tesla feeds these labels to V12 during the training?

Tesla just feeds videos with no labels. V12 "teaches itself" what everything is.

Tronguy · Sep 14, 2023

powertoold said:
Do you mean that Tesla feeds these labels to V12 during the training?

If I understand Elon and his buddy in the car, more fundamental than that.

It's not labels that gets fed into the NN. A NN has multiple inputs and multiple stages of summers on inputs for a stage, with each summer multiplying each input by a weight; then the outputs of the summers on one stage are inputs for the summers on the next stage, and so on. In addition, at each stage, the outputs of that stage may be fed back as inputs for that or previous stages as well, with more weights thrown in.

At the end of all this malarky comes out forces on the steering wheels brakes, and what-not. That's what end-to-end NN means.

Thing is, one can have algorithmic (i.e., C++ or the like) code in there as well, to a greater or lesser extent, depending upon the design.

The general idea behind training a NN is to have known inputs and known, correct outputs; the actual outputs are compared to the known, correct inputs and, if they're off by yea, the weights are manipulated until the outputs of the NN match up with the known, correct (externally generated) outputs.

What Elon more or less said was that, originally, the NN would come out with outputs that then went into an algorithmic section that had The Rules Of Driving built in and curated by software designers. The change is that they dumped the Rules of Driving and now compare the NN outputs with what the car should be doing.

So, when the car approaches a Stop Sign, does it read the stop sign, understand what the stop sign is saying, or the stop slgn gets labeled as a stop sign and some algorithms with the $STOP_SIGN variable set do their bit? None of that. The NN simply puts out outputs that Bring The Car To A Halt. Or whatever is supposed to happen.

From this view, the car knows nothing. It.. just... does it. Whatever "it" is.

So, a street sign that you or I would read that says, "No turn on red"? Naw, the car doesn't understand that. It just, NN style, doesn't turn right on red. No thought; no consciousness; no introspection; no reading; it's a ridiculously complicated machine that Just Does That.

The part that I don't understand is that, while I was watching the demo, the car was showing the usual: Predicted path, when it was going to come to a halt, position of the other cars, and so on. In a pure NN system, all of that wouldn't even be internal variables, it'd just be this morass of NN inputs, outputs, and weights, and no spacial identity whatsoever. Since there was that spacial view, that means that, to some extent, what I wrote above is probably not quite true. Instead, there may very well be an NN-based occupancy generating pile with all its outputs of place and time; but instead of that all going into Algorithm City, it may go into another NN, with the second NN doing the actual driving. And figuring out (well, no figuring, exactly) what all the signs mean.

Fun days.

powertoold · Sep 14, 2023

diplomat33 said:
Tesla just feeds videos with no labels. V12 "teaches itself" what everything is.

That's my interpretation as well, so it goes back to my original question:

1) If V12 is not taught / labeled what a traffic light / lane / pedestrian / sign is, how will it be able to extract what information is relevant in a video stream?

Tronguy · Sep 14, 2023

powertoold said:
That's my interpretation as well, so it goes back to my original question:

1) If V12 is not taught / labeled what a traffic light / lane / pedestrian / sign is, how will it be able to extract what information is relevant in a video stream?

Feed the V12 all the images and all the data. If it doesn't do the Right Thing, feed error movements back into the training set, which results in weights being changed. Later, rinse, repeat until the outputs match the desired.

The video stream is the input into the NN, pixels and all. (Or at least one set of the inputs.)

JB47394 · Sep 14, 2023

powertoold said:
Anyone have any ideas on how V12 works?

Only the guys at Tesla

powertoold said:
1) If V12 is not taught / labeled what a traffic light / lane / pedestrian / sign is, how will it be able to extract what information is relevant in a video stream?

Correlation. Let's say that you see 10,000 clips of people stopping at stop signs. In those clips are trees randomly showing up, mail boxes occasionally, people here and there, various road intersection geometries, etc, but the one constant is that there's a blob of red pixels with some white pixels in the middle. So you key off that. In all of those clips, the control outputs were to slow the car, so that's what you learn to do; red pixels with white pixels off to the right means slow the car to a stop.

Note that you don't know what trees, mail boxes, people or roads are, but for the purposes of that training episode you know that they're not significant to what you're supposed to be learning. It is the red and white blob that correlates to coming to a stop.

Its a ridiculous way to teach anything to anyone because it doesn't take any advantage of prevailing concepts. Elon is famous for observing that people drive with just their eyes. Well, we also drive with our understanding of the world via a certain set of concepts and having a mental model of how it all fits together. Autonomy doesn't require sentience, but it can sure make good use of learning based on the prevailing concepts that people work with.

diplomat33 · Sep 14, 2023

powertoold said:
That's my interpretation as well, so it goes back to my original question:

1) If V12 is not taught / labeled what a traffic light / lane / pedestrian / sign is, how will it be able to extract what information is relevant in a video stream?

It extracts the information on its own.

For example, Tesla feeds Dojo 10k video clips of a human driver driving through an intersection. The machine infers from matching the human driver behavior to the features in the video. So it learns to stop at a red light because in all the video clips, that is what the human driver does. So, V12 does not know what a traffic light is per se, it just knows that when an object that looks like a traffic light turns red, it should stop.

powertoold · Sep 14, 2023

JB47394 said:
Correlation. Let's say that you see 10,000 clips of people stopping at stop signs. In those clips are trees randomly showing up, mail boxes occasionally, people here and there, various road intersection geometries, etc, but the one constant is that there's a blob of red pixels with some white pixels in the middle. So you key off that. In all of those clips, the control outputs were to slow the car, so that's what you learn to do; red pixels with white pixels off to the right means slow the car to a stop.

Note that you don't know what trees, mail boxes, people or roads are, but for the purposes of that training episode you know that they're not significant to what you're supposed to be learning. It is the red and white blob that correlates to coming to a stop.

Its a ridiculous way to teach anything to anyone because it doesn't take any advantage of prevailing concepts. Elon is famous for observing that people drive with just their eyes. Well, we also drive with our understanding of the world via certain set of concepts and having a mental model of how it all fits together. Autonomy doesn't require sentience, but it can sure make good of learning based on the prevailing concepts that people work with.

Is this your current belief about how V12 works? If so, does it change your interpretation about what James Douma said about V12 building on top of V11?

JB47394 · Sep 14, 2023

powertoold said:
Is this your current belief about how V12 works?

I'm still assuming that V12 is V11 with the control module replaced by a neural network. I would consider inferring control outputs based entirely on pixel inputs to be another one of those things where Tesla would beat its head against the problem for years without a good solution ever coming into focus. It just makes zero sense to me.

The solution I expect to work is one where there are a variety of neural networks that collect this, that, or the other bit of information about the driving environment. The labeler can identify all the essential objects, but there are also various cues that people rely on to drive safely. Is the rear end of a car lifting suddenly? Are its wheels slowly turning? Is the car's center of mass high or low? Is there a bunch of crap mounted to the vehicle? Is the driver of the car looking in my direction? What is the angle of the sun? These are the sorts of things that we consider significant when we drive. Tesla may already be doing all this. If so, then replacing the control module with a neural network should do the trick. Even without the additional cues they may end up with something that works pretty well.

Once they get some form of that working, only then should they consider removing the boundaries between the various neural networks. Each network will have been trained independently, but by allowing them to continue their training together there's the potential for the system to create its own correlations and such that produce some kind of superior driving intuition. But V11 plus a neural network control module is the first step.

Reminder: I am not a machine learning engineer

powertoold · Sep 14, 2023

JB47394 said:
I'm still assuming that V12 is V11 with the control module replaced by a neural network. I would consider inferring control outputs based entirely on pixel inputs to be another one of those things where Tesla would beat its head against the problem for years without a good solution ever coming into focus. It just makes zero sense to me.

The solution I expect to work is one where there are a variety of neural networks that collect this, that, or the other bit of information about the driving environment.

I'm curious about this interpretation.

If the car is controlling for the "red and white blob" video stream which it "interprets" as a "stop sign" in its black box, what's the point of all the prior autolabels of stop signs (in relation to V12's actual driving)? What's even the point of a birds eye view of a stop sign coming up?

To me, it seems like V12 is a "one box" solution for perception, planning, and control in one NN architecture. Where the good driver videos define perception, planning, and control in one package, without the need to separate these concepts into separate tasks.

enemji · Sep 14, 2023

Tronguy said:
Um. Being of the DSP persuasion, although long before Markov Chains became a thing, I've been following the foibles of Neural Networks for quite some time. Not that I've programmed same, but, still.

There's no question that NN's can be rigged up in control loops with feedback that, in turn, changes the weights of the NN and, thus, its behaviors. That's the math part of it. From another perspective, we know that our brains work primarily with NNs all the time. As users of this wetware, we have a fairly decent idea of How It All Works, including how to train children up to become adults, including learning how to walk, talk, and probably rub one's belly at the same time.

So, in a way, it's easy to think that with a bit of self training, along the lines of what we do when we train a newbie driver how to drive, one might think that a NN'd car is trainable like that.

And it's right about here where it all falls apart. The NNs that our wetware brains work with are not just blind, unprogrammed NNs; they've got a couple-three billion years (or however long it's been since nerves first evolved) of evolution behind them. Evolution is Nature that's bloody in tooth and claw; if a built-in NN isn't as good as the NN in the same species (natural variation) or as good as the NN in the predator species, those that don't measure up end up dead.

And this impacts everything in our wetware. Sense of balance? Ability to suss out sounds? Image recognition? Instinctive (that's a good word..) fear of situations that raise the hairs on the back of our necks?

It's that good old argument of nature vs. nuture for particular traits, which people can get endlessly involved in about which is more important. But, no question, we've got loads of the first (nature) stuff, without which we wouldn't survive, period.

The fact that we even have the ability to drive a car is pretty amazing. First, my spouse, a Human Factors engineer, will absolutely tell anyone who'll listen that it's perfectly possible (and has been done, not on purpose) to design a vehicle that can't be driven by mere mortals, be they test pilots or no. Second, it doesn't take much thought to see that, as descendants of a long line of omnivores (and who knows what even before there were primates), being able to dodge predators and work in concert with others was probably survival-enhancing thing. At which point Nature takes over and puts in instincts.

So, as a stupid example, running into trees at full tilt is something that humans have a decided aversion to. But think about what that means: Ability to recognize a tree, how the energy in speed translates to splat, and What Happens if One Hits, the instinct towards self preservation.. If one takes one of those robots Tesla is working upon, those things don't have a built-in aversion to trees at high speeds, they just got NN/algorithmic rules written, more or less, on white sheets of paper. Which is not the same thing as a human's thought (or lack thereof) processes when approaching a tree at speed.

It's arguments like the above that make probably fuel the naysayers around these parts who are of the unsubtle opinion that FSD, from whoever, not just Tesla, is a research project that might take decades to solve. They may have a point.

But I'm not one of those people. As weird and varied as roads are, it's Not The General Universe. And, well, in a couple hundred years we may very well be building truly intelligent beings that do have all the bells and whistles; but, at the moment, we're coming into the shallows with the technology that we do have; and, that technology, as limited as it is, is ridiculously more powerful than, say, what was available fifty years ago. So I think it quite likely that we'll have the equivalent of the smarts of a rat's brain to traverse the landscape in a car.

But, a generally self-programmable NN/Algorithmic computer that copies a driver? No way, not now, possibly never, and, if it ever does appear, will have some core self-preservation system (perhaps modeled on today's work) that prevents it from doing stupid stuff. Ten, fifty years from now? If one was be lucky.

I do agree with you and your wife, and that self preservation, and that includes discrimination instincts, are what keeps us alive when not just driving but pretty much anything in life.

To that point, I have let FSDb drive on a road with no lanes and just cars parked on both sides. It was able to weave around them and keep going. That is the self preservation instinct that has been programmed in.

Now the hard part is actually following the road rules as they vary by neighborhood, city, county, and states. In that manner it is better and best that the NN learns from each Tesla where they are local to. I and my car don’t need to learn how to navigate in the city of San Francisco and vice versa for the car and driver in San Francisco.

Good discussion….

uscbucsfan · Sep 14, 2023

enemji said:
I do agree with you and your wife, and that self preservation, and that includes discrimination instincts, are what keeps us alive when not just driving but pretty much anything in life.

To that point, I have let FSDb drive on a road with no lanes and just cars parked on both sides. It was able to weave around them and keep going. That is the self preservation instinct that has been programmed in.

Now the hard part is actually following the road rules as they vary by neighborhood, city, county, and states. In that manner it is better and best that the NN learns from each Tesla where they are local to. I and my car don’t need to learn how to navigate in the city of San Francisco and vice versa for the car and driver in San Francisco.

Good discussion….

It's a fine discussion if you preface it with that's not Tesla's plan currently.

JB47394 · Sep 14, 2023

powertoold said:
If the car is controlling for the "red and white blob" video stream which it "interprets" as a "stop sign" in its black box, what's the point of all the prior autolabels of stop signs (in relation to V12's actual driving)? What's even the point of a birds eye view of a stop sign coming up?

Autolabeling has no role in the monolithic neural network. It's pixels-in and controls-out. In between it's all unstructured magic. It is only when you structure the solution with multiple networks with defined interfaces that you can use something like an autolabeler. V11 is the structured solution - that relies on heuristics for vehicle control. In contrast, V12 is, in theory, the structured solution that relies on a neural network for vehicle control. V11 to V12 represents a stepwise change which seems completely manageable, especially in an eight month timeframe.

powertoold · Sep 14, 2023

JB47394 said:
Autolabeling has no role in the monolithic neural network. It's pixels-in and controls-out. In between it's all unstructured magic. It is only when you structure the solution with multiple networks with defined interfaces that you can use something like an autolabeler. V11 is the structured solution - that relies on heuristics for vehicle control. In contrast, V12 is, in theory, the structured solution that relies on a neural network for vehicle control. V11 to V12 represents a stepwise change which seems completely manageable, especially in an eight month timeframe.

If V12 is based on top of V11's structured solution for perception (autolabels and all), how do you reconcile Elon and Ashok's comments during the livestream about V12 not being taught what traffic lights / signs / lane lines / VRUs are?

JB47394 · Sep 14, 2023

powertoold said:
If V12 is based on top of V11's structured solution for perception (autolabels and all), how do you reconcile Elon and Ashok's comments during the livestream about V12 not being taught what traffic lights / signs / lane lines / VRUs are?

I'm rewatching the drive and the bulk of his statements are about "we don't have a line of code for [behavior]", each of which is a reference to removal of the C++ code.

At one point he says

We have not programmed in the concept of traffic lights. So there's not like 'this is a red light', 'this is a green light', and 'this is the traffic light position'. We have that in the normal stack, but we do not have that in V12. This is just video. Video training. Like I said, nothing but neural nets.

I think this is more of the same, but using vague language that can be interpreted multiple ways. It seems like he's just trying to explain that all the heuristic code that explicitly knew about red lights, green lights and so on has been replaced with a neural network, which obviously tickles Elon pink. I assume that the labeling is still taking place, but the fact that the labeler is kicking out REDLIGHT doesn't mean anything to the control network per se. It's just a thing in the field of view and it correlates to the driver stopping the car. So it doesn't know what a red light 'is'. In contrast, the engineers working on the C++ code knew what a red light was and structured the logic with that in mind.

I believe that's all Elon was saying. It would be nice to be able to sit down with Ashok and ask him pointed questions, but I'm assuming that he or a member of his team will eventually present something at some technical conference and we'll get the definitive statements that way.

enemji · Sep 14, 2023

uscbucsfan said:
It's a fine discussion if you preface it with that's not Tesla's plan currently.

It is my best guess as to where Elon's mind is at...

enemji · Sep 14, 2023

powertoold said:
If V12 is based on top of V11's structured solution for perception (autolabels and all), how do you reconcile Elon and Ashok's comments during the livestream about V12 not being taught what traffic lights / signs / lane lines / VRUs are?

It is the other way around.

When the video is recorded, it captures the surrounding environmental variables and how the driver reacted to those combination of variables. From that it learns how to behave (react) to a set of variables in and around the vehicle. No one tells it that that red/white is a stop sign or that the 3 lights hanging in the sky is a traffic light.

That is how training works. Structured training is by definition non-AI.

FSD v12.x (end to end AI)

Well-Known Member

Active Member

Average guy who loves autonomous vehicles

Active Member

Average guy who loves autonomous vehicles

Active Member

Active Member

Active Member

Active Member

Average guy who loves autonomous vehicles

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Similar threads