Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
Waymo really needs a dedicated robotaxi vehicle. Since it looks like Cruise is done maybe they can buy Origins from GM.
This might be getting too far ahead of 12.x, but potentially Tesla will need a new 13.x architecture for supporting many types of robotaxi vehicles. Sensors will most likely be in different places on an Origin-style robotaxi vs Tesla S/X/3/Y vs Cybertruck vs Semi vs incoming Robotaxi / compact vehicle vs potentially other automakers licensing Autopilot. Even HW3/HW4/HW5+ differences have potential training concerns.

Tesla should have plenty of growing compute to specially train 12.x neural networks for supporting these other vehicles and hardware, but it might become training data limited. If it is a shared network, hopefully training on say Semi data doesn't regress FSD for passenger vehicles, and maybe a new architecture will be needed to more efficiently share common end-to-end learnings, e.g., how to stay in lane, even with very different inputs to speed up deployment across many vehicles.
 
  • Like
Reactions: JB47394
Google just announced their latest AI, Gemini 1.5 that can understand up to 10M tokens with >99.7% recall. The reason I bring it up in this thread is because I think it shows the potential of FSD V12 E2E. It shows that we are getting to the point where AI can process very large amounts of data, with very high recall. So it seems to me like we could see V12 get to 99%+ reliability in the not too distant future. It would take massive amounts of training data but I think it would be doable. I don't know if that would be good enough for eyes-off but if V12 could achieve 99% reliability everywhere, that would be an incredible eyes-on system!

GGYwpHGbgAAK6iB


 
I'm not clear what Waymo's business model is atm (or even if they are).

Exactly. I've been waiting for Waymo to say something about this. But we never even hear, "We see a clear path to profitability."

About a year ago, George Hotz argued that Waymo should be shut down to stop the massive cash burn.

Using some napkin math, Hotz claims:
Every year, GOOG spends $2.7B to make $631k. In order to just break even, assuming miraculously that operating costs stay the same, Waymo would need 4278x growth.

This is why Tesla is trying to build a profitable robotaxi on the cheap. And if they do it will be the end of Waymo. At that point, who else will be able to compete at all?
 
Exactly. I've been waiting for Waymo to say something about this. But we never even hear, "We see a clear path to profitability."

About a year ago, George Hotz argued that Waymo should be shut down to stop the massive cash burn.

Using some napkin math, Hotz claims:

Hotz is biased since he has his own company trying to do self-driving on the cheap. So he has a clear business interest in seeing Waymo fail. So I would not put much stock in him arguing that a competitor should be shut down. Also, it is up to investors or shareholders to shut down Waymo if they don't like the cash burn. It is not up to competitors to shut down companies. Hotz does not get a say whether Waymo gets shut down. The only company that has the power to shut down Waymo is Alphabet and they have not shown any interest in doing that. In fact, I saw an interview with the President of Alphabet who was very positive about Waymo. She also mentioned that Alphabet has $100M in cash. So I don't think Alphabet is too worried about Waymo losing a couple billion.

Hotz's calculation is also very flawed since it assumes nothing changes with Waymo. But we know that won't be true. Hardware costs will come down. Operating costs will also come down. Scaling will also become more efficient as the Waymo Driver becomes more generalized and requires less validation. Waymo won't need 4278x growth to break even. It will likely happen sooner than that.

This is why Tesla is trying to build a profitable robotaxi on the cheap. And if they do it will be the end of Waymo. At that point, who else will be able to compete at all?

That is a big IF. Tesla needs to achieve reliable L4 first.
 
In fact, I saw an interview with the President of Alphabet who was very positive about Waymo. She also mentioned that Alphabet has $100M in cash.
I basically agree with what you said, but I couldn't find the interview. Did she give any hints as to Waymo's financials?

The one thing I found is that there were reports of belt tightening in Alphabet's "other bets" division, which Waymo is part of.
 
I basically agree with what you said, but I couldn't find the interview. Did she give any hints as to Waymo's financials?

Here is the interview when she starts talking about Waymo. She does not talk about Waymo's financials. She just briefly mentions that she loves riding in Waymo and that they are very safe. The interview is mostly about other topics.


The one thing I found is that there were reports of belt tightening in Alphabet's "other bets" division, which Waymo is part of.

Yes, I heard those reports. And I think Waymo did face some layoffs. Waymo will definitely face some belt tightening. I am just saying that I don't think Alphabet will shut down Waymo completely.
 
  • Informative
Reactions: Usain
Exactly. I've been waiting for Waymo to say something about this. But we never even hear, "We see a clear path to profitability."

About a year ago, George Hotz argued that Waymo should be shut down to stop the massive cash burn.

Using some napkin math, Hotz claims:


This is why Tesla is trying to build a profitable robotaxi on the cheap. And if they do it will be the end of Waymo. At that point, who else will be able to compete at all?

Yesterday Hotz said full autonomy is still 10+ years away. If so TSLA is clearly on the wrong path and their profitability today can only be had from a diminishing pool of crowdsourced suckers.
 
Basic NN question, sorry to hijack: how does a neural net actually work? So let's say it's trained on billions of videos so it "knows" what to do in most situations. What does that "knowledge" actually look like?

I mean, the NN can't just sit there and continuously look at all the videos all the time and choose something to do, that's impossible. It must have some kind of generalized understanding that it actually uses in different contexts. Can anyone probe the NN and really see what's actually going on, in the immediate functioning of the system? Can we look at its "understanding" (whatever form that really is) and see a finished form of the net result of all its training?

I guess another way of asking is -where- does it store -what- information, and how is that information processed from perception to action?

I know, go take a class in computer science, or at least google this, but y'all seem really knowledgeable about how it all works, so maybe someone can provide an executive summary for newbies?

My biggest problem is understanding how billions of frames of sequential images made up of pixels (video) can make an impression on some mysterious NN "program" that can be separated from the training and loaded into a pretty simple computer in the car.
 
Basic NN question, sorry to hijack: how does a neural net actually work? So let's say it's trained on billions of videos so it "knows" what to do in most situations. What does that "knowledge" actually look like?

I mean, the NN can't just sit there and continuously look at all the videos all the time and choose something to do, that's impossible. It must have some kind of generalized understanding that it actually uses in different contexts. Can anyone probe the NN and really see what's actually going on, in the immediate functioning of the system? Can we look at its "understanding" (whatever form that really is) and see a finished form of the net result of all its training?

I guess another way of asking is -where- does it store -what- information, and how is that information processed from perception to action?

I know, go take a class in computer science, or at least google this, but y'all seem really knowledgeable about how it all works, so maybe someone can provide an executive summary for newbies?

My admittedly layman understanding is that NN are black boxes. You cannot see what is inside them. The way NN work is they take certain data in and will statistically output some response. A simple example might be a NN that is trained to recognize cats. So it takes images of animals as input and outputs the word "cat" when it thinks the image has a cat in it.

The "knowledge" of the NN is just a connection based on all the training that when it sees X, it should do Y.

In the case of FSD V12, Tesla is training the NN to take in video from the cameras as input and output a certain driving control like turn the steering x degrees or apply the brakes.

My biggest problem is understanding how billions of frames of sequential images made up of pixels (video) can make an impression on some mysterious NN "program" that can be separated from the training and loaded into a pretty simple computer in the car.

The training does not in the computer in the car. The computer in the car only gets the final NN that takes input in and outputs something.
 
Basic NN question, sorry to hijack: how does a neural net actually work? So let's say it's trained on billions of videos so it "knows" what to do in most situations. What does that "knowledge" actually look like?

I mean, the NN can't just sit there and continuously look at all the videos all the time and choose something to do, that's impossible. It must have some kind of generalized understanding that it actually uses in different contexts. Can anyone probe the NN and really see what's actually going on, in the immediate functioning of the system? Can we look at its "understanding" (whatever form that really is) and see a finished form of the net result of all its training?

I guess another way of asking is -where- does it store -what- information, and how is that information processed from perception to action?

I know, go take a class in computer science, or at least google this, but y'all seem really knowledgeable about how it all works, so maybe someone can provide an executive summary for newbies?

My biggest problem is understanding how billions of frames of sequential images made up of pixels (video) can make an impression on some mysterious NN "program" that can be separated from the training and loaded into a pretty simple computer in the car.
All righty then. I've actually been through some training.

So, it all starts out with Real Neurons, like the ones in your brain. A given brain cell of the appropriate type has a large number of neurons that stick far away from the cell. If my memory is working OK today, these things tend to be one-way: That is, the "output" of a given cell nearly touches the "input" of another cell.

Now for the tricky bit. Say that, during some time interval, a number of these input neurons get tickled by a bunch of input neurons. Each input neuron has a "weight". If the zaps (nerves tend to run with discharge pulses that vary in amplitude, they're not continuous like, say, a steady tone or DC value) multiplied by the weights of each neuron is above some threshold, the output(s) of the cell then put out their own zap. Which are picked up by a further layer of neurons and cells.

Now, a neural network device: Each stage in such a device has individual cells with connections to cells in the previous stage; each of those connections have weights. And, just to add to the fun, typically the outputs of the final stage are fed back into the whole array, and the feeding back portions run into weights on the input cells of the (usually) the first stage of all this.

All right. Set up the whole process for detecting Giraffes. Repetitively feed pixel images of Giraffes into a NN as above, starting with random weights on all the inputs, on all of the stages. Decide that one (or several) of the outputs will be Logic One (or a code) when a Giraffe is sighted. When one has a Giraffe and the code isn't present, change the weights and Do It Again until the code shows up. Keep this up with Giraffes right side up, upside down, coming at you, running away, grabbing leaves off of trees, and so on. Keep on varying the weights until, with the Giraffes present, you always get the code. For that matter, while you're doing this, check for not-a-giraffe: Make sure that the code is not present on a similarly large set of images where there's no Giraffe to be seen. Just keep on a-changing the weights so one has a solid, "There's a giraffe" and "there's no giraffe" with one's training sets.

This is where it gets weird. If one has trained up a NN like this and one shows it a picture of a giraffe in a forest, where maybe there's only a couple of spots to be seen, or an odd horn sticking out here or there - the NN finds it. NNs are Really Good at image recognition.

First time I heard of this approach was when taking a tour of the Remote Sensing lab at Purdue. These guys were taking pictures of the Earth with wavelength-of-light specific sensors at different wavelengths. They were using these to figure out how many acres of what crops were being planted, if the crops were infected with something, and other socially/scientific ways of predicting what the crop tonnages of what would be that year. So, they were running something very like a neural net (except that it was actually FORTRAN image processing, back in the day, and a run took hours) for this detection.

The fun part was the previous users of the software systems had been looking for airplanes. And the crop-types were looking at the output of a run and saw Yea Many Acres of Corn, Yea many Acres of Wheat, Yea Many Acres of Soybeans, ... and three airplanes. They went back to these shot-from-space 50 mile by 50 mile images in Impossible Resolution and, yeah - there were three airplanes flying by at different locations and altitudes. The researchers thought it was pretty amusing.

And now we get into the engineering aspects of all this. Digital computers work on steps, one step at a time. (There might be multiple cores, but, still.) They can go ridiculously fast - but it's still one step at a time.

Analog computers (at least, the old, traditional types, which may or may not be electronic) aren't like that. They have continuous inputs and, critically, continuous outputs. Change an input? The output changes nearly immediately. Yeah, there are propagation delays (gear lash on Norden Bombsights, transistor propagation delays with capacitance), but these are small.

A single cell in a Neural Network is designed to real-time multiply and add up a bunch of inputs, all at once.

And an overall NN? It's a massively parallel analog computer that can, for example, look at all the pixels in an image at the same time and come up Giraffe. Or dog.

There's this phrase: "The right tool for the right job." NNs are Really Good at image recognition and much faster than a digital computer trying to do the same thing. Our brains (which, admittedly, aren't that close to the NN used in Teslas) have Issues with, say, multiplying 18 digit numbers that a digital computer can handle. But they're pretty blasted good at spotting that Lion in the Weeds.

Now, the people at Tesla are the experts at how to use NNs. If they can shape the problem to be solved to something that can be handled by a NN, the advantage is typically in speed.

And I guess that this is the point: V11 and earlier were using the NNs in the computer to identify objects around the car; C++ code or whatever took these objects and such and synthesized a way of driving the car through the real world. The change with V12 is to take that C++ code and swap it out for a NN approach.. and that's where the speedup kicks in.
 
The training does not in the computer in the car. The computer in the car only gets the final NN that takes input in and outputs something.
That last sentence is my big question. What's actually in the car? What's it look like? There has to be 0's and 1's at some level somewhere, that's all a computer can do, deal with 0's and 1's. I'm just super curious how connections between nodes can equal instructions for perception/outputs.
 
  • Like
Reactions: LowlyOilBurner
That last sentence is my big question. What's actually in the car? What's it look like? There has to be 0's and 1's at some level somewhere, that's all a computer can do, deal with 0's and 1's. I'm just super curious how connections between nodes can equal instructions for perception/outputs.

The only thing in the car would be the "code" that is inside the NN. The computer in the car executes the "code" that is inside the NN.
 
  • Like
Reactions: FSDtester#1
All righty then. I've actually been through some training.

So, it all starts out with Real Neurons, like the ones in your brain. A given brain cell of the appropriate type has a large number of neurons that stick far away from the cell. If my memory is working OK today, these things tend to be one-way: That is, the "output" of a given cell nearly touches the "input" of another cell.

Now for the tricky bit. Say that, during some time interval, a number of these input neurons get tickled by a bunch of input neurons. Each input neuron has a "weight". If the zaps (nerves tend to run with discharge pulses that vary in amplitude, they're not continuous like, say, a steady tone or DC value) multiplied by the weights of each neuron is above some threshold, the output(s) of the cell then put out their own zap. Which are picked up by a further layer of neurons and cells.

Now, a neural network device: Each stage in such a device has individual cells with connections to cells in the previous stage; each of those connections have weights. And, just to add to the fun, typically the outputs of the final stage are fed back into the whole array, and the feeding back portions run into weights on the input cells of the (usually) the first stage of all this.

All right. Set up the whole process for detecting Giraffes. Repetitively feed pixel images of Giraffes into a NN as above, starting with random weights on all the inputs, on all of the stages. Decide that one (or several) of the outputs will be Logic One (or a code) when a Giraffe is sighted. When one has a Giraffe and the code isn't present, change the weights and Do It Again until the code shows up. Keep this up with Giraffes right side up, upside down, coming at you, running away, grabbing leaves off of trees, and so on. Keep on varying the weights until, with the Giraffes present, you always get the code. For that matter, while you're doing this, check for not-a-giraffe: Make sure that the code is not present on a similarly large set of images where there's no Giraffe to be seen. Just keep on a-changing the weights so one has a solid, "There's a giraffe" and "there's no giraffe" with one's training sets.

This is where it gets weird. If one has trained up a NN like this and one shows it a picture of a giraffe in a forest, where maybe there's only a couple of spots to be seen, or an odd horn sticking out here or there - the NN finds it. NNs are Really Good at image recognition.

First time I heard of this approach was when taking a tour of the Remote Sensing lab at Purdue. These guys were taking pictures of the Earth with wavelength-of-light specific sensors at different wavelengths. They were using these to figure out how many acres of what crops were being planted, if the crops were infected with something, and other socially/scientific ways of predicting what the crop tonnages of what would be that year. So, they were running something very like a neural net (except that it was actually FORTRAN image processing, back in the day, and a run took hours) for this detection.

The fun part was the previous users of the software systems had been looking for airplanes. And the crop-types were looking at the output of a run and saw Yea Many Acres of Corn, Yea many Acres of Wheat, Yea Many Acres of Soybeans, ... and three airplanes. They went back to these shot-from-space 50 mile by 50 mile images in Impossible Resolution and, yeah - there were three airplanes flying by at different locations and altitudes. The researchers thought it was pretty amusing.

And now we get into the engineering aspects of all this. Digital computers work on steps, one step at a time. (There might be multiple cores, but, still.) They can go ridiculously fast - but it's still one step at a time.

Analog computers (at least, the old, traditional types, which may or may not be electronic) aren't like that. They have continuous inputs and, critically, continuous outputs. Change an input? The output changes nearly immediately. Yeah, there are propagation delays (gear lash on Norden Bombsights, transistor propagation delays with capacitance), but these are small.

A single cell in a Neural Network is designed to real-time multiply and add up a bunch of inputs, all at once.

And an overall NN? It's a massively parallel analog computer that can, for example, look at all the pixels in an image at the same time and come up Giraffe. Or dog.

There's this phrase: "The right tool for the right job." NNs are Really Good at image recognition and much faster than a digital computer trying to do the same thing. Our brains (which, admittedly, aren't that close to the NN used in Teslas) have Issues with, say, multiplying 18 digit numbers that a digital computer can handle. But they're pretty blasted good at spotting that Lion in the Weeds.

Now, the people at Tesla are the experts at how to use NNs. If they can shape the problem to be solved to something that can be handled by a NN, the advantage is typically in speed.

And I guess that this is the point: V11 and earlier were using the NNs in the computer to identify objects around the car; C++ code or whatever took these objects and such and synthesized a way of driving the car through the real world. The change with V12 is to take that C++ code and swap it out for a NN approach.. and that's where the speedup kicks in.
OK, so if I might summarize what you said back to you: straight up magic.

Seriously, thanks for that explanation. I kinda start to imagine that i've seen similar ideas looking at 3d images of brain neurons lighting up based on similar visual inputs.

I also vaguely remember how they found a rudimentary middle language the the Google translate NNs had created in secret on their own, that researchers working on it discovered. At this point, I want to watch a 3 part Nova special on how NNs work.

One last think I'd still like to see. How is a NN actually compiled in a computer, and what's it actually doing? How are nodes and connections and weights implemented with 0's and 1's?

Maybe when I retire in 10 years I'll go study computer science for a hobby. This stuff is super cool to think about.
 
  • Like
Reactions: FSDtester#1
How is a NN actually compiled in a computer, and what's it actually doing? How are nodes and connections and weights implemented with 0's and 1's?

The weights of the network are essentially just large matrices, and the act of running an input through the network to achieve an output is mostly matrix multiplication.

I would recommend anyone interested check out Karpathy's video lecture series. He starts from square one, and ends up demonstrating how to build networks like GPT from scratch: Neural Networks: Zero To Hero
 
OK, so if I might summarize what you said back to you: straight up magic.

Seriously, thanks for that explanation. I kinda start to imagine that i've seen similar ideas looking at 3d images of brain neurons lighting up based on similar visual inputs.

I also vaguely remember how they found a rudimentary middle language the the Google translate NNs had created in secret on their own, that researchers working on it discovered. At this point, I want to watch a 3 part Nova special on how NNs work.

One last think I'd still like to see. How is a NN actually compiled in a computer, and what's it actually doing? How are nodes and connections and weights implemented with 0's and 1's?

Maybe when I retire in 10 years I'll go study computer science for a hobby. This stuff is super cool to think about.
Um. Individual cells in a Neural Network are analog computers.

Say that a cell has 100 connections to cells in the previous stage. Each input to a cell has a weight on it; think, digital value, digital to analog converter, and that analog value thus generated going into an analog multiplier.

So, for this hypothetical scenario, 100 inputs, some at zero, some at one, some at some intermediate value; each input gets multiplied by a weight from the DAC; then the sum of resulting 100 inputs gets up against a comparator (whose other input probably comes from Yet Another Comparator); if the summed value is bigger than yea, then the output of the cell goes to 1; else, it's 0.

This is purpose built hardware. It doesn't resemble a digital cpu without a lot of squinting. Typically, one applies a zillion inputs on one clock; waits the propagation delay (which is short, this is all analog), samples the outputs, then loads in the next set of weights. The loading of ye weights is probably Extremely Pipelined.

I'm guessing, since I'm not really somebody who's ever worked on these things, but one can see how it goes: These NN's probably have huge numbers of inputs and outputs and can do multiple jobs at the same time: One looking for dogs, another for humans, another for horses, cars, and on and on.

Heh. Sort of reminds me of using ROM to do complicated logic functions. Say one has a ROM chip with, I dunno, 16 address bits and 8 output bits. Some of the data bits can be turned around and fed into some of the address bits, or all the address bits can be input data. One can write up a bunch of boolean equations that would max out a Xilinx device and simply use that to program the ROM, so long as the number of inputs is less than 16 (minus whatever feedback one has in mind). One one ROM read cycle, then, one gets 8 outputs, and those 8 outputs are any function of the 16 address bits. Handy for nasty looking state machines, and it runs at ROM cycle speeds, which can be fast.
 
I would recommend anyone interested check out Karpathy's video lecture series. He starts from square one, and ends up demonstrating how to build networks like GPT from scratch: Neural Networks: Zero To Hero
Here's another resource: 3Blue1Brown. It's a four part series where he talks about the practice, without actually getting into implementation details.

Some people here disagree with me when I say V12 is magical
It's sufficiently advanced technology.
 
I am not sure we can say that. Both Tesla and Elon have used the SAE levels. Tesla referenced the SAE levels in their communication with the CA DMV. And I think Elon has referenced the SAE levels in earnings calls or on X.
I have not heard that but I am sure elon basically said something along the lines of “it is similar to L2 or L3” but not officially categorizing it as a specification of the FSD
 
  • Funny
Reactions: KArnold