Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

FSD AP improvements in upcoming v11 from Lex Fridman interview

This site may earn commission on affiliate links.

Cosmacelf

Well-Known Member
Supporting Member
Mar 6, 2013
12,699
46,797
San Diego
The recent Lex Fridman interview of Elon had some interesting tidbits about the next major FSD beta version (https://www.youtube.com/watch?v=DxREm3s1scA).

First, something simple. Tesla’s AI inference chip (each car has two of them) has an image signal processor that processes each image frame from the eight cameras, and the resulting processed image is what the car neural net then sees, and more importantly, what it was trained on. This processing is similar to what any digital camera does. The unprocessed image looks like what a raw image file looks like (for those photo buffs) and isn’t human useful since a human really can’t see much in the resulting image.

But that initial processing takes a huge 13 milliseconds. For a system that runs at a 27ms frame rate, that’s like half your time budget. Moreover, the processed image has a lot less data in it that the raw image (which is basically photon counts). In particular, the raw image would allow a computer to see much better in very low light situations than with a processed image.

So, Tesla is bypassing the image processor entirely in V11 and the neural net will use just the raw, photon count, images. Doing this means Tesla will have to completely retrain its neural nets from scratch since the input is so different. This is yet another example of “the best part is no part” thinking and should offer huge improvements since they will now have more accurate input and more time for other kinds of processing. BTW, the top left of this image of the Tesla inference chip is what they are bypassing. It isn’t a complete waste since they still need the processed images for sentry mode and whatnot, but it won’t be part of the critical FSD time loop.

r/teslamotors - FSD AP improvements in upcoming v11 from Lex Fridman interview
Tesla FSD inference chip

v11 will also push even more C code into the neural net. Currently, the neural net outputs what Elon called “a giant bag of points” that is labeled. Then C code turns that into vector space which is a 3D representation of the world outside the car. V11 will expand the neural net so that it will produce the vector space itself, leaving the C code with only the planning and driving portions. Presumably this will be both more accurate and faster.

Not all the neural nets in the car use the surround video pipeline yet. Some still process perception camera by camera, so that’s getting addressed as well.

In the end, lines of code will actually drop with this release.

Other interesting things. They have their own custom C compiler that generates machine code for the specific CPUs and GPUs on the AI chip. All the hardcore time sensitive code is written in C.

So, that’s the info Elon told us from the interview. As I wrote in my last deep dive about FSD (Layman's Explanation of Tesla AI Day), Tesla has a lot of optimizations it can still do, and this is an example of a couple of them.
 
The recent Lex Fridman interview of Elon had some interesting tidbits about the next major FSD beta version (https://www.youtube.com/watch?v=DxREm3s1scA).

First, something simple. Tesla’s AI inference chip (each car has two of them) has an image signal processor that processes each image frame from the eight cameras, and the resulting processed image is what the car neural net then sees, and more importantly, what it was trained on. This processing is similar to what any digital camera does. The unprocessed image looks like what a raw image file looks like (for those photo buffs) and isn’t human useful since a human really can’t see much in the resulting image.

But that initial processing takes a huge 13 milliseconds. For a system that runs at a 27ms frame rate, that’s like half your time budget. Moreover, the processed image has a lot less data in it that the raw image (which is basically photon counts). In particular, the raw image would allow a computer to see much better in very low light situations than with a processed image.

So, Tesla is bypassing the image processor entirely in V11 and the neural net will use just the raw, photon count, images. Doing this means Tesla will have to completely retrain its neural nets from scratch since the input is so different. This is yet another example of “the best part is no part” thinking and should offer huge improvements since they will now have more accurate input and more time for other kinds of processing. BTW, the top left of this image of the Tesla inference chip is what they are bypassing. It isn’t a complete waste since they still need the processed images for sentry mode and whatnot, but it won’t be part of the critical FSD time loop.

r/teslamotors - FSD AP improvements in upcoming v11 from Lex Fridman interview
Tesla FSD inference chip

v11 will also push even more C code into the neural net. Currently, the neural net outputs what Elon called “a giant bag of points” that is labeled. Then C code turns that into vector space which is a 3D representation of the world outside the car. V11 will expand the neural net so that it will produce the vector space itself, leaving the C code with only the planning and driving portions. Presumably this will be both more accurate and faster.

Not all the neural nets in the car use the surround video pipeline yet. Some still process perception camera by camera, so that’s getting addressed as well.

In the end, lines of code will actually drop with this release.

Other interesting things. They have their own custom C compiler that generates machine code for the specific CPUs and GPUs on the AI chip. All the hardcore time sensitive code is written in C.

So, that’s the info Elon told us from the interview. As I wrote in my last deep dive about FSD (Layman's Explanation of Tesla AI Day), Tesla has a lot of optimizations it can still do, and this is an example of a couple of them.
Don’t worry about people fussing about whether you have posted this in the correct thread … this is a brilliant plain English explanation of that part of the interview. Keep posting wherever and whenever you have more insights to share!
 
Doing this means Tesla will have to completely retrain its neural nets from scratch since the input is so different.
This is the most interesting part. How long to retrain? Did they already collect raw pictures from the cars? Do they have to label all again? Or do they already have the training set ready to go? It sounds like the n'th rewrite...

And we all know raw from our DSLRs are giant files compared to any compressed format. Do the cars have enough bandwidth for this, and compute power? Do the plan on compresing the raw data?
 
This is the most interesting part. How long to retrain? Did they already collect raw pictures from the cars? Do they have to label all again? Or do they already have the training set ready to go? It sounds like the n'th rewrite...

Great questions. Presumably you can go backwards from processed images and recreate a raw image. It won’t be 100% the same, but it might be enough for training. Going forward they can grab raw images (And probably have been for a while).

And we all know raw from our DSLRs are giant files compared to any compressed format. Do the cars have enough bandwidth for this, and compute power? Do the plan on compresing the raw data?

Note that for transmission purposes, you can still compress raw images. But yes, there is a lot of devil in the details that Elon didn’t tell us!
 
I got the distinct impression from Elon in the Lex Freidman interview that we can expect v11 to be "worse" before things get better. As I said in another thread, I was quite surprised because I thought a lot this (e.g., the 8-camera surround video and neural networks for vector-space generation) was already done in the current "pure vision" stack.
 
  • Like
Reactions: K5TRX
I hate pinned threads. Useless stale info most of the time. Comes across like spam to me. The good stuff is below. Prefer non consolidated threads with accurate subject.
I prefer well kept and upto date pinned threads - that people actually use. Like the market thread in investors forum.

Just see all the FSD Beta threads - several of them for each release.

In this case, the whole is more than the sum of its parts.

ps : Yes, they do need to be kept updated and cleaned up. Like the one below - I use it quite a bit (well, used to when I followed deliveries more closely).

 
@EVNow, the reason I didn’t use your pinned thread is that most pinned threads contain stale info (like @Terminator857 said). Yours may not, but how would I know that since the rest of TMC isn’t like that. Also your pinned thread has two subjects, one of which was anticipation, so it wasn’t relevant. That’s why I never even looked at it. To each his own, but now you know why.
 
Small clarification The 13ms is for all 8 cameras combined, something like ~1.5ms per cam FYI. Elon makes it sound like per camera at first but his next couple of sentences sort of clarify it. I'm also not entirely swallowing his great night vision statement, because he's not accounting for sensor noise. It's not like the sensors produce nothing if photons are hitting it. I'm not saying its not better than expected, just he doesn't fill in the details.
 
Small clarification The 13ms is for all 8 cameras combined, something like ~1.5ms per cam FYI. Elon makes it sound like per camera at first but his next couple of sentences sort of clarify it. I'm also not entirely swallowing his great night vision statement, because he's not accounting for sensor noise. It's not like the sensors produce nothing if photons are hitting it. I'm not saying its not better than expected, just he doesn't fill in the details.

Quite right on the 13ms combined. And yes, sensor noise is a thing. But neural nets can mitigate that in rather interesting ways. Past knowledge allows the neural net to be more confident of a noisy image. For instance, if a car off in the distance passes under a streetlight, and then is in shadows as it comes towards you, the neural net can merge the confident prediction of a car with later noisy images to be fairly confident that the noisy block is still a car (Pretend car has no headlights, or you’re looking at a bicyclist).

Little side story that happened to me. I was fumbling in the dark trying to find the keyhole for my key. All I could see was typical sensor noise, nothing distinct. By accident, my key fell into the keyhole and at that moment, all of a sudden I could “see” the key and keyhole. My brain used the extra information of the fact that there must be a key and keyhole where my hand was to disambiguate my vision sensor noise, and voila I could literally see the keyhole whereas 1 second before I could not.
 
Great sounds like 20 steps back one step forward to me.
I think that's the nature of "foundational rewrites" in a Machine Learning system. The idea is that the performance after multiple iterations in v11 will be better than the what could be achieved through additional iterations of v10.X. I don't expect to see a 10.9 at this point. If they are already saying that v11 is "fundamentally" different and requires retraining with "8-camera surround video," then what would be the point in putting out another version on the now defunct vision stack?
 
  • Like
Reactions: israndy
I think that's the nature of "foundational rewrites" in a Machine Learning system. The idea is that the performance after multiple iterations in v11 will be better than the what could be achieved through additional iterations of v10.X. I don't expect to see a 10.9 at this point. If they are already saying that v11 is "fundamentally" different and requires retraining with "8-camera surround video," then what would be the point in putting out another version on the now defunct vision stack?

That’s what happened with normal fsd. The rewrite started and we got a year of no updates. I hope they are further along this time, and that they don’t put everyone on that task. We could still use some updates on the current builds.
 
And we all know raw from our DSLRs are giant files compared to any compressed format. Do the cars have enough bandwidth for this, and compute power? Do the plan on compresing the raw data?

I just realized I didn’t do a great job of explaining this above. The current neural net does not process a compressed image (like a jpeg image), it works on the uncompressed output of the image signal processor. So processing raw would maybe add 1 or 2 more bits per channel (RGB) to process.
 
That’s what happened with normal fsd. The rewrite started and we got a year of no updates. I hope they are further along this time, and that they don’t put everyone on that task. We could still use some updates on the current builds.
Sounds like they are very close to what will be v11 already. However, as far as I can tell, they are still working to finish implementing the overall 8-camera vision system and hydranets that Karpathy talked about like two or three years ago. I have always been skeptical of how far FSD can go since I started driving on EAP/NOA in 2018, but I am still surprised that these "foundational rewrites" are taking so long. And now, yet again, Elon is predicting they won't be done until the end of the year (2022), which makes me think they still aren't close to "complete," whatever that means. At some point I imagine they will accept the limitations of the current sensor suite, pop a bottle of champagne and call it "done," and then move on to HW4 or HW5 with new sensors that will be the base on the L5/Robotaxi that Elon has been promising. Maybe by the end of 2023? ;)
 
Sounds like they are very close to what will be v11 already. However, as far as I can tell, they are still working to finish implementing the overall 8-camera vision system and hydranets that Karpathy talked about like two or three years ago. I have always been skeptical of how far FSD can go since I started driving on EAP/NOA in 2018, but I am still surprised that these "foundational rewrites" are taking so long. And now, yet again, Elon is predicting they won't be done until the end of the year (2022), which makes me think they still aren't close to "complete," whatever that means. At some point I imagine they will accept the limitations of the current sensor suite, pop a bottle of champagne and call it "done," and then move on to HW4 or HW5 with new sensors that will be the base on the L5/Robotaxi that Elon has been promising. Maybe by the end of 2023? ;)

Yes. Unlike Elon, I always thought it was going to take a long time. I based this on my knowledge of what neural nets, especially the kind Tesla is using, are capable of. They will get there, but it may take another Tesla AI chip version, and mostly likely, a better camera suite. The current system has a hard time seeing 90 degrees left and right for, really, any kind of intersection.
 
  • Like
Reactions: pilotSteve
I just realized I didn’t do a great job of explaining this above. The current neural net does not process a compressed image (like a jpeg image), it works on the uncompressed output of the image signal processor. So processing raw would maybe add 1 or 2 more bits per channel (RGB) to process.
Ok, I am confused now. You wrote:
First, something simple. Tesla’s AI inference chip (each car has two of them) has an image signal processor that processes each image frame from the eight cameras, and the resulting processed image is what the car neural net then sees, and more importantly, what it was trained on. This processing is similar to what any digital camera does. The unprocessed image looks like what a raw image file looks like (for those photo buffs) and isn’t human useful since a human really can’t see much in the resulting image.

But that initial processing takes a huge 13 milliseconds. For a system that runs at a 27ms frame rate, that’s like half your time budget. Moreover, the processed image has a lot less data in it that the raw image (which is basically photon counts). In particular, the raw image would allow a computer to see much better in very low light situations than with a processed image.
to me, this sounds very much like compressing the raw image data, before NN interpretation.
So with this new architecture, what happens really? The "photon to NN" sound mostly like some salesman mumbo jumbo.

Does a decent technical explanation exist somewhere?