Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

FSD AP improvements in upcoming v11 from Lex Fridman interview

This site may earn commission on affiliate links.
Ok, I am confused now. You wrote:
to me, this sounds very much like compressing the raw image data, before NN interpretation.
So with this new architecture, what happens really? The "photon to NN" sound mostly like some salesman mumbo jumbo.

Does a decent technical explanation exist somewhere?
You can look at some Karpathy talks on you YouTube, but they don't specifically address the image pre-processing issue. From what I have read, the vision stack comprises multiple levels, including multiple NNs and conventional image pre-processing. This includes stitching together the 8-camera images into "surround vision" while correcting for the varying focal lengths, color factors, light intensity levels, etc. across the cameras. There is some image compression as well, although not the kind of compression we think of when talking about JPEG files and such - more image compression in the form utilized for conventional image processing in neural networks. In the LF interview, Elon made it sound like a lot of this was really only needed for a human to perceive information from the camera images, and was a waste for neural networks. Sounds to me at least that what he is saying is to just feed the outputs from the CMOS sensors in the 8 cameras as inputs into the first level neural networks (thus "photons to NN") and cut out much of this image pre-processing. I imagine there will still have to be SOME pre-processing and/or compression of the images to at least account for different camera types across the fleet, camera calibration, and the like. Or maybe camera type and calibration information can just be provided as additional inputs to the NN.
 
Tesla is bypassing the image processor entirely in V11 and the neural net will use just the raw, photon count, images
Is this different from what was released in 10.8 "Improved photon-to-control vehicle response latency by 20% on average" as maybe it was intended for 11 but got cherry-picked into 10.x? Do we know when the interview happened as Lex did say "10.6 just came out recently" and "10.7 on the way," so assuming 10.6.1 wasn't even released, potentially this interview happened early December with enough time to get the raw photon behavior for the holiday update.

In terms of neural network training, different layers and connections can be kept relatively static to train specific portions of the neural networks. Karpathy has talked about this multiple times in terms of fine tuning specific outputs while keeping the main trunk stable to reduce unexpected regressions. For example, some papers have described only needing to retrain the last layer to completely change the type of output as the internal layers have the core understanding of the problem, so similarly the earlier layers could be reworked and retrained to feed into the core understanding of lane lines, etc. Sounds like the networks periodically get fully retrained anyway to better propagate the signals through these boundaries.
 
From watching the Lex interview with Elon, I get the sense that Elon really believes that humans drive by constructing 3D vector spaces in our brains. It's just as crazy as the suggestion that you need to shoot lasers out of your head in order to see obstacles and drive. I can't speak for everyone but I'm pretty sure we don't drive like FSD and draw bounding boxes around every car and pedestrian and lane marker, and then do a tree search for a steering trajectory. When you drive you just drive, humans don't have any comprehensible feature set in-between perception and steering wheel output. When people crash, they don't even really know what happened, they just make up a story after the fact.
 
From watching the Lex interview with Elon, I get the sense that Elon really believes that humans drive by constructing 3D vector spaces in our brains. It's just as crazy as the suggestion that you need to shoot lasers out of your head in order to see obstacles and drive. I can't speak for everyone but I'm pretty sure we don't drive like FSD and draw bounding boxes around every car and pedestrian and lane marker, and then do a tree search for a steering trajectory. When you drive you just drive, humans don't have any comprehensible feature set in-between perception and steering wheel output. When people crash, they don't even really know what happened, they just make up a story after the fact.

Well, our brains have to have a mental model of the world around us, or else we'd be banging into things constantly. When we perceive walls and furniture, our brains have to construct a 3-d mental model of that which is then used for motor functions. You are correct that the brain is a lot more interconnected than Tesla's current (or even V11) neural net in that perception, planning and movement are all intertwined. But our brains essentially do the same thing that Tesla's perception NN does. Tesla does not have NN for planning and motor functions ... yet. They are working on that to have a fully NN integrated driving function.

Also, don't confuse training with inference. When you say bounding boxes, that's what is used to train a neural net. But that isn't what the car's NN produces. The V11 NN produces an abstract 3-D representation of labeled world, which is what our brains do too, only more sophisticated.
 
Well, our brains have to have a mental model of the world around us, or else we'd be banging into things constantly. When we perceive walls and furniture, our brains have to construct a 3-d mental model of that which is then used for motor functions. You are correct that the brain is a lot more interconnected than Tesla's current (or even V11) neural net in that perception, planning and movement are all intertwined. But our brains essentially do the same thing that Tesla's perception NN does. Tesla does not have NN for planning and motor functions ... yet. They are working on that to have a fully NN integrated driving function.

Also, don't confuse training with inference. When you say bounding boxes, that's what is used to train a neural net. But that isn't what the car's NN produces. The V11 NN produces an abstract 3-D representation of labeled world, which is what our brains do too, only more sophisticated.
I think it's obvious that the abstraction of the world that the brain builds internally is far beyond what FSD is doing, and if you tried to reverse engineer the brain's internal representations while driving then it would be utterly incomprehensible, it definitely wouldn't look anything like FSD's 3D world. This is because the abstractions that you use to drive are learned and not explicitly chosen by Tesla engineers so they're not going to look pretty and fit into a bulleted list of objects. It's probably not even going to be in Cartesian space.

Tesla has dozens/hundreds of models that attempt to answer questions like, where are all the cars in X,Y,Z coordinates and their corresponding width/height/depth (that's what a bounding box is!). Where are the pedestrians in X,Y,Z, where the traffic cones, where are the signs, the traffic lights, the lane lines, speed bumps, and so on. I.e. this explicit list of feature/object detectors. Then what they end up with is what you see rendered on the screen, this 3D scene of objects like a video game. Then they take that and they feed it into this quasi-learned/hand-coded C++ planner system that produces the lateral and longitudinal control of the car. Then every version Tesla keeps revving these models over and over, "turning the data crank", they make the cone detector better, they make the velocity estimation better, they add more clips to the lane predicator, etc. But ultimately, the list of tasks they've settled on is not learned, they were explicitly chosen by engineers. This is not how humans drive. As long as the feature detectors are explicitly defined by engineers, it will never be as robust as you can be. It might be able to get pretty good, maybe even tolerable to use, but it will eventually hit some local maxima. It's definitely not going to lead into a cooking and cleaning Tesla Bot that way.

It's just like in the early days of computer vision, before deep learning, if you wanted to find a car in an image you couldn't just train a model and ask it to tell you where the cars are, it wouldn't scale. You'd had to hand-select different feature detectors for the parts of a car, a detector that looks for the shape of the wheels, the tail lights maybe, the geometry of the hood/trunk. Today when we want to make a model to detect cars, we don't hand choose features anymore, we train it on entire cars and then ask it where the cars are. If you peered into the network layers, it may have automatically built detectors for tail lights and wheels, and probably a lot of incomprehensible things, weird figments of cars and edges and shapes. But it'll be far more robust than doing it by hand. In a way, systems like Tesla's FSD are just repeating the same mistake of the past, just on a large scale. They aren't asking the models, where would you drive given this image? They are asking it for hand-selected features, where are the cars, the lanes, the signs, etc. Ultimately I think that approach is just doomed to be thrown out in time, and is only being built now to satisfy this business need to ship an intermediate product before Tesla tosses it for a more end-to-end approach that more closely imitates human driving behavior.
 
I think it's obvious that the abstraction of the world that the brain builds internally is far beyond what FSD is doing, and if you tried to reverse engineer the brain's internal representations while driving then it would be utterly incomprehensible, it definitely wouldn't look anything like FSD's 3D world. This is because the abstractions that you use to drive are learned and not explicitly chosen by Tesla engineers so they're not going to look pretty and fit into a bulleted list of objects. It's probably not even going to be in Cartesian space.

Tesla has dozens/hundreds of models that attempt to answer questions like, where are all the cars in X,Y,Z coordinates and their corresponding width/height/depth (that's what a bounding box is!). Where are the pedestrians in X,Y,Z, where the traffic cones, where are the signs, the traffic lights, the lane lines, speed bumps, and so on. I.e. this explicit list of feature/object detectors. Then what they end up with is what you see rendered on the screen, this 3D scene of objects like a video game. Then they take that and they feed it into this quasi-learned/hand-coded C++ planner system that produces the lateral and longitudinal control of the car. Then every version Tesla keeps revving these models over and over, "turning the data crank", they make the cone detector better, they make the velocity estimation better, they add more clips to the lane predicator, etc. But ultimately, the list of tasks they've settled on is not learned, they were explicitly chosen by engineers. This is not how humans drive. As long as the feature detectors are explicitly defined by engineers, it will never be as robust as you can be. It might be able to get pretty good, maybe even tolerable to use, but it will eventually hit some local maxima. It's definitely not going to lead into a cooking and cleaning Tesla Bot that way.

It's just like in the early days of computer vision, before deep learning, if you wanted to find a car in an image you couldn't just train a model and ask it to tell you where the cars are, it wouldn't scale. You'd had to hand-select different feature detectors for the parts of a car, a detector that looks for the shape of the wheels, the tail lights maybe, the geometry of the hood/trunk. Today when we want to make a model to detect cars, we don't hand choose features anymore, we train it on entire cars and then ask it where the cars are. If you peered into the network layers, it may have automatically built detectors for tail lights and wheels, and probably a lot of incomprehensible things, weird figments of cars and edges and shapes. But it'll be far more robust than doing it by hand. In a way, systems like Tesla's FSD are just repeating the same mistake of the past, just on a large scale. They aren't asking the models, where would you drive given this image? They are asking it for hand-selected features, where are the cars, the lanes, the signs, etc. Ultimately I think that approach is just doomed to be thrown out in time, and is only being built now to satisfy this business need to ship an intermediate product before Tesla tosses it for a more end-to-end approach that more closely imitates human driving behavior.

Your understanding of how the Tesla NN works now isn’t quite correct, but whatever. You are correct in that they need to evolve into an end to end NN. Tesla does know this (now). Elon’s stupid timelines and frankly his lack of understanding have been part of the problem. Tesla has been forced to deliver a piecemeal solution just to make deadlines. Elon himself finally acknowledged that the v10 NN has “probably” hit a local maxima. V11 will hit one too. End to end NN is definately where they need to go.

IMHO, that still won’t be the end though. Even an end to end NN will approach the same road on its 100th trip as if it has never seen it before. When you or I drive a mountain road we’ve never driven before we notice that the locals drive it much faster. That’s because they literally remember every little jiggle and hidden intersection in the road. Tesla does not use a continuous learning NN, so this will never be possible for them. It isn’t a knock against Tesla per se, because NO ONE uses a continuous learning NN, they haven’t been invented for production use yet.
 
I think it's obvious that the abstraction of the world that the brain builds internally is far beyond what FSD is doing, and if you tried to reverse engineer the brain's internal representations while driving then it would be utterly incomprehensible, it definitely wouldn't look anything like FSD's 3D world. This is because the abstractions that you use to drive are learned and not explicitly chosen by Tesla engineers so they're not going to look pretty and fit into a bulleted list of objects. It's probably not even going to be in Cartesian space.

Tesla has dozens/hundreds of models that attempt to answer questions like, where are all the cars in X,Y,Z coordinates and their corresponding width/height/depth (that's what a bounding box is!). Where are the pedestrians in X,Y,Z, where the traffic cones, where are the signs, the traffic lights, the lane lines, speed bumps, and so on. I.e. this explicit list of feature/object detectors. Then what they end up with is what you see rendered on the screen, this 3D scene of objects like a video game. Then they take that and they feed it into this quasi-learned/hand-coded C++ planner system that produces the lateral and longitudinal control of the car. Then every version Tesla keeps revving these models over and over, "turning the data crank", they make the cone detector better, they make the velocity estimation better, they add more clips to the lane predicator, etc. But ultimately, the list of tasks they've settled on is not learned, they were explicitly chosen by engineers. This is not how humans drive. As long as the feature detectors are explicitly defined by engineers, it will never be as robust as you can be. It might be able to get pretty good, maybe even tolerable to use, but it will eventually hit some local maxima. It's definitely not going to lead into a cooking and cleaning Tesla Bot that way.

It's just like in the early days of computer vision, before deep learning, if you wanted to find a car in an image you couldn't just train a model and ask it to tell you where the cars are, it wouldn't scale. You'd had to hand-select different feature detectors for the parts of a car, a detector that looks for the shape of the wheels, the tail lights maybe, the geometry of the hood/trunk. Today when we want to make a model to detect cars, we don't hand choose features anymore, we train it on entire cars and then ask it where the cars are. If you peered into the network layers, it may have automatically built detectors for tail lights and wheels, and probably a lot of incomprehensible things, weird figments of cars and edges and shapes. But it'll be far more robust than doing it by hand. In a way, systems like Tesla's FSD are just repeating the same mistake of the past, just on a large scale. They aren't asking the models, where would you drive given this image? They are asking it for hand-selected features, where are the cars, the lanes, the signs, etc. Ultimately I think that approach is just doomed to be thrown out in time, and is only being built now to satisfy this business need to ship an intermediate product before Tesla tosses it for a more end-to-end approach that more closely imitates human driving behavior.
End to end NN is a black box. Can’t be used in real world driving, demos yes.
 
Its not interpretable. When it fails you don't know how it failed, why it failed or how to fix it.
Well, only if you built it like an academic paper would build it. Give Tesla AI team some credit in that they’ll build it with lots of instrumentation. The same thing could be said with their perception NN, but they instrument it to see what the AI sees. Likewise, they’ll be able to interrogate the planning and driving portions to see what it is doing.

I mean, the current FSD beta makes bad decisions, it can’t get any worse 😄
 
Care to expand on that? I’m not saying you’re wrong, but would like to hear more about it.
Isn’t that the nature of the NN? You feed in stimulation and backpropagate right answers and hopefully mistakes get smaller, but you really don’t know what happens inside.

I think driving decisions need to be based on traditional code, so that you know what happens inside.

disclaimer; I’m layman.
 
  • Like
Reactions: daktari and Sporty
Isn’t that the nature of the NN? You feed in stimulation and backpropagate right answers and hopefully mistakes get smaller, but you really don’t know what happens inside
Tesla's neural network approach seems to have a lot of training targets including both intermediate and aggregate features, e.g., main camera should predict a vehicle ahead and birds-eye-view also fusing fisheye camera should predict vehicle ahead. So indeed one can increase training data for situations where it gets wrong and concretely evaluate and visualize whether the networks have learned.

Some companies like Wayve skip the feature detection completely and aim to directly predict controls, but the neural network approach would still be the same of finding problematic situations and training the network to be better. Indeed it is much more of a black box as now one can't easily ask the network what it perceives to show a visualization like FSD Beta.
 
They aren't asking the models, where would you drive given this image? They are asking it for hand-selected features, where are the cars, the lanes, the signs, etc. Ultimately I think that approach is just doomed to be thrown out in time, and is only being built now to satisfy this business need to ship an intermediate product before Tesla tosses it for a more end-to-end approach that more closely imitates human driving behavior.
I think the progression is natural, though, and follows how a human learns to drive. It's been interesting beta testing FSD and teaching my 15 year old to drive at the same time. In the early days, steering and acceleration were conscious efforts based on consciously evaluated rules: press the accelerator to go, press harder to accelerate faster, press the brake hard to stop, turn left to make car go left, etc. Those conscious efforts had to be combined with a conscious understanding of where the lanes were and where other cars were just to keep the car in the lane and going at a controlled speed without hitting anything. Within weeks, though, that effort was becoming autonomic (call it "muscle memory") and now more conscious effort could be made to understand the context of the driving situation, what other cars were likely to do, how the interaction of cars at an intersection were going to unfold, etc. Before long he could drive to school or back, going through lights and intersections and interacting with other cars, all while carrying on a decent conversation.

Tesla FSD has unfolded in a similar way, albeit MUCH SLOWER. But I think there will likely be some balance between muscle memory (neural nets) and conscious rule evaluation (procedural code) in just about every new version, with the former eclipsing the latter over time. Doesn't mean that current FSD is "doomed," however, it's just the natural way for it to progress.
 
  • Like
Reactions: pilotSteve
I think the progression is natural, though, and follows how a human learns to drive. It's been interesting beta testing FSD and teaching my 15 year old to drive at the same time. In the early days, steering and acceleration were conscious efforts based on consciously evaluated rules: press the accelerator to go, press harder to accelerate faster, press the brake hard to stop, turn left to make car go left, etc. Those conscious efforts had to be combined with a conscious understanding of where the lanes were and where other cars were just to keep the car in the lane and going at a controlled speed without hitting anything. Within weeks, though, that effort was becoming autonomic (call it "muscle memory") and now more conscious effort could be made to understand the context of the driving situation, what other cars were likely to do, how the interaction of cars at an intersection were going to unfold, etc. Before long he could drive to school or back, going through lights and intersections and interacting with other cars, all while carrying on a decent conversation.

Tesla FSD has unfolded in a similar way, albeit MUCH SLOWER. But I think there will likely be some balance between muscle memory (neural nets) and conscious rule evaluation (procedural code) in just about every new version, with the former eclipsing the latter over time. Doesn't mean that current FSD is "doomed," however, it's just the natural way for it to progress.
Yes that is how Humans learn. It isn’t just muscle memory per se, but more the non conscious parts of the brain taking over leaving the conscious part to supervise and give occasional inputs. But none of that is yet happening in the Tesla Autopilot since all planning and car control is done via C code.

Basically, our brains are composed of many separate neural nets connected to each other. The “muscle memory” and non conscious parts of the brain are constantly looking over the shoulder as it were seeing what the body is doing based on sensor input and the outputs that results in. Given enough repetition, it copies the actions based on sensor input, so when it sees similar inputs, it can take over, should the conscious part of the brain allow it to.

This can work in reverse too. When asked how to do something you’ve muscle memoried how to do (like tie shoelaces), you sometimes have to watch yourself automatically do it to understand how you’ve done it. Which is pretty freaky.
 
  • Like
Reactions: pilotSteve
Ultimately I think that approach is just doomed to be thrown out in time, and is only being built now to satisfy this business need to ship an intermediate product before Tesla tosses it for a more end-to-end approach that more closely imitates human driving behavior.
While the individual intermediate features can be thrown out (assuming no need for visualizations for humans), a lot of the neural network architecture will likely be reused even when going to a more end-to-end approach. The current structure to support 360º video with temporal and spatial memory is quite different from a "plain" convolutional neural network that "just" reduces inputs to a more abstract output at each layer. The types of situations and variations that the neural network can learn is heavily influenced by the structure.

Andrej Karpathy is well aware of the ideal as he explained to Pieter Abbeel on The Robot Brains Podcast:

Maybe at the end of the day, this can all just be a neural net. So maybe there's very little room for engineering. Maybe the images just come in and what comes out is just what you really want which is the steering and the acceleration. Easily said; hard to do. But that is the final conclusion of this kind of transition. And there's very little software written by people. It's just a neural network that does the whole thing. That's the holy grail I would say."​

Practically, this likely will depend on Dojo as well as having FSD Beta at a good enough safety level to benchmark. This is similar to DeepMind iterating on AlphaGo to AlphaGoZero to AlphaZero to MuZero where each time they could benchmark against its previous best approach automatically (without needing to play against humans). Tesla is well positioned to crank the data engine to generate training data from the fleet to actually deliver an end-to-end approach.
 
  • Informative
Reactions: pilotSteve
Two major changes in 10.9. Handling intersections using a new auto-regression neural net should be a huge improvement. "Auto-regression" means using past information to make future predictions. So the network is finally starting to reason using a continuous memory of past results about what is happening in the intersection. Sounds like this is only being used for intersections right now. Should be a big improvement when this technique is rolled out for the rest of the planning work. Am very interested to see how this affects actual intersection behavior (I used to have FSD Beta, but just bought a new car, so I don't have it anymore).

And looks like we have 10 bit raw camera inputs instead of 8 bit processed images. Should help in low light and other obscured camera situations.