Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
V12.1 going out to employees with internally-terse release notes:

Looks like the same "FSD Beta v12" release notes from a few weeks ago for 2023.38.10. But presumably they made improvements to bump it to "FSD Beta v12.1" as well as updating to 2023.44.30.10, so that employee vehicles already on holiday update could get this version. The release notes seem to be remotely hosted anyway, so theoretically customer vehicles could update to this same version if there aren't issues found with wider employee testing to slowly rollout even more.
 
  • Helpful
Reactions: willow_hiller
When I listen to Elon talking about the 300k lines of code, he's constantly referring to behaviors at traffic elements / lanes / etc.

If turning left at roundabout, wait check left quadrant for any cars > 5mph with predicted path intersecting ego path within next 3 seconds, proceed after 3 seconds
Thanks for the example. I'm not sure if FSD Beta has had that type of control logic specifically for roundabouts as it seems to be more of a general driving behavior to check for intersecting paths of relevant cars continuously in addition to all the other things to consider like lanes. The complexity comes from all the other things to simultaneously control for like a roundabout-looking intersection but actually has traffic lights or a pedestrian in the center or construction cones are on the edge, etc. Or even more basic, the vehicles in the roundabout suddenly stop, so proceeding still needs to avoid hitting the new lead vehicle.

If a neural network was trained to handle all the potential behaviors needed at a roundabout, it seems like it might as well be trained to handle every other intersection as well non-intersections. But I suppose if there was special roundabout-specific control logic and this control neural network was only trained for roundabouts, that portion of software 1.0 could switch over to 2.0, and the rest of control could follow the pattern and incrementally get to end-to-end modular control

Perhaps Tesla did look into incrementally increasing scope of a control network vs a complete control network for every situation vs single end-to-end network and decided that it would be best to go with single.
 
I still complain by voice on important disengagements. I doubt every single one is listened to by a human, but I bet somewhere in their sea of metrics, they're at least giving more weight to disengagements that were voice-tagged at all, and who knows maybe they count swear words or something.
The smart way to do it—and they way I suspect they’re actually doing it—is to send the recording through voice recognition, then look for keywords. For example, “speed bump”.

Then, suddenly you have 50,000 immediate examples of the car failing to slow down for a speed bump. Then run those clips through the auto-labeler (or if early in the process, label manually as needed). Review the auto-labeled results by a human and adjust as necessary. Would be a great way to get a lot of pertinent clips quickly.
 
Last edited:
Perhaps Tesla did look into incrementally increasing scope of a control network vs a complete control network for every situation vs single end-to-end network and decided that it would be best to go with single.

Yes, I believe this is the case. A software 2.0 strategy is actually compute inefficient for inference. Karpathy has talked about FSD being a balancing act of inference compute since 2020, where he mentioned that different fsd teams would be rationed some portion of the compute for the 1000+ outputs from HW3. Over time, it follows that you'd need more efficient architectures as you consume more heuristics with NNs.
 
Anyone notice that the release notes for V12 say that it upgrades "city streets driving"? This seems to imply to me that the new end-to-end stack is only for city streets right now and therefore V12 still uses the "V11 stack" for highway driving.
Tesla has been very cautious in modifying Navigate on Autopilot. Likely due it its level of performance in that domain. Methodology may be to test new stuff in City which has more stressors, and only migrate to NoA when its solid (and after running in parallel with NoA).
 
Anyone notice that the release notes for V12 say that it upgrades "city streets driving"? This seems to imply to me that the new end-to-end stack is only for city streets right now and therefore V12 still uses the "V11 stack" for highway driving.

Teslascope did answer the question. I think it's a possibility, but their understanding is that this is just draft text and not really indicative of what we'll eventually see as public release notes:

 
I am almost positive they are reusing a lot if not all of the perception pieces from version 11. V12 is about feeding these inputs into a new neural network or set of neural networks that do all the decision-making and control outputs for the driving task.

Is this a broadly agreed on community consensus?

I really thought they were trying to end-to-end one single neural network where it takes in video (and other sensor inputs) and spits out a path and velocity plan.

A previous post makes the point the if they're reusing the existing perception layer they could be building V12 more incrementally.

Another downside I see is that separate layers for perception and planning implies a connection between them in the middle (think "public API" in computer engineering terminology). I think people have used the term "Vector Space" for the output from the perception stack (locations of vehicles, locations of VRUs, lane lines, driveable space, signs, signals, metadata about these objects, etc).

The downside I see is these are all human picked and curated concepts. By building planning on top of these the planner can only know about the concepts that human decided to build into the perception layer. Combining all of it into one neural net, end to end, eliminates the need to even pick what stuff is represented in the intermediary "Vector Space", or how it's represented.
 
The downside I see is these are all human picked and curated concepts. By building planning on top of these the planner can only know about the concepts that human decided to build into the perception layer. Combining all of it into one neural net, end to end, eliminates the need to even pick what stuff is represented in the intermediary "Vector Space", or how it's represented.

It's still possible to connect intermediate layers of one module to another, prior to the high-dimensional vector space being reduced down to the human-interpretable concepts.

The only requirement I see for a neural network to be truly end to end is that the back propagation during training is done in one continuous step. Tesla could hold the current final layer of the perception stack constant and separate, for visualization purposes, and train the rest of the network end to end.

But throwing out the training efficiently of actual end to end for the sake of visualizations would be throwing the baby out with the bathwater.
 
I see, sort of. This is where my lack of ML experience isn't helping...



How sort of information is passed around in this architecture? What are the inputs and outputs of "layered" NNs like this? Are those inputs and outputs completely non-human readable? How do you train it then?

If you have a couple hours and are interested in learning, this video from Karpathy is really useful for understanding the core mechanics of modern NNs:


It also comes with a GitHub repository: GitHub - karpathy/ng-video-lecture

I can't speak for C++ implementations, but the inputs and outputs in Python are just a special type of numerical matrix, with defined dimensions. The architecture that defines the connections between the weights are modules, and the training is a process of initializing random weights, running the inputs through them to achieve a predicted output, and then updating the values of the weights from back to front based on how the predicted output compares to the desired output.
 
I see, sort of. This is where my lack of ML experience isn't helping...



How sort of information is passed around in this architecture? What are the inputs and outputs of "layered" NNs like this? Are those inputs and outputs completely non-human readable? How do you train it then?
In a discrete human understandable NN, you feed data in one end and gave human labels at the other. Video -> {lanes, cars, pedestrians} that kind of thing. So it might go video, objects, path planning.
However, that intermediate labeling step throws out all the additional data in the feed and only leaves the next NN an N label vector of probabilities/values. Maybe there is a low confidence pedestrian object that is actually something important.
Connecting all the NN together without the downsampled intermediate stages let's it pull more information into the final output.
Result is less/no human friendly quantization layers.

If doing OCR.
NN1: circle, arc, line detection
NN2 or C++: based on NN1, determine letters
NN3 or C++: based on NN2 outputs, determine word
vs
NN: determine word
 
It’s completely unrealistic to me that FSD will be L3 at highway speed next year. Perhaps in a few years at lower speed daytime, dry roads and without lane changes like Mercedes, but likely never due both to lack of business incentives and technical limitations . L4? LOL.

Just the certification for (L3) UNECE R157 will take 3-6 months and Tesla currently don’t have the reliability required nor the features required like hand over protocol, MRM, emergency corridor et c. Earliest 2025, probably never on current hw is my guess.

Instead they’re targeting they new DCAS regulations from UNECE in they ever can get that approved. Tesla’s been chairing that L2 effort for 3 years with limited success thus far. Perhaps implemented by 2025-2026?
My experience with AP and, more recently, FSD on the highway has been excellent. The only time I’ve had to take over for FSD is because it misses exits, and those are primarily cloverleaf exits which are somewhat difficult anyway because there’s such a short merge period. If you restrict it to Straight interstate driving with no exits then it absolutely could be ready.

I don’t think it’s ready for all highway driving but definitely some limited use scenarios, and even those would be huge.

Edit: my other minor gripe is it doesn’t reliably merge back over to the right lane after passing unless there’s a car on your tail. That should be a trivial fix for them to make, though.
 
My experience with AP and, more recently, FSD on the highway has been excellent. The only time I’ve had to take over for FSD is because it misses exits, and those are primarily cloverleaf exits which are somewhat difficult anyway because there’s such a short merge period. If you restrict it to Straight interstate driving with no exits then it absolutely could be ready.

I don’t think it’s ready for all highway driving but definitely some limited use scenarios, and even those would be huge.

Edit: my other minor gripe is it doesn’t reliably merge back over to the right lane after passing unless there’s a car on your tail. That should be a trivial fix for them to make, though.
If you feel it’s ready, why not try it with a blindfold and see if it still feels ready? 😉

I love my AP but I also understand that I am driving.