Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Layman's Explanation of Tesla AI Day

This site may earn commission on affiliate links.

Cosmacelf

Well-Known Member
Supporting Member
Mar 6, 2013
12,686
46,769
San Diego
This is my attempt to explain the technical information that was presented at Tesla AI Day.

Part 1: Overview

Tesla autopilot needs to be understood as a general purpose driving system that can in theory handle any driving situation in any location. This is opposed to many other self driving systems from Ford, GM, Mercedes, etc. that rely on pre-defined high definition maps and geographically locked self driving areas.

The Tesla system perceives the driving environment in real time through its eight cameras. The Tesla vision and car control system uses backpropagation trained neural networks in combination with complex C++ coded algorithms.

A " backpropagation trained neural network" is the same kind of neural network that is currently used throughout the AI industry today. It is what powers voice assistants like Alexa and Siri, Netflix recommendations, and Apple's face recognition technology.

These neural networks (NN) have some differences and some similarities with the way our brains work.

Production systems will train such a NN using millions of examples that explicitly tells the NN what it is supposed to be learning from these examples. For instance, a visual NN will be shown millions of images with each image also carrying one or more labels identifying what is in the image. To train a NN for complex scene analysis, a person would draw lines around objects, identifying what is in each drawn polygon.

For example, in the image below, labels are assigned to each object you want the system to know about.

1630042690159.png

Once you have millions of labeled images, you can train a NN on this data. Training is very compute intensive and is normally done on huge GPU clusters. It isn't unusual to take several days to train a large data set on a very large GPU cluster.

After training, you now have a NN with millions of internal parameters that were generated during training. You can then download this NN onto a typically much smaller "inference processor" and run the NN to do your scene analysis or whatever you are trying to do. Running a scene analysis will typically take a fraction of a second per image on an inference processor with a trained NN. Tesla, of course, built their own AI inference chip and each car has two of these.

Now, when Elon says each person using Autopilot is helping to train the NN, this is only partially true. First, your particular car isn't learning anything. It can only run the downloaded NN to understand visual scenes. What it can do though is upload 10 second long video snapshots when requested by Tesla.

For example, when Tesla wanted to improve detection of "cut-ins" or cars merging into your lane on the freeway, they wrote a query that ran in each car in the Tesla fleet. The Tesla car computers would trigger whenever it saw a car moving into your lane and then upload these 10 second or so clips to the Tesla datacenter. After a few days, Tesla would have, say, 10,000 video clips of this happening. These clips are "auto labelled" since they each contain video of the car being cut off on the freeway. Tesla then trains their NN on these labelled video clips. The NN picks out and learns clues from the video by itself, like when a blinker is on, or a car starts to drift over, or its angle subtly changes. That's the fundamental power of a NN – with labelled data, it can automatically without (much) human programming, learn how to recognize patterns.

So after retraining the NN, Tesla will deploy the new NN in shadow mode into the Tesla fleet and ask the fleet to observe when cut-in predictions are incorrect. Either failing to predict that a car will cut-in, or predicting that a car will cut in but it doesn't are flagged by the cars running in shadow mode. These exceptions are then uploaded again to the datacenter and a new NN is trained on this exception data. A newer NN is then deployed, and run again in shadow mode. This cycle repeats for however many times it takes for the network to get really good at prediction, at which time it is finally deployed to the fleet as a working update.

When Tesla says its fleet is its secret weapon versus everyone else, it isn't boasting. Tesla effectively has 1M+ mobile Tesla AI chips available at its disposal to do this kind of training. Thinking about it in these terms, Tesla has by far the biggest AI supercomputer on the planet.

OK, with that as background, let's get to how the Tesla NN actually works.

Part 2: Vision Architecture

This figure shows the overall architecture of the vision system. This is only for perception, meaning understanding what you are seeing. Not shown is the driving and control system which will be discussed later.

1630042734557.png

This is very high level – each of these boxes will typically contain a few to a dozen neural nets each connected in various complex ways.

All 8 cameras first undergo a calibration neural net which warps each camera into a "standard" image that should be the same across the entire fleet. When you first purchase the car, Autopilot goes through a several day calibration exercise where it trains an in-car neural network to understand how each camera in the car is off baseline. Each camera could be slightly twisted, or pointed too high skyward, or whatever, and the car needs to understand what these deviations from baseline are so that the image that is presented to the Autopilot neural network looks the same to that neural network across all cars. This is the only time a neural network is actually trained in the car itself.

Once you get past this calibration (warping) stage, standard low level features are extracted from the images to reduce or compress the huge amount of image data into something more manageable without losing any important information.

The next step is to merge all eight cameras into one view while at the same time building a 3D vector space representation of the world.

Creating a 3D vector space representation of a 2D flat image is something our human visual cortex does so well automatically that you probably don't even realize you are doing it.

When you look at an image like this:

1630042765125.png


You just know that the red sign is further away than those three cars, and the American flag is at about the same distance away as the car with the headlights and that car is well back of the white van. The Autopilot NN has to figure this out as well and the resulting output is a three dimensional model of the world with key features located in that 3D space.

Tesla's 3D vector space model is a fairly dense raster with each point having a distance measurement away from the car. It is similar to the point cloud a LIDAR system would generate:

1630042787826.png


Tesla's 3D vector space additionally adds velocity to moving objects, locates curbs, driveable areas, is in color and has a higher resolution. It is also quite a bit more sophisticated than a LIDAR plot since it makes a 3D image that includes occluded objects – for instance, a pedestrian that is currently behind a car and can't be seen in the current video frame. The network tracks these occluded objects using a memory based video module described below.

Note that Tesla's low level NN vision system is very sophisticated. Just like the human cortex, these networks share information from the top level of the hierarchy to the lowest. Here's an example where you are trying to figure out what a 10x10 pixel image is (on the left).

1630042809255.png


The NN knows that the 10x10 blurry pixels are at the vanishing point of the image, and thus it can infer they are a car's headlights.

Once the camera fusion and 3-D vector space is created, Tesla adds in time and location based memory. The example they gave is when you drive towards an intersection, you might roll over lane markings indicating the leftmost lane is a turn lane.

1630042829908.png


By the time you reach the intersection, it's been a couple of seconds when you passed over the left hand turn lane marker, so there needs to be some kind of memory of these features as you pass them by. The vision engine will snapshot the extracted features it is seeing every short time internal or 1m of travel, creating a list of recently seen visual clues. Again, this is very similar to the way humans drive – we may not consciously remember that we rolled over a left hand turn marker, but our own neural network does and uses that information to guide our driving.

This "video module" where Tesla adds in a running memory of recent features is very powerful and I'm not aware of any other self driving system that does anything like this.

Finally, the output of this video network that is a combination of the current visual scene and recent visual clues projected onto a 3D vector space is then consumed by around 80 other neural networks that Tesla calls "tasks". Each task does a very specific thing like locate and understand traffic lights, or perceive traffic cones, or see stop signs, or identify cars and their velocities, etc. The output of these tasks is high level semantic information like: in 30m, there is a red traffic light, or this car 30m in front of me is likely parked.

Part 3: Planning and Control

Now that the NN understands locations of all objects, what they are, how fast and in what direction they are moving, what is road surface, where the curbs are, what the signal environment (traffic, road signs) is, and what recently happened, now the autopilot must do three things.

First, it must predict what all the other moving objects are going to do in the next short while. Second, it must plan out what it is going to based on an overall plan (like following a GPS route), and third, it has to tell the car what to do.

This can get quite complicated. Let's explain using a simple example where the car is in the rightmost lane of a three lane road and needs to make a left turn up ahead while there are cars in the middle lane. To do this properly, the car must merge in between the two cars in the center lane and then merge into the far left lane, and do this without going too fast or too slow or with too much acceleration or deacceleration.

Currently, Tesla uses a search algorithm coded in C++ to figure this out. It essentially simulates a couple of thousand different scenarios in which the car and the other cars on the road all interact. It does this at a fairly coarse grained level and the numbers they gave is that the simulator/search system could simulate/search 2,500 different scenarios in 1.5 ms. All these simulations use physics based models meaning they are idealized representations of cars on roads in a simple road representation of lanes and dividers, etc.

At the end of this exercise, the path planner has a rough plan for what the next 15 seconds of travel will be and at that point, it creates a very detailed plan that exactly maps out the specific turning radius, travel path and speed changes required to make the driving as smooth as possible. This too is done using a C++ search algorithm, but uses a different technique that takes into account lateral acceleration, lateral jerk, collision risk, traversal time and merges all this into a total cost function to minimize the "total cost" for the refined planned route. The example they showed to plan out the smooth path for 10 seconds of driving shows about 100 iterations through this search function.

This planning function, along with all the visual perception system before it, happens every 27 ms (36 times a second). So autopilot is sensing and recreating plans 36 times a second to ensure that nothing is going to catch it unprepared for whatever weird crap happens around the car. Autopilot isn't omniscient and can't tell that the idiot driver you are merging in front of is rage texting a twitter response, but it can notice, more quickly that you can, that said driver's car is on a collision course and autopilot will thus react accordingly.

Tesla is working on a new planning algorithm that mostly uses a planning NN instead of hand coded C++ as is currently done. The benefit of a NN algorithm will probably be speed, but it might provide some better plans as well.

Oddly enough, one of the toughest things to plan is navigating around a parking lot. Tesla showed how their NN planner in development would handle this task in much less time than their current planner.

Part 4: Training

OK, I've talked about how the various Tesla NNs work when perceiving the world, and eventually, planning, to drive around it, but how do you teach the NN?

As I described in part 1, you need to give a NN millions of labeled examples for it to learn what you are trying to teach it.

Tesla started off doing this the same way the rest of the industry does it, by having humans draw polygons around features in images. This does not scale. Tesla soon created tools to label in their vector space that is an 8 camera stitched together 3D representation of the world. Label a curb in this enhanced view and now it is auto labeled in the corresponding visual image on maybe three cameras.

They then realized that their fleet drives through the same intersection thousands of times so by tagging an intersection using GPS co-ordinates, a single vector human space labelling effort could now auto label thousands of images taken during all sorts of weather conditions and time of day, and angle of viewing. So, in 3d vector space, once you know a stoplight is a stoplight, then thousands of future drives through the same intersection, from multiple different directions, time of day, weather and lighting all can label the same stoplight as a stoplight automatically with no more human input.

Tesla then added completely automatic auto labelling. A car's AI chip can only recognize so much but a 1,000 GPU cluster is much more powerful and doesn't have millisecond time constraints, so such a cluster can be used to automatically label novel images.

Tesla used this auto labelling capability to remove radar with 3 months of effort in early 2021. Part of the problem they had when removing radar was degraded video. For example when a passing snowplow dumps tons of snow onto the car occluding, well, everything.

Tesla realized they needed to auto label poor visibility situations, so they asked the Tesla fleet for examples of when the car NN had temporary zero visibility. They auto labeled 10,000 poor visibility videos in a week, something Tesla said would have taken several months with humans.

After retraining the visual NN with this batch of labelled poor visibility video, the system was able to remember and predict where everything was in the scene as you go through temporary poor visibility, similar to how humans handle such situations.

Using Simulations

Simulations, or creating a 3D world like you would in a detailed video game, is something that Tesla is starting to use mostly for edge cases. They showed examples of people or dogs running on a freeway, a scene with hundreds of pedestrians, or a new type of truck that no one's seen before (the Cybertruck!).

Even the auto labeler has a hard time with hundreds of pedestrians as you might have in Hong Kong or NYC, so creating a simulation where the system knows exactly where everyone is and how fast they are walking is a better solution to train with.

They work really hard to make accurate simulations. They have to mimic what the camera will see in real world conditions. Lighting and ray tracing must be very accurate. Noise in road surfaces must be introduced and they even have a special NN used just to add noise and texture to generated images.

Tesla now has thousands of unique vehicles, pedestrians, animals and props in their simulation engine. Each move or look real. They also have 2,000 miles of very diverse and unique roads.

And when they want to create a scene, the scene is most often created procedurally via computer algorithms as opposed to having an artist create the scene. They can ask for nighttime, daytime, raining, snowing, etc.

Finally, because they can automatically build whatever kind of visual world they want, they now can use adversarial machine learning techniques where one computer system is trying to create a scene that will break the car's visual perception system, and then this breakage can be used to train the vision system better and then it goes back into a closed loop training system.

To say that Tesla's learning pipeline is next level is an understatement. It is now extremely sophisticated and is arguably the most powerful in the world.

DoJo

DoJo is Tesla's next generation training data center. Right now, they "make do" with thousands of the current state of the art Nvidia A100 GPUs clustered together. Tesla thought they could do better so they custom designed their own AI training chip (not to be confused with the AI inference chip which is inside every Tesla car). The training chip is far more powerful and has been designed to work as a piece of a huge super computer cluster.

I won't say much about DoJo, but suffice it to say that it is world class in so many ways from packaging, cooling, power integration, and huge communications bandwidth (easily 10x of current capabilities). Dojo is still being built, I'd guess it is 4-6 months away from being used in a production environment. When it does start getting used, Tesla's productivity will increase – they'll basically be able to do more, faster.

Part 5: Future Work & Conclusions

From watching this and other Tesla AI videos from the last two years, you can tell that Tesla has been racing to get to a working FSD vision system. When Tesla releases their v10 FSD Beta (only weeks away 😀), it will be a great system, but there are a lot of optimizations Tesla is working on that will be rolled out over the next year or so. Here are a few they talked about.

You might have noticed that Tesla's vision system merges all eight cameras quite late in processing. You could do this earlier in the pipeline and potentially save a lot of in car processing time.

The vision system produces a fairly dense 3D raster of world. The rest of the processing system (like those 80+ tasks) must then interpret this raster. Tesla is exploring ways of producing more of a simple object based representation of the world which would be much easier for the recognition tasks to process. By the way, this is one of those problems that sounds easy but is actually mind bendingly challenging.

As mentioned, Tesla is working on a neural network planner which would reduce path planning compute time significantly.

Their core in-car NNs use high precision floating point, but there is opportunity to use lower precision floating point and maybe going all the way down to 8 bit integer computations in certain cases.

And these are only some broad areas that are being worked on. No doubt each individual NN and piece of code could use some optimization care and love.

Conclusions

All these words, and I've only given you a top level summary of what was presented at Tesla AI day. Tesla really opened their kimono and gave way, way more details than written here at a fairly deep technical level. Indeed you actually had to be an AI researcher or programmer to truly understand everything that was said.

Companies simply do not give deep technical roadmaps like this to the public. Obviously Tesla was trying to recruit AI researchers, but even so, exposing so much of their architecture was a real OG move. The subtext is that Tesla doesn't care if you try to copy them, because by the time you re-create everything, they will have advanced that much further ahead.

The bottom line is that, in my opinion, Tesla can indeed deliver Full Self Driving with this architecture. "When" is always a tough thing to predict, but it appears that all the pieces are in place, and certainly we've seen some pretty impressive FSD beta videos.
 
Tesla's 3D vector space model is a fairly dense raster with each point having a distance measurement away from the car. It is similar to the point cloud a LIDAR system would generate:
I'm not sure I agree with this. While they did talk about pixel space associated with the 3D model, I think that was earlier attempts (essentially a synthetic Z-buffer). The impression I had was they were building a true 3D vector space with the car aware of the camera positions in that space.
 
Having watched the video a couple of times now, my take-away was just how much Tesla is invested in FSD and car autonomy:

-- 1,000 people in the labeling team alone. That is stunning.
-- Dojo .. this is a MAJOR investment in dollars, expertise, and pushing the boundaries of NN computer design. I'm impressed. Clearly there is a way to go yet, but the magnitude of the dataset they should be able to train with is going to be impressive.
-- The way they use the fleet to gather data on specific tricky situations. People have speculated about this, it's interesting to see it confirmed that they do this. This remains a significant advantage of Tesla .. the large fleet of cars that they can use as data sources.
-- The sophistication of the simulation software, even down to simulating the visual properties of the cars cameras.

Overall, this is clearly a company that is VERY serious about building car autonomy. However, my take-away was that this is still very much a work in progress, and I think there are still quite a few years of work to be done as they progress up the learning curve to higher SAE levels (if they can).
 
I'm not sure I agree with this. While they did talk about pixel space associated with the 3D model, I think that was earlier attempts (essentially a synthetic Z-buffer). The impression I had was they were building a true 3D vector space with the car aware of the camera positions in that space.

I may not have described it very well - how would you describe their vector space?
 
I may not have described it very well - how would you describe their vector space?
I think one way to explain it is as the "inverse" of how. modern computer/console games works.

The computer game starts with a "world model" in 3D with all the objects represented by geometry. So a cube for example has 6 sides (faces) positioned in the 3D space etc. There are no pixels here, just a description, something like "there is a cube with a size of X at location XYZ in 3D space and facing in direction ABC" Part of that view includes a "camera" also positioned in the world, and the job of the CPU/GPU is to convert the geometry into a flat plane of pixels as they would appear based on the cameras point of view. To change the view, you just move the camera and re-render a new set of pixels from the new camera position. (Of course, this is all a massive over-simplification). The point is that the 3D geometry, or "vector space" is good for manipulations such as moving objects around, and computing if objects intersect overlap etc, since only math on the object locations is involved. It's also a good space in which to computer changes over time, since you an "move" an object easily (you just update the co-ordinates that say where that object is).

What Tesla have done with vector space (my understanding, which is very imprecise) is essentially the inverse operation. They start with the flat pixel views from all the cameras and work backwards, using the NNs and a lot of very careful math and training to reconstruct that 3D geometry (this is of course massively complex). So a car in the camera view starts off as flat pixels that "look like" a car to the NN, which then works to position and orient the car into the 3D world the car is constructing moment-by-moment. Of course, the software isnt really interested in the detailed shape of the car so much as the "bounding box" of the car (you could see these as the colored boxes in the 8.x FSD beta videos, eliminated in the later betas). With the roads, cars, pedestrians, obstacles etc as bounding boxes, the car can start to apply path-finding rules based upon its 3D vector space world view..
 
I think one way to explain it is as the "inverse" of how. modern computer/console games works.

The computer game starts with a "world model" in 3D with all the objects represented by geometry. So a cube for example has 6 sides (faces) positioned in the 3D space etc. There are no pixels here, just a description, something like "there is a cube with a size of X at location XYZ in 3D space and facing in direction ABC" Part of that view includes a "camera" also positioned in the world, and the job of the CPU/GPU is to convert the geometry into a flat plane of pixels as they would appear based on the cameras point of view. To change the view, you just move the camera and re-render a new set of pixels from the new camera position. (Of course, this is all a massive over-simplification). The point is that the 3D geometry, or "vector space" is good for manipulations such as moving objects around, and computing if objects intersect overlap etc, since only math on the object locations is involved. It's also a good space in which to computer changes over time, since you an "move" an object easily (you just update the co-ordinates that say where that object is).

What Tesla have done with vector space (my understanding, which is very imprecise) is essentially the inverse operation. They start with the flat pixel views from all the cameras and work backwards, using the NNs and a lot of very careful math and training to reconstruct that 3D geometry (this is of course massively complex). So a car in the camera view starts off as flat pixels that "look like" a car to the NN, which then works to position and orient the car into the 3D world the car is constructing moment-by-moment. Of course, the software isnt really interested in the detailed shape of the car so much as the "bounding box" of the car (you could see these as the colored boxes in the 8.x FSD beta videos, eliminated in the later betas). With the roads, cars, pedestrians, obstacles etc as bounding boxes, the car can start to apply path-finding rules based upon its 3D vector space world view..

To my way of thinking, regardless of how you get there, a 3D vector space represents the distance and direction of different points in space relative to the observer (or any other reference point). In other words, it doesn't matter how that information is derived and encoded, it's what it represents.

You two are both saying essentially the same thing.
 
Last edited:
  • Like
Reactions: Cosmacelf
To my way of thinking, regardless of how you get there, a 3D vector space represents the distance and direction of different points in the vector space relative to the observer (or any other reference point). In other words, it doesn't matter how that information is encoded, it's what it represents.

You two are both saying essentially the same thing.

The reality is that the vector space is actually some internal NN representation of the world. All the visualizations we’ve seen of it are just recreations based on the NN representation. What’s important is the information that is encoded, which is a 3D color raster of the world.

Just like what is in our brains, the 3D model of the world isn’t 3D grid of points. It is an almagamation of low level features (Edges, lines, high/low contrast surfaces), and high level features (cars, animals, person). But there‘s no point getting into this much detail for this level of explanation. You really need to understand how all the various types of NN work to delve deeper.
 
To my way of thinking, regardless of how you get there, a 3D vector space represents the distance and direction of different points in space relative to the observer (or any other reference point). In other words, it doesn't matter how that information is derived and encoded, it's what it represents.

You two are both saying essentially the same thing.
To an extent. The point is that vector space deals with entities like cars and pedestrians, each bounded and placed in a 3D space. Camera/image space is just flat RGB data from which vector space can be elided using the NNs. I was clarifying because at one point early in the AI Day talk they talked about a "pixel" of vector space, which in fact was really (I'm surmising) just a Z-buffer reverse-mapping that the pre-FSD NN stack used to use (and was discarded as inadequate for FSD).

The value of the vector space is that not just the vectors, but that they are vectors of classified objects. So the car assigns very different analysis goals to (say) a stationary pedestrian than a stationary traffic cone.
 
  • Helpful
Reactions: Carl Raymond
From what I understand watching AI day, the inference stack does not use a 3D vector space point cloud like you describe. There was a visualization with a 3D point cloud but that's an output of the neural network rather than how it functions.

The Transformer combines the 2D feature maps from multiple cameras (1280x960 -> 20x15x512 each) into a 2D Bird's Eye View (BEV) that is 20x80 pixels with 256 layers of channels/features. The re-projection is done by tiling sine and cosine embeddings over the raw pixels for each camera and then training the Transformer NN to output 2D BEV rasters that match labelled training data.

They typically train the full stack only once in a while to get the Transformer trained. Then they lock that part of the stack and below and fine tune the heads above for specific use like road edges and vehicles. But, training the Transformer relies on the full stack being functional and having an output space that matches labelled 2D training data. If they want the neural network to output high quality raster predictions 2D BEV of road edges, vehicles, etc. then they have to train it on high quality 2D BEV labelled data.

1630525023429.png


1630525031467.png


When the time dimension is added for video you can see that everything is still 2D (Width X Height X Channels X Time)

1630525363327.png
 
From what I understand watching AI day, the inference stack does not use a 3D vector space point cloud like you describe. There was a visualization with a 3D point cloud but that's an output of the neural network rather than how it functions.

The Transformer combines the 2D feature maps from multiple cameras (1280x960 -> 20x15x512 each) into a 2D Bird's Eye View (BEV) that is 20x80 pixels with 256 layers of channels/features. The re-projection is done by tiling sine and cosine embeddings over the raw pixels for each camera and then training the Transformer NN to output 2D BEV rasters that match labelled training data.
You could be right, the presentation was interesting but at such a high level it was tricky to figure out the entire stack. However, it's clear that they do most of the planning based on a vector space, but keep in reserve the various intermediate results should the planning layer wish to probe deeper.

Overall I felt what we were seeing was the evolution of the software teams approach to solving the basic problems .. they are trail-blazing and, like others in similar circumstances, are learning as much by trial and error as by reasoning (which is not a bad thing).
 
My understanding listening to the presentation is as follows, and I'd be interested to know if you think I am getting this right. It has been 35-years since I worked on this stuff so please be gentle.

The classified objects must be in vector space, and there must be some (memory-based) feedback loop (or is this a feedforwards loop, easy to get confused) from the vector space, via the probability/physics engine, to the later cycles of the inference engine. That is how the pedestrian emerging into view from behind the vehicle can be recognised as such, before coming fully into view, i.e. it is the same pedestrian as went out of view a short time before. In this example both the vehicle and the pedestrian are in vector space, but knowledge of them and their likely future position is being fed into the inference engine alongside the (processed) camera input.

I think that at present unclassified stuff doesn't get fed forwards into the inference engine. So if the pedestrian goes behind some screening foliage then that foliage is not in vector space. My guess is they are trying to suss out what is worth vectorising and remembering what is not and can just be forgotten. (It's something we humans struggle with despite all our evolution, see the "clever girl" scene in Jurassic Park - bet he wished he'd kept a better accounting vector space
). As the various chips become more capable, and as the NN becomes better trained, the optimal forget/remember decision will migrate over time.

I wonder to what extent they are adding attribute labels to classified objects ? Clearly there is some sorting of vehicles going on, but how deeply ? Vehicle: car, truck, motorbike ....... parked, moving, indicating ........ on fire ....... flashing blue lights ..... acting normally/erratically ..... guns ? Ditto for traffic lights (red, green, amber .... broken ?). But what about other objects - pedestrians - blue top, yellow top, man, woman, child, medic, uniformed, armed, firefighter, walking, running, seated, etc. These labels would have a role to play both in the vector-space input into the inference engine (hey, that pedestrian emerging into view is NOT the expected child in the blue t-shirt, but is the lady in the yellow-dress, now running WHERE is the child (OK, only some of that example takes place in the inference engine); and in the physics/probability engine - pedestrian walking commuters waiting at roadside -- likely destination metro entrance on other side of road, will be difficult to route to turn through them.

I am sure you all appreciate that this is dramatically changing (accelerating the ongoing direction of travel) re the defence technology assessment and the surveillance state equation.
 
Just a reminder that when we saw the video of the pedestrians still being shown even though they were occluded temporarily by a passing car, that video was a recreated training video by analyzing the entire 30 second video clip on the 1,000 GPU cluster. Ie by spending a lot of offline processing, they could auto build that video in vector space which was then used to train the production NN so that it would learn to then predict the paths of occluded objects during production (with a 27 ms inference loop).
 
I'd like to know how what we saw on AI Day relates to software in cars now or coming soon. For example, a year ago Elon talked about a big rewrite to do labeling in vector space so I imagine this is in cars now, but perhaps only in beta-9. I'd really like to see how the AI Day features relate to real world performance.
 
a year ago Elon talked about a big rewrite to do labeling in vector space so I imagine this is in cars now, but perhaps only in beta-9
Karpathy briefly talked about manual 4D labeling as an improvement over their 2D labeling, so most likely it was used for earlier FSD Beta to correctly position objects outside the view of the main camera, e.g., positioning a cone based on the right pillar and repeater camera views:
4d labeling.jpg


This was surpassed/augmented by Auto Labeling that uses larger neural networks that cannot run realtime in vehicles and also get the benefit of future/hind-sight as well as data collected from the rest of the fleet. It sounds like this new technique was used as part of the Vision-only velocity detections that are already live in production non-FSD-beta vehicles (as well as FSD Beta 9+).
 
I'd like to know how what we saw on AI Day relates to software in cars now or coming soon. For example, a year ago Elon talked about a big rewrite to do labeling in vector space so I imagine this is in cars now, but perhaps only in beta-9. I'd really like to see how the AI Day features relate to real world performance.
I think all of us not running FSD beta are running a much older architecture. Tesla does talk about merging the FSD stack to be used for highway implying it isn’t being used for highway right now.
 
I think all of us not running FSD beta are running a much older architecture. Tesla does talk about merging the FSD stack to be used for highway implying it isn’t being used for highway right now.
yep .. the only “new” bit is the very simple traffic light beta. However, it sounds as if they are trying to move NoA to the new stack as part of the v10 FSD beta, which is gutsy (I was expecting this much later). I suspect this is driven by resources .. they simply want to retire the old NNs to free up resources in the car for the new stack (so they don’t have to run both stacks side by side).
 
They stated that the whole team was located in California. In fact, they built the team because they were using an outside contractor and they were getting poor quality results.
The team that does the labeling software yes, but I'm dubious if there are 1K grunts in California doing labeling. An unnecessary expense for Tesla when this can be done so much cheaper elsewhere. Mechanical turk will do it cheaply. Where in California? In Palo Alto where a shack costs $2M? Doesn't make sense to me.