Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Tesla.com - "Transitioning to Tesla Vision"

This site may earn commission on affiliate links.
Karpathy is once again pounding the fact that, currently, only Tesla can source the data required for FSD.

Thank you diplomat for posting the video.

The new bits that I gleaned from Karpathy's talk:
1) Tesla is using transformer NNs for fusing surround video.
2) Tesla is now training with video, not only on images.
3) Tesla has built a new Nvidia-based supercomputer to train their NNs.
4) Tesla is still working on Dojo.

You are welcome.

By the way, I did start a separate thread just to discuss the Karpathy talk:

So if you want to share this or other bits that you got from the video, feel free to share it in the other thread.

Thanks.
 
Karpathy is once again pounding the fact that, currently, only Tesla can source the data required for FSD.

Thank you diplomat for posting the video.

The new bits that I gleaned from Karpathy's talk:
1) Tesla is using transformer NNs for fusing surround video.
2) Tesla is now training with video, not only on images.
3) Tesla has built a new Nvidia-based supercomputer to train their NNs.
4) Tesla is still working on Dojo.
Clarification on point 2 "Tesla is now training with video, not only on images"

Video is processed on the fly into frames (images)... thats how they are fed to a CNN. Tesla might be u sing what they call "surround video", but the processing of video into frames is still required (frames played in sequence == video).
 
Video is processed on the fly into frames (images)... thats how they are fed to a CNN. Tesla might be u sing what they call "surround video", but the processing of video into frames is still required (frames played in sequence == video).

It's possible you're right, but Elon has said that Tesla is working on video labeling and training directly. And this is the first time I've heard Karpathy talking about video training / data directly. In all prior talks, Karpathy has always referred to Tesla's data as images.
 
Karpathy is presenting at CVPR. He gave some specific info on why Tesla dropped radar. Basically, radar and sensor fusion was not working right. Karpathy explains that Tesla could have spend time fixing radar and sensor fusion but decided their engineering resources would be better spend on just just doing camera vision instead.

I grabbed these screenshots.

First screenshot shows a case of a car in front braking hard. The graphs show the acceleration of the Tesla over time for radar and no radar.

Basically, radar would lose track of the car in front, causing very hard braking. Now with Tesla Vision, the braking is much more smooth.

HDcHtkd.png


Next screenshot is case of an overpass. Radar mistakes the overpass for a stationary object. Tesla Vision does not make that mistake so it very smooth now.

Fv5gEoA.png


Here is the live stream. Karpathy talk starts at 7:51:46:

Thanks for posting this Diplomat. I found it quite encouraging, and it definitely boosts my confidence that Tesla will figure this out soon. It looks like Tesla is doing a good job incorporating time domain information into their processing (i.e., videos), and this seems essential if they expect to estimate distance, velocity, and acceleration from cameras alone. I also think the post processing analysis they are doing to learn from the video data looks very promising.

It may be a long road, but I really think they are going to crack this and it will eventually surpass the best we can do with radar at present. My only concern, as a new customer, if whether this will be one more year, or more like five.
 
Clarification on point 2 "Tesla is now training with video, not only on images"

Video is processed on the fly into frames (images)... thats how they are fed to a CNN. Tesla might be u sing what they call "surround video", but the processing of video into frames is still required (frames played in sequence == video).
That's not the distinction from my understanding. The way AP worked is frame by frame. The NN analyzes things per frame and then spits out the result for that frame. That is why things are so jumpy (anyone who have used AP for a while will notice this).

If instead the NN takes in a video (multiple frames at once, maybe even work with the video compression info, like keyframes, interframes, macroblocks, motion vectors, etc), then that should not be possible, given the NN has info from previous frames, so it shouldn't be possible for something to suddenly jump.
 
Tesla will figure this out soon

Definitely “soon” is not a precise enough term when it comes to Tesla and FSD.

My only concern, as a new customer, if whether this will be one more year, or more like five.

Yeah that is the big question. Everyone wants to know, and for some of the early adopters, from lived experience, they are definitely guessing the longer end of that timeframe.

I have a lot of questions, which I wrote two paragraphs on, but then deleted. I guess “we will see.”
 
That's not the distinction from my understanding. The way AP worked is frame by frame. The NN analyzes things per frame and then spits out the result for that frame. That is why things are so jumpy (anyone who have used AP for a while will notice this).

If instead the NN takes in a video (multiple frames at once, maybe even work with the video compression info, like keyframes, interframes, macroblocks, motion vectors, etc), then that should not be possible, given the NN has info from previous frames, so it shouldn't be possible for something to suddenly jump.
Everyone has their opinion on how they think AP works, but what I've told you are facts regardless of opinion. My background is actually in computer vision and have been working in dedicated computer vision projects since 2016 and have personal friends that are part of the AP team at Tesla. Convolutional neural networks take in frames, you can create a pipeline that starts with stitched frames (e.g. video) and have that be parsed out on the fly and it makes it seem like you are working with "video" only. The "training" that occurs is a supervised approach which annotators look over parsed video frames and label objects of interest. The label process creates an XML file with the coordinate points of what you labeled and the classification. This is what then goes into training and why this is a supervised approach. On inference (e.g. when you are driving your car on autopilot), the video is read in, parsed into frames, ran through the layers of convolution neural network and inferences on object coordinates made. Fast forward this to 20FPS of processing through the GPU and you have live "video" inferencing.
 
That's not the distinction from my understanding. The way AP worked is frame by frame. The NN analyzes things per frame and then spits out the result for that frame. That is why things are so jumpy (anyone who have used AP for a while will notice this).

If instead the NN takes in a video (multiple frames at once, maybe even work with the video compression info, like keyframes, interframes, macroblocks, motion vectors, etc), then that should not be possible, given the NN has info from previous frames, so it shouldn't be possible for something to suddenly jump.
Read my message above.

Now, about labeling "on video", look at computer-vision offerings from IBM for example, you load in your video, you can label right on the video, but it parses the video into frames and corresponding xml files are saved to a directory to be used for training.. CNN takes in video frames (pics), your pipeline can start with video but somewhere in that pipeline you'll see a block of openCV code to parse out frames for CNN ingestion (not just NN, basic neural nets would make AP unfeasible).
 
Thanks for posting this Diplomat. I found it quite encouraging, and it definitely boosts my confidence that Tesla will figure this out soon. It looks like Tesla is doing a good job incorporating time domain information into their processing (i.e., videos), and this seems essential if they expect to estimate distance, velocity, and acceleration from cameras alone. I also think the post processing analysis they are doing to learn from the video data looks very promising.

It may be a long road, but I really think they are going to crack this and it will eventually surpass the best we can do with radar at present. My only concern, as a new customer, if whether this will be one more year, or more like five.

2011 and 2014 low resolution 2D ACC radar which no one uses for AV.

Critical thinking isn’t illegal. This is like getting excited and hyped because Samsung, LG came out and said their 2021 TV beats their 2011 and 2014 in some scenarios.....
 
Everyone has their opinion on how they think AP works, but what I've told you are facts regardless of opinion. My background is actually in computer vision and have been working in dedicated computer vision projects since 2016 and have personal friends that are part of the AP team at Tesla.

So you're telling us it's not possible to use videos as training data? What are you actually saying?
 
Clarification on point 2 "Tesla is now training with video, not only on images"

Video is processed on the fly into frames (images)... thats how they are fed to a CNN. Tesla might be u sing what they call "surround video", but the processing of video into frames is still required (frames played in sequence == video).
Video is nothing more than frames (images) in sequence.
 
Read my message above.

Now, about labeling "on video", look at computer-vision offerings from IBM for example, you load in your video, you can label right on the video, but it parses the video into frames and corresponding xml files are saved to a directory to be used for training.. CNN takes in video frames (pics), your pipeline can start with video but somewhere in that pipeline you'll see a block of openCV code to parse out frames for CNN ingestion (not just NN, basic neural nets would make AP unfeasible).


Wasn't the idea video would let them do far LESS supervised labeling?

Like when they did it frame by frame you had to label an object TRUCK in every frame but with video it would understand if a human labels a thing TRUCK in frame 1, the system can self-label that same object in future frames so long as it remains visible- thus saving a ton of human effort?
 
that's it, guys. that's all there is to this. it was too hard for them.
LOL, that is what you got out of that?

What I heard from this talk was, that now they have ALL NN engineers focused on Tesla Vision, there is no sensor fusion team that takes away skill/resources from the main team.
And the fact that vision is better/smoother at velocity & depth estimation than the legacy radar - at this stage already - sounds like they "solved the problem the correct way" instead of "barking up the wrong tree".
 
long story short (imho): fusing the 2 domains of sensor inputs is 'too hard' so we're giving up.

that's it, guys. that's all there is to this. it was too hard for them.

(sigh)

and yes, its not trivial but I think its absurd that they threw in the towel wrt radar. wrong direction, guys. you wont ever get there THIS way ;(

Theoretically, they could fuse 3 or 4 or 5 domains of sensor inputs, right? Too hard, or unnecessary?