Tesla.com - "Transitioning to Tesla Vision"

diplomat33 · Jun 20, 2021

powertoold said:
Karpathy is once again pounding the fact that, currently, only Tesla can source the data required for FSD.

Thank you diplomat for posting the video.

The new bits that I gleaned from Karpathy's talk:
1) Tesla is using transformer NNs for fusing surround video.
2) Tesla is now training with video, not only on images.
3) Tesla has built a new Nvidia-based supercomputer to train their NNs.
4) Tesla is still working on Dojo.

You are welcome.

By the way, I did start a separate thread just to discuss the Karpathy talk:

Karpathy talk today at CVPR 2021

Karpathy just finished a new presentation moments ago at CVPR 2021. Here is livestream. Karpathy presentation starts around 7:53:00: He mentions FSD Beta can do zero intervention drives in sparse conditions but struggles with dense, high adversity conditions (like busy traffic in San...

teslamotorsclub.com

So if you want to share this or other bits that you got from the video, feel free to share it in the other thread.

Thanks.

Isidro Jr · Jun 20, 2021

powertoold said:
Karpathy is once again pounding the fact that, currently, only Tesla can source the data required for FSD.

Thank you diplomat for posting the video.

The new bits that I gleaned from Karpathy's talk:
1) Tesla is using transformer NNs for fusing surround video.
2) Tesla is now training with video, not only on images.
3) Tesla has built a new Nvidia-based supercomputer to train their NNs.
4) Tesla is still working on Dojo.

Clarification on point 2 "Tesla is now training with video, not only on images"

Video is processed on the fly into frames (images)... thats how they are fed to a CNN. Tesla might be u sing what they call "surround video", but the processing of video into frames is still required (frames played in sequence == video).

powertoold · Jun 20, 2021

Isidro Jr said:
Video is processed on the fly into frames (images)... thats how they are fed to a CNN. Tesla might be u sing what they call "surround video", but the processing of video into frames is still required (frames played in sequence == video).

It's possible you're right, but Elon has said that Tesla is working on video labeling and training directly. And this is the first time I've heard Karpathy talking about video training / data directly. In all prior talks, Karpathy has always referred to Tesla's data as images.

rjpjnk · Jun 20, 2021

diplomat33 said:
Karpathy is presenting at CVPR. He gave some specific info on why Tesla dropped radar. Basically, radar and sensor fusion was not working right. Karpathy explains that Tesla could have spend time fixing radar and sensor fusion but decided their engineering resources would be better spend on just just doing camera vision instead.

I grabbed these screenshots.

First screenshot shows a case of a car in front braking hard. The graphs show the acceleration of the Tesla over time for radar and no radar.

Basically, radar would lose track of the car in front, causing very hard braking. Now with Tesla Vision, the braking is much more smooth.

Next screenshot is case of an overpass. Radar mistakes the overpass for a stationary object. Tesla Vision does not make that mistake so it very smooth now.

Here is the live stream. Karpathy talk starts at 7:51:46:

Thanks for posting this Diplomat. I found it quite encouraging, and it definitely boosts my confidence that Tesla will figure this out soon. It looks like Tesla is doing a good job incorporating time domain information into their processing (i.e., videos), and this seems essential if they expect to estimate distance, velocity, and acceleration from cameras alone. I also think the post processing analysis they are doing to learn from the video data looks very promising.

It may be a long road, but I really think they are going to crack this and it will eventually surpass the best we can do with radar at present. My only concern, as a new customer, if whether this will be one more year, or more like five.

stopcrazypp · Jun 20, 2021

Isidro Jr said:
Clarification on point 2 "Tesla is now training with video, not only on images"

Video is processed on the fly into frames (images)... thats how they are fed to a CNN. Tesla might be u sing what they call "surround video", but the processing of video into frames is still required (frames played in sequence == video).

That's not the distinction from my understanding. The way AP worked is frame by frame. The NN analyzes things per frame and then spits out the result for that frame. That is why things are so jumpy (anyone who have used AP for a while will notice this).

If instead the NN takes in a video (multiple frames at once, maybe even work with the video compression info, like keyframes, interframes, macroblocks, motion vectors, etc), then that should not be possible, given the NN has info from previous frames, so it shouldn't be possible for something to suddenly jump.

AlanSubie4Life · Jun 20, 2021

rjpjnk said:
Tesla will figure this out soon

Definitely “soon” is not a precise enough term when it comes to Tesla and FSD.

rjpjnk said:
My only concern, as a new customer, if whether this will be one more year, or more like five.

Yeah that is the big question. Everyone wants to know, and for some of the early adopters, from lived experience, they are definitely guessing the longer end of that timeframe.

I have a lot of questions, which I wrote two paragraphs on, but then deleted. I guess “we will see.”

Isidro Jr · Jun 21, 2021

stopcrazypp said:
That's not the distinction from my understanding. The way AP worked is frame by frame. The NN analyzes things per frame and then spits out the result for that frame. That is why things are so jumpy (anyone who have used AP for a while will notice this).

If instead the NN takes in a video (multiple frames at once, maybe even work with the video compression info, like keyframes, interframes, macroblocks, motion vectors, etc), then that should not be possible, given the NN has info from previous frames, so it shouldn't be possible for something to suddenly jump.

Everyone has their opinion on how they think AP works, but what I've told you are facts regardless of opinion. My background is actually in computer vision and have been working in dedicated computer vision projects since 2016 and have personal friends that are part of the AP team at Tesla. Convolutional neural networks take in frames, you can create a pipeline that starts with stitched frames (e.g. video) and have that be parsed out on the fly and it makes it seem like you are working with "video" only. The "training" that occurs is a supervised approach which annotators look over parsed video frames and label objects of interest. The label process creates an XML file with the coordinate points of what you labeled and the classification. This is what then goes into training and why this is a supervised approach. On inference (e.g. when you are driving your car on autopilot), the video is read in, parsed into frames, ran through the layers of convolution neural network and inferences on object coordinates made. Fast forward this to 20FPS of processing through the GPU and you have live "video" inferencing.

Isidro Jr · Jun 21, 2021

stopcrazypp said:
That's not the distinction from my understanding. The way AP worked is frame by frame. The NN analyzes things per frame and then spits out the result for that frame. That is why things are so jumpy (anyone who have used AP for a while will notice this).

If instead the NN takes in a video (multiple frames at once, maybe even work with the video compression info, like keyframes, interframes, macroblocks, motion vectors, etc), then that should not be possible, given the NN has info from previous frames, so it shouldn't be possible for something to suddenly jump.

Read my message above.

Now, about labeling "on video", look at computer-vision offerings from IBM for example, you load in your video, you can label right on the video, but it parses the video into frames and corresponding xml files are saved to a directory to be used for training.. CNN takes in video frames (pics), your pipeline can start with video but somewhere in that pipeline you'll see a block of openCV code to parse out frames for CNN ingestion (not just NN, basic neural nets would make AP unfeasible).

Bladerskb · Jun 21, 2021

rjpjnk said:
Thanks for posting this Diplomat. I found it quite encouraging, and it definitely boosts my confidence that Tesla will figure this out soon. It looks like Tesla is doing a good job incorporating time domain information into their processing (i.e., videos), and this seems essential if they expect to estimate distance, velocity, and acceleration from cameras alone. I also think the post processing analysis they are doing to learn from the video data looks very promising.

It may be a long road, but I really think they are going to crack this and it will eventually surpass the best we can do with radar at present. My only concern, as a new customer, if whether this will be one more year, or more like five.

2011 and 2014 low resolution 2D ACC radar which no one uses for AV.

Critical thinking isn’t illegal. This is like getting excited and hyped because Samsung, LG came out and said their 2021 TV beats their 2011 and 2014 in some scenarios.....

powertoold · Jun 21, 2021

Isidro Jr said:
Everyone has their opinion on how they think AP works, but what I've told you are facts regardless of opinion. My background is actually in computer vision and have been working in dedicated computer vision projects since 2016 and have personal friends that are part of the AP team at Tesla.

So you're telling us it's not possible to use videos as training data? What are you actually saying?

mikes_fsd · Jun 21, 2021

Isidro Jr said:
Clarification on point 2 "Tesla is now training with video, not only on images"

Video is processed on the fly into frames (images)... thats how they are fed to a CNN. Tesla might be u sing what they call "surround video", but the processing of video into frames is still required (frames played in sequence == video).

Video is nothing more than frames (images) in sequence.

texas_star_TM3 · Jun 21, 2021

so with the "2 weeks" well being over now... are vision-only cars still limited to AP @ 75mph ?

how about NHTSA checkboxes for safety features? reinstated given that vision-only is as capable or better after that update in "2 weeks" ?

powertoold · Jun 21, 2021

texas_star_TM3 said:
no lane departure avoidance

There is lane departure avoidance. There was so much fud about vision-only, people don't even know what is what:

texas_star_TM3 · Jun 21, 2021

powertoold said:
There is lane departure avoidance. There was so much fud about vision-only, people don't even know what is what:

View attachment 675812

lol "coming soon" and "features not *yet* rated by NHTSA or IIHS" .... sorry... that checkmark with the * doesn't mean a lot... it's either a checkmark or it isn't

powertoold · Jun 21, 2021

texas_star_TM3 said:
lol "coming soon"

The only thing coming soon according to that chart is emergency lane departure avoidance, which isn't even necessary if you leave lane departure avoidance enabled in AP settings.

Knightshade · Jun 21, 2021

Isidro Jr said:
Read my message above.

Now, about labeling "on video", look at computer-vision offerings from IBM for example, you load in your video, you can label right on the video, but it parses the video into frames and corresponding xml files are saved to a directory to be used for training.. CNN takes in video frames (pics), your pipeline can start with video but somewhere in that pipeline you'll see a block of openCV code to parse out frames for CNN ingestion (not just NN, basic neural nets would make AP unfeasible).

Wasn't the idea video would let them do far LESS supervised labeling?

Like when they did it frame by frame you had to label an object TRUCK in every frame but with video it would understand if a human labels a thing TRUCK in frame 1, the system can self-label that same object in future frames so long as it remains visible- thus saving a ton of human effort?

linux-works · Jun 21, 2021

long story short (imho): fusing the 2 domains of sensor inputs is 'too hard' so we're giving up.

that's it, guys. that's all there is to this. it was too hard for them.

(sigh)

and yes, its not trivial but I think its absurd that they threw in the towel wrt radar. wrong direction, guys. you wont ever get there THIS way ;(

mikes_fsd · Jun 21, 2021

linux-works said:
that's it, guys. that's all there is to this. it was too hard for them.

LOL, that is what you got out of that?

What I heard from this talk was, that now they have ALL NN engineers focused on Tesla Vision, there is no sensor fusion team that takes away skill/resources from the main team.
And the fact that vision is better/smoother at velocity & depth estimation than the legacy radar - at this stage already - sounds like they "solved the problem the correct way" instead of "barking up the wrong tree".

Zinzan · Jun 21, 2021

linux-works said:
long story short (imho): fusing the 2 domains of sensor inputs is 'too hard' so we're giving up.

that's it, guys. that's all there is to this. it was too hard for them.

(sigh)

and yes, its not trivial but I think its absurd that they threw in the towel wrt radar. wrong direction, guys. you wont ever get there THIS way ;(

Theoretically, they could fuse 3 or 4 or 5 domains of sensor inputs, right? Too hard, or unnecessary?

gearchruncher · Jun 21, 2021

mikes_fsd said:
solved the problem the correct way

It's a bit premature to call it "solved" when Tesla has had to remove/reduce major features from the system when they removed radar that are sill missing today.

Tesla.com - "Transitioning to Tesla Vision"

Average guy who loves autonomous vehicles

Member

Active Member

Active Member

Well-Known Member

Efficiency Obsessed Member

Member

Member

Senior Software Engineer

Active Member

Banned

Active Member

Active Member

Active Member

Active Member

Well-Known Member

Active Member

Banned

Member

Well-Known Member

Similar threads