4D vision

shrineofchance · May 18, 2021

This is 2D vision:

Image credit: greentheonly.

Projecting out 3D models from detections within individual 2D images.

This is 3D vision:

Image source: Karpathy at CVPR 2020.

The neural network directly outputs a 3D model from the 8 camera images.

This is 4D vision (3D + time):

“[The neural network] has seen the images over time and [it has] done the tracking. And having accumulated information from all those frames, here's actually what the world looks like around you.” -Karpathy

Original source.

Bladerskb · May 18, 2021

Other SDC companies are also doing this.
The temporal part is just prediction.

Toyota

Self-supervised Learning in Depthâââpart 1 of 2

Self-supervised learning and beyond for depth estimation from images.

medium.com

Waymo

Mobileye

Wayve

Predicting the future from monocular cameras in bird’s-eye view

Driving can often result in situations where the immediate decision is not obvious if you only take a single snapshot of the world

wayve.ai

shrineofchance · May 18, 2021

Bladerskb said:
The temporal part is just prediction.

Nope. Prediction is a different thing. The vision neural networks are using temporal information for perception. Karpathy hasn't elaborated on how exactly they're doing this, but an example of how they could glean useful information from sequences of frames is occlusion.

If you enforce internal consistency within a sequence of frames and the assumption that objects don't just pop in and out of existence randomly, you could train the vision NNs to "remember" objects that become occluded by other objects.

Enforcing temporal consistency might give better vision NN predictions overall.

ZeApelido · May 18, 2021

Bladerskb said:
Other SDC companies are also doing this.
The temporal part is just prediction.

What do you mean the temporal part is just prediction?

Are they learning on 4D information or not?

ZeApelido · May 18, 2021

I don't know what other companies are doing, but Tesla is moving to training on 4D information. That means training on some sequence of temporal 3D representations. This is essential not just for object permanence but also for their depth perception.

Other companies could certainly build the same models. In the past "video" has certainly been limited by training and inference compute power , maybe transformers have helped reduce these requirements.

Of course, as you can imagine in deep learning, 4D transformers probably need lots and lots of data to work well.

S4WRXTTCS · May 18, 2021

Bladerskb said:
Other SDC companies are also doing this.

So what?

With a Tesla we get to try it while its still in Alpha stage.

Take that!!!

Search

4D vision

shrineofchance

she/her, they/them

This is 2D vision:

This is 3D vision:

This is 4D vision (3D + time):

Bladerskb

Senior Software Engineer

Self-supervised Learning in Depthâââpart 1 of 2

Predicting the future from monocular cameras in bird’s-eye view

shrineofchance

she/her, they/them

ZeApelido

Active Member

ZeApelido

Active Member

S4WRXTTCS

Well-Known Member

Similar threads

4D vision

she/her, they/them

​

This is 2D vision:​

This is 3D vision:​

This is 4D vision (3D + time):​

Senior Software Engineer

she/her, they/them

Active Member

Active Member

Well-Known Member

Similar threads

This is 2D vision:

This is 3D vision:

This is 4D vision (3D + time):