Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Transformers(Tesla’s next big rewrite)

This site may earn commission on affiliate links.

heltok

Active Member
Aug 12, 2014
2,897
27,700
Sweden
Wanted to catch up on the progress for using transformers for computer vision. Just wanted to share a quick primer for you all.

Transformers in Computer Vision: Farewell Convolutions!
Transformers in Computer Vision
In parallel to how Transformers leveraged self-attention to modelize long-range dependencies in a text, novel works have presented techniques that use self-attention to overcome the limitations presented by inductive convolutional biases in an efficient way.

These works have already shown promising results in multiple Computer Vision benchmarks in fields such as Object Detection, Video Classification, Image Classification and Image Generation. Some of these architectures are able to match or outperform SOTA results even when getting rid of convolutional layers and relying solely on self-attention.

The visual representations generated from self-attention components do not contain the spatial constraints imposed by convolutions. Instead, they are able to learn the most suitable inductive biases depending on the task and on the stage where the layer is placed within the pipeline. It has been shown how self-attention used in early stages of a model can learn to behave similarly to a convolution.

Here is a paper for video where they set SOTA on charades(you know the game), which seems very relevant for understanding traffic, pedestrians, police etc:
https://arxiv.org/pdf/1711.07971.pdf

Watch this video starting 16min:
(SOTA and trains much faster)

All in all I believe changing from CNN/RNN to transformers will:
*Improve performance on simple task such as detecting vehicles
*Greatly improve hard tasks such as predicting how a pedestrian will walk, what a police officer is doing
*Be able to train on larger datasets generated by 4D labeling
*Require a huge dataset to be useful(which Tesla happens to have)

The good news is that development of this can happen in parallell to development of their current system as they can use the same dataset. One day they just replace the neural network and get a bump in performance on task that that benefit from more connections between parts of videos.

Bonus: Karpathy tweets about transformers:
https://twitter.com/karpathy/status/1312279279741276161?lang=en
https://twitter.com/karpathy/status/1284928564530278400?lang=en
https://twitter.com/karpathy/status/1305302243449516032?lang=en
”Transformers
Specifically, organizing information processing into multiplicative message passing in graphs; generalizing, simplifying, unifying, improving neural nets across domains. For a while there I was growing bit jaded with slowing progress on neural net architectures
feels like a lot is kicked up in dust, and the closest we've come to a full refactor of your typical neural net. stop me if I'm being overly dramatic :)
https://twitter.com/karpathy/status/1265742371649544192?lang=en
https://twitter.com/karpathy/status/1305312717364887552
https://twitter.com/karpathy/status/1304142058345512960

Source code by Karpathy for image classification using transformers:
karpathy/minGPT
 
*Greatly improve hard tasks such as predicting how a pedestrian will walk, what a police officer is doing
It would seem like transformers could be particularly useful in finding dependencies across multiple cameras over time for these examples. Instead of explicitly stitching together a view with CNN/RNN, the network could learn how the different views relate to each other and when they're important. Although I wonder if this would be sensitive to the different camera positions for vehicles, e.g., sedan vs suv vs truck.
 
  • Like
Reactions: DanCar
It would seem like transformers could be particularly useful in finding dependencies across multiple cameras over time for these examples. Instead of explicitly stitching together a view with CNN/RNN, the network could learn how the different views relate to each other and when they're important. Although I wonder if this would be sensitive to the different camera positions for vehicles, e.g., sedan vs suv vs truck.
Great point. I think they will handle different camera poses the same as a language model will handle different authors, different dialects or even different languages. Like if I you know add some you know kind of extra California kind of like annoying filler words and you still kind of like understand what I am saying.
 
... Like if I you know add some you know kind of extra California kind of like annoying filler words and you still kind of like understand what I am saying.

As a Californian, I legitimately read this as "For example, if I add some extra California kind of annoying filler words, you still understand what I am saying". Had to read it another 2 times to actually see what you meant! :p

Only one of the "kind of"s didn't fit naturally, and stood out to me as odd, so good job, you're a 90% convincing Californian!
 
Important question: how does this all map to the Tesla HW3 chip, whose network processing
units are optimized for convolutions using 96 x 96 multiply-accumulate units, etc.:

https://en.wikichip.org/wiki/tesla_(car_company)/fsd_chip
It’s basically the same multiply accumulate computations. Like with CNN they just use the same part of the processor over and over again until they reach the end. Don’t think they will run out of memory (I think, could be wrong).

There might be some benefit to changing Dojo to be more optimized for larger models. I am not sure, I think we need to ask some expert about this. But at least Tesla has one of the experts in charge so they probably know what they are doing.
 
  • Like
Reactions: APotatoGod
... Like if I you know add some you know kind of extra California kind of like annoying filler words and you still kind of like understand what I am saying.
Like, brah, I just, you know almost like right this minute, did Like your post. Its like super annoying me all the time when dudes do that, you know for no reason.
But yeah AI literally rocks.
 
Great point. I think they will handle different camera poses the same as a language model will handle different authors, different dialects or even different languages. Like if I you know add some you know kind of extra California kind of like annoying filler words and you still kind of like understand what I am saying.
Seriously, you motivated me to experinent with my Amazon Echo. Indeed I can throw almost any amount of useless filler speech into a command to turn on the hallway light, and still the command is processed perfectly. I don't know the core parsing techniques there, which may be quite different from the FSD NN model. I'm just noting your point about the robust capability to extract the key words from the extraneous ones. Interesting.

I haven't tried a barrage of profanity; that probably gets ignored too, but it goes into my Voice Command History and for whatever reason I'm not anxious to have that stored by AWS.
 
Did you see that Andrej has a podcast?

 
Did you see that Andrej has a podcast?

Yes! Really great stuff. Even made a thread where I intend to list all this podcasts:

https://teslamotorsclub.com/tmc/threads/karpathy’s-talk-unrelated-to-tesla.221765/
 
  • Informative
Reactions: JHCCAZ
Here's the slide where he talks about transformers:
transformer.png

"Our architecture roughly looks like this. We have these images coming in from multiple cameras on the top. All of them are processed by an image extractor like a backbone -- think ResNet kind-of style. Then there's a multicam fusion that fuses the information from all the 8 views, and this is kind of a transformer to fuse this information. We fuse the information first across all the cameras and then across all of time, and that is also done by a transformer, by a recurrent neural network or just by 3-dimensional convolutions. We've experimented with a lot of fusion strategies here to get this to work really well.

And then what we have after the fusion is done, we have this branching structure that doesn't just consist of heads -- but actually we've expanded this over the last year or so -- where we now have heads that branch into trunks that branch into terminals. So there's a lot of branching structure. And the reason why you want this branching structure is because there's a huge amount of outputs that you're interested in, and you can't afford to have a single neural network for every one of the individual outputs. You have to of course amortize the forward pass for efficient inference at test time, so there's a lot of feature sharing here.

The other nice benefit of the branching structure is that it decouples at the terminals all of these signals. So if I'm someone working on velocity for a particular object type or something like that, I have a small piece of neural network that I can actually fine-tune without touching any of the other signals. So I can work in isolation to some extent and actually get something to work pretty well. Basically the iteration scheme is that a lot of people are fine-tuning, and once in a while we do an uprev of all the backbone end-to-end."

Is there a significance to the left side of the backbone vs the right side? Both sides seem to be showing that previous frames/times of multicam fusion are used as inputs to the various heads, which seems to be duplicated but might represent object types, e.g., vehicles and pedestrians, each outputting attributes, trajectory, etc.? Unless the two sides are representing some A/B node processing split?