Wanted to catch up on the progress for using transformers for computer vision. Just wanted to share a quick primer for you all.
Transformers in Computer Vision: Farewell Convolutions!
Transformers in Computer Vision
In parallel to how Transformers leveraged self-attention to modelize long-range dependencies in a text, novel works have presented techniques that use self-attention to overcome the limitations presented by inductive convolutional biases in an efficient way.
These works have already shown promising results in multiple Computer Vision benchmarks in fields such as Object Detection, Video Classification, Image Classification and Image Generation. Some of these architectures are able to match or outperform SOTA results even when getting rid of convolutional layers and relying solely on self-attention.
The visual representations generated from self-attention components do not contain the spatial constraints imposed by convolutions. Instead, they are able to learn the most suitable inductive biases depending on the task and on the stage where the layer is placed within the pipeline. It has been shown how self-attention used in early stages of a model can learn to behave similarly to a convolution.
Here is a paper for video where they set SOTA on charades(you know the game), which seems very relevant for understanding traffic, pedestrians, police etc:
https://arxiv.org/pdf/1711.07971.pdf
Watch this video starting 16min:
(SOTA and trains much faster)
All in all I believe changing from CNN/RNN to transformers will:
*Improve performance on simple task such as detecting vehicles
*Greatly improve hard tasks such as predicting how a pedestrian will walk, what a police officer is doing
*Be able to train on larger datasets generated by 4D labeling
*Require a huge dataset to be useful(which Tesla happens to have)
The good news is that development of this can happen in parallell to development of their current system as they can use the same dataset. One day they just replace the neural network and get a bump in performance on task that that benefit from more connections between parts of videos.
Bonus: Karpathy tweets about transformers:
https://twitter.com/karpathy/status/1312279279741276161?lang=en
https://twitter.com/karpathy/status/1284928564530278400?lang=en
https://twitter.com/karpathy/status/1305302243449516032?lang=en
”Transformers
Specifically, organizing information processing into multiplicative message passing in graphs; generalizing, simplifying, unifying, improving neural nets across domains. For a while there I was growing bit jaded with slowing progress on neural net architectures
feels like a lot is kicked up in dust, and the closest we've come to a full refactor of your typical neural net. stop me if I'm being overly dramatic ”
https://twitter.com/karpathy/status/1265742371649544192?lang=en
https://twitter.com/karpathy/status/1305312717364887552
https://twitter.com/karpathy/status/1304142058345512960
Source code by Karpathy for image classification using transformers:
karpathy/minGPT
Transformers in Computer Vision: Farewell Convolutions!
Transformers in Computer Vision
In parallel to how Transformers leveraged self-attention to modelize long-range dependencies in a text, novel works have presented techniques that use self-attention to overcome the limitations presented by inductive convolutional biases in an efficient way.
These works have already shown promising results in multiple Computer Vision benchmarks in fields such as Object Detection, Video Classification, Image Classification and Image Generation. Some of these architectures are able to match or outperform SOTA results even when getting rid of convolutional layers and relying solely on self-attention.
The visual representations generated from self-attention components do not contain the spatial constraints imposed by convolutions. Instead, they are able to learn the most suitable inductive biases depending on the task and on the stage where the layer is placed within the pipeline. It has been shown how self-attention used in early stages of a model can learn to behave similarly to a convolution.
Here is a paper for video where they set SOTA on charades(you know the game), which seems very relevant for understanding traffic, pedestrians, police etc:
https://arxiv.org/pdf/1711.07971.pdf
Watch this video starting 16min:
All in all I believe changing from CNN/RNN to transformers will:
*Improve performance on simple task such as detecting vehicles
*Greatly improve hard tasks such as predicting how a pedestrian will walk, what a police officer is doing
*Be able to train on larger datasets generated by 4D labeling
*Require a huge dataset to be useful(which Tesla happens to have)
The good news is that development of this can happen in parallell to development of their current system as they can use the same dataset. One day they just replace the neural network and get a bump in performance on task that that benefit from more connections between parts of videos.
Bonus: Karpathy tweets about transformers:
https://twitter.com/karpathy/status/1312279279741276161?lang=en
https://twitter.com/karpathy/status/1284928564530278400?lang=en
https://twitter.com/karpathy/status/1305302243449516032?lang=en
”Transformers
Specifically, organizing information processing into multiplicative message passing in graphs; generalizing, simplifying, unifying, improving neural nets across domains. For a while there I was growing bit jaded with slowing progress on neural net architectures
feels like a lot is kicked up in dust, and the closest we've come to a full refactor of your typical neural net. stop me if I'm being overly dramatic ”
https://twitter.com/karpathy/status/1265742371649544192?lang=en
https://twitter.com/karpathy/status/1305312717364887552
https://twitter.com/karpathy/status/1304142058345512960
Source code by Karpathy for image classification using transformers:
karpathy/minGPT