A short post on the best architectures for real-time image and video processing.TL;DR: use convolutions with stride or pooling at the low levels, and stick self-attention circuits at higher levels, where feature vectors represent objects.
PS: ready to bet that Tesla FSD uses convolutions (or...