So if we cautionously assume that current network has data at 60fps (2 cameras at 30fps each) + we'll get 2 more of similar complexity NNs at another 60 fps (2x repeater + 2x pillars) = 180fps in total + a hopefully simplier wide camera NN and that just leaves the backup camera that has a totally different picture pattern, but on the other hand would not need to be used all the time.
This looks like there's a chance the whole performance would be about where it should be unless they drastically redo their NNs to make them much heavier I imagine.
I think the hardware isn't underpowered at all.
So one very important thing this analysis leaves out is that you're basing your capacity estimate on the current system. Which, as previously discussed, is feeding very low-resolution images. As
@jimmy_d rightly pointed out, NNs often perform quite well on what to people would seem very low-res images, but I also need to point out that the current system (which emulates MobileEye) clearly is looking at only what's pretty much immediately in front of the car, by which I mean objects at rather short range. You can tell when driving this thing that it can't see clearly very far away. If you want long-distance vision (and believe me, for L3+ you
need long-distance vision), you need higher resolution. This is what the long camera is for, after all. From what I can tell from the info on this thread combined with my own experience with AP2, not only is it downsampling the resolution but it's also cropping the image to a narrow window directly in front of the car.
So fixing all of that, required for L3+ (or even just bearable, smooth, reliable L2), is going to dramatically increase the required GPU capacity.
I haven't seen a network that is know to be able to do FSD so it's only speculation on my part, but I'd expect this chip to be able to do it if the FSD algorithm is decently mature.
Wow. Let me rephrase that for you: "Nobody has ever done anything like this before, despite trying for 30 years, and nobody has even developed algorithms capable of it. Notwithstanding this, I am confident that I know how much computing power is required, based on a cursory analysis of a primitive system that as I'm about to point out is not even of the correct architecture to do the job and looks like it was put together by an intern copying and pasting from Stack Overflow." You're clearly a smart guy who knows more than a little about CV and ML, but this is wild speculation.
I mean seriously, it's like an intern did the network architecture for this. That's a big part of why I think they must be working on something else. Because this sure doesn't feel like the product of a world class team with tons of resources.
On this I agree.
I was really not expecting to see something like this. Which is kind of what makes it interesting. I mean, I spend A LOT of time studying DL networks on the cutting edge of research and I was excited to get this thing because, hey, what kind of amazing stuff must Tesla have under the hood? Elon founded OpenAI and those guys are friggin' rock stars. Karpathy came straight over from that group. I was expecting... not this.
They are slammed, under-resourced, and being asked to meet impossible deadlines for nearly-impossible tasks with inadequate supporting infrastructure. This is a complex system involving way more than just CV and ML algorithms -- a real-life software/hardware system requires a huge amount of more boring software infrastructure that they are probably also way behind on -- just the infrastructure to handle the large amounts of training data they're supposedly collecting from the fleet requires a solid team of software engineers and several months. This isn't an academic exercise; this is the real world.
But there's this one thing I can say: the hard part of this network is a total cut and paste from the single most popular demo network out there today. Google includes it with their free tensorflow framework when you download it. This is the demo network that launched a thousand "deep learning 101" class projects.
Which makes it
not the hard part, as others have already alluded to. Perception is just the beginning, and of the very hard problems in the autonomy space, it's the closest to being "easy" given ample modern hardware (and, alas, Tesla does not have ample hardware...).