For a given neural network, inference latency, memory usage is constant i.e. computational characteristics is invariant for a given neural network regardless of inputs.
Assuming we have an agreement on neural network inference computation charactertistics, I interpret what you said to mean that a neural network capable of handling FSD is larger than what can be fit inside AP3 memory. Why do you think so? And do you have a rough estimate of the minimum neural network size needed for FSD?
Yes agree with what you wrote.
Although I do work in deep learning, I generally work with physiological time signals so the size of the networks is much smaller that what I imagine Tesla would need and don't really have to focus on memory limits. So I don't have direct experience to get a good estimate what size of the network it would take, but I think we can look at some historical priors to get a sense...
The null hypothesis here should be that the computational / memory load is not sufficient until proven otherwise. Companies, such as Waymo (I've talked with their engineers) did not focus more heavily on camera vision only because of the vast computational load it would require, much much more than LIDAR obviously. Of course, they made those decisions before the growth of deep learning.
In neuroscience, it is well known how amazing and intricate the human vision system is. Not really the eyes, but the brain. The amount of processing being done is insane, and HW3 is a joke compared to it. I forget the numbers, but its orders of magnitude more.
But of course maybe advanced neural networks can be optimized and pruned to achieve the desired accuracy with much less compute than humans. But where is that threshold reached?
We know that on HW2.0 (I believe) Tesla wasn't even using the raw images for processing in the neural network. They were downsampling the images (by a factor of at least 2, maybe 4) before feeding in. I knew at that time it was a joke to think they were going to have FSD on that sort of hardware.
Well know we know they at least don't have to do that. Read this post (literally the most "informative" ever on TMC
Neural Networks)
But here it seems they are still feeding in only a few time snapshots at at time. I can tell you this will absolutely not do for sufficient accuracy of FSD. I don't even know if humans could take just a few snapshots over 2 or 3 seconds and have a robust enough object prediction.
No, to achieve good enough FSD, Tesla is going to have to feed in "video". Like, information over 3-6 seconds. At what sampling rate and resolution I have no idea, but that is going to be needed to improve static object detection to a sufficient level. I am confident of that.
Now, the 4 dimensional labeling is presumably operating along those lines. This in itself is a deep learning task that must take some amount of video to reconstruct the 3-D space (VIDAR) before feeding that reconstruction into the other modules (perception, planning, etc...).
And I truly believe that Tesla's approach can work. The camera resolution is good enough I think. The amount of data they can collect is assuredly good enough. But at what point can they make each component of high enough accuracy, how many images will they need to feed in? How good are their computer vision skills (likely need to make processing of videos most efficient before feeding into NNets).
In some ways MobileEye has more domain knowledge and if it was that alone I would trust them more to be able to make their models the most efficient memory wise. But Tesla's data advantage may lead them to realize the best approaches as well.
All of that to say, I am bullish on the approach in the long term. It's generalized and supported by a lot of data.
But there is NO basis to think HW 3.0 is enough to handle all of this. Definitely not Elon's words. If Karpathy says it confidently, I would be more optimistic.