I don't know if Tesla will max out AP3 soon since I have very little info on how much of AP3, Tesla is currently using. But I am sure "city NOA" will probably take up a significant portion of AP3. As you point out, there is a lot for the camera vision to process and render in vector space. So AP3 will definitely be busy.
I'm not an expert on this topic so I could be wrong, but here's my current understanding. I think neural network can, in principle, be scaled up and down in size almost arbitrarily. So, Tesla may have a version of Hydranet (or whatever the proper name is) that is far more computationally intensive than what HW3 can run. Then they might “squeeze” that bigger network down until it's exactly HW3-sized. (That might be why DeepScale was acquired.) Or they might simply work on a HW3-sized network from the start.
As I understand it, there would be no point using only half of HW3 because you could just double the size of the network and get better accuracy on your NN's predictions.
IIRC, one limitation on scaling up neural network size is that you need to scale up your training datasets along with your network or else your network will overfit your specific datasets rather than generalize. But particularly with self-supervised learning — which will be accelerated by Dojo — it doesn't seem that will much of an issue.
Karpathy was talking about having a "black box" that takes in sensor input and spits out driving policy. That completely skips vector space visible by humans, and may require HW4. The advantage that has is you can train the NN with both sensor and driver control input, which Tesla has loads of.
When did Karpathy talk about this? What you're describing is end-to-end imitation learning. Nvidia did
a demo of this concept a few years back:
“We trained our network to steer the car by having it study human drivers. The network recorded what the driver saw using a camera on the car, and then paired the images with data about the driver’s steering decisions. We logged a lot of driving hours in different environments: on roads with and without lane markings; on country roads and highways; during different times of day with different lighting conditions; in a variety of weather conditions.
The trained network taught itself to drive BB8 without ever receiving a single hand-coded instruction. It learned by observing. And now that we’ve trained the network, it can provide real-time steering commands when it sees new environments. See it in action in the video below.”
Elon briefly mentioned on Autonomy Day (I think in response to a question from Tasha Keeney at ARK Invest) that he expected the system would “eventually” move to pixels in, steering and acceleration out. But I think he meant in the long-term future, not anytime soon.
Karpathy and Elon have also both talked about
self-supervised learning, which is a different concept from end-to-end learning but to confuse with it. I originally thought Elon's comments on Dojo were about end-to-end learning, but now (thanks to
@jimmy_d) I'm pretty sure he was talking about self-supervised learning for computer vision (i.e. creating the vector space representations).
The current "conventional" method of having a NN generate vector space requires human coded driving policy, which can be quite the bottleneck.
Fortunately, this isn't true. You can do “mid-to-mid” imitation learning, in which the imitation network takes the vector space representations as its input rather than pixels. Then its output is a plan/path/trajectory/action. That gets sent to the control software (which is hand-coded) and turned into low-level steering and acceleration commands. In training, the human's plan/path/trajectory/action is paired with the vector space representations and that's the state-action pair (“state” as in world state or environment state), which is the input-output pair for deep
supervised learning.
Waymo did mid-to-mid imitation learning with their ChauffeurNet research project:
Learning to Drive: Beyond Pure Imitation
I believe Tesla's approach to planning/driving policy is a combination of hand-coded elements and mid-to-mid imitation learned elements.