Firstly, Tesla isn't telling much about this - and they shouldn't, they are the industry leader who has various proprietary secrets to protect.
But I got the impression that they have two main methods of data collection: on Autopilot disengagements, and also they have filter conditions they are sending out to the fleet. So if say Andrej Karpathy is interested in disengagement events about "tunnels", they might have the methods to get only such events from the 100 cars a day that disengage around tunnels, not millions of disengagement events per day.
I believe Tesla's labeled image database must be in the millions - perhaps hundreds of millions of images, and I wouldn't be surprised if it was above a billion images.
They are not doing stereoscopic vision the usual way, instead their networks recognize 3D perspective patterns in 2D frames and to turn them into distance and perhaps object attitude (object movement vector) attributes. So their networks can, in essence, do things like:
Note that these are actually processed 2D images of cars, which the neural network recognized as 3D cars and the output is rotated by the neural network itself. I.e. after training the network you can input "144°" rotation as an input parameter, and you'll get a rotated car generated by the neural network.
So the GIFs above are not real photos of cars, they are the generated output of the trained neural network and a camera/view position parameter.
See this:
This is pretty close to how the visual cortex in the human brain is working I believe. This is how we can see in 3D just fine even with a single eye.