Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

How does "fleet learning" work?

This site may earn commission on affiliate links.
Apologies if this is in the wrong section....

I have a few questions about how this fleet learning works, at a technical level, and I'm wondering if somebody out there might be able to shed some light?

I get how these deep neural networks can be trained to drive by following a human driver. But to do that requires a LOT of information (and bandwidth) and a lot of processing power. I doubt that the cars have the processing power to do any meaningful training in-car, and so that implies that they must send information back to the mother ship to help refine the training.

So...

1. What do the cars send back? Clearly the cars can't all be sending continuous raw video and sensor data back. The bandwidth demands would be absurd.

2. Is the data limited simply to the location and correct response to sensed/known obstacles and road features? That would seem to be useful but... more minor. It would seem to not be as helpful when the car is learning more complex behaviors: navigating intersections, construction areas and parking lots; learning how to deal with on-road obstacles like birds, junk dropped off of trucks, and so on.

3. Are the cars smart enough to send back detailed video on exceptional circumstances, like dropped objects and so-on?

Just curious.
 
The car with 8 cameras can creates a holistic 3d view of its world in real world with annotations.

IF an interesting situation were to take place. Or an interesting intersection or stretch of road were marked on the map for recording. All that needs to happen is for the car to record the last 30 seconds of the cars encounter and send it over to Tesla HQ.

But wait, you don't have to send raw video over. You people keep forgetting that the Nvidia PX2 runs the DNN in real time and already processes the raw video data.

Tesla would only need the meta data. What do i mean by meta data? everything already tagged and processed by the DNN and includes numerical and temporal representation of the space around the car, the coordinates of the objects(other cars/obstacles/pedestrain/traffic signs/lanes/road edges/etc) around the car and their velocity, and more.

This immense data could be less than 1MB and can then be loaded into Tesla's simulator.

They will have metadata of the exact position of every traffic light in every city. They will have metadata of position of traffic signs, stop signs, speed limits, lane markings, etc

They will have data of every parking place and spot a tesla car ever parked in during manual mode.

So if a tesla drives to and parks in McDonald or any business in manual mode. The car will save data of the parking structure and its spots and how exactly to navigate in and out of it and beams it up to HQ.

if the car were to encounter a place it fails at during shadow mode. it simply does what xbox does. Records the last 30 seconds and beams it up to HQ.
 
  • Informative
Reactions: sandpiper
Thanks for the response Bladerskb.

I have a couple of follow-on question. But a quick disclaimer... my understanding of DNNs is fairly basic

So... I was of the understanding that a key difference between the EyeQ system and the Tesla system is that the system employed more unsupervised learning.. that it didn't tag objects in the environment. If tha

Nobody actually knows for sure. Mobileye had "curated" vision for tagging objects - the learning was done at Mobileye and downloaded to Teslas. Tesla however did its own sensor fusion and decision making.

The *new* Tesla system runs on NVIDIA's Drive PX2 - and yes, NVIDIA claims it is capable of doing unsupervised learning - and NVIDIA gives OEM's a software stack for learning. *However, in the case of Tesla, Tesla claims they wrote all their own software and it can run on anyone's hardware. Whether or not the object recognition in Tesla's new software is all unsupervised learning - nobody at Tesla has publicly said, as far as I know.
 
  • Informative
Reactions: sandpiper
Sorry... accidentally cut-off my post. But you answered my question anyway!

It would be fascinating to understand how Tesla includes this, presumably huge, stream of data into their training process. If you imagine that each jurisdiction has unique standards for traffic signs, road markers, pseudo traffic (handicapped for example) signs, etc... I struggle to believe that it would be possible for the system to learn all of these variations without some sort of directed training process.

And is the implication, then, that system would be structured in two levels? The lower level DNN is responsible for recognizing and tagging the objects in the environment using video and sensors? And then a second level DNN is fed the metadata ?vector? and the high level navigation intent vector and then makes driving output command decisions?

Or is that too simplistic?
 
Sorry... accidentally cut-off my post. But you answered my question anyway!

It would be fascinating to understand how Tesla includes this, presumably huge, stream of data into their training process. If you imagine that each jurisdiction has unique standards for traffic signs, road markers, pseudo traffic (handicapped for example) signs, etc... I struggle to believe that it would be possible for the system to learn all of these variations without some sort of directed training process.

And is the implication, then, that system would be structured in two levels? The lower level DNN is responsible for recognizing and tagging the objects in the environment using video and sensors? And then a second level DNN is fed the metadata ?vector? and the high level navigation intent vector and then makes driving output command decisions?

Or is that too simplistic?

Yes it would be fascinating but the trouble is that most of this cutting edge research is not happening at universities that are publishing papers, but at auto companies with a strong incentive to keep their secret sauce secret.

Your two level model sounds plausible to me but I'm no computer scientist.
 
  • Informative
Reactions: sandpiper
And further, I struggle to believe that they wouldn't have some sort of procedurally programmed input for traffic rules. Such as... "no right turn on red in this state". Or "don't park within X meters of a stop light". That's not the sort of thing that you'd want to learn the hard way.

I suppose they could have include a value for traffic tickets in the performance scoring function. :)
 
Nvidia has a paper that briefly describes the data training process. It seems far from unsupervised:

The first step to training a neural network is selecting the frames to use. Our collected data is labeled with road type, weather condition, and the driver’s activity (staying in a lane, switching lanes, turning, and so forth). To train a CNN to do lane following we only select data where the driver was staying in a lane and discard the rest.

To remove a bias towards driving straight the training data includes a higher proportion of frames that represent road curves.

After selecting the final set of frames we augment the data by adding artificial shifts and rotations to teach the network how to recover from a poor position or orientation.

http://images.nvidia.com/content/te...2016/solutions/pdf/end-to-end-dl-using-px.pdf
 
And is the implication, then, that system would be structured in two levels? The lower level DNN is responsible for recognizing and tagging the objects in the environment using video and sensors? And then a second level DNN is fed the metadata ?vector? and the high level navigation intent vector and then makes driving output command decisions?
That seems to be the "traditional" approach and is certainly what would make sense to me. The idea is that object detection would need to take place at a relatively low level, to "boil down" all of the raw sensor data, essentially gobs of pixels, to higher level constructs. This would include geo-rectifying the image/video data.

Nvidia has a paper that briefly describes the data training process. It seems far from unsupervised:
http://images.nvidia.com/content/te...2016/solutions/pdf/end-to-end-dl-using-px.pdf
Wow, in that paper, a CNN (convolutional neural network) is used to "map raw pixels from a single front-facing camera directly to steering commands". On the face of it, this seems quite simple, yet could be feasible only because of recent improvements in GPU-based processing. Tesla is using eight cameras pointing in all directions, so Tesla's approach would obviously be more complex.

Personally, I find the whole idea of CNNs quite interesting, as this could reduce the need for explicit object detection. As that paper points out, a good deal of experimentation would be needed to choose optimal kernels. And the initial normalization phase would be pretty important to nail down, as one would not want to have to re-train the entire system every time the camera hardware changes slightly.

That said, I agree with sandpiper that some rule-based decision making would be needed for traffic laws and the like. To support that, objects like speed limit signs, lane markings, and traffic lights would need to be recognized explicitly. I wouldn't be surprised if Tesla utilizes multiple, separate neural networks (including CNNs) in parallel, with perhaps a rule-based system to integrate everything at the top level.