@jimmy_d feel free to enlighten us
I can't compete with the internet in terms of educational material. There's so much great stuff out there that explains deep learning it's just crazy. Of course if there's any particular thing you would like pointers on I'll try to help.
I haven't been posting much stuff on the AP2 NN lately because real facts are hard to come by and I'm not sure how much people want to listen to me speculate. Certainly there seems to be a cohort which is allergic to any speculation that doesn't confirm their biases and the blowback gets old. Our mutual friend gave me access to copies of recent and some older data on the NN's and I took them apart to see what I could learn from them. I also tried building and analyzing various parts of the network description on the frameworks that they seem to have been created on, which yielded some ideas but mostly it just killed off some theories that I had.
But here's what I've got as of today:
Facts:
1) There's a vision NN running in AP2 that takes images from the main and the narrow cameras and processes them to extract feature maps. There's only one network but it runs as two instances in two threads that separately process each camera independently.
2) The front half of the NN in AP2 is basically Googlenet with a few notable differences:
- The input is 416x640 (original Googlenet was 224x224)
- The working frame size in Googlenet is reduced by 1/2 in each dimension between each of the 5 major blocks. The AP network omits the reduction between blocks 4 and 5 so that the final set of features is 2x2 times larger than in Googlenet.
3) The final set of output features from the googlenet "preprocessor" is digested in several different ways to generate output. All of these outputs are floating point tensor fields with 3 dimensions and they all have frame sizes that either match the input frame size of 416x640 or a reduced version of it at 104x160.
4) All of these output tensors are constructed by deconvolution from the output of the googlenet preprocessor.
5) There are no scalar or vector outputs from the NN, so it is not end-to-end in the usual sense. It's output must be interpreted by some downstream process to make driving decisions.
6) Between 40 and 42 new output categories were added to the NN
7) Sometime in the last four months major changes were made to the deconvolution portions of the NN that included the removal of network sections that are normally associated with improving the accuracy of segmentation maps. This happened after 26 but before 40, an interval in which users reported substantial improvements in operation at highway speed.
8) Between 40 and 42 two additional NNs were added to the code. Neither has been seen in operation.
9) One of the new NNs is named fisheye_wiper. The network itself is a substantially simplified googlenet implementation. The output from this network is a five output classifier indicating that it outputs a single one of 5 categories being detected. This is exactly the output you would expect for a rain intensity detector intended to control wiper blades.
10) The other new NN is named repeater. This network includes a googlenet preprocessor almost identical to the one used for the main and narrow cameras. The outputs are similar to a subset of main narrow.
Stuff that's probably true:
1) Forward vision is probably binocular using cropped, scaled, and undistorted segments of main and narrow that are trimmed to provide identical FOV. This allows the car to determine depth from vision independently from whatever depth information it gets from other sensors (radar). The depth extraction is not being done in the NN but there are function names in the code that refer to stereo functions and frame undistortion. Depth extraction can be done efficiently using conventional vision processing (that is, non NN) techniques so that's probably why it's not included in the NN. I haven't seen or heard about any AP2 behavior that can only be explained by the existed of binocular vision so it's possible this is a shadow feature. No real idea.
2) Output from the forward vision NN probably includes at least one full resolution semantic segmentation map with bounding boxes for that map. It might include several maps including some at lower resolution. Output from the repeater NN is similar though it eliminates some classes that are present in the main/narrow NN. It is extremely likely that both the main/narrow and repeater networks are detecting objects in multiple classes and generating bounding boxes for those objects.
3) The NN is probably getting a lot of development. Not only does it change for every single firmware revision, some of the changes are drastic. Some look like experiments, others look like diagnostic features. Some look like transient bug fixes that are later corrected by fixing the underlying problem (eliminating the need for the hack).
4) The network is probably being trained in part by using simulated driving data. Generating labeled training data for semantic segmentation outputs of the kind the vision NN seems to be outputting is extremely labor intensive if done manually and simulation is a common approach to augmenting training data. My review of Tesla's open positions in their Autopilot division found that they were recruiting simulation experts and simulation artists with a bias towards driving environments and the Unreal 4 engine. I found no positions of the type that I would expect to see if they were manually labeling large volumes of real world data. Of course, this latter is the kind of thing you might outsource, so it's not a very strong datapoint except that it fails to show they are manually labeling lots of data.
5) Tesla is probably developing fairly advanced customizations to the tools used for testing, training, developing, and deploying neural networks. I found evidence for custom libraries and custom tools all over the place as I was trying to track down clues to how they were doing their development. There isn't anything I could find in their system that you could just drop into publicly available tools and make sense of it, but then I find references to parts of publicly available tools and libraries all over the place. It seems like they are pulling from a lot of different sources but then building their own tools.
6) After analyzing the options, I think the calibration phase for the cameras is to allow matching the main and narrow cameras to a high enough accuracy to enable stereo vision. When I look at manufacturing variances for cameras and compare that to the operational variance that the system has to be able to deal with in use the only thing I can come up with that can't be compensated on the fly is support for stereo vision. In order to support rectified aligned image stereo processing you need to pre-calculate the alignment transformations for the two cameras to sub-pixel accuracy. I don't think this can be factory calibrated because the calibration probably wouldn't survive transport of the vehicle to it's delivery destination. That's not true of any other vision process that I can come up with and it's certainly not true of NN vision processes that are appropriate to vehicle applications. Calibration probably has to be re-done frequently by the car because even normal operation will lead to the alignment drifting enough for it to become a problem. And it probably needs to go through the cal process anytime there is maintenance performed that requires manipulation of the forward camera assembly. If this is true then AP2 must have been using stereo since it's first incarnation as the calibration period has always been present, and it must be using it for driving decisions since you can't use the AP features until the cameras have been calibrated.
Pure speculation:
After my first look at the version 40 NN I was surprised at how simple it was, conceptually, and how 'old' the circa 2015 architectural concepts were and speculated that perhaps this version of EAP was not getting much effort. (In the deep learning world 2 years is an eternity). I thought this might make sense if the company were pushing to release a separate and much more sophisticated package and needed this current NN to be just a placeholder substitute for the missing AP1 vision chip while they readied a much more ambitious system, which might be FSD. After reviewing older versions of the network I found that the output varied from version to version so this system can't be merely a substitute for a missing module - if that were the case the I/O wouldn't be changing. Also, stereo vision is clearly a feature that wasn't possible in AP1. And now we see NNs being added for cameras not present in the AP1 hardware - the repeaters. Finally, I am finding small features being included in the NN that did not appear in the public literature until fairly recently, implying that the EAP team is actively trying out cutting edge ideas in limited domains. So at this point I think that EAP is getting the kind of development that suggests it's not a placeholder.
I recall that at some point one of the differentiating features of EAP and FSD was the number of cameras, with EAP having 4 in use and FSD the full suite of 8. With the addition of the repeaters to main and narrow we would see 4 driving cameras in use assuming the fisheye is just for the wipers in the EAP use case. That would be an interesting match up and it might indicate that these four are the candidates for on-ramp to off-ramp level of functionality in EAP.