I got a chance to look at definition files for a new set of vision NNs which I understand to be the ones which are going out in 2018.10.4. I’m going to summarize the differences here. For background on what I found in earlier networks (2017.28, 2017.34, and 2017.44) please see this post from last November:
Neural Networks
Cameras
I’ve seen three new networks which I’m going to refer to as main, fisheye, and repeater. These names come from filenames used for the network definitions as well as from variable names used inside the networks. I believe main is used for both the main and narrow forward facing cameras, that fisheye is used for the wide angle forward facing camera, and that repeater is used for both of the repeater cameras.
Overview
These network definition files that I’m talking about are used to describe how the ‘neurons’ in a neural network are arranged and interconnected - the network architecture. This architecture defines the inputs and outputs for the network and the shape of the tensors (the data) that flow through the network. By comparing the architecture to other well known networks it’s possible to understand what kind of processing is occurring, how the network is being trained, what kind of performance is possible, how much data of what kind is needed to train the network, and how much computer power is required to run the network.
Review of previous network
2017.44 was an inception network closely modeled on GoogLeNet - an award winning vision processing network design that was invented by Google about 4 years ago. GoogLeNet (GLN) is probably the single most popular high performance vision network in use today because it combines high accuracy with good computational efficiency. It can be slow to train but it runs very fast when deployed. The architecture is well understood and flexible - it can easily be adapted to different kinds of imaging data. The foundation of 2017.44’s main, repeater, and pillar networks (actually, introduced in 2017.42) was almost identical to GLN with only the most minimal changes required to adapt to the camera type and to provide the particular kinds of outputs that AP2 needed. The fisheye_wiper (introduced with 17.44) was based on a truncated GLN with 3 inception layers instead of the normal 9.
All of these networks had custom output stages that took the high level abstractions generated by GLN and interpreted them in various ways that would be useful for downstream processing. The fisheye_wiper network only put out a simple value - presumably an indicator of how hard it was raining. The repeater and pillar networks identified and located six classes of objects (Note that objects here can include not just discrete items like pedestrians and vehicles but also, for instance, areas of pavement) . The main network (used twice for both main forward camera and narrow forward camera) had generic object outputs as well as some more specialized outputs (for instance, for identifying the presence of adjacent lanes and the road shoulder).
Changes in 2018.10.4
As of 2017.44 - the most recent network I’ve seen that was a substantial departure from earlier versions - there were versions of main, fisheye, and repeater networks in use and also another network referred to as ‘pillar’, which was probably used for the b-pillar cameras. I understand that pillar is not present in 2018.10.4. This could mean that the b-pillar cameras were used in 44 but are not being used in 18.10.4, or it might not. In 44 the networks for the pillar and repeater cameras were identical in structure but had different parameters. It’s possible that they could be merged functionally, with a single network being used for both repeaters and for pillars. Merging them would reduce their accuracy but it could lead to procedural and computational efficiency gains.
Changes to the network for main and narrow cameras
The main network now uses about 50% larger data flows between layers, which will increase the number of parameters by more than 2x and will substantially increase the representational power of the network and the amount of data required to train it. All other things being equal this network will have a more ‘nuanced’ set of perceptions. The inputs are the same and the outputs are the same with one exception. A new output called super_lanes has been substituted for a previously unnamed output. Super_lanes summarizes the image into a 1000 dimensional vector output, which is interesting because it probably means that the output of super_lanes is being fed into another neural network.
(BTW - The internal name on this main network is now “aknet”. Andrej Karpathy Net ? )
Changes to repeater network
The repeater network in 10.4 has been truncated to 4 inception layers where the previous repeater network was a full 9 inception layers. The outputs are the same as before - a six class segmentation map (labels each pixel in the camera view as one of 6 categories) plus bounding boxes for objects.
Changes to the fisheye_wiper network
This network remains a truncated GLN. It appears to have been rewritten in a syntax that is now similar to the other networks. The previous fisheye network was in a different syntax and seemed to have been a holdover from some earlier generation of development tools. The new fisheye has some small changes introduced to the earlier layers but it still has just 2 inception layers and it still outputs a single class value for rain (one of 5 choices which are probably various types/degrees of rain). (It recently occurred to me that snow or ice might be included in these categories.) The new version seems to break the field of view into 4 quadrants and output a class for each one where the old network did not subdivide the field of view. Maybe rain looks different against the road rather than the sky. Additionally, segmentation and bounding box outputs have been added for the fisheye, so it seems like the fisheye is also getting trained to recognize things other than rain. Which might mean that it’s also going to be scanning the field of view for cars and pedestrians, or it could mean that it’s specifically sensing stuff like bird poo and dead bugs so that it can respond appropriately.
Summary
So the main and narrow camera network is getting quite a bit more powerful, the repeater has been simplified, and fisheye has been remodeled with possibly some non-rain functions being included.
As a reminder - these networks are only for processing camera inputs. They take single frame images and interpret them individually. Downstream processing has to take these camera outputs and interpret them as a sequence, combine them with perception from other sensors, and make driving decisions. This is only a small part of the overall driving software suite, but vision is an important part of the vehicle’s perception capacity and changes to these networks might be a good indicator of progress in the development of AP.