Referencing this post:
Seeing the world in autopilot, part deux
Am posting here because this is mostly about neural networks. Will make a link over there pointing to this post.
seeing the world in autopilot part deux observations:
First, some background:
Based on taking apart a set of AP2 binaries earlier this year I came up with a general structure for how data flow was going through the system and was able to identify some intermediate processing products from the nature of the data structures, how they were being used, and the names of variables.
This general structure has a group of advanced CNN networks processing the output from each of the 7 navigation cameras (excluding the backup camera) followed by a second set of networks that I called post-processing networks. The camera networks were identifying and localizing several classes of objects in the field of view of all of the cameras. Among the types of objects that seem to be detected were vehicles, traffic signals, and lane markings. The second layer of networks generated outputs that seemed to be focused on identifying and understanding the shape of lanes, assigning vehicles to lanes, predicting whether other vehicles were moving, stopped, or parked, and also identifying physical landmarks (mainly poles and maybe the corners of buildings).
So all of this was from looking at the code. I had started that analysis with the hope of finding a way to understand the system capabilities but being unable to see the code in action what I could come up with was pretty limited.
Now we see some beautiful output from the efforts of
@verygreen and
@DamianXVI which can extend my earlier observations by giving us examples of what comes out of the network.
So I’m going to interpret what’s happening in @verygreens video here in light of what I’ve seen in the code.
First - the annnotations here seem to be output from second layer of networks, not from the primary camera networks. There are various ways to show this but a simple one is thus - vehicle ID’s in the video persist from frame to frame. It’s not possible for the camera networks to do that because they only process one frame at a time and have no knowledge of other frames or the machine state other than one single frame of camera output. Downstream networks have to correlate the output from successive frames of camera network output in order to allow the ID to persist.
The major categories of annotation are: predicted lane boundaries and boundary type, vehicles, trucks, pedestrians, motorcycles, bicycles, and driveable space.
Lane boundaries predict the left and right edges of the acceptable driving lane that the AP vehicle currently occupies with color coding indicating whether a lane boundary is separating oncoming or same direction traffic. At junctions where a turn might optionally occur multiple lane boundaries will be identified representing the edges of the driving lane for each of the optional paths that have appeared. I never saw more than two options at once. Options are show for the occupied lane but otherwise the only lane boundaries predicted are the far boundaries of adjacent lanes.
It’s notable that the lane boundaries aren’t just identifying pavement marking or curbs. The boundaries are present even when pavement marking are absent and both the left and right lane boundaries appear even when only one of them is easily identifiable from what’s in the camera view. Aside from lane markings AP seems to use the presence and state of other vehicles and the presence of obstacles to predict lane boundaries.
Objects all seem to carry a confidence value in % which probably represents the confidence of the system in it’s object class prediction. Identified objects optionally carry several attributes including a lane assignment, motion state (moving, stopped, or stationary), distance, and relative velocity. It seems that objects are also labelled as to whether AP has a corresponding radar return associated with a particular object. Notably, lane assignment include making a distinction whether a lane is a parking lane (off-road) or not. Making that distinction requires a lot of context. Lane assignment also seems to have a lot of states including not just whether the object is in your lane, left, or right but also whether it’s straddling lanes and also something labeled IMM - which might be “immediately adjacent”
Drivable space represents unobstructed area that the AP vehicle has physical access to and is bounded by edge markings that indicate the kind of obstacle that is limiting the drivable space at that section of the edge. Vehicles and pedestrians are obstacles that are in a different class from other sorts of barriers. While traffic cones, bollards, and fencing aren’t called out as discretely identified objects it’s clear that AP is seeing them and recognizing them functionally because it adjusts the driving space according to their presence.
And finally we have that beautiful, beautiful path prediction arrow in orange. I love this element because it probably gives us the most abstracted and subtle insight into what AP is ‘thinking’ as it moves around the world. I’ll make the strong claim that the path prediction is the output of a neural network because it behaves probabilistically, seems to be affected by the full context of a scene, lacks hysteresis, and presents a continuous selection space. A human written heuristic is unlikely to show this behavior. From the shape of the path prediction we can see that AP is making a nuanced prediction of the road shape extending out at least a couple of hundred meters and - and this is really amazing to me - is able to usefully predict the rising/falling shaped of the road ahead and predict the probable path of road sections *which it cannot see*. So it estimates that a road around a blind curve will continue curving and it estimates the direction a road takes over a blind rise even when the road shape leading up to the rise is pretty complicated. This latter is probably the critical capability that finally solved the disastrous ‘cresting hill’ fail that finally went away when 2018.10.4 shipped out.
So some interesting things we know from this:
- AP2 estimates distance to objects and their relative velocity based on vision even when no radar return is available for an object. It looks like radar is thus a fully redundant backup for vision capabilities. Whether that is from stereo vision processing or from scale estimates is still open, but there’s clearly a useful degree of distance estimation that is being extracted even for items that have no radar return signal.
- The FOV of the camera seems to be quite a bit wider than the FOV of the radar - objects lose their radar signal at the edges of the camera FOV. This seems to be a view from the main camera - if the wide angle camera has comparable recognition accuracy then the useful FOV of the vision system is going to be enormously larger than that of the radar.
- AP2 identifies vehicles even when they are substantially occluded, and at quite substantial distances.
- Strange backgrounds seem to confound identification as much as occlusion does - maybe more. Cyclists seen against a background of traffic have much lower confidence than cyclists with a background of pavement or buildings.
- Radar still seems to be bad at seeing cross traffic - though it seems like vision makes up for that pretty well.
- At a minimum we can see that radar and vision are being fused since single objects are being given both radar and vision attributes. Is forward sonar also being fused? There doesn’t seem to be any good evidence of that here.
- This video doesn’t rule out the possibility that high definition maps are being used for driving, but it also doesn’t present any evidence to support it.
I’d be really interested to know if there’s any evidence of navigation data being fed into AP2 to be used as part of navigation. And it would also be interesting to know if there’s any evidence of AP gathering data to be used to create HD maps. A lot of groups seems to be relying on HD maps as a critical part of their driver assistance systems (comma ai, cruise, waymo) but so far I haven’t seen any evidence that Tesla is actually doing that - aside from some claims from a few years ago.