Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Using driver behaviour as a source of automatic labels for semantic segmentation

This site may earn commission on affiliate links.
Paper: “Minimizing Supervision for Free-space Segmentation”

Here's an awesome application of weakly supervised learning for semantic segmentation of free space (i.e. unobstructed roadway that a car can safely drive on). The researchers use human driving as a form of automatic labelling instead of manual annotation. They exploit the fact that wherever humans drive is free space (or at least it is 99.99%+ of the time). The researchers note:

“Of course, fully supervised somewhat outperforms our results (0.853 vs 0.835). Nonetheless, it is impressive that our technique achieves 98% of the IoU [Intersection over Union] of the fully supervised model, without requiring the tedious pixel-wise annotations for each image. This indicates that our proposed method is able to perform proper free-space segmentation while using no manual annotations for training the CNN [convolutional neural network].”​

If you can automatically label 10,000x as much data (or more) as you can afford to manually label — which is true for a company like Tesla — then I would imagine weakly supervised learning would outperform fully supervised learning. A hybrid approach in which you use a combination of manually labelled and automatically labelled data might outperform both.

According to research from Baidu, on the ImageNet benchmark for image recognition, accuracy improves sub-linearly with labelled training data such that a 10,000x increase in data would yield a roughly 10x to 100x increase in accuracy (assuming the neural network doesn't run into any fundamental limits).

Paper abstract:

“Identifying "free-space," or safely driveable regions in the scene ahead, is a fundamental task for autonomous navigation. While this task can be addressed using semantic segmentation, the manual labor involved in creating pixel-wise annotations to train the segmentation model is very costly. Although weakly supervised segmentation addresses this issue, most methods are not designed for free-space. In this paper, we observe that homogeneous texture and location are two key characteristics of free-space, and develop a novel, practical framework for free-space segmentation with minimal human supervision. Our experiments show that our framework performs better than other weakly supervised methods while using less supervision. Our work demonstrates the potential for performing free-space segmentation without tedious and costly manual annotation, which will be important for adapting autonomous driving systems to different types of vehicles and environments.”​

A key excerpt:

“We now describe our technique for automatically generating annotations suitable for training a free-space segmentation CNN. Our technique relies on two main assumptions about the nature of free-space: (1) that free-space regions tend to have homogeneous texture (e.g., caused by smooth road surfaces), and (2) there are strong priors on the location of free-space within an image taken from a vehicle. The first assumption allows us to use superpixels to group similar pixels. ... The second assumption allows us to find “seed” superpixels that are very likely to be free-space, based on the fact that free-space is usually near the bottom and center of an image taken by a front-facing in-vehicle camera.”​

Open access PDF:

http://openaccess.thecvf.com/conten...inimizing_Supervision_for_CVPR_2018_paper.pdf

Examples of segmentations included in the paper:

3IVodKWr.jpg
 
On the topic of weakly supervised learning, I'm reminded of something Elon said on the ARK Invest podcast in February. At 14:25, he said:

“...and we're really starting to get quite good at not even requiring human labeling. Basically, the person, say, drives the intersection... and is thereby training Autopilot what to do.”
There are different ways to interpret what this quote could mean. For example, Elon could be referring to imitation learning. I think it's interesting, though, to consider other ways driver behaviour could provide automatic labels for computer vision tasks. For example, when a traffic light is detected, what about using human driving behaviour to label the light as red, yellow, or green?
 
On the topic of weakly supervised learning, I'm reminded of something Elon said on the ARK Invest podcast in February. At 14:25, he said:

“...and we're really starting to get quite good at not even requiring human labeling. Basically, the person, say, drives the intersection... and is thereby training Autopilot what to do.”
There are different ways to interpret what this quote could mean. For example, Elon could be referring to imitation learning. I think it's interesting, though, to consider other ways driver behaviour could provide automatic labels for computer vision tasks. For example, when a traffic light is detected, what about using human driving behaviour to label the light as red, yellow, or green?
Imitation learning relies on diverse techniques to discover a better policy (a behavioral algorithm), as a class of reinforcement learning, with explore-exploit as a common technique.

You would precisely **not** want to run explore-exploit while driving - you don’t run a red light occasionally to see if you’ve been wrong all along. but you can do it in shadow mode. In intuitive terms, while in shadow mode let’s say your cameras classify images as containing a red light for your lane and other lights for others. Then you want to make sure your behavior matches what humans would do. Occasionally you could log “with what probability would I run this red light under these perceived inputs” and then classify user behavior in the next few seconds as having or having not run it. (Or having slammed breaks vs not etc)

One of the good things of imitation Learning and shadow mode collection is that it’s safe. However, it is a) very slow to learn robust policies as it can’t expose itself to all sorts of states and decisions and b) it can never exceed (collective) human performance.

I think more than analysis of what comes down in new firmware versions, an exhaustive analysis of what information is being sent up to Tesla can be very illustrative of possible ML techniques being used.
 
In intuitive terms, while in shadow mode let’s say your cameras classify images as containing a red light for your lane and other lights for others. Then you want to make sure your behavior matches what humans would do. Occasionally you could log “with what probability would I run this red light under these perceived inputs” and then classify user behavior in the next few seconds as having or having not run it. (Or having slammed breaks vs not etc)

Right, I think maybe you're describing supervised imitation learning (a.k.a. behavioural cloning) for planning tasks like traversing an intersection.

In my post above, I was more referring to weakly supervised learning for computer vision tasks like classifying a traffic light as red, green, or yellow. Similar to the technique used in the paper I cited above where human driving behaviour is used to classify pixels from a vehicle camera as corresponding to free space.

I'm wondering if similar techniques could be used for other computer vision tasks besides free space segmentation. What about traffic light classification, for example? What about something else? Any ideas?
 
Last edited: