Facebook paper on using unlabeled data to augment datasets

heltok · Oct 19, 2019

I think this paper deserves a mention here:
Billion-scale semi-supervised learning for state-of-the-art image and video classification

“Accurate image and video classification is important for a wide range of computer vision applications, from identifying harmful content, to making products more accessible to the visually impaired, to helping people more easily buy and sell things on products like Marketplace. Facebook AI is developing alternative ways to train our AI systems so that we can do more with less labeled training data overall, and also deliver accurate results even when large, high-quality labeled data sets are simply not available. Today, we are sharing details on a versatile new model training technique that delivers state-of-the-art accuracy for image and video classification systems.

This approach, which we call semi-weak supervision, is a new way to combine the merits of two different training methods: semi-supervised learning and weakly supervised learning. It opens the door the door to creating more accurate, efficient production classification models by using a teacher-student model training paradigm and billion-scale weakly supervised data sets. If the weakly supervised data sets (such as the hashtags associated with publicly available photos) are not available for the target classification task, our method can also make use of unlabeled data sets to produce highly accurate semi-supervised models.”

“Semi-supervised learning offers a different approach to decreasing AI systems’ dependence on labeled data sets. The method trains a target model using large amounts of unlabeled data in combination with a small set of labeled examples.

The first step is to train a larger-capacity and highly accurate “teacher” model with all available labeled data sets. The teacher model is designed to predict the labels and corresponding soft-max scores for all the unlabeled examples. These examples are then ranked against each concept class. Top-scoring examples are used for pretraining the lightweight, computationally highly efficient “student” classification model. The final step is to fine-tune the student model with all the available labeled data. The target model learns both from its teacher and the unlabeled datasets at the pre-training stage.

This proposed model training framework produces models with higher accuracy compared with the fully supervised regime, in which the target model is trained only on labeled data.

While this high-level description outlines the basic principles for semi-supervised learning, we have found that many nuanced decisions affect the performance of semi-supervised frameworks in practice. Furthermore, semi-supervised training has not previously been explored at this scale (with billions of content examples) for image and video classification models evaluated on competitive academic benchmarks.”

“We believe that learning from unlabeled data sets is the path forward for improving state-of-the-art classification models. Human annotation resources will continue to be resource intensive, difficult to scale, and sometimes simply unavailable. But ongoing hardware advances are making it easier to train on extremely large sets of photos or videos. Billion-scale unlabeled data sets will be an important tool for training highly accurate visual understanding models.

By developing training methods that do not rely solely on data that’s been labeled for training purposes by humans, we hope to develop systems that are more versatile and are able to generalize to unseen tasks — potentially bringing us closer to our goal of achieving AI with human-level intelligence.”

My take:
Tesla happens to have access to a lot of unlabeled data and can request more data of some specific type whenever they need. The competition relies on less but perhaps higher quality data. Imo once Tesla implements this(is this related to project dojo perhaps) they can expect another step change improvement in their performance for scenarios where they have less labeled data. Tesla might have something similar already and maybe this can help the competition catch up with Tesla. Anyway it seems that Tesla’s approach is the right one.

Your takes on this?

alsetym · Oct 20, 2019

My understanding is that the biggest gains available for self driving are still from software. I think the leaders won't just be inching the bar forward, it will be silicon transistors all overr again.

Case in point: The human brain learns an untold number of times faster than the best ai...so we know we have lots of room to grow.

Search

Facebook paper on using unlabeled data to augment datasets

heltok

Active Member

alsetym

Member

Similar threads