Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Relevant new papers

This site may earn commission on affiliate links.

heltok

Active Member
Aug 12, 2014
2,937
28,172
Sweden
Can we have a thread where we post all new papers presenting results relevant to Tesla and Autonomous vehicles.

Will start off with this one:
https://arxiv.org/pdf/1904.04998.pdf

We present a novel method for simultaneous learning of depth, egomotion, object motion, and camera intrin- sics from monocular videos, using only consistency across neighboring video frames as supervision signal. Similarly to prior work, our method learns by applying differen- tiable warping to frames and comparing the result to ad- jacent ones, but it provides several improvements: We ad- dress occlusions geometrically and differentiably, directly using the depth maps as predicted during training. We in- troduce randomized layer normalization, a novel power- ful regularizer, and we account for object motion relative to the scene. To the best of our knowledge, our work is the first to learn the camera intrinsic parameters, includ- ing lens distortion, from video in an unsupervised man- ner, thereby allowing us to extract accurate depth and motion from arbitrary videos of unknown origin at scale. We evaluate our results on the Cityscapes, KITTI and Eu- RoC datasets, establishing new state of the art on depth prediction and odometry, and demonstrate qualitatively that depth prediction can be learned from a collection of YouTube videos.
 
https://arxiv.org/pdf/1903.12650.pdf

Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds

Abstract—There has been a strong demand for algorithms that can execute machine learning as faster as possible and the speed of deep learning has accelerated by 30 times only in the past two years. Distributed deep learning using the large mini-batch is a key technology to address the demand and is a great challenge as it is difficult to achieve high scalability on large clusters without compromising accuracy. In this paper, we introduce optimization methods which we applied to this challenge. We achieved the training time of 74.7 seconds using 2,048 GPUs on ABCI cluster applying these methods. The training throughput is over 1.73 million images/sec and the top-1 validation accuracy is 75.08%.

Karpathy(9.30 into video) said that he recommends to retrain the network after every commit(including committing new labeled data). With this they might get down build times a lot.
 
https://arxiv.org/pdf/1812.07179.pdf
3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate de- tection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies — a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that data representation (rather than its quality) accounts for the majority of the dif- ference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image- based depth maps to pseudo-LiDAR representations — es- sentially mimicking LiDAR signal. With this representation we can apply different existing LiDAR-based detection al- gorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state- of-the-art in image-based performance — raising the de- tection accuracy of objects within 30m range from the pre- vious state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest en- try on the KITTI 3D object detection leaderboard for stereo image based approaches.
 
  • Informative
Reactions: TesLou22
Billion-scale semi-supervised learning for image classification
This paper presents a study of semi-supervised learning with large convolutional networks. We propose a pipeline, based on a teacher/student paradigm, that leverages a large collection of unlabelled images (up to 1 billion). Our main goal is to improve the performance for a given target architecture, like ResNet-50 or ResNext. We provide an extensive analysis of the success factors of our approach, which leads us to formulate some recommendations to produce high-accuracy models for image classification with semi-supervised learning. As a result, our approach brings important gains to standard architectures for image, video and fine-grained classification. For instance, by leveraging one billion unlabelled images, our learned vanilla ResNet-50 achieves 81.2% top-1 accuracy on the ImageNet benchmark.

Should be relevant to Tesla, they can label a small part of their dataset and use this to automatically label a MUCH larger unlabelled dataset.
 

The ability to decompose scenes in terms of abstract building blocks is crucial for general intelligence. Where those basic building blocks share meaningful properties, interactions and other regularities across scenes, such decompositions can simplify reasoning and facilitate imagination of novel scenarios. In particular, representing perceptual observations in terms of entities should improve data efficiency and transfer performance on a wide range of tasks. Thus we need models capable of discovering useful decompositions of scenes by identifying units with such regularities and representing them in a common format. To address this problem, we have developed the Multi-Object Network (MONet). In this model, a VAE is trained end-to-end together with a recurrent attention network -- in a purely unsupervised manner -- to provide attention masks around, and reconstructions of, regions of images. We show that this model is capable of learning to decompose and represent challenging 3D scenes into semantically meaningful components, such as objects and background elements.
MONet: Unsupervised Scene Decomposition and Representation
 
https://openreview.net/pdf?id=rJl-b3RcF7

Neural network pruning techniques can reduce the parameter counts of trained net- works by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.
We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the lottery ticket hypothesis: dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolation— reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.
We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.

So two different goodies:
1. They found a good way to prune NN(to make them faster/smaller/consume less energy)
2. They do 1 by finding the lucky randomness. This can be used when re-training (let’s say if you commit some new data to the training dataset) a new NN, making training both faster and able to reach an ever higher performance.
 
This one could be useful to increase accuracy of very rare events
https://arxiv.org/pdf/1905.09272.pdf

Large scale deep learning excels when labeled images are abundant, yet data-efficient learning remains a long- standing challenge. While biological vision is thought to leverage vast amounts of unlabeled data to solve classifi- cation problems with limited supervision, computer vision has so far not succeeded in this ‘semi-supervised’ regime. Our work tackles this challenge with Contrastive Predic- tive Coding, an unsupervised objective which extracts sta- ble structure from still images. The result is a representa- tion which, equipped with a simple linear classifier, sepa- rates ImageNet categories better than all competing meth- ods, and surpasses the performance of a fully-supervised AlexNet model. When given a small number of labeled im- ages (as few as 13 per class), this representation retains a strong classification performance, outperforming state-of- the-art semi-supervised methods by 10% Top-5 accuracy and supervised methods by 20%. Finally, we find our un- supervised representation to serve as a useful substrate for image detection on the PASCAL-VOC 2007 dataset, ap- proaching the performance of representations trained with a fully annotated ImageNet dataset. We expect these re- sults to open the door to pipelines that use scalable unsu- pervised representations as a drop-in replacement for su- pervised ones for real-world vision tasks where labels are scarce.
 
https://arxiv.org/pdf/1905.11946.pdf

Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we sys- tematically study model scaling and identify that carefully balancing network depth, width, and res- olution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.
To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accu- racy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state- of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https: //github.com/tensorflow/tpu/tree/ master/models/official/efficientnet.
 
Here's a thought-provoking high-level review of developments in self-supervised learning,
with references to AlexNet, ResNet, newer work by Kaiming He about masked auto-encoders
providing "latent representations", Jean-Rémi-King's work with transformers, etc.

The idea of self-supervised networks training highly recurrent networks is intriguing.
We hope Tesla has room to leverage some of this to FSD efforts:

from Quanta Magazine: Self-Taught AI Shows Similarities to How the Brain Works