jimmy_d
Deep Learning Dork
My takeaway from that great Karpathy talk is that NNs have shifted the problem from direct coding to data labeling and categorization, which was somewhat known anyway but is nice to have it spelled out so clearly.
However, to speculate on my own question to @jimmy_d upthread, namely -- why are Tesla's crashing into fire trucks, street sweepers, etc but braking for stopped cars. -- it seems that the answer might be as simple as they haven't categorized them because they're in the 1e-3 percentile of vehicles they capture.
In other words, how many Tesla's are approaching street sweepers and fire trucks in normal driving? And even when they do -- how does Tesla find and categorize and train on this data?
If this is true -- then it really highlights the incredible difficulty of solving vision by labeled examples, in that all the crashes and accidents will simply be in things you haven't trained the system to recognize, i.e. the rare occurrences -- like a Winnebago towing a car, or an earth mover that parked in a lane...
Man I'm glad you came up with that conclusion on your own (it's a good answer) because I tried writing an answer to your question and it kept getting long and messy. It's a hard question to answer, at least for me, in a way that's short, true, complete and easy to understand. Not a natural teacher I guess.
Your insight about the difficulty of dealing with rare events is correct. It's possible to deal with them by 'whitening' the database - you filter the data to change the relative representation of events by reducing stuff that's common and augmenting events that are important and less common. There are lots of different tricks that can be used but it requires studying your training data and how it affects the performance of your network and then adjusting the set of training data to improve the networks behavior. If you have a lot of data it can be pretty time consuming.
There's another thing about your query that I thought was worth addressing too, which is the "why are things that seem simple to me hard for the network to do" element. What I wanted to offer there is this: that NNs as they are used today are basically making decisions based on very complex statistical analysis of pixel distributions in an image and they are doing it without any high level priors. A "prior' is something that you know to be true about the data that you tell the network to assume in it's analysis. For instance, we humans know the world is 3D so that's a 'prior' for humans in terms of how they interpret what they see. If you assume that what your eyes are looking at is a 3D world that simplifies things a lot and because it's *true* it will almost never steer you wrong. Other priors are things like simple physics, the fact that light travels in straight lines and is occluded by opaque bodies, and the notion of time. All of this stuff it so obvious to a person that we're not even aware that we know it. But today we don't tell our networks any of that stuff. As far as the network knows it's looking at a 4 dimensional window on a 10 dimensional universe. It's just looking for a way, any way, to match up patterns in the pixels to what it's told is a car, a lane, a sign, and so forth.
So to a human there are a lot of things that are so crazy simple to "see" that we have a hard time understanding why a camera cannot "see" them. And it's because the job that NNs are doing when they "look" is much harder, in a sense, that what humans have to do.
So why don't we tell networks all this simple stuff we know is true? One reason is that we can get pretty impressive performance even if we don't tell the network this stuff. Another is that the techniques for 'telling' the network these things are very new, quite computationally demanding, and nobody really trusts them yet. So right now networks are being built in the way that is well understood and mature and trustworthy (relatively, anyway). But these other techniques will gradually mature and come into use and when they do the 'mistakes' that networks mistake will look more like the 'mistakes' that humans make and it'll all make more sense. All this deep learning stuff is brand-stinking-new and it's frankly amazing that, just a short few years after the very first working deep networks we are already able to use them in the real world. But we've barely begun to scratch what these things will eventually be able to do. Which is why I'm personally pretty sanguine about the potential for my current car to have real FSD *eventually*. I think it's a hard problem, but I also see crazy fast progress.
<oh no - another blithering digression>
Where we are with NNs today feels to me like where electronics was in the decade after the invention of the transistor. At the time people were figuring out NPN and PNP, thinking about thyristors, trying to make them work better over temp, get consistent behavior, increase the gain and so forth. They were doing all that stuff with single transistors and it was mostly guys in academia doing research. Transistors were so amazing compared to tubes that it was just mind boggling and for a long time we just worked on making the best transistor that we could. Nobody was really working on what to do with a thousand, much less a million, much less a *billion* transistors. NNs are advancing much, much faster than the transistor did - 10x per year right now. In the not too distant future we'll be building NNs that are as far ahead of what we have today as a smartphone is ahead of a transistor radio.