Basic NN question, sorry to hijack: how does a neural net actually work? So let's say it's trained on billions of videos so it "knows" what to do in most situations. What does that "knowledge" actually look like?
I mean, the NN can't just sit there and continuously look at all the videos all the time and choose something to do, that's impossible. It must have some kind of generalized understanding that it actually uses in different contexts. Can anyone probe the NN and really see what's actually going on, in the immediate functioning of the system? Can we look at its "understanding" (whatever form that really is) and see a finished form of the net result of all its training?
I guess another way of asking is -where- does it store -what- information, and how is that information processed from perception to action?
I know, go take a class in computer science, or at least google this, but y'all seem really knowledgeable about how it all works, so maybe someone can provide an executive summary for newbies?
My biggest problem is understanding how billions of frames of sequential images made up of pixels (video) can make an impression on some mysterious NN "program" that can be separated from the training and loaded into a pretty simple computer in the car.
All righty then. I've actually been through some training.
So, it all starts out with Real Neurons, like the ones in your brain. A given brain cell of the appropriate type has a large number of neurons that stick far away from the cell. If my memory is working OK today, these things tend to be one-way: That is, the "output" of a given cell nearly touches the "input" of another cell.
Now for the tricky bit. Say that, during some time interval, a number of these input neurons get tickled by a bunch of input neurons. Each input neuron has a "weight". If the zaps (nerves tend to run with discharge pulses that vary in amplitude, they're not continuous like, say, a steady tone or DC value) multiplied by the weights of each neuron is above some threshold, the output(s) of the cell then put out their own zap. Which are picked up by a further layer of neurons and cells.
Now, a neural network device: Each stage in such a device has individual cells with connections to cells in the previous stage; each of those connections have weights. And, just to add to the fun, typically the outputs of the final stage are
fed back into the whole array, and the feeding back portions run into weights on the input cells of the (usually) the first stage of all this.
All right. Set up the whole process for detecting Giraffes. Repetitively feed pixel images of Giraffes into a NN as above, starting with random weights on all the inputs, on all of the stages. Decide that one (or several) of the outputs will be Logic One (or a code) when a Giraffe is sighted. When one has a Giraffe and the code isn't present, change the weights and Do It Again until the code shows up. Keep this up with Giraffes right side up, upside down, coming at you, running away, grabbing leaves off of trees, and so on. Keep on varying the weights until, with the Giraffes present, you always get the code. For that matter, while you're doing this, check for not-a-giraffe: Make sure that the code is
not present on a similarly large set of images where there's no Giraffe to be seen. Just keep on a-changing the weights so one has a solid, "There's a giraffe" and "there's no giraffe" with one's training sets.
This is where it gets weird. If one has trained up a NN like this and one shows it a picture of a giraffe in a forest, where maybe there's only a couple of spots to be seen, or an odd horn sticking out here or there - the NN
finds it. NNs are Really Good at image recognition.
First time I heard of this approach was when taking a tour of the Remote Sensing lab at Purdue. These guys were taking pictures of the Earth with wavelength-of-light specific sensors at different wavelengths. They were using these to figure out how many acres of what crops were being planted, if the crops were infected with something, and other socially/scientific ways of predicting what the crop tonnages of what would be that year. So, they were running something very like a neural net (except that it was actually FORTRAN image processing, back in the day, and a run took
hours) for this detection.
The fun part was the previous users of the software systems had been looking for airplanes. And the crop-types were looking at the output of a run and saw Yea Many Acres of Corn, Yea many Acres of Wheat, Yea Many Acres of Soybeans, ... and three airplanes. They went back to these shot-from-space 50 mile by 50 mile images in Impossible Resolution and, yeah - there were three airplanes flying by at different locations and altitudes. The researchers thought it was pretty amusing.
And now we get into the engineering aspects of all this. Digital computers work on steps, one step at a time. (There might be multiple cores, but, still.) They can go ridiculously fast - but it's still one step at a time.
Analog computers (at least, the old, traditional types, which may or may not be electronic) aren't like that. They have continuous inputs and, critically, continuous outputs. Change an input? The output changes nearly immediately. Yeah, there are propagation delays (gear lash on Norden Bombsights, transistor propagation delays with capacitance), but these are
small.
A single cell in a Neural Network is designed to real-time multiply and add up a bunch of inputs, all at once.
And an overall NN? It's a massively parallel analog computer that can, for example, look at
all the pixels in an image at the same time and come up Giraffe. Or dog.
There's this phrase: "The right tool for the right job." NNs are Really Good at image recognition and much faster than a digital computer trying to do the same thing. Our brains (which, admittedly, aren't
that close to the NN used in Teslas) have Issues with, say, multiplying 18 digit numbers that a digital computer can handle. But they're pretty blasted good at spotting that Lion in the Weeds.
Now, the people at Tesla are the experts at how to use NNs. If they can shape the problem to be solved to something that can be handled by a NN, the advantage is typically in speed.
And I guess that this is the point: V11 and earlier were using the NNs in the computer to identify objects around the car; C++ code or whatever took these objects and such and synthesized a way of driving the car through the real world. The change with V12 is to take that C++ code and swap it out for a NN approach.. and that's where the speedup kicks in.