Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Interview with Andrej Karpathy about AI, autonomy, and Tesla

This site may earn commission on affiliate links.

shrineofchance

she/her, they/them
Feb 10, 2021
247
278
Canada
rb4VYI4.jpg


 
I haven't finished listening to the whole thing yet. The part about showing a NN 50,000 variations of a fire-hydrant got me thinking. Would a two stage NN, where the first stage just determines primitive geometric shapes, such as cylinders, boxes and spheres. This would be trained to see such shapes in many different lighting conditions and surface textures. That NN spits out a rudimentary jumble of 3D shapes, which goes into a 2nd NN that that can determine if it's a human, fire hydrant, car, etc.
 
I haven't finished listening to the whole thing yet. The part about showing a NN 50,000 variations of a fire-hydrant got me thinking. Would a two stage NN, where the first stage just determines primitive geometric shapes, such as cylinders, boxes and spheres. This would be trained to see such shapes in many different lighting conditions and surface textures. That NN spits out a rudimentary jumble of 3D shapes, which goes into a 2nd NN that that can determine if it's a human, fire hydrant, car, etc.
I noticed how he mentioned that a new untrained NN is useless, it can't do anything. As you add neural connections it starts to get better about recognizing things. Fine, I suppose that's NN 101.

So its current state, the NN in FSD Beta is driving around and it "recognizes" things it's been trained on. Seems to be good with garbage cans, pylons, stop signs. That's great to identify/display 0.0001% of the items in the world. Ok, so the rest it's putting boxes around. Fine. As was said it's not necessary really to know if something is a dog or a cat, it's a dog/cat-sized animal.

Is the ongoing goal at Tesla to identify more things? Is that the priority? Is a NN that identifies things more useful than a NN that drives better and doesn't run into things? It's an interesting project to identify things better but they are developing a driving AI not a tour guide. Well, so long as they are doing both things at once, fine.
 
Last edited:
> Is a NN that identifies things more useful than a NN that drives better and doesn't run into things?
I think it's pretty clear that they want to identify things in the images that the cameras capture ... so as not to run into things. These folks are pretty smart, and fairly unlikely to be making a mistake like developing a silly electronic tour guide :)
 
  • Like
Reactions: shrineofchance
@Dan D. There are 4 main parts to a self-driving system:

1. Computer vision & radar-, sonar, and/or lidar-based perception.

2. Prediction: predicting the future behaviour of vehicles, pedestrians, and cyclists over the next few seconds.

3. Planning (a.k.a. behaviour generation): the generation of driving decisions like “at the intersection, when the light turns green, turn left along such-and-such a trajectory”.

4. Control: the software that translates planning into low-level actuator commands, i.e. steering, acceleration, and braking commands.

Tesla is working on all 4 of these parts of their FSD system.

When the system fails, it is not always clear which part (or parts) of the system failed. Casual observers often make assumptions, but to make an informed guess, you need information like you see in Tesla’s developer mode (“Augmented Vision”).

An important 5th part is integration with maps, whether that be the low-definition navigation maps (similar to Google Maps) that Tesla uses, or the high-definition lidar maps that Waymo uses. AFAIK, in Tesla’s case, this is much less of a technical challenge than (1) through (4). As I understand it, it’s more or less an extension of the GPS navigation technology that has existed in cars for years.
 
Last edited:
  • Informative
Reactions: Dan D.
I haven't finished listening to the whole thing yet. The part about showing a NN 50,000 variations of a fire-hydrant got me thinking. Would a two stage NN, where the first stage just determines primitive geometric shapes, such as cylinders, boxes and spheres. This would be trained to see such shapes in many different lighting conditions and surface textures. That NN spits out a rudimentary jumble of 3D shapes, which goes into a 2nd NN that that can determine if it's a human, fire hydrant, car, etc.
AFAIK, this is already sort of how convolutional neural networks work.
 
  • Like
Reactions: thewishmaster
@diplomat33 Have you listened to this?

Yes.

It's a good interview. Karpathy interviews are always good. Although I already knew most of what they discussed, so there was not a ton of new info. I like Karpathy. He is smart and knowledgeable. He makes ML interesting. And he always does good interviews. One thing that I especially like about Karpathy is that he is a "true scientist". By that I mean, he is focused on the science and the engineering, like how ML works, how NN work, how the data fits etc... He does not offer crazy predictions on when FSD will be finished, or when Tesla will achieve L5. He is simply focused on the task at hand and solving technical problems.
 
Yes.

It's a good interview. Karpathy interviews are always good. Although I already knew most of what they discussed, so there was not a ton of new info. I like Karpathy. He is smart and knowledgeable. He makes ML interesting. And he always does good interviews. One thing that I especially like about Karpathy is that he is a "true scientist". By that I mean, he is focused on the science and the engineering, like how ML works, how NN work, how the data fits etc... He does not offer crazy predictions on when FSD will be finished, or when Tesla will achieve L5. He is simply focused on the task at hand and solving technical problems.

He loves to make fun of the lidar approach though. He's more diplomatic nowadays, but in the past, it was funny when he kept bringing up the "exact distance to the leaf on that tree."
 
He loves to make fun of the lidar approach though. He's more diplomatic nowadays, but in the past, it was funny when he kept bringing up the "exact distance to the leaf on that tree."

Yeah, that is the only thing that irks me. In the past, he's flat out misrepresented the Waymo approach. And yes, he is more diplomatic about it now. Thankfully, in this interview he did not do the really dumb "the exact distance to the leaf on the tree".
 
  • Disagree
Reactions: mikes_fsd
I found a lot of new ideas and sentiments from Karpathy to appreciate and ruminate over in the interview. I am more excited for FSD Beta v9 than ever because Karpathy speaks persuasively about the new ideas Tesla is integrating into its AI and about the power of neural networks themselves. And he seems excited himself.
 
  • Like
Reactions: diplomat33
One interesting bit from the podcast: Karpathy said "you can drive around with a few cars with lidar and use sensor annotation". (Don't remember if this is an exact quote or a paraphrase; just saw it in my notes.) The first hint we got that Tesla might be doing this was in November 2016. The concept is similar to the radar-supervised learning that Karpathy talked about on Autonomy Day.

Yes we knew this. Tesla basically has a couple cars with lidar that drive around. They use the lidar data only to train the camera vision. Basically, they know the lidar data is accurate so they can use the lidar to "calibrate" the camera vision. For example, lidar gives extremely accurate depth perception. So they can get accurate depth perception with lidar and then compare it to the depth perception from the camera vision to see how good the camera vision is. And they can train the camera vision until the camera vision gives depth perception as close as possible to the lidar data. It's like checking the answer key to see if you have the correct answers on your exam. Lidar is the "answer key". Continuing my analogy, Tesla wants to practice until they can get an A+ on the exam without needing the answer key (training the camera vision).
 
Last edited:
I haven't finished listening to the whole thing yet. The part about showing a NN 50,000 variations of a fire-hydrant got me thinking. Would a two stage NN, where the first stage just determines primitive geometric shapes, such as cylinders, boxes and spheres. This would be trained to see such shapes in many different lighting conditions and surface textures. That NN spits out a rudimentary jumble of 3D shapes, which goes into a 2nd NN that that can determine if it's a human, fire hydrant, car, etc.
This is called "feature engineering", and presumes that we (humans) know exactly what intermediate stages of object classification are most useful and relevant, and just as importantly, which are irrelevant. The catch is that a lot of the "fluff" surrounding the shape identification (e.g. "this is pretty much a sphere, but deviates from a sphere in such-and-such hard-to-enumerate ways") turns out to be highly useful, and would be lost in the process you describe. It nearly always ends up being more effective to just let the NN figure out the whole thing for itself end-to-end.

An exception might be in something like audio processing, where pre-chewing the waveform data through an FFT or wavelet transform (to provide a clean frequency spectrum) might make the neural net's job a lot easier. This is sort of the equivalent of "giving the neural net a calculator", but the important thing is that the transform preserves all the information from the input and doesn't exclude anything.