Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Neural Networks

This site may earn commission on affiliate links.
My takeaway from that great Karpathy talk is that NNs have shifted the problem from direct coding to data labeling and categorization, which was somewhat known anyway but is nice to have it spelled out so clearly.

However, to speculate on my own question to @jimmy_d upthread, namely -- why are Tesla's crashing into fire trucks, street sweepers, etc but braking for stopped cars. -- it seems that the answer might be as simple as they haven't categorized them because they're in the 1e-3 percentile of vehicles they capture.

In other words, how many Tesla's are approaching street sweepers and fire trucks in normal driving? And even when they do -- how does Tesla find and categorize and train on this data?

If this is true -- then it really highlights the incredible difficulty of solving vision by labeled examples, in that all the crashes and accidents will simply be in things you haven't trained the system to recognize, i.e. the rare occurrences -- like a Winnebago towing a car, or an earth mover that parked in a lane...

Man I'm glad you came up with that conclusion on your own (it's a good answer) because I tried writing an answer to your question and it kept getting long and messy. It's a hard question to answer, at least for me, in a way that's short, true, complete and easy to understand. Not a natural teacher I guess.

Your insight about the difficulty of dealing with rare events is correct. It's possible to deal with them by 'whitening' the database - you filter the data to change the relative representation of events by reducing stuff that's common and augmenting events that are important and less common. There are lots of different tricks that can be used but it requires studying your training data and how it affects the performance of your network and then adjusting the set of training data to improve the networks behavior. If you have a lot of data it can be pretty time consuming.

There's another thing about your query that I thought was worth addressing too, which is the "why are things that seem simple to me hard for the network to do" element. What I wanted to offer there is this: that NNs as they are used today are basically making decisions based on very complex statistical analysis of pixel distributions in an image and they are doing it without any high level priors. A "prior' is something that you know to be true about the data that you tell the network to assume in it's analysis. For instance, we humans know the world is 3D so that's a 'prior' for humans in terms of how they interpret what they see. If you assume that what your eyes are looking at is a 3D world that simplifies things a lot and because it's *true* it will almost never steer you wrong. Other priors are things like simple physics, the fact that light travels in straight lines and is occluded by opaque bodies, and the notion of time. All of this stuff it so obvious to a person that we're not even aware that we know it. But today we don't tell our networks any of that stuff. As far as the network knows it's looking at a 4 dimensional window on a 10 dimensional universe. It's just looking for a way, any way, to match up patterns in the pixels to what it's told is a car, a lane, a sign, and so forth.

So to a human there are a lot of things that are so crazy simple to "see" that we have a hard time understanding why a camera cannot "see" them. And it's because the job that NNs are doing when they "look" is much harder, in a sense, that what humans have to do.

So why don't we tell networks all this simple stuff we know is true? One reason is that we can get pretty impressive performance even if we don't tell the network this stuff. Another is that the techniques for 'telling' the network these things are very new, quite computationally demanding, and nobody really trusts them yet. So right now networks are being built in the way that is well understood and mature and trustworthy (relatively, anyway). But these other techniques will gradually mature and come into use and when they do the 'mistakes' that networks mistake will look more like the 'mistakes' that humans make and it'll all make more sense. All this deep learning stuff is brand-stinking-new and it's frankly amazing that, just a short few years after the very first working deep networks we are already able to use them in the real world. But we've barely begun to scratch what these things will eventually be able to do. Which is why I'm personally pretty sanguine about the potential for my current car to have real FSD *eventually*. I think it's a hard problem, but I also see crazy fast progress.

<oh no - another blithering digression>

Where we are with NNs today feels to me like where electronics was in the decade after the invention of the transistor. At the time people were figuring out NPN and PNP, thinking about thyristors, trying to make them work better over temp, get consistent behavior, increase the gain and so forth. They were doing all that stuff with single transistors and it was mostly guys in academia doing research. Transistors were so amazing compared to tubes that it was just mind boggling and for a long time we just worked on making the best transistor that we could. Nobody was really working on what to do with a thousand, much less a million, much less a *billion* transistors. NNs are advancing much, much faster than the transistor did - 10x per year right now. In the not too distant future we'll be building NNs that are as far ahead of what we have today as a smartphone is ahead of a transistor radio.
 
i suspect that manpower for labeling isn’t a huge problem. They probably only need on the order or a dozen or so people that just do labeling. No need to outsource or crowdsource.

Depends how well the tools can interpolate classifications across intervening frames. Very parallelable, and one situation where man months are not so mythical.

As the NN gets better and interprets the scene, it may turn into more confirmation than annotation.

I like the idea of free supercharging in exchange for image classifications.
 
  • Like
Reactions: 2virgule5
I was going to start a new thread on the Karpathy video but since we are talking about it here...

I too am interested in @jimmy_d 's take on the TRAIN AI 2018 vid here:

It's a great talk and I am happy to see him open up a little about what's happening inside the black box of Tesla's development efforts. Considering the consternation in these forms it seems like even basic info can go a long way. Most of the stuff he talks about with respect to Tesla's efforts is telling us that Tesla isn't doing anything exotic. They're labeling data and doing supervised training at scale just like 99% of the other production systems out there. And he says they're growing the NN relative to the overall system capabilities, which makes me really happy. I think he said something like "deep learning guys think that whole box should be blue (i.e. all NN)" and I thoroughly agree. I'm jazzed that he feels the same, seems to have the wherewithal to push things in that direction, and is being successful enough with that approach that they're letting him do it.

I don't think anything he talks about there is going to be surprising to anyone who works on building NNs for commercial applications today. Quite the opposite really - the approach is very much by the book. I like to thing that they have some secret sauce in the works for future developments (Karpathy's certainly capable of it) but it's unsurprising that they're sticking with the basics right now and just trying to push that as far as they can for the time being.
 
Hmm, I didn't really find the talk all that enlightening - a bit worrying really, since they're still talking about labelling and dataset issues. Really, at this stage I'd have hoped they'd be talking about more groundbreaking stuff than the well-known "NN requires a lot of good, well labelled data" mantra. I know this was a conference specially about training AI, but the fact he's still talking about basic vehicle detection and lane line detection is a bit disappointing.

i suspect that manpower for labeling isn’t a huge problem. They probably only need on the order or a dozen or so people that just do labeling. No need to outsource or crowdsource.

Mobileye are said to have nearly one thousand (!) employees just doing labelling all day every day - for years already. They clearly understand the power of a well labelled dataset.

Waymo/Google even brought the labelling issue for driving into captcha solvers - I've certainly done a few myself where I have to point out the cars in an image and various other things to "prove I'm a human" (and thus label the image for them).

Also a bit worrying that some of the "crazy" example pictures are actually relatively normal occurrences; for example, the first image of a short section of diagonal zig-zag lines that appear before a roundabout (or pedestrian crossing) is standard here in the UK (Zebra crossing - Wikipedia) I have noticed in previous versions those lines would indeed send autopilot a bit crazy.

The idea that my AP2 car would actually ever be able to handle the zebra crossing itself is seeming more and more like a pipe dream: it'd have to watch for pedestrians intending to cross (versus just walking by but not crossing), stop the car smoothly, wait for them to cross, resume the car - and for a busy crossing knowing when to push forward slowly so as not to get stuck there forever. Even with HD maps telling the car in advance exactly where the crossing is, the level of decision making there is way beyond anything we see mentioned or hinted at by Tesla. They're seemingly still at the basic classification stages.

The unknown and difficult bit of FSD isn't the image collection or data labelling, but the path planning, assertive decision making, negotiation with other actors, the mapping etc etc. Whether or not there's a training aspect to any NNs used here, I don't know... but it would have been nice to even pick up a tiny hint that they're actually looking at this stuff...

It might be, of course, that Mobileye simply talk about this stuff in the open a lot more -- which I'm glad they do, because it is actually fascinating. For example, the most recent talk goes into more detail and shows that they're working on some really interesting stuff.
 
Last edited:
I'm happy that they stick to a pragmatic approach, going step by step further - and letting us participate of each step. Isn't that fun for us nerds? What other car gives us this opportunity?
Telling us now that they indeed work on all sorts of local roads stuff (lights, signs), even trying to watch out for blinkers on cars ahead. It's just a matter of time that we'll see these features coming. Just a when now not an if.

I completely agree that Mobileye is far more evolved on the whole matter, but I don't care because I can't participate.
 
I think he said something like "deep learning guys think that whole box should be blue (i.e. all NN)" and I thoroughly agree. I'm jazzed that he feels the same, seems to have the wherewithal to push things in that direction, and is being successful enough with that approach that they're letting him do it.
If the whole system is one big NN, how can you handle different traffic rules in different regions?
 
If the whole system is one big NN, how can you handle different traffic rules in different regions?

Feed region info to the single NN via precatergorized GPS location.
(and hope the tools allow for giving strong hints to the NN, though with a sim that knows the laws, it could self train to correlate location to rule, but that seems really painful)

While I understand where Karpathy and @jimmy_d are coming from with an all SW 2.0/ NN stack, it seems like some things such as which side of the street to park on during winter based on plate # and date, should stay hard coded...
 
  • Like
Reactions: PhaseWhite
@mrkisskiss
Thanks for the link!
Whenever I watch Amnon talk and explain Mobileye's current state of the art I'm always kind of overwhelmed how far they are and especially how well designed their systems is. From my outside view it's just phenomenal. I then think: Is it even possible that others might surpass them?

Some points:
  • ME does it with vision only too. Radar and lidar only as backup and redundancy. ME seems to create a very accurate 3D model out of vision
  • ME approach to hide sensors in contrary to Waymo and most others for not getting bullied by other drivers
  • ME wants to use the fleet of BMWs, VWs, Nissans etc. to create HD maps from their cameras, something like Teslas (infamous) shadow mode? Are they allowed to do that? I have never seen a BMW asking me sharing data to a mothership. But maybe in new models?
 
Feed region info to the single NN via precatergorized GPS location.
(and hope the tools allow for giving strong hints to the NN, though with a sim that knows the laws, it could self train to correlate location to rule, but that seems really painful)

While I understand where Karpathy and @jimmy_d are coming from with an all SW 2.0/ NN stack, it seems like some things such as which side of the street to park on during winter based on plate # and date, should stay hard coded...

IMO this is what the map data is needed for.

Things like timed/zoned parking controls, correct side of the road, roundabout rules (i.e. direction, who has right of way), nose in or tail in at superchargers, etc, etc.

I am sure that some of that stuff can be interpreted by the NN directly from the map data, but for each feature there would have to be a cost/benefit. Is it really worth training the NN to understand every parking permit scheme in the world?
 
Pulling from general investor:
Full Self Driving and Stock Valuation.

Elon was talking about driving to work and back as an early milestone.

If you look at Self Driving, there are two major value propositions:

1) low cost (more fully utilized capital) rentals (sharing economy) stuff
2) helping out the person who bought the car on routine routes that they have driven before - like Elon's first milestone.

You get margin, market share and stock price appreciation, even if you only do item 2.

Please ignore the low information stream below this line.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Now to the safety model. Humans, this human, use(s) a 'difference engine' on routine routes to improve signal to noise.
I only see differences along the route. Anything that is the 'same as it ever was' isn't processed or remembered . Improving signal to noise is a key step toward reducing response time with no false triggers.

If hardware could subtract off all the stationary/static information on habitual routes, the car could focus on difference and motion. It would demonstrate a better safety record. Tesla started a program to filter out stationary landmarks like road signs with radar after the Florida truck run in. I think the goal was to generically filter less after selectively filtering the landmarks. Did that work out?

I also don't know if they are doing something like this to increase the frame rate on their camera (another way of saying filter less in the time domain). They have likely turned up the frame rate to meet some sort of Nyquist sampling rate on blinker turn signals. Faster computers help this. But I don't know if they are subtracting out of the background on a high speed video signal. The distortion of being in a moving vehicle with a wide angle lens seems almost impossible to deal with versus a stationary radar like, say, at an airport.

Maybe something could be done on just the center section of the camera sensor data, where the lens distortion is not so bad? The part near the path of the vehicle that you care most about.

Does anyone know if Tesla is doing anything with vision to subtract routine stationary objects so they don't have to be processed? If this works to improve signal to noise on a high data rate video stream, this will let the car trigger action with fewer false alarms. This means safer for everybody.

Your below the line section is underrated.
Much of what you are referring to depends on whether the processing is CPU/GPU limited. If it is, then they need to have pre-filters/ down sampling. However, it also means they have a serious issue when present with a new route that they don't have the ability to pre-process (akin to screen lag on a video game when there are many objects on screen). Karpathy's recent talk showed the NN running in constant time, so no region specific down sampling.

That said, the image preprocessing likely has image dewarping, vehicle motion compensation, and such.
 
However, to speculate on my own question to @jimmy_d upthread, namely -- why are Tesla's crashing into fire trucks, street sweepers, etc but braking for stopped cars. -- it seems that the answer might be as simple as they haven't categorized them because they're in the 1e-3 percentile of vehicles they capture.

Seems like they really need to concentrate on recognizing when the car is hurtling face first towards a stationary object. They don't seem to track objects from frame to frame though, so all the AI sees is a series of unknown objects without connecting that they are the same object getting closer.
 
Seems like they really need to concentrate on recognizing when the car is hurtling face first towards a stationary object. They don't seem to track objects from frame to frame though, so all the AI sees is a series of unknown objects without connecting that they are the same object getting closer.

All of this depends on how radar works.
http://paos.colorado.edu/~fasullo/1060/resources/radar.htm
So, it depends on which form of radar you are using?
 
Seems like they really need to concentrate on recognizing when the car is hurtling face first towards a stationary object. They don't seem to track objects from frame to frame though, so all the AI sees is a series of unknown objects without connecting that they are the same object getting closer.

Elon Tweet Jun 10th:
That issue is better in latest Autopilot software rolling out now & fully fixed in August update as part of our long-awaited Tesla Version 9. To date, Autopilot resources have rightly focused entirely on safety. With V9, we will begin to enable full self-driving features.
 
Seems like they really need to concentrate on recognizing when the car is hurtling face first towards a stationary object. They don't seem to track objects from frame to frame though, so all the AI sees is a series of unknown objects without connecting that they are the same object getting closer.

They are probably going to need to do multiple clear driving path sensing. If radar can't do it reliably then something like lidar will need to be used. It's obvious at this point that Waymo has solved this issue, at least at 40mph. Musk claiming it can be done with just cameras and radar is just conjecture at this point.
 
Seems like they really need to concentrate on recognizing when the car is hurtling face first towards a stationary object. They don't seem to track objects from frame to frame though, so all the AI sees is a series of unknown objects without connecting that they are the same object getting closer.

I would hope their NN is doing more than just single image frame analysis, because being able to stitch together a moving image scene temporally is a huge scene analysis win.

And yes, specifically for stationary objects, as I said upthread, you should be able to determine rough distance from objects based on how quickly they are growing from frame to frame.
 
I would hope their NN is doing more than just single image frame analysis, because being able to stitch together a moving image scene temporally is a huge scene analysis win.

And yes, specifically for stationary objects, as I said upthread, you should be able to determine rough distance from objects based on how quickly they are growing from frame to frame.

The image pre-filter could be doing the motion estimation. Similar to the Pong NN where the data feed was the difference between the current and previous frames.
 
  • Helpful
Reactions: croman
As I elaborated I think it's very unlikely the triple cameras are usable for stereoscopy.
However Tesla supposedly uses temporal smoothing (can't find a source for the claim atm), which allows for about 4" spatial resolution from a single camera stream by checking differences over several frames rather than frames of different cameras. That could also be used to compare the results of temporal smoothing between the different cameras, which might give a somewhat accurate representation of the 3D environment in view.

EDIT: correct term seems to be visual SLAM [simultaneous location and mapping]
A few months ago there was a leak of a Tesla with a developer camera screen on the MCU, which displayed tracking points eerily similar to what can be found if you look up older videos on SLAM on YouTube.
Tesla engineering car leaked picture shows what Autopilot sees live with settings for ‘Full Self-Driving’

The method seems quite fast and reliable, given a solid NN as foundation.
 
Last edited:
As I elaborated I think it's very unlikely the triple cameras are usable for stereoscopy.
However Tesla supposedly uses temporal smoothing (can't find a source for the claim atm), which allows for about 4" spatial resolution from a single camera stream by checking differences over several frames rather than frames of different cameras. That could also be used to compare the results of temporal smoothing between the different cameras, which might give a somewhat accurate representation of the 3D environment in view.

They are using 2 cameras for stereoscopy @jimmy_d and @verygreen reported this long ago like 18.10 days or earlier.