Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

FSD Beta 10.69

This site may earn commission on affiliate links.
It's also possible the low resolution of the cameras (only 1280x960) means that recognition and classification of imagery cannot occur at far enough distances as it's too coarse compared to good human fovea vision, meaning that reaction is too late. Better cameras aren't a problem to install now, but it would greatly increase the data rate into the neural networks and that requires significantly faster compute which is a big issue.

Faster compute without the most efficient chips means high expense and high power consumption. I wouldn't be surprised if the Waymo cars have $10-20K of compute and 4000 watts of power consumption, which would obviously kill efficiency. It would be like resistive heaters running at full blast. And in the summer that would have to be A/C'ed out too.
I think this is indeed the case. In my area there are 4 lane roads where the average speed is 60mph. For an unprotected right on red at a traffic light the cameras have to look across a wide intersection and then work out the velocity of cars that are approaching the traffic light - about 100m. At least with 10.12 it seems to not see the traffic at all - when it is 100% clear it pauses a good 15 seconds and does a timid commit to the turn. When there is approaching traffic it does the same thing - except there is traffic and I have to disengage or slam on the accelerator / change lanes. Curious to see if 10.69 is any better on this type of turn.

Phantom breaking is probably similar - at 1280x960 your are calculating the risk of hitting fuzzy pixelated blobs at 50 meters - AI's can do amazing things but it will be limited by the cameras. If the cameras were sharper there would be much better distinction of shadows / lighting vs a concrete object in the road at distance. And yeah, the compute on higher res cameras would be way too much.
 
  • Like
Reactions: DarkForest
I think this is indeed the case. In my area there are 4 lane roads where the average speed is 60mph. For an unprotected right on red at a traffic light the cameras have to look across a wide intersection and then work out the velocity of cars that are approaching the traffic light - about 100m. At least with 10.12 it seems to not see the traffic at all - when it is 100% clear it pauses a good 15 seconds and does a timid commit to the turn. When there is approaching traffic it does the same thing - except there is traffic and I have to disengage or slam on the accelerator / change lanes. Curious to see if 10.69 is any better on this type of turn.

Phantom breaking is probably similar - at 1280x960 your are calculating the risk of hitting fuzzy pixelated blobs at 50 meters - AI's can do amazing things but it will be limited by the cameras. If the cameras were sharper there would be much better distinction of shadows / lighting vs a concrete object in the road at distance. And yeah, the compute on higher res cameras would be way too much.
Um. There’s this thing in traditional DSP audio processing where one is trying to recognize diphthongs and words and such.

As some of you who are older may remember, getting those early algorithms to work reliably involved training with the user with many sounds, and if users were changed, the user who had done the original training had a cold, or whatever, the accuracy fell off a cliff. And it wasn’t that great in the first place. Think early versions of Dragon Dictate.

One day, this all changed dramatically: the algorithms used began to use Noisy Markov Chains math.

A Markov chain algorithm has the idea that one knows the past history of a number of events, a probability matrix of what the next event might be, and the current raw data input. Into such an algorithm one could place a series of already detected diphthongs, a probability matrix based upon the language being used, and the current raw detected audio. It worked, but was not much better than Dragon Dictate.

The breakthrough was to realize that one was looking at an ideal bunch of diphthongs that were masked with colored noise, the noise being generated by variations in peoples’ vocal tracts, the ambient, and so on. There’s some serious math involving detecting signals in the presence of noise (think: Information Theory, a la Shannon), which could then be applied. Suddenly, there was a quantum jump in the effectiveness of voice recognition technology, where speaker-independent voice recognition technology could be deployed-and it Just Worked. Giving rise to Iron Ladies everywhere and, eventually, for those minuscule processors on your cell phone to correctly respond to, “Hey Google!” or whatever.

Thing is, a low resolution camera is looking at a full resolution image. The image on the CCD is the real image with (wait for it..) quantization noise. Which is there because the real world is continuous, but the image sensor is discrete.

But-we know how to handle noise. In fixed algorithms like speech, it’s that noisy Markov chain stuff. I have to believe that neural network image processing is at least as good and probably better than that.

Admittedly, a one pixel image receiver blinking on and off isn’t going to do much for one. But I strongly suspect that increasing the image resolution is going to fairly quickly reach the point of negative returns, where it takes longer to process the increased number of pixels than any added benefit it might create in determining whether there’s a car up there a quarter mile off.
 
Last edited:
Um. There’s this thing in traditional DSP audio processing where one is trying to recognize diphthongs and words and such.

As some of you who are older may remember, getting those early algorithms to work reliably involved training with the user with many sounds, and if users were changed, the user who had done the original training had a cold, or whatever, the accuracy fell off a cliff. And it wasn’t that great in the first place. Think early versions of Dragon Dictate.

One day, this all changed dramatically: the algorithms used began to use Noisy Markov Chains math.

A Markov chain algorithm has the idea that one knows the past history of a number of events, a probability matrix of what the next event might be, and the current raw data input. Into such an algorithm one could place a series of already detected diphthongs, a probability matrix based upon the language being used, and the current raw detected audio. It worked, but was not much better than Dragon Dictate.

The breakthrough was to realize that one was looking at an ideal bunch of diphthongs that were masked with colored noise, the noise being generated by variations in peoples’ vocal tracts, the ambient, and so on. There’s some serious math involving detecting signals in the presence of noise (think: Information Theory, a la Shannon), which could then be applied. Suddenly, there was a quantum jump in the effectiveness of voice recognition technology, where speaker-independent voice recognition technology could be deployed-and it Just Worked. Giving rise to Iron Ladies everywhere and, eventually, for those minuscule processors on your cell phone to correctly respond to, “Hey Google!” or whatever.

Thing is, a low resolution camera is looking at a full resolution image. The image on the CCD is the real image with (wait for it..) quantization noise. Which is there because the real world is continuous, but the image sensor is discrete.

But-we know how to handle noise. In fixed algorithms like speech, it’s that noisy Markov chain stuff. I have to believe that neural network image processing is at least as good and probably better than that.

Admittedly, a one pixel image receiver blinking on and off isn’t going to do much for one. But I strongly suspect that increasing the image resolution is going fairly quickly reach the point of negative returns, where it takes longer to process the increased number of pixels than any added benefit it might create in determining whether there’s a car up there a quarter mile off.
In the AI workshop video from a few weeks ago, I thought he said something about putting more compute resources on the parts of the scene that needed it the most. I’m hoping that concept would also extend to using higher resolution cameras and using an early stage process to lower the resolution on the parts of the incoming image that don’t need it and leaving the parts of the image with distant objects (including the horizon) intact.

I also vaguely recall him mentioning there was a downscaling and subsequent upscaling step due to some in between processing only working on lower resolution imagery. I wonder if it’d be possible to just not downscale the important parts of the image and potentially get even more out of the current cameras.

I’ll need to see the video again…
 
Admittedly, a one pixel image receiver blinking on and off isn’t going to do much for one. But I strongly suspect that increasing the image resolution is going to fairly quickly reach the point of negative returns, where it takes longer to process the increased number of pixels than any added benefit it might create in determining whether there’s a car up there a quarter mile off.
I know something about DSP (in audio), but not nearly to the depth that you seem to have. But I still think higher resolution cameras could help a lot in two ways. The most obvious is that pixels can be combined to form a higher quality image that is at the same resolution as those provided by the current cameras. The same processing with better quality images should yield real benefits.

But when we drive, we are not constantly looking at what is happening a quarter mile away. We look occasionally and retain what we have observed until we get the chance to look again. Tesla could do something similar. Think of it as two threads of processing, one a very high frame rate of low-res images (formed by pixel binning), the other a much lower frame rate of native high-res images.
 
  • Like
Reactions: DarkForest
I wouldn't anticipate huge changes to pulling off maneuvers etc in something like a x.x.x release, seems like they may have jacked up the sensitivity to objects/VRUs

The first 10.12 iteration came out in May? Feels like Tesla needs to speed things up to hit a wide release by the end of the year, I don't know if one more big update this year will do it
 
It's also possible the low resolution of the cameras (only 1280x960) means that recognition and classification of imagery cannot occur at far enough distances as it's too coarse compared to good human fovea vision, meaning that reaction is too late.
One of Andrej Karpathy’s many talks at various tech conference touched on this very issue. His specific example was recognizing a car coming towards you from an image that was maybe 8 pixels across. He did say that this was a case where AI was better at recognizing items from very low resolution images than humans were.

They may have an advantage here since the front-facing cameras give them three versions of overlapping image data…
 
Actually it didn't run the Red Light since it is an offset intersection
I also think FSD Beta 10.69.1.1 correctly interpreted that large intersection split with 3 crosswalks and knew it passed the initial stop line and still needed to unblock and exit the intersection. Even if the light was still green, it might have still stopped for the pedestrian in the crosswalk -- just the usual pedestrian behavior.

However, here's 10.69.2 clearly running a red light when exiting a highway and slowing down only to 11mph:

I believe it misinterpreted that this was a slip lane even though the traffic lights were directly above the lane maybe a regression from:
  • Increased smoothness for protected right turns by improving the association of traffic lights with slip lanes vs yield signs with slip lanes. This reduces false slowdowns when there are no relevant objects present and also improves yielding position when they are present.
Although even interpreting it as a yield, it knows to check for cross traffic, so it was safe, but technically it broke the law for not stopping on red.
 
  • Funny
Reactions: Daniel in SD
But I strongly suspect that increasing the image resolution is going to fairly quickly reach the point of negative returns, where it takes longer to process the increased number of pixels than any added benefit it might create in determining whether there’s a car up there a quarter mile off.
Yes - and remember, the problem isn't to reliably detect a car in a single image. The problem is to reliably detect a car in a continuous series of images (also known as a video stream).

Remember playing Doom on 640x480 or 1024x768 displays? A semi-distant demon might only occupy a couple of pixels, but the combination of motion from the first person shooter perspective, the changing geometry, and the trajectory of the demon made it possible to quickly and reliably detect, categorize, prioritize, and react to threats.

That's exactly what Tesla is claiming to have implemented in this new release:

"Upgraded Occupancy Network to use video instead of images from single time step. This temporal context allows the network to be robust to temporary occlusions and enables prediction of occupancy flow."