Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
The next big milestone for FSD is 11. It is a significant upgrade and fundamental changes to several parts of the FSD stack including totally new way to train the perception NN.

From AI day and Lex Fridman interview we have a good sense of what might be included.

- Object permanence both temporal and spatial
- Moving from “bag of points” to objects in NN
- Creating a 3D vector representation of the environment all in NN
- Planner optimization using NN / Monte Carlo Tree Search (MCTS)
- Change from processed images to “photon count” / raw image
- Change from single image perception to surround video
- Merging of city, highway and parking lot stacks a.k.a. Single Stack

Lex Fridman Interview of Elon. Starting with FSD related topics.


Here is a detailed explanation of Beta 11 in "layman's language" by James Douma, interview done after Lex Podcast.


Here is the AI Day explanation by in 4 parts.


screenshot-teslamotorsclub.com-2022.01.26-21_30_17.png


Here is a useful blog post asking a few questions to Tesla about AI day. The useful part comes in comparison of Tesla's methods with Waymo and others (detailed papers linked).

 
Last edited:
We don't know that. Don't you think v12 came up when he met with senior FSD staff? Of course that doesn't mean Tesla would share any v12 information with him but to just state he doesn't know anything may not be accurate either.
I'm sure what information is shared is tightly controlled.

ps : Legally I believe they actually can't share any information is not publicly known. Otherwise it would be securities fraud !
 
View attachment 988032

This was taken at a distance of about 30 feet with my iPhone 13 mini 12MP camera. It’s cropped from a 2.9MB HEIF image. Now it’s 73x34 and it’s an image of text about 8 inches wide.

I took a few to make sure focus, etc. were not an issue.

It’s much more clear with my human eyes. I can read the time and date, and would be able to read the month and day number if it changed (and besides I didn’t know the date for certain, until reading this).

Why is this? This is actually a question. I don’t know whether I am making a valid comparison or there is something wrong with this simple test case.

To me, corrected human vision seems amazingly sharp compared to a 12MP image. Need that 48MP camera I guess!
You had me at HEIF. That is compression, perhaps 20 to 1.

How do you get 12,000,000 pixels X three colors X 2 bytes each, which equals 72 MB into a 2.9 MB file? Compression. Does image quality suffer? Yep.

Try again with a "raw" file. I don't think iPhone can do that, sorry, you'll need an actual camera.

Really, this discussion belongs in some other forum. I brought it up in the context of concerns about FSD not seeing what was going on. I think it sees it just fine, but understanding what it sees and then deciding what to do about what it sees are a very different and difficult issue, still under construction.
 
Compression. Does image quality suffer? Yep.
Yeah I definitely would like to have seen the raw (need the Pro 48MP for that I think).

However, I doubt it makes a huge difference. Compression works ok - though certainly it has some impact. The whole point of good compression is to NOT lose information like this. It’s not really fair to make the comparison you made because lossless compression would get you a bit closer to the JPEG/HEIF size (I would guess 3x the size but I am sure it can be Googled for a typical number - looks like it can be 10x-15x but comparisons are a bit tricky, have to read in detail about it - part of the savings is in color bits, too, which is not very relevant here.). I guess someone should try it with raw image data.

I do have an actual old Canon camera capable of RAW. I could try it if it still works. 4MP I think.

But yeah, perhaps a bit off topic.

I very much doubt that 1MP image is anywhere close to the performance of my eyes, though!

It’s just incredibly coarse, only on the order of 1000 pixels each way.

(In fact, I remember looking at RAW output from that Canon camera many years ago. For landscapes it just didn’t compare to human vision. Small figures coming over ridge lines, etc., in landscape shots are just completely lost, while my human sight could easily determine what they were. )

I feel like it is relatively obvious that the current cameras do not come close to human vision. You just have to look at images as we did above (and get through the compression argument, of course). I don’t think it is even remotely close. This certainly would impact their ability to distinguish what is happening at a distance.

Someone should just do the math and figure it out. In terms of what size object can be resolved at 200 yards or so. Not “resolution” or any detached metrics like that.
 
Last edited:
Isn't it compression used to reduce image size? Does Tesla use compressed images or raw photo count to detect objects?
And compared to earlier versions, in v11.4 FSD can see objects in front of the ego a lot farther (25 - 40%?). I wonder how Tesla achieve that?
 
Last edited:
Does Tesla use compressed images or raw photon count
They use direct from photons allegedly, which makes sense.

I doubt compression is the reason for the lack of clarity above. It probably makes it a little more fuzzy but removing it won’t bring it to human-level vision. And that’s a 12MP sensor.
 
Last edited:
  • Like
Reactions: FSDtester#1
They use direct from photons allegedly, which makes sense.

I doubt compression is the reason for the lack of clarity above. It probably makes it a little more fuzzy but removing it won’t bring it to human-level vision. And that’s a 12MP sensor.
Speaking as a only-sometimes DSP and algorithm guy: There's lossless compression and lossy compression. RAW is lossless; the files are a lot bigger. JPEG and HEIF are lossy.

It is precisely when one is blowing up a cropped image that the vagaries of the lossy compression will show up.

Most of the garden variety digital cameras I've owned over the years don't even have RAW as an option. And, mostly, I haven't cared: If it looks good to an eyeball, that was fine.

I did a little checking, out of curiosity. iPhone 12 PRO and up have RAW as an option; so do more recent Samsungs. iPhone 12 non-pro (which is what I have) needs additional, $20 and down software in order to take RAW pictures. In any case, one needs a RAW-aware software program (MS Paint won't do it) in order to process RAW images. The article I read on the subject kind of claimed that this was expensive, rent-not-buy software. I just checked: GIMP coupled with a couple of different plugins (RAW Therapee, DarkTable) can do the job without forking over valuta.
 
They use direct from photons allegedly, which makes sense.

I doubt compression is the reason for the lack of clarity above. It probably makes it a little more fuzzy but removing it won’t bring it to human-level vision. And that’s a 12MP sensor.
Be very careful when you talk about "human vision". The image coming from your eye is actually pretty bad, and is very low resolution outside of the fovea (though it is optimized for things other than acuity at the edges). Your brain does a VAST amount of processing to create what we think of as "the real world", much of which is extrapolated from the rather poor data coming from the retina. For example, you eye is never still, and in fact if you TRY to keep you r eye truly still you will find you cannot actually see very well at all. This is because your brain continually oversampled information from the retina to fill in extra detail. The cars NN can do this as well (not saying they DO in FSD, but this can be done) .. it can get far more data from a video image than a still image because something that is ambiguous in one frame can be validated across multiple frames.

As an example, here is a famous optical illusion. Note that the "real image" you see is VERY FAR from what actually arrives at your retina. The squares "A" and "B" are the EXACT same gray level in the image, yet your brain creates a visual representation that is very far removed from the objective image.

 
RAW is lossless; the files are a lot bigger. JPEG and HEIF are lossy.

It is precisely when one is blowing up a cropped image that the vagaries of the lossy compression will show up
Not really sure I believe this. When someone posts a RAW image section showing the massive difference, I’ll believe.

Compression works well. Of course some detail is lost. But contrasts will tend to be preserved. See the image above. It clearly just doesn’t have the resolution to render the detail!

Really doubt it has anything to do with compression. Remember, this image only had 4032 pixels across. That’s not that many! It only had 73 horizontal pixels (actually 64) to represent “Saturday November 4.” (19 letters). That is only 3.5 pixels per letter! This has no dependence on compression.

Not going to look very good at all. It does seem like it should look better than it does, though, so that may be the effect of compression - not sure.

Really crisp and clear to my human vision though. I have to work, but when I do, I can see it clearly.

In any case this is a 12MP image. Not a 1MP one. Roughly 3x denser in pixels. So with 1MP those 19 letters would have to be represented with about 20 horizontal pixels. Seems like that’s not going to look great even direct to photons with none of the compression losses.
 
Last edited:
Your brain does a VAST amount of processing to create what we think of as "the real world", much of which is extrapolated from the rather poor data coming from the retina.
Sure. This is my point, though.

The question is how good it is. And how good the car can be.

I don’t really care about the retinal image. I care what I can see.

Maybe the car can do it with sufficient processing. Certainly many successive images should help! But the question remains: what can it do?

I am completely unconvinced it has visual range even approaching that of a competent human. And I think that is the reason for v11 behavior. There is not a way to prove it though.

Circumstantially from Chuck’s ULT it appears the range is at least 130 meters and probably closer to 150 meters from the sides. But I doubt it is much better than that and probably not much better in front with more cameras, either. Would guess 200-250 meters is about the limit roughly for reliable perception. 250 meters is being generous I would guess. All guesses.
 
Last edited:
Not really sure I believe this. When someone posts a RAW image section showing the massive difference, I’ll believe.

Compression works can work well. Of course some detail is lost. But contrasts will tend to be preserved. See the image above. It clearly just doesn’t have the resolution to render the detail!
FTFY!

You’re exposing your ignorance. Do some more reading on compression, optics and human vision.

‘Compression’ is not uniform or consistent. There are many different types and algorithms, each with different strengths and weaknesses.

Fundamentally, the goal of compression is to reduce file size. There are an errant compromises that must be made in this process. The biggest one is file size versus quality. You can achieve a high-quality compressed image, but it will be relatively large. Conversely, you can also have a very compact image, but quality will suffer without knowing the balance that the iPhone strikes with its HEIF format it’s impossible to know how much quality was lost. Of course this is also completely ignoring the quality of the optics.
 
You’re exposing your ignorance
Remember I was the one who asked “why” in the first place! I am not sure why saying modern compression works well is ignorant either.

And remember I have repeatedly said that compression probably causes some of the issue with this image.
HEIF format
It’s a modern compression format, superior to JPEG, and has some advantages for burst photos and video compression.

Of course this is also completely ignoring the quality of the optics.

Anyway the point is this:

In any case this is a 12MP image. Not a 1MP one. Roughly 3x denser in pixels. So with 1MP those 19 letters would have to be represented with about 20 horizontal pixels.

I really should have stuck with that, rather than posting an image with compression artifacts. (Or at least I should have blown off the blurred nature of the image, since it is irrelevant. Also optics are irrelevant.)

“Shockingly” the human eye can easily resolve these letters, and I would estimate based on my own visual image that it has (at least!) something like an effective 50-100x higher (than 1MP) resolution (7x7 to 10x10 more) for objects that are focused on (obviously it is a bit lower for a quick glance - I’d estimate 10x lower - but of course “quick glance” is not relevant for this application, and that would still be way better than 1MP) in the middle of the visual field. Just a rough estimate, and I am probably shortchanging my eyes a bit.

So, is anyone still trying to make the claim that the car cameras will not limit the car’s visual range below that of a human?

I have no idea whether that is the current limit of course. All I know is it doesn’t seem to react until a couple hundred meters in advance of an issue (sometimes less, perhaps). Any ideas why?
 
Last edited:
  • Like
Reactions: FSDtester#1
Not sure what's going on over the last few pages, but Tesla has told us exactly how far the cameras see. Forward camera is about 820 feet (HW3). We can do elementary school level math to calculate stopping times, etc. More MP, further distance seen, higher CPU requirements. Cover the car with 50MP cameras and put in 5000+ TOPS processors, and perhaps we can end this debate.
 
Remember I was the one who asked “why” in the first place! I am not sure why saying modern compression works well is ignorant either.

And remember I have repeatedly said that compression probably causes some of the issue with this image.

It’s a modern compression format, superior to JPEG, and has some advantages for burst photos and video compression.



Anyway the point is this:



I really should have stuck with that, rather than posting an image with compression artifacts. (Or at least I should have blown off the blurred nature of the image, since it is irrelevant.)

“Shockingly” the human eye can easily resolve these letters, and I would estimate based on my own visual image that it has (at least!) something like an effective 50-100x higher (than 1MP) resolution (7x7 to 10x10 more) for objects that are focused on (obviously it is a bit lower for a quick glance - I’d estimate 10x lower - but of course “quick glance” is not relevant for this application, and that would still be way better than 1MP) in the middle of the visual field. Just a rough estimate, and I am probably shortchanging my eyes a bit.

So, is anyone still trying to make the claim that the car cameras will not limit the car’s visual range below that of a human?

I have no idea whether that is the current limit of course. All I know is it doesn’t seem to react until a couple hundred meters in advance of an issue (sometimes less, perhaps). Any ideas why?
You may have asked 'why,' but you're completely missing all of the nuances that people are trying to explain to you. The rest of your statements continue to ignore these details and make assumptions that can't be substantiated.

The bottom line is no one here can actually say without more information whether the imaging hardware is the limiting factor. It may be and it may not be, but taking a single compressed image from an iPhone and blowing it up proves nothing other than showing the limits of the iPhone.

Edit: when I said you were exposing your ignorance in my previous post, I did not mean it to be derogatory, simply that you have incomplete knowledge of the subject. I hope you didn't take offense at the wording.
 
Last edited:
Not sure what's going on over the last few pages, but Tesla has told us exactly how far the cameras see. Forward camera is about 820 feet (HW3). We can do elementary school level math to calculate stopping times, etc. More MP, further distance seen, higher CPU requirements. Cover the car with 50MP cameras and put in 5000+ TOPS processors, and perhaps we can end this debate.
Do we know how far the new HW4 cameras can see?
Most of this discussion though doesn't really matter to me until Tesla fixes the B-pillar problem. Had to perform 2 disengagements today caused by obstructions to the B-pillar cameras. At least you know when the view is obstructed so I know when to lean forward knowing a disengagement is likely.
Just spent a week in Houston using FSD and didn't encounter one obstructed view intersection. Same in Florida several months ago. But In Massachusetts obstructed view intersections are extremely common. The only adjective I associate with Tesla and the B-pillar problem is stupidity because the problem is so obvious.
Car is 3 feet into the crossing road when I took this picture. When I drive this manually I'm less than a foot into the intersection since I lean forward. To the right is a blind hill with 13 degree grade so you cannot afford to be sticking out since cars coming from the right don't see this intersection until they are on top of it as they come over the hill. And of course cars from the left are way too close when FSD decides it's safe to go.
 

Attachments

  • Left Pillar Camera.jpg
    Left Pillar Camera.jpg
    414.9 KB · Views: 16
Last edited:
  • Like
Reactions: FSDtester#1
I really should have stuck with that, rather than posting an image with compression artifacts. (Or at least I should have blown off the blurred nature of the image, since it is irrelevant. Also optics are irrelevant.)

“Shockingly” the human eye can easily resolve these letters, and I would estimate based on my own visual image that it has (at least!) something like an effective 50-100x higher (than 1MP) resolution (7x7 to 10x10 more) for objects that are focused on (obviously it is a bit lower for a quick glance - I’d estimate 10x lower - but of course “quick glance” is not relevant for this application, and that would still be way better than 1MP) in the middle of the visual field. Just a rough estimate, and I am probably shortchanging my eyes a bit.

So, is anyone still trying to make the claim that the car cameras will not limit the car’s visual range below that of a human?

I have no idea whether that is the current limit of course. All I know is it doesn’t seem to react until a couple hundred meters in advance of an issue (sometimes less, perhaps). Any ideas why?
I'm not really sure what can be determined here, as there are so many variables. How well does a human need to be able to see to drive safely? Certainly many people do not have 20/20 vision, even with correction. Are they safe to drive? What about at night, when the cameras clearly have an advantage (ever looked in your rear view mirror vs the backup camera at night?).

Before FSD was in wide beta there were many posts in these forums calculating that the car could never see far enough and "proving" that it would not see a car beyond 50 meters. Yet Chuck's famous UPL tests show that the car can indeed do that (mostly I suspect thanks to the NN training on moving images).

I've been out driving at night in pouring rain when I can hardly make out the lane markings 20 feet in front of me, what with all the reflections and glare from oncoming car headlights. And yet the car continues to pick out these lines with amazing accuracy, to the point where even when manually driving I double-check the cars lane markings to augment what I can see myself (and I DO have 20/20 vision according to my optician).

As for image compression, most lossy algorithms are tuned to generate images that seem "close" to the original when viewed by humans, where "close" means "pleasing" and "lacking in jarring artifacts". These criteria are not really in line with the requirements of an NN, where large distortions in (say) color accuracy are far less important than (say) edge retention in signage lettering. There is also the additional time delay when compressing the image, which is why Tesla are moving (moved?) to photon counting (which is just a fancy name for taking the raw sensor data).

In fact, the ideal situation for a car would be for cameras that rapidly generate images with just enough color/luminance data across a wide range of absolute scene brightness levels and no more than that, since extra detail beyond that simply slows down processing and/or increases the cost of the hardware to do that. If the NN takes (say) 50ms to work on a high-resolution image but could take 10ms to do the exact same thing on a lower resolution image, then the lower resolution image is "safer" in the sense that FSD could react faster to a given situation.

I've no idea if the current hardware, or even HW4, can indeed achieve/exceed human level skill in driving, butI I dont think there is any actual evidence to show that it cannot. It's all speculation since no-one has ever attempted this before.
 
Last edited:
Completely missing the point of raw files. Raw is unaltered, with no in camera processing so you can apply your own. That means you can go anywhere with your process path.
Post processing and compression imply knowledge of final purpose, which may or may not be suitable if you decide to change the purpose later, like using the image for further editing.
One format isn’t “better” just suited for different purposes.
 
Not until Tesla officially releases statements. Currently the only benefit is improved image quality, so it can read signs with more clarity, and identify objects at distance with more accuracy. Visual distance may still be exactly the same as HW3 (250 meters).
You're contradicting yourself a bit - identifying objects at a distance is exactly what seeing at a distance is. The distance limit of the cameras is not determined by the maximum distance a photon can travel to the camera, rather by the maximum distance at which the camera can resolve objects well enough to be useful.
 
Tesla has told us exactly how far the cameras see. Forward camera is about 820 feet (HW3).
What does this mean?

ompletely missing all of the nuances that people are trying to explain to you
Which ones?
make assumptions that can't be substantiated.
Which ones?
taking a single compressed image from an iPhone and blowing it up proves nothing other than showing the limits of the iPhone.

Completely missing the point of raw files.

As for image compression,

Let's stop the image compression discussion, as I mentioned, it's completely irrelevant!!!

A 1MP sensor has 20 pixels to represent the 19 letters in the image. Those letters were quite readily readable to my eyes, at the distance of 30-33 feet mentioned (the letters spanned about 8 inches).

The point of the image was not the fuzziness (though that is how I erroneously presented it). It's how many pixels are available to represent the required detail. I think we can all agree that pixels can be separated from various compression artifacts. If there aren't enough pixels, you're done.

I don't think there is any debate about the equivalent MP of the human eye. It's super high. For quick glances it sounds like it's on the order of 10-20MP, and for dwelling on a scene, it's more like ~500MP. There isn't anyone here who disputes these numbers (right order of magnitude), is there?

(And as mentioned above, it seems that you can determine this yourself! Take a picture, and measure the pixel dimensions of a piece of that image, then roughly estimate how many pixels your eyes have in that same section of image, based on what you can see and how clear it is. Just a rough estimate (that's how I came up with the conservative ~100MP number earlier).)

How well does a human need to be able to see to drive safely? Certainly many people do not have 20/20 vision, even with correction. Are they safe to drive? What about at night, when the cameras clearly have an advantage (ever looked in your rear view mirror vs the backup camera at night?).

These are good questions. I think we should strive to ensure that the system can see as well as a person with 20/20 vision. People who have inadequate vision struggle to drive safely. I reduce my driving at night when I can (specifically for road trips - in town is less of an issue since it is well lit), since my scleral contacts can cause fogging and loss of contrast sensitivity.

I certainly think we should make sure the system can figure out what is happening 1/4 or 1/2 a mile ahead, if that's what a human can do in those particular circumstances (obviously it's not always possible to see 1/4 mile ahead).

FSD was in wide beta there were many posts in these forums calculating that the car could never see far enough and "proving" that it would not see a car beyond 50 meters. Yet Chuck's famous UPL tests show that the car can indeed do that
And you'll see I am on the record saying that that claim of only 50 meters was nonsense, before it was proven to be nonsense.

That never made any sense. It was clear from even early in Chuck's ULT days that the car could see at least 100 meters fairly reliably (more like 130-150 meters probably).

I haven't seen a screw up on Chuck's turn with recent software that appears to be related to perception. As long as the car moves quickly enough, the sensors appear adequate.

butI I dont think there is any actual evidence to show that it cannot.

I guess I would say that as long as the car is failing to react to stopped traffic in a timely manner, we need to figure out what is the reason for that (or at least Tesla does).
 
Last edited: