Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.

lunitiks

Cool James & Black Teacher
Nov 19, 2016
2,698
5,996
Prawn Island, VC
Dear camera savvy, driver assistance-knowledgeable and AP2HW insightful fellows. My head is full of questions about Teslas 8 x camera sensor suite. Such questions are being touched upon now and then in many different threads, but I would really like to read some more focused discussion on the topic. So:
  • What capabilities and limitations can/should we expect from Teslas 8 x camera sensors, given available information about their physical position, their FOVs, range, colors, resolution etc.?
  • How do the cameras compare to our human eyesight?
  • How much of a difference does ambient light, artificial light (headlamps etc.) play for the camera vision?
  • Are there certain scenarios where we can absolutely and definitely say that the cameras are of no help to the AP2-system?
  • Also, what do you think about the physical positioning of the cameras on the vehicle? I'm also wondering about the so called "heating elements", the absence of water and dust wiping/blasting mechanisms, etc.
I'll kick off with some "maximum range" info from Teslas site and Wikipedia (which reminds me of another question: what does "maximum range" even mean? surely the 60 m range fish eye cam is in principle able to spot the moon, 1.3 light seconds away (yeah, but you get my point), so I guess "maximum range" has to do with the ability to distinguish certain critical details in traffic, i.e. something to do with resolution, or...? [/technical competence revealed/])

Forward cameras:
Narrow: 250 m (820 ft)
Main: 150 m (490 ft)
Wide: 60 m (195 ft)

Forward / side looking B-pillar cameras:
80 m (260 ft)

Rearward / side looking side repeater cameras:
100 m (330 ft)

Backup (rear view) camera:
50 m (165 ft)

And here's a quick sketchup of the camera vision, based on the official renderings:


Please, please don't destroy this thread with long discussions about radars, lidars, GPS, ultrasonic or other sensor types. OTOH, all camera talk must pass :)
 
How do the cameras compare to our human eyesight?
The dynamic range of the human eye, I believe, still surpasses the best of cameras. With that being said, that doesn't play a big role WRT to computer vision in the sense of autonomous driving.

How much of a difference does ambient light, artificial light (headlamps etc.) play for the camera vision?
Depends on the algorithms, right? Ideally, it would be minimal.

Detecting something in pitch dark isn't the same as detecting something in the daylight. So that would play a bigger role.

surely the 60 m range fish eye cam is in principle able to spot the moon,
Depending on the camera resolution, with a fisheye lens, the moon might be a pixel, right? So how do you know it's really the moon as compared to just noise? You don't.

60m for a fisheye probably gives enough resolution to distinguish objects. (i.e. you need a handful of pixels to tell if it's a car, a boat, a person, a dog, or the moon. I don't have an answer to how many a handful is)

Also, what do you think about the physical positioning of the cameras on the vehicle?
It's hard to tell from the images how much overlap the cameras have. Taking redundancy aside, since the cameras seem to have varying field of views, stitching together for a true 360 view isn't trivial.

In order to get good coverage, you either need a good 360 degree view stitched to one "large" image. Or you need non-insignificant overlap. (imagine you see something at the camera's edge, if you don't have enough overlap with the next camera, how will you know if it's a car that you need to worry about, or a shadow that you don't?)
 
  • Love
  • Like
Reactions: croman and lunitiks
How do the cameras compare to our human eyesight?

Typically automotive cameras use a Red/Clear (RCCC) color filter meaning there's one red channel and three clear channels so it's effectively monochrome vision plus red information.
Elon Musk on Twitter

The dynamic range on these sensors is also typically higher than with normal camera sensors.
 
Last edited:
  • Like
Reactions: Nosken
Thanks for the input!

Is it possible for anyone with math/physics skills to determine the camera resolution? If only for the FishEye cam, that we know the following about:
- Maximum range: 60 m
- FOV: 120 degrees

The B-pillar cams we know have 80 m maximum range and 90 degrees FOV...

Or are we missing some crucial data points before resolution can be determined?
 
Thanks for the input!

Is it possible for anyone with math/physics skills to determine the camera resolution? If only for the FishEye cam, that we know the following about:
- Maximum range: 60 m
- FOV: 120 degrees

The B-pillar cams we know have 80 m maximum range and 90 degrees FOV...

Or are we missing some crucial data points before resolution can be determined?
Technically it's possible to determine the MINIMAL necessary resolution. But the camera can have 10x more res than that.
 
I'll preface this by saying that I may have made some stupid mistake here:

Draw a triangle, take the FOV angle and the max res distance, and determine the field of view in meters.
For the fisheye, you have 120degree FOV and 60m range and you need to find your FOV in meters = 2*tan(120deg/2)*60meters = about 200meters.
Then you need to determine what your smallest feature will be, let's say it's only 1m (you're only looking for things that 1m or bigger)

So you get your sensor resolution = 2 * (200m / 1m) = 400 pixels in each direction or 0.16MP [this assumes you want the same resolution in az and el, and are not using the confines of a conventional camera with a 4/3 or 3/2 sensor]


But that doesn't mean your algorithm will be able to detect something that's 1m square at 60m range. It just means it'll see it as a pixel. If you need a certain number of pixels to do classification, your resolution goes up. Add to that if you want to do noise reduction, your resolution goes up.


I'll end this with saying, I may have made some stupid mistake here. Please correct me where I'm wrong.
 
Last edited:
  • Informative
  • Like
Reactions: scottf200 and Fiver
As far as resolution you don't need much to outdo a human. You also don't need a huge amount for image recognition. Larger pixels are better at capturing light as well so it's in the best interest to keep the resolution fairly low (compared to modern digital cameras). This also serves to decrease processing time.
 
As far as resolution you don't need much to outdo a human. You also don't need a huge amount for image recognition. Larger pixels are better at capturing light as well so it's in the best interest to keep the resolution fairly low (compared to modern digital cameras). This also serves to decrease processing time.
You might be right, I haven't done the math, but I think that's debatable.

If you're confined in your sensor size, there might be (?) merits to keeping a smaller pixel size while having more pixels. If you average neighboring pixels, you get a sqrt(n) reduction in noise. So every 4 pixels reduces the magnitude by half or by 3dB. There are more sophisticated approaches too.

I'm not sure how much SNR you gain by increasing your pixel size.
 
As I understand it, at 60 metres the fisheye cam is at it's absolute limit in resolving objects/pixels in any meaningfull way. 60 metres is therefore the fisheye cam's "maximum range". Combine this with the fact that FOV=120, and you can illustrate the camera's view as a "cone", or a "pizza slice" seen from above, where each point at the edge (crust) is 60 metres from the camera lens. This should mean, that at 60 metres, the *lenght* of the "end" or "crust" of the pizzacone represents the maximum FOV in metres
 
  • How do the cameras compare to our human eyesight?
I touched on this subject in an argument about whether Tesla will use stereo cameras. The human eye has about 10MP resolution in center and about 576 MP if you count peripheral vision and the fact they can move.
Tesla Model 3 will have the new Autopilot 2.0 with dual cameras

However, as @Max* points out matching human eyesight does not play a big role for computer vision. As an example, early versions of the Mobileye system uses only VGA cameras (0.3MP), but it can already do a lot of recognition in the driving context. This is only a guess, but I would assume Tesla has moved to HD cameras (~1-2MP) for the latest systems.

The main reason for that is basically humans do not use or need to use their eyesight to full capability during driving. Sign reading is basically the application where eyesight requirements are the highest, but signs are necessarily designed for even people with poorer vision.

As for the "range" of the cameras, @Max* covers most of it, but basically has to do with what point things become indistinguishable (sub-pixel). I also talk about that in the other tread.
Tesla Model 3 will have the new Autopilot 2.0 with dual cameras
Tesla Model 3 will have the new Autopilot 2.0 with dual cameras

The way the cameras were used in the Mobileye system to measure distance was by using monocular visual cues (height of camera and bottom of vehicle relative to video frame). There is a minimum accuracy required which is largely subjective and varies by application. Basically if you shift up one pixel in height, that represents a given amount of distance. As the distance is further away, it gets more inaccurate. As an example, in the Mobileye paper I linked, a VGA (640x480 pixel camera) provides 10% error at 90 meters, 5% error at 45 meters. If you use a higher resolution camera, that decreases the error.
 
As I understand it, at 60 metres the fisheye cam is at it's absolute limit in resolving objects/pixels in any meaningfull way. 60 metres is therefore the fisheye cam's "maximum range". Combine this with the fact that FOV=120, and you can illustrate the camera's view as a "cone", or a "pizza slice" seen from above, where each point at the edge (crust) is 60 metres from the camera lens.

This should mean, that at 60 metres, the *lenght* of the "end" or "crust" of the pizzacone represents the maximum FOV in metres.
Which is where I got my about 200m from. It's actually 207.85, but I rounded down.
 
It's hard to tell from the images how much overlap the cameras have. Taking redundancy aside, since the cameras seem to have varying field of views, stitching together for a true 360 view isn't trivial.

In order to get good coverage, you either need a good 360 degree view stitched to one "large" image. Or you need non-insignificant overlap. (imagine you see something at the camera's edge, if you don't have enough overlap with the next camera, how will you know if it's a car that you need to worry about, or a shadow that you don't?)
I agree with your other points, but not really on this part. I highly doubt there is any image stitching going on any of the multi-camera implementations. Image stitching is only required to give a picture that a human can use to visualize, but is completely unnecessary for a computer.

Basically the computer knows the position/angles of the camera relative to the vehicle and ground and can use that to map a 3D environment. The camera is used to map recognized objects into boxes in that environment. The images from different cameras do not have to be stitched to accomplish this.

As an object moves into view of one camera to the next, the system will know that and map accordingly, but it does not need a continuous image to do that.

Here's Mobileye's:
3D-Modelling.png


Here's Tesla's:
tesla-self-driving-demonstration-video-screenshot_100581834_m.jpg
 
I agree with your other points, but not really on this part. I highly doubt there is any image stitching going on any of the multi-camera implementations. Image stitching is only required to give a picture that a human can use to visualize, but is completely unnecessary for a computer.

Basically the computer knows the position/angles of the camera relative to the vehicle and ground and can use that to map a 3D environment. The camera is used to map recognized objects into boxes in that environment. The images from different cameras do not have to be stitched to accomplish this.

As an object moves into view of one camera to the next, the system will know that and map accordingly, but it does not need a continuous image to do that.

Here's Mobileye's:
3D-Modelling.png


Here's Tesla's:
tesla-self-driving-demonstration-video-screenshot_100581834_m.jpg
I haven't professionally worked with camera sensors much (I have messed with them for fun though), so you might be right.

But generally speaking in regards to other sensors, when trying to run classification algorithms, the edge cases are a pain. And I'm aware of one of two approaches to fix it -- stitch the sensor data together, or create an ignore section at the edges, which requires an overlap from the next sensor. Though we were stitching, in part, for visualization to the operator.

So you're right, you don't have to do it. But you will need non-trivial overlap, even in a moving car scenario. In the above example, the objects it detects are stationary [parked cars]. Imagine a moving car at the edge of two cameras driving towards you. With no overlap, you may not be able to classify it as a car. If you keep all the edge data, you significantly increase your false alarm rates too.


So while I do agree with you, I somewhat stand by my original assessment.
 
Last edited:
@stopcrazypp the more I think about it, the more you're probably right. In the context of a moving car (the Tesla, or whatever) against a 2nd moving car (car B), the edge cases don't matter, because by the next frame car B will be more in one frame than the other, and by the Nth frame, it'll be fully in it.

If your latency (N) gets too high, then this becomes a problem. But as long as N isn't too high, the Tesla should have enough time to react and register it.
 
Last edited:
The human eye has about 10MP resolution in center
This is not really true considering the around 1-2 degrees sharp focus in the center in the fovea and you have around 6-7 million cones (which perceive color) in the retina total. Around 200,000 of these are in the fovea itself. You can't really count each cone as a pixel because it can only perceive one pigment out of the three possible types ("red" cones (64%), "green" cones (32%), and "blue" cones (2%)) In the center 1 degree there's only 17,500 cones and zero rods.

500px-Vis_Fig2.jpg

visual_field_large.png


The only reason humans think they have such great resolution is due to the movement of the eye and the brain stitching it all together. :(
 
I haven't professionally worked with camera sensors much (I have messed with them for fun though), so you might be right.

But generally speaking in regards to other sensors, when trying to run classification algorithms, the edge cases are a pain. And I'm aware of one of two approaches to fix it -- stitch the sensor data together, or create an ignore section at the edges, which requires an overlap from the next sensor. Though we were stitching, in part, for visualization to the operator.

So you're right, you don't have to do it. But you will need non-trivial overlap, even in a moving car scenario. In the above example, the objects it detects are stationary [parked cars]. Imagine a moving car at the edge of two cameras driving towards you. With no overlap, you may not be able to classify it as a car. If you keep all the edge data, you significantly increase your false alarm rates too.


So while I do agree with you, I somewhat stand by my original assessment.
I get what you are trying to say about overlap and an object on the edges, but my point was mainly focused on the need for image stitching specifically (which comes with its own challenges and artifacts; if you have used google streetview for example, you will see what I mean).

I agree that ideally there would be overlap so edge cases can be handled (even in stationary cases, where you are only initially acquiring the object, so half of it is in one view and half in the other), esp if attempting to recognize an larger object as a car for example. However, the Mobileye and Tesla implementations, at least from visualizations seem to treat everything the same. Meaning it does not necessarily recognize cars as cars, but rather as objects in its path. So even with a partial view of it near the edges, it just has a smaller box.
 
  • Like
Reactions: Max*
This is not really true considering the around 1-2 degrees sharp focus in the center in the fovea and you have around 6-7 million cones (which perceive color) in the retina total. Around 200,000 of these are in the fovea itself. You can't really count each cone as a pixel because it can only perceive one pigment out of the three possible types ("red" cones (64%), "green" cones (32%), and "blue" cones (2%)) In the center 1 degree there's only 17,500 cones and zero rods.

500px-Vis_Fig2.jpg

visual_field_large.png


The only reason humans think they have such great resolution is due to the movement of the eye and the brain stitching it all together. :(
Well there's a bunch of different ways to get an equivalent MP, and adding color makes things even more complicated.

The 10MP I refer to is mainly the rough equivalent resolving power at the normal field of view (ignoring peripheral vision). There may be some micromovement of the eye in this case (meaning no change in viewing angle, but perhaps some movement), but this excludes the case where there is any angle change (for example rolling eyes either left-right or top-down or any combination to look at objects not exactly center).
The Resolution of the Human Eye Tested Is 10MP
 
Well there's a bunch of different ways to get an equivalent MP, and adding color makes things even more complicated.

The 10MP I refer to is mainly the rough equivalent resolving power at the normal field of view (ignoring peripheral vision). There may be some micromovement of the eye in this case (meaning no change in viewing angle, but perhaps some movement), but this excludes the case where there is any angle change (for example rolling eyes either left-right or top-down or any combination to look at objects not exactly center).
The Resolution of the Human Eye Tested Is 10MP
That's not a scientific test, used low quality cameras, made no mention of lens resolution, and allowed eye micromovements. Without micromovements, 10 MP is not possible in the fovea of the human eye, there simply aren't enough cells to even accomplish this (even if color wasn't a factor). Granted a human eye is constantly moving that 1 degree, high cell density, FOV all over the road while driving.

Luckily large resolution is not needed for image recognition here's a 0.3 MP video:
It doesn't seem to have very much trouble.
 
  • Informative
Reactions: MorrisonHiker
That's not a scientific test, used low quality cameras, made no mention of lens resolution, and allowed eye micromovements. Without micromovements, 10 MP is not possible in the fovea of the human eye, there simply aren't enough cells to even accomplish this (even if color wasn't a factor). Granted a human eye is constantly moving that 1 degree, high cell density, FOV all over the road while driving.

Luckily large resolution is not needed for image recognition here's a 0.3 MP video:
It doesn't seem to have very much trouble.
Well you can do a lot of different assumptions, looking at cones, looking only at number of nerves, etc. I agree there was not much controls done on that experiment, but it was one that resulted in a relatively conservative number.

In general, the larger MP estimates are focused more on the practical resolving power of the eye, not a situation where the eye is not allowed any micromovements, since such micromovements are involuntary and play a critical function in the way the eye works. Cameras in general do not replicate this function (except some rare exceptions, like the "pixel shift"/"hi res" modes in cameras that have in-body stabilization, which works similarly by shifting the sensor by a small amount and combining images for a higher resolution).

The 576 MP number for example is from the human eye being able to perceive pixel spacing at 0.3 arc minutes and assuming the full movement of the eye having a 120 x 120 degree field of view.
Clarkvision Photography - Resolution of the Human Eye

If you work that back to the 60x60 degree from above, you get 144 MP, which is a drastically higher number. So the 10 MP number is quite conservative.

In general, you get no argument from me about large resolution being unnecessary, as that was my point in the first place.
 
Thanks for the input!

Is it possible for anyone with math/physics skills to determine the camera resolution? If only for the FishEye cam, that we know the following about:
- Maximum range: 60 m
- FOV: 120 degrees

The B-pillar cams we know have 80 m maximum range and 90 degrees FOV...

Or are we missing some crucial data points before resolution can be determined?
Not sure if this was conclusively resolved anywhere so I am dding my data here.

I've no physics skills, but the resolution of the 7 cameras directly available to APE is 1280*964 grayscale, 16 bits per pixel, it's directly visible in the ape firmware file.
There's also backup camera, but it's accessed via cid, resolution 1160x720.

The 7 cameras are:
Main
FishEye
LeftPillar
LeftRep
Narrow
RightPillar
RightRep

Only Main and Narrow are currently used by the autopilot (as of 17.11.45 firmware, anyway).