Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

New AU FSD vrs USA FSD post June 21

This site may earn commission on affiliate links.
Will the Australian model 3 be capable of operating under the new FSD ie only camera no radar.

To clarify: Tesla's belief is Tesla Vision will be like superhuman vision so there's no need for radar nor lidar. So, this belief is not only for Tesla in North America but globally. So, as Tesla's plan progresses, you can expect Australia will also get radarless feature as well. But don't quote me on the timeline.
 
I mean it's fairly simple logic..

Humans seem to drive OK with with 2 eyes and a few mirrors.

Your Tesla has 8 cameras, far better positioned than the human eye.
The human eye uses two eyes to help the brain asses depth or distance. Its why possibly no creatures only have one eye. Only exception I can think of is pirates and the tesla camera system, where each camera does not see anything the other camera sees, so each camera is one eye. Its intersting that technology can overcome this depth of field limitation, but maybe thats why we have phantom braking and panic breaking when cars are a long way off.
 
Seems too, that the cameras night performance is way down compared to their daytime performance and way less than our own eyes night vision. I'm unconvinced the removal of radar is a positive step towards better auto pilot performance. Its more about supply chain issues, shortages and reducing cost I suspect.
Doesnt the radar see under or through the car in front? Mine has reacted many times to a fast stopping car two cars in front that I cannot t see.
 
  • Like
Reactions: Wol747
yep, there is a good Youtube video of a Tesla emergency warning / braking when the radar picks up an incident ahead of the car in front.

I just cant see in this example, how the camera alone at this distance would have picked up the deacceleration of the vehicle ahead of the vehicle in front

 
Last edited:
  • Helpful
Reactions: paulp
yep, there is a good Youtube video of a Tesla emergency warning / braking when the radar picks up an incident ahead of the car in front.

I just cant see in this example, how the camera alone at this distance would have picked up the deacceleration of the vehicle ahead of the vehicle in front


I see many people making this incorrect assumption. The biggest thing people need to remember when evaluating vision based system is that the neural net input has far more data than a YouTube video.

Let me explain. The video you are watching is a blackvue dashcam. This has a consumer grade camera with an RGB pixel sensor along with automatic aperture control. This in essence means that video is being processed so that it looks good for a human viewer. Fine details that are required for automated driving tasks are lost as they are not visually appealing or necessary for our brains. Furthermore, the dynamic range of the sensor is limited to a large degree in order to keep the compressed h.264/h.265 video compatible with SDR playback (i.e. 100 nits of brightness).

In other words, the visual data stream in the YouTube video is missing most of the data compared to the Tesla FSD computer. To elaborate further, the AutoPilot cameras capture RCCB video, which essentially means capturing 2 colour channels in order to prioritise luminance data (i.e. changes in brightness which reveals edges and therefore objects). Moreover, the FSD computer input is capable of 12-bits per channel plus a dynamic range pre-processor. This is to condition the video sensor data to have the highest possible signal for the neural net in situations where there is say bright sunlight and a shadowy overpass. To a human viewer, this image would look terribly washed out and devoid of detail (as you would need to map all this information to 8-bit RGB pixels with 100-nit SDR) but in fact all the data necessary for perception is there in the raw data which you cannot see on a traditional monitor.

What about occluded objects I hear you ask? Well this is where having multiple cameras with multiple vantage points operating at 10x human processing speed helps. A neural net can be trained to perform object detection and classification on partially occluded objects with super-human ability (given the right training corpus). The reason for this is that one of the fundamental principles in convolutional neural nets is the ability to detect fine edges and pass these features down the layers of abstraction until you end up with a higher level label like car, dog, person etc. If there is even a single frame (which is 36 times per second or every 31ms) from any one of the cameras (which have different angles compared to your 2 eyeballs), this prediction can be added to the environmental entities at each timestamp via a function called orthographic feature transform. The Bird's Eye View network Tesla has developed can then stitch these entities together in a 3D world to more accurately determine if there is indeed a car in front of your lead car and more importantly what trajectory this car is on (including being stopped). The BEV is the missing piece to allow radar-less sensor suite and I believe even basic AutoPilot will use the BEV in the fullness of time.

Secondly, the AP suite has one long range forward facing camera. Its field of view would be similar to this:

Screen Shot 2021-06-07 at 11.05.15.png


Incidentally, this frame is captured about a second before the AP forward collision alarm sounds. Its obvious even with this low resolution image that there is a car with its brake lights on directly in front of the lead car. The FSD computer will see up to 250m ahead in great detail so things like cars going around slight bends to reveal sides, objects visible through windshields etc will all bubble up to the BEV representation of the world. compare this to what is required for the radar equivalent where the radar returns need to be heuristically filtered out to determine what is the car in front of the lead car. This is far more error prone and temperamental. Ever heard of phantom braking? This is basically the AutoPilot trying to resolve a disagreement between the the radar says and what the vision sees.

As an aside, keep in mind that the brake lights themselves are emitting electromagnetic radiation at a specific wavelength. With a training corpus annotating brake lights, a neural net can detect brake lights to a superhuman level and then accurately determine relative distance and given a second frame, estimate velocity/acceleration/trajectory. This is a huge freebie when it comes to detection and is very close in effect as vehicle to vehicle communication just done via visible light emitters rather than radio emitters.

TL;DR: There is no fundamental reason that Vision-Only AutoPilot could not perform the same as radar sensor fusion. In fact, I believe eventually it will perform far better. AP cameras have better field of view, 10x more data and can operate 10x the speed than a human. This coupled with specialised neural nets will allow AutoPilot to have super-human perception and reaction times.
 
The human eye uses two eyes to help the brain asses depth or distance.
Wrong. Human eyes can only use binocular cues from 5cm to 5m. This is because human eyes are only around 7cm apart making triangulation of objects in stereoscopic more than 5 metres away imperceptible based on our eye's angular resolution. The vast majority of depth perception (especially when driving) comes from relative size and motion which is why you can in theory drive just fine with one eye closed.

Its why possibly no creatures only have one eye.
Most animals on earth have eyes that do not operate in stereoscopic mode (i.e. binocular vision) so no, having stereo vision is not the reason no animals have <2 eyes at birth. Ducks and rabbits for example have eyes that do not overlap and thus have to move their head rapidly to determine distance via motion. Also, in the same way that humans work, they have an intrinsic understanding of how large certain objects should be and then by seeing those objects at a certain appaerent size in their field of view can infer distance.

Only exception I can think of is pirates and the tesla camera system, where each camera does not see anything the other camera sees, so each camera is one eye. Its intersting that technology can overcome this depth of field limitation, but maybe thats why we have phantom braking and panic breaking when cars are a long way off.
Again you are completely and utterly incorrect. The cameras do overlap quite considerably. However that doesn't bring that much useful depth information (i.e. geometric pixel triangulation) compared to a trained neural net that works much in the same way humans do. The depth is just an emergent property with no calculations done at all.

The reason we have panic breaking and phantom breaking is because of radar and the hand written control policy code. Not because of an inherent problem with the placement of the cameras of the neural net processing of video.
 
I see many people making this incorrect assumption. The biggest thing people need to remember when evaluating vision based system is that the neural net input has far more data than a YouTube video.

Let me explain. The video you are watching is a blackvue dashcam. This has a consumer grade camera with an RGB pixel sensor along with automatic aperture control. This in essence means that video is being processed so that it looks good for a human viewer. Fine details that are required for automated driving tasks are lost as they are not visually appealing or necessary for our brains. Furthermore, the dynamic range of the sensor is limited to a large degree in order to keep the compressed h.264/h.265 video compatible with SDR playback (i.e. 100 nits of brightness).

In other words, the visual data stream in the YouTube video is missing most of the data compared to the Tesla FSD computer. To elaborate further, the AutoPilot cameras capture RCCB video, which essentially means capturing 2 colour channels in order to prioritise luminance data (i.e. changes in brightness which reveals edges and therefore objects). Moreover, the FSD computer input is capable of 12-bits per channel plus a dynamic range pre-processor. This is to condition the video sensor data to have the highest possible signal for the neural net in situations where there is say bright sunlight and a shadowy overpass. To a human viewer, this image would look terribly washed out and devoid of detail (as you would need to map all this information to 8-bit RGB pixels with 100-nit SDR) but in fact all the data necessary for perception is there in the raw data which you cannot see on a traditional monitor.

What about occluded objects I hear you ask? Well this is where having multiple cameras with multiple vantage points operating at 10x human processing speed helps. A neural net can be trained to perform object detection and classification on partially occluded objects with super-human ability (given the right training corpus). The reason for this is that one of the fundamental principles in convolutional neural nets is the ability to detect fine edges and pass these features down the layers of abstraction until you end up with a higher level label like car, dog, person etc. If there is even a single frame (which is 36 times per second or every 31ms) from any one of the cameras (which have different angles compared to your 2 eyeballs), this prediction can be added to the environmental entities at each timestamp via a function called orthographic feature transform. The Bird's Eye View network Tesla has developed can then stitch these entities together in a 3D world to more accurately determine if there is indeed a car in front of your lead car and more importantly what trajectory this car is on (including being stopped). The BEV is the missing piece to allow radar-less sensor suite and I believe even basic AutoPilot will use the BEV in the fullness of time.

Secondly, the AP suite has one long range forward facing camera. Its field of view would be similar to this:

View attachment 670419

Incidentally, this frame is captured about a second before the AP forward collision alarm sounds. Its obvious even with this low resolution image that there is a car with its brake lights on directly in front of the lead car. The FSD computer will see up to 250m ahead in great detail so things like cars going around slight bends to reveal sides, objects visible through windshields etc will all bubble up to the BEV representation of the world. compare this to what is required for the radar equivalent where the radar returns need to be heuristically filtered out to determine what is the car in front of the lead car. This is far more error prone and temperamental. Ever heard of phantom braking? This is basically the AutoPilot trying to resolve a disagreement between the the radar says and what the vision sees.

As an aside, keep in mind that the brake lights themselves are emitting electromagnetic radiation at a specific wavelength. With a training corpus annotating brake lights, a neural net can detect brake lights to a superhuman level and then accurately determine relative distance and given a second frame, estimate velocity/acceleration/trajectory. This is a huge freebie when it comes to detection and is very close in effect as vehicle to vehicle communication just done via visible light emitters rather than radio emitters.

TL;DR: There is no fundamental reason that Vision-Only AutoPilot could not perform the same as radar sensor fusion. In fact, I believe eventually it will perform far better. AP cameras have better field of view, 10x more data and can operate 10x the speed than a human. This coupled with specialised neural nets will allow AutoPilot to have super-human perception and reaction times.
One of the things I’ve wondered about in the new software is whether they are going to include object persistence in the algorithm.

The idea being that something of potential interest is sighted by the system and even when it becomes occluded from view the system uses last position and relative velocity to keep track of it to anticipate where it might appear next.

This would substantially improve situational awareness.
 
One of the things I’ve wondered about in the new software is whether they are going to include object persistence in the algorithm.

That's exactly what the Bird's Eye View Network (BEV-net) does. Basically you are asking a neural net to make a prediction of the bird's eye view of the environment based on a multi camera fused video stream. This forces the model to resolve object permanence in order to make a prediction.

As there is a finite number as to what any piece of road infrastructure can possibly look like, the NN builds an understanding of how to fill in details that are not directly in view of the camera. If a car appears one frame and disappears in the next, the predicted BEV view of the scene will include that car but in a slightly different position based on its previous timestep data as based on the training data, there never exists a case where all of a sudden an object disappears.

I believe this approach doesn't get any benefit from radar hence it being removed. There is a small period of time when the non-BEV AutoPilot is still in use without radar and that's going to cause some slight regression but the long term benefit is clear.
 
>>HW3 runs a neural network<<

Is that actually IN the MCU or OTA? I would have thought a NN would be too much for the car to handle. Presumably, too, HW 3 will have to be retrofitted?

MCU and FSD computer (I.e HW3) are two entirely different things.

FSD computer is a NN accelerator capable of
processing enormous amounts of data very quickly at low power.

the NNs that the FSD computer runs are updated OTA via regular SW updates over wifi
 
  • Like
Reactions: DOMn8