Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Tesla.com - "Transitioning to Tesla Vision"

This site may earn commission on affiliate links.
Wasn't the idea video would let them do far LESS supervised labeling?

Like when they did it frame by frame you had to label an object TRUCK in every frame but with video it would understand if a human labels a thing TRUCK in frame 1, the system can self-label that same object in future frames so long as it remains visible- thus saving a ton of human effort?
You can create a labeling system that takes in "video" and on the video itself you can manually label the objects of interest (drawing a bounding box) and then select the number of frames (e.g. 20 frames) you would like to use for training. This step bypasses labeling manually on 20 frames, but it is still done at the frame level (backend code), but in essence you are only doing it on 1 screen shot (which corresponds to 20FPS, e.g. 20 images in 1 second).

Now, take everything I wrote above and instead of manually labeling, you have the system which is already trained (your model) make inferences on where the object of interest is and create more training data this way.
 
LOL, that is what you got out of that?
yep. I have no dog in this fight other than my personal car; I dont own tesla stock so I'm not a fanboy in that respect. I want the whole industry to move forward with tech and I see tesla as taking a cop-out, with a fake explanation of 'you dont need radar' just to cover their asses.

plain as day to anyone who ever did software work before. yes, I know, its 'hard' work but giving up should not have been an option.
 
Theoretically, they could fuse 3 or 4 or 5 domains of sensor inputs, right? Too hard, or unnecessary?
thought experiment: would you willingly give away any one of your 5 senses? do you think anyone is ever more capable in the world with LESS senses than they were born with?

its really simple. more rich info is always good. always. its 101 level stuff!
 
There is lane departure avoidance. There was so much fud about vision-only, people don't even know what is what:
The only source of FUD around Vision Only is Tesla, as they did a poor job of explaining what this change really impacted.

They're the ones that removed AEB from their website for a time. They're the ones that didn't have that table on their page for days after they announced the change, just saying that "Emergency Lane Departure Avoidance" will be removed. They're the ones that didn't deal with the NHTSA rating removals ahead of time and allowed it to become a broad news story.

It's pretty understandable that the average reader would not realize that there are two forms of LDA, and only one was missing. There's only one buried blog post that explains the difference on Tesla's site, or you have to read the manuals.
 
Oh, dear God! 🤦‍♂️:rolleyes:


You do know, that they (Tesla) take raw non-processed streams of "video" (i.e. stream of frames) directly from the 8 camera sensors, right?
Of course I know that. Do you understand what I am explaining? Regardless if you are using 8 or 12 cameras. You have business rules as an "application" and then you have your vision models. Now, do you understand why they are moving to Tesla Vision? ^^^ Masters AI, concentration in computer vision. God, i feel like i am taking to people on the business side.
 
video as a whole (.avi, .mp4, etc) cant be processed by a CNN. You need a pipeline to preprocess the video first into frames and then feed to CNN layers.

What you’re saying still doesn’t make sense. Your second sentence is saying that yes, you can use videos as training data. Your first sentence is implying you can’t use videos as training data.

Again, what are you saying?

I can’t believe you work in CV because preprocessing training data is routine in the field. For the purposes of AVs, video data vs image data implies that the labeled data moves across time, not only static (for images).

After all, NNs are simply data crunchers outputting predictions. As long as you provide NNs with clean labels and a structured bitstream, they will give you the right predictions (With a large enough dataset).
 
Last edited:
We just took quite a trip to trade in our old Mid Range and pick up my wife's new Model 3. I had over 500 miles each way in each car to directly compare them in the same conditions. I wanted to download my thoughts on Tesla Vision. I think the TLDR version is that this new system was simultaneously the best and worst version of Autopilot I have ever used. Some features are way beyond the system running in the radar cars, and some are probably dangerous in their current state. This was a former demo car, so it was already fully calibrated.

The Good

The ability of the car to manage it speed smoothly in traffic has taken a quantum leap. The vehicle clearly utilizes visual cues such as brake lights to determine when to start slowing down. The car also appears to manage its deceleration curve similarly to the cars surrounding it. In heavy traffic I used to experience that it would slam on the brakes way to late, brake too hard, and then leave a huge gap in front of the vehicle for others to slip in to. I would describe the current acceleration curve as human like. It was scary good.

The autowipers are actually finished. I can't wait to get this part of the code ported over to my Model S and I have no idea why it isn't fleet wide at this point. This is another feature that I would describe as being human level at this point. I had about 3 hours of scattered thunderstorms, with all of the typical associated conditions such as heavy and dynamic road spray. The new autowiper system handled the spray flawlessly. It unparked the wipers as I came up on trucks, used the appropriate amount of wiping as we passed through the spray cone, and immediately modulated it as we passed out of it. As we were moving from the road spray section of the highway into a dry sunny one, it kept the wipers unparked until the spray went away on the cars too far ahead to affect us. Clearly the car is identifying and reacting to spray. Normal rain performance was also immediate and utilizes a perfect amount of wiper speed for the rain at hand.

The Bad

The summary is that anything besides full manual driving at night is currently impossible. The system is clearly currently only functional with high beams in the on state. While manually driving down a highway at night without even TACC engaged, and without high beams, every forward facing camera is indicated as being blinded. It's a typical nighttime situation where on a highway, the cars may be spaced an average of 1/8 mile apart, and you never have an occasion to turn your high beams on, otherwise you would be pulsing your beams every 8 seconds or so. Well, this is exactly the behavior that the car wants to occur. As soon as it sees that there are no cars within about 300' of you (in either direction), it will activate the high beams. It was the kind of behavior that will absolutely result in a ticket from a state trooper out here in the midwest. And even when driving on portions of the highway alone, anything reflective like signs or barrels will cause the system to oscillate with a frequency of 3 seconds. And whenever the car is forced by traffic around it to be in a state of low beams while AP/TACC is engaged, cameras start registering as "Blinded" after about 30 seconds. This portion of the drive did not coincide with the rain, described above.

So yeah... AP/TACC is unusable at night.

The Ugly

The 75 MPH speed limit does not allow the car to keep up with the flow of traffic out here. There are roads where that is the speed limit. This means that I end up spending a lot of time looking in mirrors to avoid getting into trouble with people moving at the pace of traffic. The DMS really does not like it when you do this. After some experimentation, we were able to determine that even when applying torque, looking in the mirror for too long will cause the car to go straight to the highest level warning beep with no intermediate nags. Only applying torque in a different direction or looking ahead again will dismiss this warning. That being said, it does not appear that these warning add a count toward AP jail. MCU usage for the same length of time does not result in this warning. Phone usage or perceived phone usage (I mimed it), also triggered this alert. It's strange that I can mess with the MCU indefinitely while applying torque, but certain ways of paying attention to traffic (that's what you're supposed to be doing, right?) are punished swiftly. I'm glad this is not our highway cruiser in its current state.
 
thought experiment: would you willingly give away any one of your 5 senses? do you think anyone is ever more capable in the world with LESS senses than they were born with?

its really simple. more rich info is always good. always. its 101 level stuff!
that's what gets me... somehow other manufacturers prefer an abundance of various sensor inputs to cover all cases and fuse the inputs.
The Mercedes EQS is expected to be fully L3 compliant with ultrasound, camera, Lidar and radar sensors.

the approach of giving up smell, hearing and tasting - because eyesight is "good enough" - isn't exactly convincing... yes, I can visually identify bulls**t but just to be sure smelling it helps....
 
Last edited:
Of course I know that. Do you understand what I am explaining? Regardless if you are using 8 or 12 cameras. You have business rules as an "application" and then you have your vision models. Now, do you understand why they are moving to Tesla Vision? ^^^ Masters AI, concentration in computer vision. God, i feel like i am taking to people on the business side.
LOL,
video as a whole (.avi, .mp4, etc) cant be processed by a CNN.
The fact that you think that there would be a processed video (i.e. in avi or mp4 format) anywhere in the FSD stack is really telling.

They are dealing in raw streams of frames (that is what a video is after all), the NN's are running on each frame (i.e. image)
There is no need to muddy the waters here and no need to throw around title either.

Karpathy describes the architecture in pretty good detail at the 8:09:30 mark of the CVPR 2021 video.:
1624290432208.png


Better video:
 
Last edited:
  • Like
Reactions: rjpjnk
thought experiment: would you willingly give away any one of your 5 senses? do you think anyone is ever more capable in the world with LESS senses than they were born with?

its really simple. more rich info is always good. always. its 101 level stuff!

If your driving is assisted by your sense of taste, you're doing it wrong
 
What you’re saying still doesn’t make sense. Your second sentence is saying that yes, you can use videos as training data. Your first sentence is implying you can’t use videos as training data.

Again, what are you saying?

I can’t believe you work in CV because preprocessing training data is routine in the field. For the purposes of AVs, video data vs image data implies that the labeled data moves across time, not only static (for images).

After all, NNs are simply data crunchers outputting predictions. As long as you provide NNs with clean labels and a structured bitstream, they will give you the right predictions (With a large enough dataset).
Nah, you are not understanding. Man, go get a masters and Phd then you will understand. My knowledge does not come from medium articles. First off, try using NN for object detection and see what happens. You have to use CNNs (HUGE difference). I cant get mad at you because well, ignorance is bliss. CNNs only take in FRAMES. Go to graduate school to understand this.

What i said is that you can create a "labeling system" that takes video. Say you record with your iPhone and you hit the pause button and extract 1 second of data, that 1 second of data is equivalent to 20 images (or depends on your cam settings, 60 fps). Just like a flip book, get teh correct images moving in the correct time sequence and you have video. Now, if you have a "auto labeling system", say for example PowerAI, you can use video to label but know that instance which you have labeled will be PARSED OUT INTO FRAMES (by the back-end code before it gets fed to any AI system)...that 1 second of video where you (or a model) highlighted bounding boxes around objects of interest will be split into frames (which do take time into account FrameTime1.jpg, fametime2.jpg, frame_time3.jpg.... )
 
that's what gets me... somehow other manufacturers prefer an abundance of various sensor inputs to cover all cases and fuse the inputs.

And yet all their systems still aren't as good as Teslas was years ago.

Weird.

The Mercedes EQS is expected to be fully L3 compliant with ultrasound, camera, Lidar and radar sensors.

Audis was expected to be L3 in like 2018. Never worked out.

No reason to think the EQS one will either until we see it (or see how limited the ODD is)


the approach of giving up smell, hearing and tasting - because eyesight is "good enough" - isn't exactly convincing... yes, I can visually identify bulls**t but just to be sure smelling it helps....


Did you watch the karpathy video? he does a pretty good job showing all the situations where the low-res radar makes things worse, and removing it makes things better.
 
LOL,

The fact that you think that there would be a processed video (i.e. in avi or mp4 format) anywhere in the FSD stack is really telling.

They are dealing in raw streams of frames (that is what a video is after all), the NN's are running on each frame (i.e. image)
There is no need to muddy the waters here and no need to throw around title either.

Karpathy describes the architecture in pretty good detail at the 8:09:30 mark of the CVPR 2021 video.:

Correct, and where do you think the raw streams of frames are coming from? They are being parsed. The referenced i made to video formats was me trying to explain things using formats you are likely familiar with. What was presented is what normally gets shown to people on the business side (and customers).

Example of real arch for detection
 

Attachments

  • sample.png
    sample.png
    149.4 KB · Views: 50
Last edited:
  • Funny
Reactions: mikes_fsd
via a lousy LTE / 3G connection?
No.

Neural Net HW3 processes the video in the car for real-time driving.

Over wifi, when parked, Tesla can have your car‘s raw video and it’s responses uploaded to the Tesla cloud. The video in the Tesla cloud are analyzed with supercomputers to generate a new neural Net that is sent Overtheair (ota)periodically to “improve” your Tesla’s real-time driving. LTE and 3G are not used for real-time processing.
 
No.

Neural Net HW3 processes the video in the car for real-time driving.

Over wifi, when parked, Tesla can have your car‘s raw video and it’s responses uploaded to the Tesla cloud. The video in the Tesla cloud are analyzed with supercomputers to generate a new neural Net that is sent Overtheair (ota)periodically to “improve” your Tesla’s real-time driving. LTE and 3G are not used for real-time processing.
All he has to do is watch the short ~30 minute video -- where all this is covered.
But then his fantasy will fall apart, and he will have nothing to Sh!tpost about....
texas_star_TM3: Please watch the video:
 
Last edited:
Nah, you are not understanding. Man, go get a masters and Phd then you will understand. My knowledge does not come from medium articles. First off, try using NN for object detection and see what happens. You have to use CNNs (HUGE difference). I cant get mad at you because well, ignorance is bliss. CNNs only take in FRAMES. Go to graduate school to understand this.

What i said is that you can create a "labeling system" that takes video. Say you record with your iPhone and you hit the pause button and extract 1 second of data, that 1 second of data is equivalent to 20 images (or depends on your cam settings, 60 fps). Just like a flip book, get teh correct images moving in the correct time sequence and you have video. Now, if you have a "auto labeling system", say for example PowerAI, you can use video to label but know that instance which you have labeled will be PARSED OUT INTO FRAMES (by the back-end code before it gets fed to any AI system)...that 1 second of video where you (or a model) highlighted bounding boxes around objects of interest will be split into frames (which do take time into account FrameTime1.jpg, fametime2.jpg, frame_time3.jpg.... )

Nah, you can have a masters and still be wrong. And yes, you are wrong.