Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
You can easily see the drivers hands manipulate the steering wheel multiple times on these turns.
I watched this on a big screen, and I can't really see this. Even at 5:05 (linked below) when the vehicle steers to the right, that seems like it's most likely FSD. It's hard to tell, for sure, but I feel that steering motion is somewhat unnatural for a human even when avoiding a turning vehicle (human would just wait a bit longer I think).


I can't explain the one where the car does not stop at the stop line. Might mean it's FSD Beta I guess, though I'd never stop at the stop line myself unless there were pedestrians. So hard to say.

In the end it's "hard to say." I couldn't see any very clear evidence one way or the other for manual driving vs. v11/"v11.5"/v12.

I think on balance I would support FSDb being in charge in most of these cases, with the support being:
1) Weird steering to avoid turning truck.
2) Creep behavior pretty unnatural for a human
3) Stopping at the stop line (as required by law) somewhat unnatural for a human

Crossing speed seems like it might be slightly improved (4 seconds???) but a little hard to say without going back to past videos. Also during development they might dial up the sliders for how aggressive it is. The extremely slow crossing speed (about 11-12mph is the max speed for a median stop) is a major shortcoming of FSDb currently.
 
Last edited:
2) Creep behavior pretty unnatural for a human
Seems they would stop at the stop line if visibility is good, otherwise make that full stop near/at the creep line. Should make NHTSA happy. Win/win.

The extremely slow crossing speed (about 11-12mph is the max speed for a median stop) is a major shortcoming of FSDb currently.

This is a head-scratcher for me. When it makes a right turn but with oncoming traffic behind the acceleration is quite brisk. So it's certainly capable. It should have that same acceleration on a UPL regardless of any following traffic. Heck I'd prefer the brisk acceleration all of the time, or at least make that part of the "aggressive" option.
 
I’m not sure how you would distinguish driving from ADAS unless there was an intervention. Driver’s hands certainly should be at 9 and 3 or so consistently when FSD is engaged. Definitely there should be a LOT of clear manipulation of the steering wheel when the car is “driving!”

But anyway it could be manual control, I have no idea. It is hard to tell. Is there a specific time stamp? I did not watch in detail obviously.
If I remember correctly 4:50 is the stamp where you can easily see the driver manipulating the steering wheel. The car also takes off at speeds far greater than we have ever seen from FSD.
 
If I remember correctly 4:50 is the stamp where you can easily see the driver manipulating the steering wheel. The car also takes off at speeds far greater than we have ever seen from FSD.
Don’t see anything indicating that. Mentioned unnatural right turn motion above which occurs after your mentioned time stamp. Did not see excess speed in that case. Though of course the driver can always apply some extra acceleration without disengaging so doesn’t tell us much one way or the other.
 
Some parts are the same but the problem to be solved is significantly different and more difficult in a fundamental way.

The LLM's have their bulk of training predicting words forward in text and estimate an approximate world model to do so. The video vision systems do the same, predicting video forward along given automotive ego telemetry.

The LLMs then are tuned with directives to produce "nice" or "helpful" text according to human feedback--but their intended mode of operation, to produce new text, is linked directly to their primary training objective, predict probability distribution of next token. Given a prob distribution you can stochastically sample it and that's what they do in a chatbot---predicting and generating text is it's reason for being.

But that equivalent doesn't give you a self-driving car at all---because the ultimate goal for the user is not to produce simulated video scenes but to drive a car. The task desired to be solved is much more remote from the ML training task. Of course you can include auto telemetry and input controls offset in time and predict video scenes a second or two in advance, but that still doesn't help the ultimate goal.

It's much harder to enforce human directed desired policy. In a LLM---which understands human words up to some level of abstraction---you can literally tell it to do what you want in natural lanaguage. There is nothing like this for video scene prediction.

The problem needs to be set up entirely the other way around, and there is much less data---predict desired behavior given an analysis of video scenes---and then bound that with rules.
 
No one dies if the music sucks.
Nice, glib answer.

People walk, too: That's neural network city. We're pretty good at it, too. One hardly thinks about the muscles moving, how to take the steps, maybe a little more work on a woodland trail where one has (subliminally) track where each foot step, or the next three or four footsteps are going to go. By and large, people can get where they're going that way.

Having said all that: People stumble, trip, and fall when walking about. It's extremely not uncommon, especially for the young (who are learning) and the old (whose reflexes and such are suspect.) Even people in their prime can and do get hurt.

It's conceivable to imagine a computer-based, NN system walking; in fact, that's what those autonomous robots that Tesla is working on are doing. Clearly, they don't do it well right now. But there's no particular reason to believe that, eventually, on average, such a construct could be better at it than a human.

So, in the not too-distant future: Strap an exoskeleton to an elderly, near wheel-chair bound person, with a battery and computer on the back; now said person can walk, with assistance, to the grocery store, get food, and return. Could even watch for cars when crossing roads. Do we throw out the technology because there's a once-in-XX chance that it'll make a mistake and injure the person so assisted? What if the technology is safer than attempting to run around in a motorized wheelchair?

I should note that these are not never-in-a-blue-moon rhetorical questions. There's at least one Japanese group I've heard of with an exoskeleton approach, although I think they had some serious computer hardware attached by wires. Still.

Your glib statement, I fear, is much like the argument against traveling by air. Air travel is demonstrably safer by orders of magnitude than driving around. Yet, airplanes do crash. Should we abandon air travel because crashes do occur? Be careful about your answer.
 
  • Like
Reactions: pilotSteve
It's much harder to enforce human directed desired policy. In a LLM---which understands human words up to some level of abstraction---you can literally tell it to do what you want in natural lanaguage. There is nothing like this for video scene prediction.
ChatGPT's reinforcement learning from human feedback of "good" / "bad" / "better" rating of text responses seems like it should apply for end-to-end control as well. Tesla has human labeling as well as autolabeling that can find video examples of the fleet with desired control. Somewhat tricky is driving control can be mostly good except a brief moment of bad, so unclear if the whole clip would be labeled bad or just the specific frames of bad control. But this type of mixed / fuzzy reinforcement is probably not too dissimilar to language models trained with a general notion that someone didn't like this particular response, and in aggregate, enough good examples will still lead to desired behaviors.
 
ChatGPT's reinforcement learning from human feedback of "good" / "bad" / "better" rating of text responses seems like it should apply for end-to-end control as well.
Yes it could---but the density of these valuable labels will be low, relative to the chatbots.

Tesla has human labeling as well as autolabeling that can find video examples of the fleet with desired control

"finding video examples of the fleet with desired control" is a major scientific/engineering problem once you go to end-to-end and don't build in a physics/robotics based control system like the previous one had. As I understand it now, the conventional (Karpathy-designed) system is heavy on the machine learning for perception (as it should be) with human and later automatic labeling of video patches for perception. They have tons of data for that. The auto-labeling works because you can jump forward in time and label an object from a condition where it is big and close and clear (low error rate) and transfer that reliable label back to when it was further away and fuzzy.

Conventionally the vehicle control is more physics based, i.e. they have state variables for all the Newtonian parameters of estimated position, velocity, steering angle, motor torque, vehicle turning torque, etc, and then with the allowable path estimated from perception constraining solutions, an optimization-based planner. They can find relevant examples by writing rules that filter on these human-understandable concepts for vehicle control and human-labled concepts for situation.

With end-to-end, none of that is built in. So there is no obvious way to "find examples of the fleet with desired control", no clear way to separate various conditions (you need to overemphasize the tails that have risky/difficult situations) to begin with, it's a big grey neural goo with incomprehensible internal states, like wetware neurons in brains are.

. Somewhat tricky is driving control can be mostly good except a brief moment of bad, so unclear if the whole clip would be labeled bad or just the specific frames of bad control. But this type of mixed / fuzzy reinforcement is probably not too dissimilar to language models trained with a general notion that someone didn't like this particular response, and in aggregate, enough good examples will still lead to desired behaviors.
Sure, it will help, but I worry it will be too sparse (it's a "WTF" response from the parent when the teen driver screws up) to train an immense policy/perception network. The LLMs work because there is a giant amount of text available.

Training on the natural distribution (average driving) assuming all positive examples (everyone with safety score > 85 or 90) where data will be numerous and easily collectible, I think will help get a good and natural high L2 assist, but the leap to L3/4 might be much more difficult because of the lack of internal understanding and ability to find and validate many corner cases. They will have GPS information I assume and would be able to do simple filtering like "Am I on the freeway" and "which city/state am I in" so they can do some very basic dataset balancing with rules but not the fine grained feedback the chatbots have. Specialized chatbots (e.g. medical/technical) are trainable because humans have already annotated and separated human knowledge and subjects---literally what libraries do---so we can feed in known reliable curated medical texts. Nothing like that for driving, and with same density of important policy knowledge & directives. Language, like this text, is high density per bit in useful semantics. Video of cars driving is much lower density of important information per bit, because the bit rate of video is far higher and those bits are burdens.

Demonstrations that synthesize video-forward-in-time (what we've seen so far) look cool and make cool demos but they don't solve the driving problem at all. They might be a competitor eventually for CGI type image synthesis for simulated traffic for making scenarios that humans can label and feed as training examples.
 
Last edited:
Nice, glib answer.

People walk, too: That's neural network city. We're pretty good at it, too. One hardly thinks about the muscles moving, how to take the steps, maybe a little more work on a woodland trail where one has (subliminally) track where each foot step, or the next three or four footsteps are going to go. By and large, people can get where they're going that way.

Having said all that: People stumble, trip, and fall when walking about. It's extremely not uncommon, especially for the young (who are learning) and the old (whose reflexes and such are suspect.) Even people in their prime can and do get hurt.
Are you seriously claiming that computers with neural networks human brains and have the same capabilities?

It's conceivable to imagine a computer-based, NN system walking; in fact, that's what those autonomous robots that Tesla is working on are doing. Clearly, they don't do it well right now. But there's no particular reason to believe that, eventually, on average, such a construct could be better at it than a human.

So, in the not too-distant future: Strap an exoskeleton to an elderly, near wheel-chair bound person, with a battery and computer on the back; now said person can walk, with assistance, to the grocery store, get food, and return. Could even watch for cars when crossing roads. Do we throw out the technology because there's a once-in-XX chance that it'll make a mistake and injure the person so assisted? What if the technology is safer than attempting to run around in a motorized wheelchair?
What about Iron man suits? Why settle with walking in this fantasy future of yours?
Your glib statement, I fear, is much like the argument against traveling by air. Air travel is demonstrably safer by orders of magnitude than driving around. Yet, airplanes do crash. Should we abandon air travel because crashes do occur? Be careful about your answer.
If airplanes had the same risks and reliability as the Wright brothers' first planes they wouldn't be very popular. That's where end to end machine learning using computer vision is today, and also likely in "a not too distant future". Autocomplete and driving have different requirements.
 
"finding video examples of the fleet with desired control" is a major scientific/engineering problem once you go to end-to-end and don't build in a physics/robotics based control system like the previous one had
Yeah, the existing autolabeling pipeline will likely be repurposed from generating explicit training targets to filtering / finding examples of desired control. For example, autolabeled "future" video of late appearing stop signs after a curve or other occlusion could generally find situations where human drivers approach intersections more cautiously. This still builds on top of the existing perception capability for offline usage even if it's not needed at inference with end-to-end. Similarly, physics based calculations can be used to find examples of good stopping or turning behavior to bias end-to-end to follow at certain follow distances or to drive in ways that tend to avoid unnecessary hard braking or aggressive turning.

With enough examples, the end-to-end system doesn't need explicit runtime perception to detect blind curves or sudden slowdown of adjacent lanes to have control slow down sooner. Potentially the general world model given enough video will have some aspects of these concepts without the finetuning step for training a control head.

Sure, it will help, but I worry it will be too sparse (it's a "WTF" response from the parent when the teen driver screws up) to train an immense policy/perception network. The LLMs work because there is a giant amount of text available.
From Tesla's CVPR presentations about the general world model / vision foundation model, these seem to be generally trained not specific to certain control behaviors, so it can use the giant amount of video available to Tesla without the extra step of preprocessing which videos to use. The example of predicting future video based on past video seems like a relatively straightforward self-supervised "pre-training" step for developing a general world model. I believe also for language models, the pre-training step is multiple orders of magnitude larger amount of data to train the foundation model than the finetuning step to bias responses and in this case control.
 
Are you seriously claiming that computers with neural networks human brains and have the same capabilities?


What about Iron man suits? Why settle with walking in this fantasy future of yours?

If airplanes had the same risks and reliability as the Wright brothers' first planes they wouldn't be very popular. That's where end to end machine learning using computer vision is today, and also likely in "a not too distant future". Autocomplete and driving have different requirements.
Um.

Look: We run on neural networks. We are not, except by emulation, a von Neumann architecture. We're massively parallel processing neural networks running on (for a processing unit) dead slow chemical processes, put together by chance and Nature's bloody tooth and claw into something reasonably efficient. How neurons actually work was figured out in, well, my lifetime, at least. Research on how nerves, brain cells, and the like continue, of course, but the bloody building blocks are known.

And, once How Neurons Work was figured out, bright people said, more or less, "Well, this works for living things. How's about we take a look at the kind of problems it can solve?"

And, just like practically everything else in engineering and mathematics, it turns out that approaching certain problems with neural networks results in much faster solutions. Image recognition. Object recognition. In their own weird way, neural networks act almost like analog computers, with fixed and ridiculously fast input-to-output solution propagation delays. In fact, from reading the popular engineering literature, it's clear that the people investigating what NN's were capable of, and how, were starting with problems that the human brain was known to solve efficiently, but, at the start, nobody had a clear idea of how.

We now have a much better idea of how NN's work in living things. And those ideas have clearly been transferred to how humans can make NN's that have desired processing characteristics. When one listens to hardware researchers talking about what NN's are doing in their hardware, it's clear that there's a foundation of biological methods to their madness. And, probably the reverse, with biological NN researchers getting insight from the hardware types.

As far as integrating NNs into hardware: As these things go, as fancy a job as being currently done goes, these are the Early Days.

As far as Iron Man suits go: They run on unobtainium. Real, live, exoskeletons are commercially available. Extending one of the latter with a driving computer off of a Tesla, like what they're using with the Optimus robots over a Tesla, is hardly a stretch.

As far as the Wright Brothers and early fliers go: The math of the time was barely up to getting a machine up into the air. As you may recall, the Wright Brothers approached the problem with one heck of a lot of guided cut-and-try and the math of the day. Even with all that, they had to invent their own ICE that was light and powerful enough to do the job. And, even after their first successful flight, it was still dangerous as all get-out. Research project, yes. Not for the general public, you betcha. But everybody involved (well, all except the Old Geezers who'd start things with, "Back in My Day...") saw the capabilities and were All For It. It took from 1903 until some time after WW1 before commercial flights began.

We have some notable advantages over the Wrights, not the least of which is a heck of a lot more computational ability with much better math and simulation.

So, what's your point?
 
Um.

Look: We run on neural networks. We are not, except by emulation, a von Neumann architecture. We're massively parallel processing neural networks running on (for a processing unit) dead slow chemical processes, put together by chance and Nature's bloody tooth and claw into something reasonably efficient. How neurons actually work was figured out in, well, my lifetime, at least. Research on how nerves, brain cells, and the like continue, of course, but the bloody building blocks are known.
Today's computer architectures are a crude and incorrect guess how brains works. Just because the computer architecture shares a name with the brain doesn't make it the same...They share some theories at best.

We really don't understand how the brain works, other than on a very high level. If you know exactly how the brain works, feel free to solve ADHD, Asperger's, blindness et.c. You'll get a Nobel prize.

Have you heard of Moravec's Paradox? Explain why it exists please.

While you're at it, please explain the differences between the capability of the human brain and the computer architectures in modern ML to start with and then dive into the not so subtle differences in how they learn, and explain why the brain is so efficient both in terms of the data needed and the energy needed for training.
So, what's your point?
That you are randomly jumping to conclusions and rambling about exoskeletons, which has absolutely nothing to do with the brain nor ML.
 
Last edited:
  • Like
Reactions: daktari
Nice, glib answer.

People walk, too: That's neural network city. We're pretty good at it, too. One hardly thinks about the muscles moving, how to take the steps, maybe a little more work on a woodland trail where one has (subliminally) track where each foot step, or the next three or four footsteps are going to go. By and large, people can get where they're going that way.

Having said all that: People stumble, trip, and fall when walking about. It's extremely not uncommon, especially for the young (who are learning) and the old (whose reflexes and such are suspect.) Even people in their prime can and do get hurt.

It's conceivable to imagine a computer-based, NN system walking; in fact, that's what those autonomous robots that Tesla is working on are doing. Clearly, they don't do it well right now. But there's no particular reason to believe that, eventually, on average, such a construct could be better at it than a human.

So, in the not too-distant future: Strap an exoskeleton to an elderly, near wheel-chair bound person, with a battery and computer on the back; now said person can walk, with assistance, to the grocery store, get food, and return. Could even watch for cars when crossing roads. Do we throw out the technology because there's a once-in-XX chance that it'll make a mistake and injure the person so assisted? What if the technology is safer than attempting to run around in a motorized wheelchair?

I should note that these are not never-in-a-blue-moon rhetorical questions. There's at least one Japanese group I've heard of with an exoskeleton approach, although I think they had some serious computer hardware attached by wires. Still.

Your glib statement, I fear, is much like the argument against traveling by air. Air travel is demonstrably safer by orders of magnitude than driving around. Yet, airplanes do crash. Should we abandon air travel because crashes do occur? Be careful about your answer.
Air travel is safer per mile travelled, but about the same as auto travel per hour in the seat. I found that interesting.