Knightshade
Well-Known Member
Well, there are others pointing out that the recent presentations conflict with your assessment
Naah- there's one guy who doesn't appear to understand much about computers, either software or hardware, or teslas previously stated design in intents either, misunderstanding Karpathy is all.
. I haven't followed both in detail so I won't jump into that (and for my point it doesn't matter which side is correct, given either version can satisfy SAE L4/L5). As for Tesla talking about redundancy early on, another comment pointed out Tesla also talked about how important radar was and then now have removed it. Nothing is stopping them from changing their approach as they see fit.
Well, yes, one thing is stopping them- Nobody will approve a non-redundant L4/L5 system.
That (nearly) everyone knows and understands that is why Tesla put so much effort into designing their hardware inherently around redundancy.
ARM big.LITTLE - Wikipedia
What you describe here sounds like a heterogeneous architecture like ARM's big.LITTLE. There are huge disadvantages to doing so (it makes the architecture hugely inflexible). You aren't necessarily saving money either, given running duplicate cores gives you economy of scale (which Tesla seemed to have achieved with HW3, I remember seeing the cost is down to $190, even less than the HW2.5 solution).
The cost savings is not in the design--- indeed designing as I describe would've been CHEAPER by a fair bit since you wouldn't have needed nearly as much resource-wise on Node B.
The cost savings is in Moores law- compute hardware gets cheaper over time and HW3 went into production roughly two moores-law cycles after HW2 did.
(plus of course there's cost savings in-housing the thing in general vs having to pay a third party like Nvidia- and also from designing a specific compute solution instead of a general one)
But point being- there's no reason to design the system as they did OTHER than FULL redundancy.
If they didn't INTEND full redundancy you'd either (if you wanted to go big little) have different node hardware- which they don't.
Or if they didn't INTEND each node to NOT share compute- you'd have instead gone with an entire different design that wasn't so heavily isolating the two nodes... because since they DID so heavily isolate them they're now running into significant performance and programming inefficiencies trying to force them to share compute of the same stuff as the hardware was never meant to be used that way.
That's not an issue as long as either node can fall back to code that can achieve the "minimal risk condition". For example:
1) Node 1 crashes, Node 2 performs the steps necessary to reach minimal risk condition.
2) Node 2 crashes, Node 1 performs the steps necessary to reach minimal risk condition.
Running some of regular driving code in both nodes does not conflict with this, given you have free computing resources available (due to the fact that the other node only needs to perform minimal risk task in the event of a crash of one).
Except that's a terrible user experience. Especially for something like robotaxis- or for when you tell your car to go somewhere without a human in it.
It means it'd fail-back to the "just stop" behavior vastly more often.
In full redundant mode if one node crashes, the car keeps driving itself just fine. No issues.
In "both nodes are needed to self drive at all" if one node (EITHER node) crashes- the self driving system fails and has to revert to stopping.
Tesla would have have designed, or intended, a system that sucks that badly.
They're only using compute on node B for the driving computer stack because they are out of compute on A.
And (and this is the only thing here you could consider "speculation" but it's supported by literally everything everyone on the Tesla side has ever said about redundancy)- they won't roll out anything they consider genuinely self-driving until they have actual redundancy that doesn't have to keep slamming on the brakes every time ONE node crashes.
No, it won't, given the "fail safely" code is much less than the code required to drive the car in normal circumstances. Say for example it takes 10% of the resources, and let's assume what you say is true (Tesla is using 100% of resources of Node 1 already or close).
They're well past 100% on node A.
That's the problem.
Here's the scenarios, just to illustrate the idea (made up numbers just to illustrate the point):
1) Duplicate code:
Node 1 100% driving code = 100% utilization
Node 2 100% driving code = 100% utilization
Tesla is SOL and needs new HW
2) "fail safely" taking 10% of resources:
Node 1 90% original driving code + 10% failsafe code = 100% utilization
Node 2 10% driving code offloaded from Node 1 + 10% failsafe code = 20% utilization
Tesla now has 80% left on Node 2 to use for other things.
You can replace the 10% assumption with another, but you will still end up with a better situation that doing duplicate code in both.
You won't though- because of the situation I described above. EITHER node crashing takes the whole self-driving system down and it has to stop the car.
That's hilariously terrible and they're not gonna roll out a system like that to production/robotaxis.
Remember how much crap Waymo got for that ONE on-video failure with a traffic cone?
Youtube'd be flooded with "Check out my Tesla robotaxi slamming on the brakes in the middle of I-95 because the computer rebooted" videos.
Plus again since the HW was never designed to share compute like this, you're losing some capacity just doing that at all.
And lastly- since we (and Tesla) still don't know how much compute you ACTUALLY NEED to do L4 or L5, they may well not -have- an extra 10% on either node to spare (or whatever the fail-code needs).
That's the thing I mention where ideal case is they can get ONE spin-up of the stack CAPABLE of L4 or L5 running, non-redundant, using the entire compute of both A and B on HW3.
If they do, then they know they can just upgrade FSD owners to HW4 and they're all good for full redundancy.
The worst case is they run out of compute on both nodes and STILL haven't solved it. Because then they still have no answer for how much do they need, and they're back to square one where HW4 may or may not be enough- and it'd be SUPER useful for them to know that before they finalize HW5 which you know will be coming.
Also note, Node 1 likely only needs to have a low demand watchdog on Node 2 and won't need to run a full copy of the failsafe code, given Node 1 would have a bulk of the driving code, which likely can already perform the failsafe function just using existing code.
There is... a lot of guesses and likelies there...
But again this produces a garbage end-user experience if any node failing knocks you out of self driving in an allegedly L4/L5 car every time.
That EU document only mentions it in a very vague way "put in place adequate design and redundancy to cope with these risk and hazards". But this conversation is talking specifically about regulations/specifications requiring two identical computers running identical code
Well, that's where the goalposts moved after one user insisting that NOBODY requires ANY redundancy was corrected with the EU doc anyway.
But see above for why that is, effectively, going to BE a requirement for a product anybody will approve or want to use.
It's not just Tesla that knows (and has said) this-- Even the more advanced L2 systems out there do this now- Supercruise for example
Caddy lead super cruise engineer said:We have two central computing systems that are both running continuously so that if we have issues with one, we have the other as a backup
Likewise Nvidia offers the AP2x, which is NOT fully HW redundant and they only cert it for up to L2 driving aids.... and they offer the AGX Pegasus which they class for L4 solutions and is... two FULL SoCs and two FULL GPUs...to permit perfect redundancy because "Pulls over any time one node crashes" is a bad solution.
. I see nothing so far that indicates that is a requirement in SAE's L4/L5 specification
SAE is not a government regulatory body. They won't "require" anything, legally. They can't.
But commercially nobody is going to approve a robotaxi system that fails to MRC every time a single computer node crashes.
And I doubt the vast majority of governing bodies would approve such a system for regular consumer consumption either.
Again this won't prevent Tesla from (if they can do it within HW3s total compute available across both nodes plus the efficiency hit for running stuff that way) offering what is EFFECTIVELY an L4 or L5 system but they still only officially call it an L2 system and still require a driver (with driver monitoring).
Because in THAT case the human is there to insure an MRC that doesn't suck nearly as badly as just "stopping on the interstate"
At which point they'd wait to upgrade the driving computer before they just flipped a switch and made the same code "L4" or whatever once it could be redundant enough to keep safely driving even with a single node failure.
Last edited: