FSD v12.x (end to end AI)

diplomat33 · Dec 31, 2023

Supcom said:
It's actually appropriate. Unless one of the two stacks crashes or has a detectable failure, how do you determine which of the two is wrong? Maybe they are both acceptable, in different ways. Or, maybe they are both wrong.

Yes, the idea is to prevent the entire system from failing if one stack crashes or has a detectable failure. If one stack crashes, the other stack can take over. I believe Tesla even had plans to do this by having FSD beta run in parallel on two separate chips so that if one chip failed, FSD beta running on the other chip could continue operating the vehicle. It just makes sense to have a back up if your system crashes or has a detectable failure.

If both stacks offer different outputs that might both be correct, there are ways to judge. For example, Mobileye uses RSS to set safety parameters. So if one stack proposes a path that violates your safety rules and the other stack proposes a path that does not violate your safety rules, you go with the path that does not violate your safety rules. And if both stacks propose paths that don't violate your safety rules, you go with the path that is more efficient. The whole point is to increase the safety. With one stack, if it proposes an unsafe path, your system fails every time. With two stacks, if one stack proposes an unsafe path, your system will no longer fail if the other stack proposes a safe path. You system will only fail when both stacks propose an unsafe path which will happen less often.

The problem with no redundancy is that, yes, you avoid the problem of having to figure out which is wrong, but when something goes wrong, there is no fail safe to correct the mistake. You are entirely dependent on that one system being reliable enough. If the single system cannot achieve the reliability you need for safety, then you are out of luck. And AVs need to be incredibly reliable, on the order of 5-10x safer than humans. IMO, you are unlikely to achieve that level of safety with just one single stack, with no redundancy.

Supcom said:
The two watch issue comes about because the wearer has no way to determine which of the two watches is closest to the 'correct' time, so is always unsure which to use.

You set both watches to the correct time at first. Then if one watch breaks, you have another watch to tell time.

drtimhill · Dec 31, 2023

diplomat33 said:
Thanks. What you call "unified NN" is what I call "end-to-end". I define "end-to-end" as a specific NN architecture where there is just one single NN with video in and control out. The literature that I have read describes the single NN, video in and control out as "end-to-end". What you call "end-to-end" is what I call "modular NN", where it is all nets from start to finish but split into separate NN, like one NN for perception and one for planning/control.

Agreed, I'm not saying my terms are better or even correct, but I dont want people to assume "end to end" means a single camera-to-control NN, which (as I explained) I consider unlikely.

drtimhill · Dec 31, 2023

diplomat33 said:
Yes, the idea is to prevent the entire system from failing if one stack crashes or has a detectable failure. If one stack crashes, the other stack can take over. I believe Tesla even had plans to do this by having FSD beta run in parallel on two separate chips so that if one chip failed, FSD beta running on the other chip could continue operating the vehicle. It just makes sense to have a back up if your system crashes or has a detectable failure.

If both stacks offer different outputs that might both be correct, there are ways to judge. For example, Mobileye uses RSS to set safety parameters. So if one stack proposes a path that violates your safety rules and the other stack proposes a path that does not violate your safety rules, you go with the path that does not violate your safety rules. And if both stacks proposes paths that don't violate your safety rules, you go with the path that is more efficient. The whole point is to increase the safety. With one stack, if it proposes an unsafe path, your system fails every time. With two stacks, if one stack proposes an unsafe, your system does not fail if the other stack proposes a safe path. You system will only fail when both stacks propose an unsafe path which will happen less often.

The problem with no redundancy is that, yes, you avoid the problem of having to figure out which is wrong, but when something goes wrong, there is no fail safe to correct the mistake. You are entirely dependent on that one system being reliable enough. If the single system cannot achieve the reliability you need for safety, then you are out of luck. And AVs need to be incredibly reliable, on the order of 5-10x safer than humans. IMO, you are unlikely to achieve that level of safety with just one single stack, with no redundancy.

This kind of redundancy only really protects against hardware failures or (less often) hard crashes. In theory, if the stacks are the same, then BOTH will crash given the same inputs. In practice, there can be small differences in timing that might not make this the case. Dual (identical) stacks offer no protection against coding (or training( errors since they will both generate the same erroneous results. For true protection, you need different stacks that are designed to do the same thing but using different approaches, but this generates issues when comparing outputs (especially when NNs are involved).

Mardak · Dec 31, 2023

drtimhill said:
it's unlikely that they are running two stacks like that

Like which? To have the old perception networks only for visualization purposes during this transition? Why do you think the framerate halved to 10fps and do you think control is similarly slow or deciding faster closer to 36fps?

This scheduling/utilization visualization from AI Day 2022 shows Lanes and Moving objects and Traffic controls could be kept but run say half as often to free up compute for a separate end-to-end network without dependency on existing perception. As you suggest, HW3 is probably close to its limits with existing networks, so something needs to be traded off.

ai-day-2022-fsd-networks-in-car-png.963811

sleepydoc · Dec 31, 2023

EVNow said:
Obviously we need NN for some stuff. The question really is - are heuristics needed at all or can NN do everything.

Specifically planning & control - can NN do it alone.

There are some aspects of driving for which heuristics fit quite well. It’s more a matter of how to integrate them with the other techniques, IMO

drtimhill · Dec 31, 2023

Mardak said:
Like which? To have the old perception networks only for visualization purposes during this transition? Why do you think the framerate halved to 10fps and do you think control is similarly slow or deciding faster closer to 36fps?

This scheduling/utilization visualization from AI Day 2022 shows Lanes and Moving objects and Traffic controls could be kept but run say half as often to free up compute for a separate end-to-end network without dependency on existing perception. As you suggest, HW3 is probably close to its limits with existing networks, so something needs to be traded off.

Which is an argument is favor of the cascaded stacks, which was my point. If you have a pure end-to-end camera-to-control stack then any visualization stack is pure overhead .. something that they cant really afford any more. So you either have a VERY poor visualization stack, with all that implies, no visualization AT ALL, or use a combined back-end that can act as both the root data for visualization AND as input to the control stack.

(There is of course the issue of photon-to-action time, but that's not directly related to the depth of the stack.)

stopcrazypp · Dec 31, 2023

Bitdepth said:
I’m sorry that’s too many information, how will they know which one is more accurate. They should use only one. Lol

It's actually trivial when you have 3, just have majority rules. That's why a lot of systems use triple redundancy. It's dead simple.

2 only works in certain logic cases, for example both must agree (accelerator is an example, if both methods don't agree, then don't apply accelerator).

However, I think the point is that doesn't work in this example. You don't want the car to just disable itself when both methods happens to disagree. In fact, you expect them to disagree frequently. So you need to set a priority, and that complicates things much more.

diplomat33 · Dec 31, 2023

Supcom said:
Only setting a hypothetical where the output is processed data, not a recording straight from a sensor.

In the case of true end-to-end (ie video in, control out), the NN is getting raw input directly from a sensor. In the case of a modular design where the planning NN is getting input from a perception stack, yes it would be getting processed data. But that is why it is key to make the perception stack as accurate and reliable as possible to make sure that the planner is getting the best data.

Supcom said:
BTW, there are many workplaces where audio/video recording is forbidden.

Audio recordings are allowed in my workplace for this meeting.

mongo · Jan 1, 2024

diplomat33 said:
So if one stack proposes a path that violates your safety rules and the other stack proposes a path that does not violate your safety rules, you go with the path that does not violate your safety rules. And if both stacks propose paths that don't violate your safety rules, you go with the path that is more efficient. The whole point is to increase the safety. With one stack, if it proposes an unsafe path, your system fails every time. With two stacks, if one stack proposes an unsafe path, your system will no longer fail if the other stack proposes a safe path. You system will only fail when both stacks propose an unsafe path which will happen less often.

That senario is based on the premise of a 100% infallible safety rule filter module. Where is it getting its data from, and how is it so reliable?

I feel like most of your examples have been dependent on a high level arbiter with better knowledge and judgment than the rest of the system...

diplomat33 said:
You set both watches to the correct time at first. Then if one watch breaks, you have another watch to tell time.

And then they drift...
Again, you are using an externally obvious "break" criteria, not an inaccuracy with no absolute reference scenario.

diplomat33 · Jan 1, 2024

mongo said:
That senario is based on the premise of a 100% infallible safety rule filter module. Where is it getting its data from, and how is it so reliable?

I feel like most of your examples have been dependent on a high level arbiter with better knowledge and judgment than the rest of the system...

Nothing is 100% infallible. You try to make your system as internally reliable as possible and then you add redundancy to help make it more reliable. I am simply saying that redundancy makes the system MORE reliable, not that it makes it 100% reliable. Redundancy will increase reliability to help you get to your deployment goals. The goal is to be "good enough" and "safe enough" to deploy. So in the case of building autonomous driving, you build hardware and software as robust as possible. They will never be 100%. But you add redundancy to make them more reliable to make it easier to deploy an AV that is safe enough. Put simply, if all things being equal, which robotaxi would you ride in: The robotaxi that uses just one stack and it is 2x safer than humans or the robotaxi that uses redundant systems and is 10x safer than humans? I know I would pick the robotaxi that is 10x safer.

mongo said:
And then they drift...
Again, you are using an externally obvious "break" criteria, not an inaccuracy with no absolute reference scenario.

Of course, they drift. But that is why we design watches to drift as little as possible. We used to have watches that had to be wound up manually and they drifted a lot. Then we developed digital watches that drifted a lot less. Now, we have atomic clocks that have infinitesimally small drift. Atomic clocks only lose about 1 second every 100M years. So yes they still drift too but the drift is so small that it does not matter. Since the drift is so small, longer than our entire civilization will probably exist, we can use atomic clocks as an absolute reference.

Using two watches that drift does not mean that you will have perfect time. Yes, if one watch breaks, the other watch will still give you imperfect time. It simply means that if one watch breaks, you will still have a watch to tell time rather than not know what time it is at all. It is better to have imperfect time than to not know the time at all. In the case of AVs, it is better for the AV to still be able to drive if one system breaks than to crash or be stranded. You don't want a situation where if one system fails/crashes, the entire AV fails and has to stop. The AV will of course still fail and have to stop but you want to make it as rare as possible. Redundancy will help do that.

mongo · Jan 1, 2024

diplomat33 said:
Nothing is 100% infallible. You try to make your system as internally reliable as possible and then you add redundancy to help make it more reliable. I am simply saying that redundancy makes the system MORE reliable, not that it makes it 100% reliable. Redundancy will increase reliability to help you get to your deployment goals. The goal is to be "good enough" and "safe enough" to deploy. So in the case of building autonomous driving, you build hardware and software as robust as possible. They will never be 100%. But you add redundancy to make them more reliable to make it easier to deploy an AV that is safe enough. Put simply, if all things being equal, which robotaxi would you ride in: The robotaxi that uses just one stack and it is 2x safer than humans or the robotaxi that uses redundant systems and is 10x safer than humans? I know I would pick the robotaxi that is 10x safer.

Except your examples of redundancy aren't. You talk of two data streams that feed something that can tell which is right and which is wrong. Akin to a variation of the halting problem.

You're also setting up a contrived senario of a multi-redundant system being 5x safer purely due to its topology without ever explaining how that realistically makes it safer (while also ignoring the increase in false positives).

diplomat33 said:
Of course, they drift. But that is why we design watches to drift as little as possible. We used to have watches that had to be wound up manually and they drifted a lot. Then we developed digital watches that drifted a lot less. Now, we have atomic clocks that have infinitesimally small drift. Atomic clocks only lose about 1 second every 100M years. So yes they still drift too but the drift is so small that it does not matter. Since the drift is so small, longer than our entire civilization will probably exist, we can use atomic clocks as an absolute reference.

Again, you are missing the point. The original statement about watches is that given only two (not clearly invalid) data points, there is no way to judge which is correct (not that the still working watch is necessarily correct, mind you)
Saying your watch is super accurate and only hard fails sets up the parallel that my NN stack is super accurate and only hard fails, in which case it is also super safe, especially if I have two of them and I just don't use the failed one's data.
But that's not reality. Reality is some false positive and false negative rate for any system and, if one only has two data sources, no way to accurately choose which of the two doors has the lion behind it.

diplomat33 · Jan 1, 2024

mongo said:
Except your examples of redundancy aren't. You talk of two data streams that feed something that can tell which is right and which is wrong. Akin to a variation of the halting problem.

You're also setting up a contrived senario of a multi-redundant system being 5x safer purely due to its topology without ever explaining how that realistically makes it safer (while also ignoring the increase in false positives).

Again, you are missing the point. The original statement about watches is that given only two (not clearly invalid) data points, there is no way to judge which is correct (not that the still working watch is necessarily correct, mind you)
Saying your watch is super accurate and only hard fails sets up the parallel that my NN stack is super accurate and only hard fails, in which case it is also super safe, especially if I have two of them and I just don't use the failed one's data.
But that's not reality. Reality is some false positive and false negative rate for any system and, if one only has two data sources, no way to accurately choose which of the two doors has the lion behind it.

I am not suggesting two data sources. Both stacks would have the same data source since they would both get the same raw data from the sensors. They would simply process the same data differently. And I explained how you tell which one is wrong. I said you use safety rules to check the final output. So you check the control output of each stack and see if there is a safety violation:
1) If one stack has a safety violation and the other does not, you select the control output that does not produce a safety violation.
2) If neither stack produces a safety violation then you select the control output that is more efficient.
3) If both stacks produce a safety violation, then you stop the AV or pull over depending on the specifics of the safety violation.

The idea is that depending on how you build each stack, since the two stacks are processing the same sensor input differently, the safety violations will likely occur at different times. So the chance of both stacks producing a safety violation at the same time, will be less than the chance of just one stack producing a safety violation. So the AV will fail, ie have to stop or pull over, less often.

drtimhill · Jan 1, 2024

stopcrazypp said:
It's actually trivial when you have 3, just have majority rules. That's why a lot of systems use triple redundancy. It's dead simple.

Not really that simple. First, if you simply replicate the code and run it three times then you are only really protected against hardware failures, since the same code will always generate the same output given the same input (though there are environement variables that can make this tricky). Bugs in the code will cause ALL THREE to agree on the WRONG output, so your redundancy buys you nothing at all in this case. The only way to guard against this kind of cloning issue is to design three DIFFERENT stacks that solve the same problem using different code (this is analogous to the dual sensors on drive by wire accelerators). But this is of course very expensive and also very hard to do. Worse, in many cases you end up with all three stacks generating different results, and your voting goes out the window.

Finally, there is also the issue of the code that compares the output and makes the choice of which stack to believe. THIS code is NOT redundant, and is a single point of failure. And in practice this code is not as trivial as you might imagine, since the output being compared can be complex.

Mardak · Jan 1, 2024

drtimhill said:
There is of course the issue of photon-to-action time, but that's not directly related to the depth of the stack.

If you're referring to depth of the stack as how long it takes a sequence of modules to process inputs to perception to control along with all its data shuffling, the framerate and time from photon to action are both heavily tied to the slowest path. Extending the stack with a new control neural network will probably make things take longer and slow down the framerate, but Elon Musk said during the August livestream that end-to-end is faster on HW3:

The pure AI version runs faster than the version that is a mixture of normal software and AI. In fact it would run it faster than 36 frames per second except the cameras are currently only capable of 36 FPS. Our current back the envelope frame number is we think it could probably run 50 frames a second.

If the end-to-end network is able to run faster without depending on existing networks that are slower, there's is actually excess compute to run the old perception potentially completely on a separate SoC in parallel to provide visualization context at a lower framerate during this transition.

kpanda17 · Jan 1, 2024

I’m feeling v12 pushed to us in the usa this month

sleepydoc · Jan 1, 2024

Supcom said:
Or, you have two people taking notes. One writes, "The CEO wants donuts, not bagels, at the next meeting. The other writes, "The CEO wants bagels, not donuts at the next meeting" You weren't at the meeting but are responsible for refreshments. Which is correct? Should you start looking for a new job now?

No, you just bring bagels AND donuts!

Bitdepth · Jan 1, 2024

Seems we're on a redundancy train, here is a good read on redundancy. It is a well understood problem and there are various ways to achieve it. This is the process NASA used in designing the redundancy systems of the space shuttle.

Computers in Spaceflight: The NASA Experience

- Chapter Four -- Computers in the Space Shuttle Avionics System -

Computer synchronization and redundancy management

[100] One key goal shaping the design of the Shuttle was "autonomy." Multiple missions might be in space at the same time, and large crews, many with nonpilot passengers, were to travel in space in craft much more self-sufficient than ever before. These circumstances, the desire for swift turnaround time between launches, and the need to sustain mission success through several levels of component failure meant that the Shuttle had to incorporate a large measure of fault tolerance in its design. As a result, NASA could do what would have been unthinkable 20 years earlier: put men on the Shuttle's first test flight. The key factor in enabling NASA to take such a risk was the redundancy built into the orbiter60. Fault tolerance on the Shuttle is achieved through a combination of redundancy and backup. Its five general-purpose computers have reliability through redundancy, rather than the expensive quality control employed in the Apollo program61. Four of the computers, each loaded with identical software, operate in what is termed the "redundant set" during critical mission phases such as ascent and descent. The fifth, since it only contains software to accomplish a "no frills" ascent and descent, is a backup. The four actuators that drive the hydraulics at each of the aerodynamic surfaces are also redundant, as are the pairs of computers that control each of the three main engines. Management of redundancy raised several difficult questions. How are failures detected and certified? Should the system be static or dynamic? Should the computers run separately without communication and be used to replace the primary computer one by one as failures occur? Could the computers, if running together, stay in step? Should redundancy management of the actuators be at the computer or subsystem level? Fortunately, NASA experience on other aircraft and spacecraft programs could provide data for making the final decisions.

Redundant Precursors

Several systems that incorporated redundancy preceded the Shuttle. The computer used in the Saturn booster instrument unit that contained the rocket's guidance system used triple modular redundant (TMR) circuits, which means that there was one computer with redundant components. Disadvantages to using such circuits in larger computers [101] are that they are expensive to produce, and an event such as the explosion on Apollo 13 could damage enough of the computer that it ceases to function. By spreading redundancy among several simplex circuit computers scattered in various parts of the spacecraft, the effects of such catastrophic failures are minimized62. Skylab's two computers each could perform all the functions required on its mission. If one failed, the other would automatically take over, but both computers were not up and running simultaneously. The computer taking over would have to find out where the other had left off by using the contents of the 64-bit transfer register located in the common section built with TMR circuits. The Skylab computers were able to have such a relatively leisurely switch-over system because they were not responsible for navigation or high-frequency flight control functions. If there were a failure, it would be possible for the Skylab to drift in its attitude without serious danger; the Shuttle would have no such margin of safety.

Figure 4-3. The F-8 aircraft that proved the redundant set configuration planned for the Shuttle would work. (NASA photo ECN-6988)

The need for the redundant computers on the Shuttle to process information simultaneously, while still staying closely synchronized for rapid switch-over, seriously challenged the designers of the system. Such a close synchronization between computers had not been done before, and its feasibility would have to be proven before NASA could make a full commitment to a particular design. Most of the [102] necessary confidence resulted from a digital fly-by-wire testing program NASA started at the Dryden Flight Research Center in the early 1970s63. The first computer used in the F-8 "Crusader" aircraft chosen for the program was a surplus AGC in simplex, with an electronic analog backup. Later, the project engineers wanted a duplex system using a more advanced computer. Johnson Space Center avionics people noted the similarities between the digital fly-by-wire program and the Shuttle. Dr. Kenneth Cox of JSC suggested that Dryden go with a triplex system to move beyond simple one-for-one redundancy. By coordinating procurement, NASA outfitted both the F-8 aircraft and the Shuttle with AP-101 processors. Draper Laboratory produced software for the F-8, and its flight tests proved the feasibility of computers operating in synchronization, as it suffered several single point computer failures but successfully flew on without loss of control. This flight program did much to convince NASA of the viability of the synchronization and redundancy management schemes developed for the Shuttle.

How Many Computers?

One key question in redundancy planning is how many computers are required to achieve the level of safety desired. Using the concept of fail operational/fail operational/fail-safe, five computers are needed. If one fails, normal operations are still maintained. Two failures result in a fail-safe situation, since the three remaining prevent the feared standoff possible in dual computer systems (one is wrong, but which?). Due to cost considerations of both equipment and time, NASA decided to lower the requirement to fail operational/fail-safe, which allowed the number of computers to be reduced to four. Since five were already procured and designed into the system, the fifth computer evolved into a backup system, providing reduced but adequate functions for both ascent and descent in a single memory load. NASA's decision to use four computers has a basis in reliability projections done for fly-by-wire aircraft. Triplex computer system failures were expected to cause loss of aircraft three times in a million flights, whereas quadruple computer system failures would cause loss of aircraft only four times in a thousand million flights64. At first the backup flight system computer was not considered to be a permanent fixture. When safety level requirements were lowered, some IBM and NASA people expected the fifth computer to be removed after the Approach and Landing Test phase of the Shuttle program and certainly after the flight test phase (STS-1 through 4)65. However, the utility of the backup system as insurance against a generic software error in the primary system outweighed considerations of the savings in weight, power, and complexity to be made by....

[103]

Figure 4-4. The intercommunication system used in the F-8 triplex computer system.

[104] ....eliminating it66. In fact, as the first Shuttle flights approached, Arnold Aldrich, Director of the Shuttle Office at Johnson Space Center, circulated a memo arguing for a sixth computer to be carried along as a spare67! He pointed out that since 90% of avionics component failures were expected to be computer failures and that since a minimum of three computers and the backup should exist for a nominal re-entry, aborts would then have to take place after one failure. By carrying a spare computer preloaded with the entry software, the primary system could be brought back to full strength. The sixth computer was indeed carried on the first few flights. In contrast with this "suspenders and belt" approach, John R. Garman of the Johnson Space Center Spacecraft Software Division said that "we probably did more damage to the system as a whole by putting in the backup"68. He felt that the institution of the backup took much of the pressure off the developers of the primary system. No longer was their software solely responsible for survival of the crew. Also, integrating the priority-interrupt-driven operating system of the primary computers with the time-slice system of the backup caused compromises to be made in the primary.

Synchronization

Computer synchronization proved to be the most difficult task in producing the Shuttle's avionics. Synchronizing redundant computers and comparing their current states is the best way to decide if a failure has occurred. There are two types of synchronization used by the Shuttle's computers in determining which of them has failed: one for the redundant set of computers established for ascent to orbit and descent from orbit, and one for synchronizing a common set while in orbit. It took several years in the early 1970s to discover a way to accomplish these two synchronizations.

The essence of Shuttle redundancy is that each computer in the redundant set could do all the functions necessary at a particular mission phase. For true redundancy to take place, all computers must listen to all traffic on all buses, even though they might be commanding just a few. That way they know about all the data generated in the current phase. They must also be processing that data at the same time the other computers do. If there is a failure, then the failed computer could drop out of the set without any functional degradation whatever. At the start, the Shuttle's designers thought it would be possible to run the redundant computers separately and then just compare answers periodically to make sure that the data and calculations matched69. As it turned out, small differences in the oscillators that acted as clocks within the computers caused the computers to get out of step fairly [105] quickly. The Spacecraft Software Division formed a committee, headed by Garman, made up of representatives from Johnson Space Center, Rockwell International, Draper Laboratory and IBM Corporation, to study the problem caused by oscillator drift70. Draper's people made the suggestion that the computers be synchronized at input and output points71.This concept was later expanded to also place synchronization points at process changes, when the system makes a transition from one software module to another. The decision to put in the synchronization points "settled everyone's mind" on the issue72.

Intercomputer communication is what makes the Shuttle's avionics system uniquely advanced over other forms of parallel computing. The software required for redundancy management uses just 3K of memory and around 5% or 6% of each central processor's resources, which is a good trade for the results obtained78. An increasing need for redundancy and fault tolerance in non-avionics systems such as banks, using automatic tellers and nationwide computer networks, proves the usefulness of this system. But this type of synchronization is so little known or understood by people outside the Shuttle program that carryover applications will be delayed.

One reason why the redundancy management software was able to be kept to a minimum is that NASA decided to move voting to the actuators, rather than to do it before commands are sent on buses. Each actuator is quadruple redundant. If a single computer fails, it continues to send commands to an actuator until the crew takes it out of the redundant set. Since the Shuttle's other three computers are sending apparently correct commands to their actuators, the failed computer's commands are physically out-voted79. Theoretically, the only serious possibility is that three computers would fail simultaneously, thus negating the effects of the voting. If that occurs, and if the proper warnings are given, the crew can then engage the backup system simply by pressing a button located on each of the forward rotational hand controllers.

Does the redundant set synchronization work? As described, the F-8 version, with redundancy management identical to the Shuttle, survived several in-flight computer failures without mishap. On the first Shuttle Approach and Landing Test flight, a computer failed just as the Enterprise was released from the Boeing 747 carrier; yet the landing was still successful. That incident did a lot to convince the astronaut pilots of the viability of the concept.

Synchronization and redundancy together were the methods chosen to ensure the reliability of the Shuttle avionics hardware. With the key hardware problems solved, NASA turned to the task of specifying the most complex flight software ever conceived.

[106] Box 4-2: Redundant Set Synchronization: Key to Reliability

Synchronization of the redundant set works like this: When the software accepts an input, delivers an output, or branches to a new process, it sends a 3-bit discrete signal on the intercomputer communication (ICC) buses, then waits up to 4 milliseconds for similar discretes from the other computers to arrive. The discretes are coded for certain messages. For example, 010 means an I/O is complete without error, but 011 means that an I/O is complete with error73. This allows more information other than just "here I am" to be sent. If another computer either sends the wrong synchronization code, or is late the computer detecting either of these conditions concludes that the delinquent computer has failed, and refuses from then on to listen to it or acknowledge its presence. Under normal circumstances, all three good computers should have detected the single computer's error. The bad computer is announced to the crew with warning lights, audio signals, and CRT messages. The crew must purposely kill the power to the failed computer, as there is no provision for automatic powerdown. This prevents a generic software failure causing all the computers to be automatically shut off.

This form of synchronization creates a tightly coupled group of computers constantly certifying that they are at the same place in the software. To certify that they are achieving the same solutions, a "sumword" is used. While computers are in a redundant set, a sumword is exchanged 6.25 times every second on the ICC buses74. A sumword typically consists of a 64 bits of data, usually the least significant bits of the last outputs to the solid rocket boosters, orbital maneuvering engines, main engines, body flap, speed brake, rudder, elevons, throttle, the system discretes, and the reaction control system75. If there are three straight miscomparisons of a sumword, the detecting computers declare the computer involved to be failed76.

Both the 3-bit synchronization code and sumword comparison are characteristics of the redundant set operations. During noncritical mission phases such as on-orbit, the computers are reconfigured. Two might be left in the redundant set to handle guidance and navigation functions, such as maintaining the state vector. A third would run the systems management software that controls life support, power, and the payload. The fourth would be loaded with the descent software and powered down, or "freeze dried," to be instantly ready to descend in an emergency and to protect against a failure of the two MMUs. The fifth contains the backup flight system. This configuration of computers is not tightly coupled, as in the redundant set. All active computers, however, do continue the 6.25/second exchange of sumwords, called the common set synchronization77.

[107]

Figure 4-5. The various computer configurations used during a Shuttle mission. The names of the operational sequences loaded into the machines are shown.

Supcom · Jan 1, 2024

sleepydoc said:
No, you just bring bagels AND donuts!

The CEO fired you for not bringing the biscuits he requested.

aronth5 · Jan 1, 2024

diplomat33 said:
Uh that is not how it works. I am the meeting secretary. I am the only one responsible for taking notes and writing the minutes from those notes. That is why I have two audio recordings so I can go back and listen to what was actually said to make sure my minutes are 100% accurate.

And meeting secretary is an additional responsibility I volunteered for. It is not my main job.

One of my favorite C-level executives had only one requirement for the meeting minutes. Identify who owned what action items. And get them sent out fast.

sleepydoc · Jan 1, 2024

aronth5 said:
One of my favorite C-level executives had only one requirement for the meeting minutes. Identify who owned what action items. And get them sent out fast.

“I want stuff broken by noon tomorrow!”

FSD v12.x (end to end AI)

Average guy who loves autonomous vehicles

Active Member

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Average guy who loves autonomous vehicles

Well-Known Member

Average guy who loves autonomous vehicles

Well-Known Member

Average guy who loves autonomous vehicles

Active Member

Active Member

Active Member

Well-Known Member

Member

Active Member

Long Time Follower

Well-Known Member

Similar threads