Seems we're on a redundancy train, here is a good read on redundancy. It is a well understood problem and there are various ways to achieve it. This is the process NASA used in designing the redundancy systems of the space shuttle.
Computers in Spaceflight: The NASA Experience
- Chapter Four -- Computers in the Space Shuttle Avionics System -
Computer synchronization and redundancy management
[
100] One key goal shaping the design of the Shuttle was "autonomy." Multiple missions might be in space at the same time, and large crews, many with nonpilot passengers, were to travel in space in craft much more self-sufficient than ever before. These circumstances, the desire for swift turnaround time between launches, and the need to sustain mission success through several levels of component failure meant that the Shuttle had to incorporate a large measure of fault tolerance in its design. As a result, NASA could do what would have been unthinkable 20 years earlier: put men on the Shuttle's first test flight. The key factor in enabling NASA to take such a risk was the redundancy built into the orbiter
60. Fault tolerance on the Shuttle is achieved through a combination of redundancy and backup. Its five general-purpose computers have reliability through redundancy, rather than the expensive quality control employed in the Apollo program
61. Four of the computers, each loaded with identical software, operate in what is termed the "redundant set" during critical mission phases such as ascent and descent. The fifth, since it only contains software to accomplish a "no frills" ascent and descent, is a backup. The four actuators that drive the hydraulics at each of the aerodynamic surfaces are also redundant, as are the pairs of computers that control each of the three main engines. Management of redundancy raised several difficult questions. How are failures detected and certified? Should the system be static or dynamic? Should the computers run separately without communication and be used to replace the primary computer one by one as failures occur? Could the computers, if running together, stay in step? Should redundancy management of the actuators be at the computer or subsystem level? Fortunately, NASA experience on other aircraft and spacecraft programs could provide data for making the final decisions.
Redundant Precursors
Several systems that incorporated redundancy preceded the Shuttle. The computer used in the Saturn booster instrument unit that contained the rocket's guidance system used triple modular redundant (TMR) circuits, which means that there was one computer with redundant components. Disadvantages to using such circuits in larger computers [
101] are that they are expensive to produce, and an event such as the explosion on Apollo 13 could damage enough of the computer that it ceases to function. By spreading redundancy among several simplex circuit computers scattered in various parts of the spacecraft, the effects of such catastrophic failures are minimized
62. Skylab's two computers each could perform all the functions required on its mission. If one failed, the other would automatically take over, but both computers were not up and running simultaneously. The computer taking over would have to find out where the other had left off by using the contents of the 64-bit transfer register located in the common section built with TMR circuits. The Skylab computers were able to have such a relatively leisurely switch-over system because they were not responsible for navigation or high-frequency flight control functions. If there were a failure, it would be possible for the Skylab to drift in its attitude without serious danger; the Shuttle would have no such margin of safety.
Figure 4-3. The F-8 aircraft that proved the redundant set configuration planned for the Shuttle would work. (NASA photo ECN-6988)
The need for the redundant computers on the Shuttle to process information simultaneously, while still staying closely synchronized for rapid switch-over, seriously challenged the designers of the system. Such a close synchronization between computers had not been done before, and its feasibility would have to be proven before NASA could make a full commitment to a particular design. Most of the [
102] necessary confidence resulted from a digital fly-by-wire testing program NASA started at the Dryden Flight Research Center in the early 1970s
63. The first computer used in the F-8 "Crusader" aircraft chosen for the program was a surplus AGC in simplex, with an electronic analog backup. Later, the project engineers wanted a duplex system using a more advanced computer. Johnson Space Center avionics people noted the similarities between the digital fly-by-wire program and the Shuttle. Dr. Kenneth Cox of JSC suggested that Dryden go with a triplex system to move beyond simple one-for-one redundancy. By coordinating procurement, NASA outfitted both the F-8 aircraft and the Shuttle with AP-101 processors. Draper Laboratory produced software for the F-8, and its flight tests proved the feasibility of computers operating in synchronization, as it suffered several single point computer failures but successfully flew on without loss of control. This flight program did much to convince NASA of the viability of the synchronization and redundancy management schemes developed for the Shuttle.
How Many Computers?
One key question in redundancy planning is how many computers are required to achieve the level of safety desired. Using the concept of fail operational/fail operational/fail-safe, five computers are needed. If one fails, normal operations are still maintained. Two failures result in a fail-safe situation, since the three remaining prevent the feared standoff possible in dual computer systems (one is wrong, but which?). Due to cost considerations of both equipment and time, NASA decided to lower the requirement to fail operational/fail-safe, which allowed the number of computers to be reduced to four. Since five were already procured and designed into the system, the fifth computer evolved into a backup system, providing reduced but adequate functions for both ascent and descent in a single memory load. NASA's decision to use four computers has a basis in reliability projections done for fly-by-wire aircraft. Triplex computer system failures were expected to cause loss of aircraft three times in a million flights, whereas quadruple computer system failures would cause loss of aircraft only four times in a
thousand million flights
64. At first the backup flight system computer was not considered to be a permanent fixture. When safety level requirements were lowered, some IBM and NASA people expected the fifth computer to be removed after the Approach and Landing Test phase of the Shuttle program and certainly after the flight test phase (STS-1 through 4)
65. However, the utility of the backup system as insurance against a generic software error in the primary system outweighed considerations of the savings in weight, power, and complexity to be made by....
[
103]
Figure 4-4. The intercommunication system used in the F-8 triplex computer system.
[
104] ....eliminating it
66. In fact, as the first Shuttle flights approached, Arnold Aldrich, Director of the Shuttle Office at Johnson Space Center, circulated a memo arguing for a sixth computer to be carried along as a spare
67! He pointed out that since 90% of avionics component failures were expected to be computer failures and that since a minimum of three computers and the backup should exist for a nominal re-entry, aborts would then have to take place after one failure. By carrying a spare computer preloaded with the entry software, the primary system could be brought back to full strength. The sixth computer was indeed carried on the first few flights. In contrast with this "suspenders and belt" approach, John R. Garman of the Johnson Space Center Spacecraft Software Division said that "we probably did more damage to the system as a whole by putting in the backup"
68. He felt that the institution of the backup took much of the pressure off the developers of the primary system. No longer was their software solely responsible for survival of the crew. Also, integrating the priority-interrupt-driven operating system of the primary computers with the time-slice system of the backup caused compromises to be made in the primary.
Synchronization
Computer synchronization proved to be the most difficult task in producing the Shuttle's avionics. Synchronizing redundant computers and comparing their current states is the best way to decide if a failure has occurred. There are two types of synchronization used by the Shuttle's computers in determining which of them has failed: one for the redundant set of computers established for ascent to orbit and descent from orbit, and one for synchronizing a common set while in orbit. It took several years in the early 1970s to discover a way to accomplish these two synchronizations.
The essence of Shuttle redundancy is that each computer in the redundant set could do all the functions necessary at a particular mission phase. For true redundancy to take place, all computers must listen to all traffic on all buses, even though they might be commanding just a few. That way they know about all the data generated in the current phase. They must also be processing that data at the same time the other computers do. If there is a failure, then the failed computer could drop out of the set without any functional degradation whatever. At the start, the Shuttle's designers thought it would be possible to run the redundant computers separately and then just compare answers periodically to make sure that the data and calculations matched
69. As it turned out, small differences in the oscillators that acted as clocks within the computers caused the computers to get out of step fairly [
105] quickly. The Spacecraft Software Division formed a committee, headed by Garman, made up of representatives from Johnson Space Center, Rockwell International, Draper Laboratory and IBM Corporation, to study the problem caused by oscillator drift
70. Draper's people made the suggestion that the computers be synchronized at input and output points
71.This concept was later expanded to also place synchronization points at process changes, when the system makes a transition from one software module to another. The decision to put in the synchronization points "settled everyone's mind" on the issue
72.
Intercomputer communication is what makes the Shuttle's avionics system uniquely advanced over other forms of parallel computing. The software required for redundancy management uses just 3K of memory and around 5% or 6% of each central processor's resources, which is a good trade for the results obtained
78. An increasing need for redundancy and fault tolerance in non-avionics systems such as banks, using automatic tellers and nationwide computer networks, proves the usefulness of this system. But this type of synchronization is so little known or understood by people outside the Shuttle program that carryover applications will be delayed.
One reason why the redundancy management software was able to be kept to a minimum is that NASA decided to move voting to the actuators, rather than to do it before commands are sent on buses. Each actuator is quadruple redundant. If a single computer fails, it continues to send commands to an actuator until the crew takes it out of the redundant set. Since the Shuttle's other three computers are sending apparently correct commands to their actuators, the failed computer's commands are physically out-voted
79. Theoretically, the only serious possibility is that three computers would fail simultaneously, thus negating the effects of the voting. If that occurs, and if the proper warnings are given, the crew can then engage the backup system simply by pressing a button located on each of the forward rotational hand controllers.
Does the redundant set synchronization work? As described, the F-8 version, with redundancy management identical to the Shuttle, survived several in-flight computer failures without mishap. On the first Shuttle Approach and Landing Test flight, a computer failed just as the Enterprise was released from the Boeing 747 carrier; yet the landing was still successful. That incident did a lot to convince the astronaut pilots of the viability of the concept.
Synchronization and redundancy together were the methods chosen to ensure the reliability of the Shuttle avionics hardware. With the key hardware problems solved, NASA turned to the task of specifying the most complex flight software ever conceived.
[
106] Box 4-2: Redundant Set Synchronization: Key to Reliability
Synchronization of the redundant set works like this: When the software accepts an input, delivers an output, or branches to a new process, it sends a 3-bit discrete signal on the intercomputer communication (ICC) buses, then waits up to 4 milliseconds for similar discretes from the other computers to arrive. The discretes are coded for certain messages. For example, 010 means an I/O is complete without error, but 011 means that an I/O is complete with error
73. This allows more information other than just "here I am" to be sent. If another computer either sends the wrong synchronization code, or is late the computer detecting either of these conditions concludes that the delinquent computer has failed, and refuses from then on to listen to it or acknowledge its presence. Under normal circumstances, all three good computers should have detected the single computer's error. The bad computer is announced to the crew with warning lights, audio signals, and CRT messages. The crew must purposely kill the power to the failed computer, as there is no provision for automatic powerdown. This prevents a generic software failure causing all the computers to be automatically shut off.
This form of synchronization creates a tightly coupled group of computers constantly certifying that they are at the same place in the software. To certify that they are achieving the same solutions, a "sumword" is used. While computers are in a redundant set, a sumword is exchanged 6.25 times every second on the ICC buses
74. A sumword typically consists of a 64 bits of data, usually the least significant bits of the last outputs to the solid rocket boosters, orbital maneuvering engines, main engines, body flap, speed brake, rudder, elevons, throttle, the system discretes, and the reaction control system
75. If there are three straight miscomparisons of a sumword, the detecting computers declare the computer involved to be failed
76.
Both the 3-bit synchronization code and sumword comparison are characteristics of the redundant set operations. During noncritical mission phases such as on-orbit, the computers are reconfigured. Two might be left in the redundant set to handle guidance and navigation functions, such as maintaining the state vector. A third would run the systems management software that controls life support, power, and the payload. The fourth would be loaded with the descent software and powered down, or "freeze dried," to be instantly ready to descend in an emergency and to protect against a failure of the two MMUs. The fifth contains the backup flight system. This configuration of computers is not tightly coupled, as in the redundant set. All active computers, however, do continue the 6.25/second exchange of sumwords, called the common set synchronization
77.
[
107]
Figure 4-5. The various computer configurations used during a Shuttle mission. The names of the operational sequences loaded into the machines are shown.