18 Aug

On Self-Modifying Code and the Space Shuttle OS

I was doing some reading about Metaprogramming and Self-modifying code at Wikipedia, a fascinating topic with many uses from optimization, patching, and genetic programming.

And it reminded me of my days during the early 1990s working as a software engineer on the Space Shuttle operating system (FCOS). Many people don’t know that the Space Shuttle OS implements self-modifying code for the purpose of “fault-tolerance”. The Shuttle computer systems consist of four primary computers running the same software, and a fifth backup computer running different software that is equal in functionality. The goal is to be Fail Operational if one or more computers fail, and Fail Safe if all primary computers fail; this is called a Fail Operational/Fail Safe system. The four primary computers run redundantly during critical phases such as launch or re-entry, all synchronizing many times per second, at critical points in the code, at every I/O, and at every timer expiration. Those computers are all connected to 24 I/O buses controlling different parts of the vehicle, with 8 of the 24 buses being flight critical data buses (to fly the vehicle), 5 are for inter-computer communication, four are for the displays, and so on, see diagram below:






(Image credit: NASA Office of Logic Design)

Each of the data buses are controlled by a micro-computer called the Bus Control Element (BCE) that runs I/O code. A simple yet effective self-patching mechanism is used to patch BCE I/O code, to bypass I/O errors, where the OS changes the BCE code at runtime when I/O errors are encountered, effectively bypassing (branching around) I/O instructions that result in I/O errors; this is referred to as “BCE bypass”. This is self-modifying code for the purpose of fault-tolerance.

Other fault-tolerance techniques implemented in the vehicle is redundant set voting, where I/O is also self-patched to assign which computer controls (talk/listen to) what data buses. As mentioned above, the primary four computers can be configured into a redundant set, synchronizing many times per second. If a computer fails to synchronize, the other computers will vote against the failing computer. While no computer can shutdown any other computers, they can take control of flight critical data buses assigned to the failed (voted-out) computer(s). I once worked on a mission where a single computer failed to sync, and I was put under tremendous pressure trying to investigate the cause, only using limited data that is down-linked every 160 ms.; I ended-up attributing the fail to sync to a specific code section where divergent code paths (branching) took place due space radiation alteration of non-radiation-hard registers. The worst case fail to sync that could happen is to have two computers vote against the other two, but that never has happened during a mission, and it is unlikely to ever happen — the action to take if that were to happen is to engage the Backup Flight Computer, a kind of scary thought as the backup computer has never been engage (during a real mission) to fly the vehicle.

An alternative to self-modifying or patching code is to use conditions (if statements) and local store (such as in-memory) to store flags that are used determine the paths to take (assuming such paths are known ahead of time, which might not be the case on genetic programming). Being the Space Shuttle a memory-limited, non-dynamic memory map and management environment, it is cheaper, less complex, and safer to patch specific instructions directly.

I’ve modified the Wikipedia entry on Self-modifying code and added “fault-tolerance” as one of the uses for self-modifying code.

ceo