Related to this see Launch of STS-135 Atlantis (final mission of the Space Shuttle).
Tomorrow I leave to the Cape Canaveral (Kennedy Space Center) to see the last launch of the Space Shuttle, STS-135.
(I hope it doesn’t get delayed).
I am very excited about this. I was able to get some tickets at the last minute to see this launch from ‘up close’ (~7 miles away) from the KSC Visitor’s Center. Back in 1994 when I used to work on the Shuttle, I was able to see the launch of STS-60 from really up-close (~3 miles); and there is nothing like seeing a Shuttle launch.
Below is a post I made to the Illlist group on my Shuttle story, which I am posting here on my blog as well. Some aspects of it are a bit technical. I am sharing this here so that everyone who reads this understand the awesomeness of the Space Shuttle and of our Manned-space program.
Looking back at the Space Shuttle Program
C. Enrique Ortiz | July 4, 2011
For months I have been looking to get tickets to see the last Shuttle mission from up-close. I was lucky to get the tickets to go see the launch of Atlantis thanks to @hugs (Jason) and the Illlist.
I mentioned to Jason that going to see this last mission of the Space Shuttle was a special event for me, and he asked me to post to the Illlist my story. It has been a very long time since I worked on the Shuttle program, so bear with me, but here it goes…
This last Shuttle mission has a lot of meaning to me because it is the last flight on which my contributions to the Space Shuttle onboard system software (SSW) will be flying on. As with many thousands of others, STS-135 and Atlantis is the last flight that will carry to space all of our contributions to the USA manned space program.
I grew up loving Space (and Astronomy). Movies like Space Odyssey 2001 changed the way I saw the future, space exploration and learned about that cool thing call computers, which later on turned out to be how I made my living. After I graduated from college my first job was in computers (software) and the space program, specifically the manned space program. Awesome. And to Houston I went. It was so exciting — the environment, the history, the astronauts, the vehicles and writing software for it.
This was back in the 1990s. I worked on the Shuttle program for ~5 years. My job was as a SW engineer on the Flight Computer OS (FCOS). I “quickly” learned the codebase, which was very complex and took me a year to master, or at least I thought I did. Then I began supporting flight missions (~20 in total) from launch to landing, and writing code in support of new CPU capabilities and/or new features, some small, some very large and all very critical.
The team I worked with was the best, and I miss them all. A bunch of great people, smart people, very passionate people. The space program was in our veins. We were all so proud of the space program and contributing to the mission’s success. There was nothing like getting the vehicle ready for flight (from our software point of view), and seeing that bird roaring up to space.
The Space Shuttle computer systems are *old*, very old. The drivers or requirements for it much came from the lessons learned from the Apollo missions; things like screen refresh rates, the redundancy and fail-operational/fail-safe requirements, the fly-by-wire and digital control and other. The SW implements a number of very important concepts that at the time, and still today, are very awesome and unique.
First, it probably still is the most complex real-time system out there. There are 5 general purpose computers (GPC). Each GPC has one processor the AP101/S and about 1MB of main memory. All ran in less than that. For storage, it relies on magnetic tape! (not sure if that got upgraded after I left, but those tapes gave us grief as it aged and I had to code to take into consideration changes in spinning start/stop & read/write times). There are 24 I/O buses for commanding/controlling different subsystems of the vehicle. Of the 24 buses, 8 are considered critical, for example, controlling critical sensors and/or controlling the main engines. Access to each subsystem is redundant via two different paths. A given GPC is assigned a String which consist of two critical I/O buses to command (one flight-forward and one flight-aft bus) and it listens to all the critical buses (to so maintain redundancy). That means a given GPC commands a set of 2 buses while it is listening and ready to take over others in case the other commanding GPC fails.
Four of the GPCs run the Primary Avionics software system (PASS) written by IBM (I was an IBMer at the time); GPC 1-4 run that exact same software image. The 5th computer runs the Backup Flight System (BFS) written by a different vendor, totally independent, given the same requirements given to IBM for the PASS. The idea of PASS vs. BFS is that no single bug should affect all of the computers. The idea is that if the primary systems fail because a common issue on that version of the software, the astronauts can engage the backup system. Note that the BFS has never been engaged during an actual mission.
(As a side note, it typically takes around one year for a new version of the software from completion to actual flight, due all the testing, then astronaut training).
The idea behind this fail-operational/fail-safe modus operandi is a common theme across the Shuttle. Everything is redundant. The system must handle failures such that a single system failure should keep the mission and its crew operational (fail-operational) and two system failures should keep the vehicle and its crew safe and able to land (fail-safe). This is the reason there are five GPCs, and four of those are primary ones and one is a separate, backup one. In addition, for this the PASS GPCs run in a Redundant Set during the critical phases of the mission (lift-off, on-orbit and re-entry/landing). In a redundant set, each of the primary GPCs, as previously mentioned, command two (of the eight) critical I/O buses at a given time, while listening to other buses. The idea is that if all computers are running the same software, and are receiving all the same inputs, then all execution should be the same (and all outputs should be the same as well). All computers in the redundant set, which again are running the same SW, sync-up at every interrupt (I/O , timer). If a computer fails to sync (not show up on time) twice in a row, it is voted out from the redundant set by the rest of the computers, and the designated bus-listener now takes command of those critical buses. The failed computer is halted as soon as possible by the astronauts. (while all this is happening, a number of audible alarms are going off). The computers also form what is called a Common Set, which can include redundant computers (in a redundant set of their own) and non-redundant computers; these sync-up every 160 ms. And example of a redundant set are when the computers are in guidance and navigation and control mode for launch, orbit or re-entry, and an example of a common set is having two computers in a redundant set in orbit, while having a 3rd computer doing system’s management dedicated to the robotic arm or the payload. (the 4th computer is in stand-by conserving power). It is uncommon for GPCs to fail to sync from a redundant set and is even more uncommon to fail-to-sync from a common set.
During my Shuttle days I was exposed to a number of great concepts that today are common. There I was first exposed to vector graphics used in the Shuttle UI/displays units. To Heads-up display (HUD) which I see it as my first exposure to “augmenting the reality”. I was exposed to deep embedded real-time programming and hard-core scheduling and redundant systems (as any manned-rated software should be) where computers can vote each other out to maintain safety. And I was also exposed to what probably was one of the first real uses of Metaprogramming and self-modifying code. Many people don’t know that the Space Shuttle OS implements self-modifying code for the purpose of “fault-tolerance” where the I/O code will at runtime overwrite itself for the purpose of bypassing faulty I/O elements and taking control of I/O buses when needed.
Back then I contributed in many ways. We were put under a lot of pressure to find “answers” to issues before we could go for launch or re-enter for landing.
One example was a ‘random’ issue that was showing up where blocks of memory were getting zeroed. It took me 3 weeks to figure that one out. Management was impatient, everyone was. But I finally nailed it when I was able to identify the issue to a *single* instruction of Assembler code. But how could this be? The answer -> microcode bug. The HW folks at first couldn’t believe it; microcode issues are almost unheard of. This was an issue related to how the Move-HalfWord (MHV) instruction behaved when the destination and source addresses overlapped, which was a ‘trick’ used to clear out memory (here 0xdeadbeef helped my find the source of the issue). Once found, code audits where done, and the code was patched and we were Go.
Another experience was when one of the computers actually failed to sync in orbit during the STS-51 mission. GPC2 was voted out of the redundant set. This particular issue was of extreme pressure as our Astronauts were in orbit and it was imperative to know if this is a problem that would affect re-entry. Because the fail-to-sync had occurred on the first or second day of a 2 week mission, we had some time to figure out this one. Using downlink data (every 160ms) and memory dumps and the knowledge of the code and the help of other experts, we all got to work. A lot of detective work. At the end, I could come up with only a one answer to the issue, which up to this date, has remained. After lots of careful analysis was able to identify the fail to sync to a single If-statement or ‘branch out’ instruction which seem has taken this particular GPC2 a different route thus didn’t show up to sync when it was supposed to. It happened at the DEU UI code (which is coded using the HAL/S programming language). It is as if the contents of the variable being tested was different on this particular computer. This was hard to prove as I could not see the actual value on the downlink, as it was loaded into registers for the actual branch-out/test. But how could that be? The Space Shuttle computers are space radiation hardened, but are susceptible to soft-errors or single-event memory upsets. In space, cosmic radiation will flip bits in memory all the time, specially when over the South Atlantic Anomaly (when entering the anomaly region, you can see the bit-flip count going up like crazy in the monitor screens during mission support). As a side note, the GPCs memory can sustain and will self-correct during memory scrubs 1-bit flip on a given word (32 bits), but 2+ bit flips will crash the computer. Back to the fail to sync story, the only part of the processor that is NOT radiation hardened are the registers themselves; so my only possible answer was that when the branch instruction executed, which uses registers (R2 in this case), the value of the register itself must have flipped. Everyone is like uhg? But there we were, I was, with the analysis, and dumps and explanation. That was the only explanation, everyone agreed, some had doubts, and the go ahead, and all went well.
BTW, one of the reasons memory dump analysis as above was possible is because the memory model used by FCOS is static. A very deterministic model from the rate monotonic process scheduling, to the I/O profile at any given time, and the memory layout: the I/O and process queues, the interrupt vectors, every piece of code, the patch areas, all — you knew the exact layout and location of everything. I could take a memory dump, read it (manually), and tell you exactly what was going on with that particular computer. Today I still believe that for any manned-rated software system, static-deterministic models are best; you need to be able to see, explain and saves lives by reading a memory dump. I then wrote tools in OS/2 that would take a given a dump, tell you what was going on.
Before I left the program, I helped in the analysis in preparation for GPS I/O support and the new Glass-Cockpit, but I didn’t get to work on their implementation.
And there are other stories like the above, not only from me, but from many others; amazing stories.
I enjoyed working with the Astronauts themselves, and it was very cool to meet in person John Young, first Shuttle astronaut and who walked on the moon (visited the moon twice!). And I enjoyed working with the other amazing individuals, super sharp, super smart. It was super cool listening to the astronauts as they used the code I have written, specially when it was used the first time in orbit, and all worked well; was great. And I always had a blast during mission support. Behind the big room where the flight controllers are (the one you see on the TV) there is another room, called the back-room or Mission Evaluation Room, where all the engineers for each subsystem of the vehicle are located. Typically a flight-controller consults with the back-room engineers when making decisions; in my case I was one of the engineers on anything related to the flight computer operating system.
I loved my time at the manned space program. I am very proud to have received the Silver Snoopy. While it didn’t pay$ a whole lot, it was the best job I have ever had; the people I worked with, the missions, the space program, the pride.
(A cool ‘family fact’ is that my brother also worked in the Shuttle program at the same time I did; he worked on the Thermal Protection System (the tiles). At a number of missions we both saw each other at the back-room/Mission Evaluation Room while giving mission support. I am not sure how many brothers have worked together at the same mission giving mission support; but that is pretty cool. My brother is also a recipient of the Silver Snoopy Award.)
And with this last mission of the space shuttle, an era of the manned space program ends, and a new one begins, I hope. I am thankful of such experiences and proud of the USA space program, and specifically the manned space program, what it has accomplished in its 50 years (and 30 years of Space Shuttle program).
It is of great importance that we as a Nation and as soon as possible get back into the manned space program. Otherwise we are going to lose lots of experience and expertise; gone. The manned space program requires practice and it is not like riding a bicycle which you can pick up back easily with few practice. On the manned space program if you do not practice, if you forget, people will die. Unacceptable. For the next five years we are going to be relying on our friends the Russians to get to space, but we really need to get back to it by ourselves. The research that comes out if this, the jobs, and the independence (and status) when it comes from space exploration of our nation nation depends on it.
Today I dedicate my time working on mobile and wireless technologies and software, but I always look back at my days at the Space Shuttle program, and remember…
Godspeed to the crew of STS-135 Atlantis…
/C. Enrique Ortiz (CEO)
The photo below is of Hoot Gibson handing and congratulating me on the Silver Snoopy Award:
Space Shuttle Computers and Avionics
7/16/2011 STS-135 Fail-to-Sync
Source: Huffington Post
NASA declared all five of Atlantis’ primary computers to be working, pending evaluation of the latest shutdown.
Computer failures like this are extremely rare in orbit, said lead flight director Kwatsi Alibaruho. The two problems appear to be quite different, he noted. The first was caused by a bad switch throw; the second possibly by cosmic radiation.