In-House Hardware Development
When Johnson Space Center’s Matt Lemke showed up for work as the project manager of the space-to-space communications system at the end of 1994, he looked forward to leading a team of NASA designers on the biggest project in his division. Lemke was an experienced avionic design engineer who was relatively new to project management. He would soon discover that he was starting with little more than an immature prototype system and an unforgiving schedule. He did not anticipate that the project would have to reverse-engineer its drawings from scratch, unravel major latent design defects, extend its delivery date by 300 percent, limit its systems testing to make up for lost time, or test the radios for anomalies on the launchpad right before its first in-flight trial on a shuttle.
The space-to-space communications system (SSCS) is designed to provide voice and telemetry among three on-orbit systems: the Space Shuttle orbiter, the International Space Station (ISS), and the Extravehicular Activity Mobility Unit (EMU), the space suit worn by an astronaut during a space walk or extravehicular activity. SSCS is designed to allow simultaneous communication among up to five users. The system consists of space suit radios (SSER), the shuttle orbiter radio (SSOR), and the space station radio (SSSR). The three have common elements but also unique features and different designs.
NASA decided to treat the SSCS as an in-house development, meaning that its own personnel would design and deliver the system. The Agency held a competitive bidding process and selected a prime contractor to refine the design and manufacture the radios.
A Difficult Reorganization
The formal start of the SSCS project coincided with a reorganization within the engineering directorate at Johnson. Two divisions, the Tracking and Communications Division and the Flight Data Systems Division, merged into a new Avionic Systems Division. At the same time, a new project management office was created to manage the engineering project teams that in the past had interacted directly with the Space Shuttle or ISS programs. Both administrative changes affected morale, and several key engineers with radio expertise opted not to work for the project management office, which now had oversight over the SSCS project. At about the same time, the Johnson engineering directorate awarded a new general engineering support contract. As a result, all the contractor designers on SSCS left the project before Lemke took the reins. The engineering drawings those designers had completed for the prototype were nowhere to be found.
In short, Lemke began his first significant NASA project management assignment under a new internal organization, with no engineering drawings, none of the designers who had worked on the earlier phase of the development, and a project team with no expertise in the complex SSCS radio system architecture.
In Lemke’s estimation, hard work was the answer. He relied on a team that was ready to give its all, despite its inexperience with the inherited technical design. The project itself was a motivator: it was the biggest project in the division, the work was important and challenging, and it offered a rare opportunity to do hands-on hardware development.
The in-house team of designers began the painstaking process of deriving drawings from the prototypes, using calipers, ohm meters, and other reverse-engineering tools to determine the exact specifications of the boards. Every measurement was an opportunity for a mistake; a single missed connection might mean that an entire circuit wouldn’t work. The team’s progress proved excruciatingly slow, and Lemke realized that at this pace the project would never be completed.
When he explained the situation to the contractor, he was assured its engineers could recreate the drawings. Lemke initiated a contract change and handed the boards over to his contractor, which was eager to prove itself on this project, its first at Johnson. Eight months later, the drawings were complete. The project was now where it should have been when Lemke arrived for his first day on the job.
The time the project had lost recreating the drawings inhibited the maturation of the design. The contractor was supposed to have spent those months turning the engineering prototypes into radios that could be manufactured and building test units. Instead, it recreated the laboratory units, which didn’t meet the project’s requirements. This became clear from the performance of the design verification test units (DVTU) the contractor had faithfully built based on the reengineered drawings. The DVTUs didn’t work well as a five-radio network for multiparty conversations.
With the scheduled delivery date for the space station radio closing in, Lemke elected to make the necessary fixes in a piecemeal fashion rather than add an additional DVTU cycle to address the problems on a systems level. Hoping to meet the delivery schedule for the space station, division management agreed.
At this stage, the contractor informed Lemke that none of the units would consistently pass the specification tests. In response to growing concern that the NASA design had problems, the SSCS chief engineer expressed confidence in the design and asserted that the problem was the contractor’s manufacturing processes. Lemke pressed his contractor to stick to the design and build the qualification test units as though they were flight units.
The problems the contractor had predicted began to surface in the qualification units. Since the modem and receiver boards were identical for all three radios, flaws in one were reproduced in the others. A seemingly endless series of quick fixes were being made at the same time that the contractor kept producing more radios. This led to constant reworking of all the existing radios. The project operated in fire-drill mode, scrambling from one problem to the next, leading to schedule changes on a weekly basis and no time for rigorous systems testing.
Division program management was assured that the fundamental problems were understood; all that remained was hard work to get the units delivered. This seemed a reasonable time in the project life cycle for Lemke to transition into another job opportunity while his deputy, Dave Lee, took the helm for the remainder of the project. The managerial transition was a smooth one, but no one involved recognized the hidden defects in the design that would soon emerge. Within a year, Lemke would be re-enlisted, along with temporary reinforcements from some of the division’s best engineers.
The radios made it through acceptance, performance, and qualification testing. Some individual radios did not perform as well as expected, but they passed. The time came to modify the Space Shuttle orbiter and the space suits to accommodate the new radios. In the fall of 1998, the SSCS underwent a test flight on Space Shuttle mission STS-95. The flight uncovered some minor glitches, including an instance in which one radio would not talk to another. This problem was attributed to operator error and solved by re-cycling the system’s power (turning it off and back on). The SSCS team thought the radios were ready for a real in-flight trial.
The project delivered its first radio to the space station in November 1998. At this point, problems seemed to be decreasing; there were still lots of fixes, but the technical work seemed manageable. The delivery schedule, however, remained daunting, as the project team faced demands for twenty-four flight radios and almost 300 spare modules. The next major effort was preparation at Kennedy Space Center for mission STS-96, which would launch in the spring of 1999.
One month before the launch of STS-96, the project was granted special access on the launchpad to conduct burn-in testing of the radios. (Burn-in testing typically involves running electronics products with the power on for a number of hours to uncover defects resulting from manufacturing aberrations.) SSCS project manager Dave Lee and Lemke, who at this point was a consultant to the project, flew to Kennedy to lead the test. After the problems on STS-95, this test was established to regain the program’s confidence that the SSCS system was stable, reliable, and error-free.
The SSCS team was granted permission to spend the entire evening of May 10 on the launchpad with the shuttle orbiter Discovery for dedicated SSCS radio tests. The first few hours were uneventful. A few anomalies were noticed, but the team, still committed to vindicating the radio’s reputation, rationalized them as flaws of the ground support equipment.
Then a thunderstorm approached, producing severe radio frequency turbulence across the marshy plains of the launchpad vicinity. With each crack of the thunder, the SSCS radio signals buzzed and oscillated crazily. As they heard new sounds in their headsets, the radio operators characterized them with descriptive nicknames: motor-boating, rain on the roof, laryngitis. Even after this test, many latent defects still remained undiscovered, the most punishing of which would prove to be the radio’s hypersensitivity to other signals near its frequency range.
By morning, the radio’s reliability problem was evident to every senior manager at the center. With the launch seventeen days away, the shuttle crew had to be trained in recovery procedures in case the radios malfunctioned in flight. A highly talented mission operations engineer, David Simon, helped the SSCS team characterize the problems, and he taught the crew how to respond by creating a cue card describing all the potential problems and mitigations. (The crew was already trained in the use of hand signals in the event of a complete radio failure.)
The last-minute training proved necessary. Astronauts experienced motor-boating during a spacewalk. The pre-established procedures allowed them to recover gracefully from the malfunction, and the crew successfully carried out its mission despite the problems with the radios. On the ground, the SSCS team was ecstatic that nothing derailed the overall success of the mission.
Aftermath and Recovery
But SSCS had failed dramatically in a high-risk and high-visibility situation, and the debriefings of the STS-96 crew drew the attention of NASA’s senior management. The shuttle and the EMUs had to be retrofitted with the original radios for the next flight. The failure also marked the beginning of the project’s turnaround. Management ordered the SSCS to fix the system. Cost and schedule were secondary to finding the root causes of the problems. Every element of the design was reviewed. This allowed the team to conduct the extensive systems testing that it had foregone in the run-up to its first flight. The project also received resources to bring in experts who could help solve the problems.
One of these experts was Mark Chavez, a soft-spoken and highly gifted radio frequency (RF) engineer, who took the helm as chief engineer. Troubleshooting a complicated design that was never fully documented or understood is a challenging reverse-engineering task. After hundreds of hours of testing and analysis, Chavez and a talented support team found the key issue plaguing the radios: a hypersensitive demodulator circuit that saturated itself every time another signal was near the SSCS frequency. The effects of the storm on the launchpad were now understood, as were other performance problems that seemingly occurred at random, such as on-orbit interference that, it appeared, was probably caused by taxicabs in South America.
Latent defects were isolated one by one in a focused and deliberate process that brought in the division’s best design engineers in RF, software, and electronics. Each discovery helped explain the next problem in line, a phenomenon that division chief engineer Paul Shack described as peeling away the layers of an onion.
Each engineer took responsibility for a specific known problem and ran extensive isolated tests to address every issue. The final graduation test was an extensive system test using every known configuration imaginable for a five-radio network—all conducted in the noisy open-air environment of the Johnson Space Center’s back lot field, adjacent to all the RF noises of fast food restaurants, two-way commercial radios, and noisy cars. A huge space station airlock mock-up was trucked into the test field to serve as a simulated space station structure to reflect and diffract RF energy from the transmitters. Simulated space suits were outfitted with radios and installed in the bed of two pickup trucks that were driven around the field to try to confuse the radio network’s stability. At the end of this grueling process, the team could claim success at last.
A year after STS-96, the SSCS was redeployed for STS-101 in May 2000. The phantom noises that had plagued the system previously were gone. By the time of STS-106 four months later, the SSCS achieved error-free operations for the first time. It has continued to do so ever since.
The SSCS project went through seventy-five contract modifications and six contract analysts in the process. The story has no single hero: a minimum of 181 people were directly and significantly involved in the project’s ultimate success.
In hindsight, Lemke can point to three major lessons of his SSCS experience. The first lesson is that technical performance must come first. Schedules and cost projections are meaningless if the design isn’t solid. My first priority on projects today is to get the right technical team in place with the right experience and the right mentors.
The second lesson I learned is the need for validation testing in realistic environments. The radios were fully tested, verified, and certified to meet all requirements prior to flying. Unfortunately, hundreds of hours of successful testing provided no assurance of proper operation if the testing wasn’t thorough enough. It wasn’t until we took a system view of the radios and tested them as they would fly that we uncovered our design flaws.
The final major lesson was to communicate schedule issues early and effectively. Had I fought harder and more effectively for my team to have the needed time up front, we would have saved countless contract modifications, configuration changes, and fixes in flight hardware that should have been done on development hardware, he said.
He saw the failure during the STS-96 mission as a turning point that led to the resolution of the project’s difficulties. That’s where we got to spend the time with our design to really get to the root cause. We got to do the testing, got to find out where the flaws were, and fix it, he said. It was just getting the team, the time, and the management support to solve it. There were no more Band-Aids. ‘Go solve it, and whatever it takes, you do it.’