Aviation safety – theory and practice

Simon Bennett provides an overview of aviation safety

img_57-1.jpg

Although one of the safest forms of transport, aviation has suffered some tragic losses. In 1974, an explosive decompression downed a Turkish Airlines DC-10 with the loss of 346 lives. In 1977, a Pan American 747 collided with a KLM 747 killing 583. In 1979, following the failure of an engine mounting, an American Airlines DC-10 spiralled into the ground killing 272. In 1980, a flash-fire on a Saudia L-1011 TriStar killed 301. In 1983, an off-course Korean Airlines 747 was blown out of the sky by a Soviet fighter, killing 269. In 1985, an explosive decompression downed a Japan Airlines 747 with the loss of 520 lives. In 1988, a US Navy surface-to-air missile downed an Iran Air A300 with the loss of 290 lives. In 1996, a mid-air collision between a Saudia 747 and a Kazakhstan Airlines II-76 killed 349. In 2014, a Russian-made missile downed a Malaysia Airlines 777 killing 298. In October 2018, a Lion Air Boeing 737 Max 8 crashed shortly after take-off from Jakarta. In March 2019, an Ethiopian Airlines 737 Max 8 crashed shortly after take-off from Addis Ababa. A total of 346 passengers and crew died in the two disasters.

Aviation has, by its nature, the capacity to kill large numbers of people in an instant. Given the human, financial and reputational costs of disaster, the industry is tireless in its pursuit of new thinking on safety. Occasionally, academia helps. For example:

Coupling, complexity and the human touch

In the 1980s, American safety advocate Professor Charles Perrow suggested that coupling and complexity are inversely related to reliability and resilience. That is, the more tightly coupled (linear) and complex a system, the less reliable it is. Perrow’s safety formula links reliability with simplicity, redundancy and ‘slack’ (buffers).

More recently, Professor Erik Hollnagel has argued that when things go awry, operators (for example, pilots), must be able to make timely and effective interventions. In Hollnagel’s socio-technical conception of safety (known as Safety-II), the operator is considered an asset rather than a liability. Frequently, it is the employee who saves the day. In 1983, a Soviet military satellite mistook reflected sunlight for missile launches during NATO’s annual Able Archer exercise. Soviet defence computers informed the duty officer, Stanislav Petrov, that the United States had launched five ballistic missiles against the USSR. Protocol required Petrov to recommend a retaliatory strike. However, Petrov judged that the satellite trace was a false alarm (a ‘false positive’). He reasoned that the United States would have launched all its missiles in a first strike. Using his judgement and intuition, Petrov decoupled the USSR’s tightly-coupled missile-defence system and, arguably, saved the world from accidental nuclear war.

In a paper titled Close Calls with Nuclear Weapons, the Union of Concerned Scientists notes: “The strongest, and one of the few, safety links in the chain was the judgement of the officer in command of the early warning centre”. Despite the bellicose context – heightened East-West tension fomented by US and Soviet leaders, a large military exercise and the recent downing by Soviet fighters of a Korean Airlines Boeing 747 – Petrov maintained his capacity for reason. The 1983 near-miss demonstrated the importance of affording those in charge of automated systems opportunities to exercise judgment. Petrov’s management of the incident to a satisfactory conclusion supports Hollnagel’s argument that operators make a net positive contribution to safety and reliability: “Things do not go right because people behave as they are supposed to, but because people can, and do, adjust what they do to match the conditions …. As systems … introduce more complexity, these adjustments become increasingly important”.

The saving of US Airways Flight 1549 in 2009 by Captain Chesley Sullenberger, First Officer Jeff Skiles and flight attendants Donna Dent, Doreen Welsh and Sheila Dail, is the best recent example of operators’ contribution to safety. On January 15, 2009, US Airways Flight 1549, an Airbus A320 carrying 155 passengers and crew, encountered a flight of Canada geese. Canada geese have a wingspan of around six feet and can weigh up to eighteen pounds. The bird-strike, which happened at 2,900ft with the aircraft making 200 knots, disabled both engines. Drawing on their experience, skill and judgment, Sullenberger and Skiles ‘worked the problem’. In his autobiography, Sullenberger recalled the moment the birds struck: “The symmetrical loss of thrust … was shocking and startling …. I could feel the momentum stopping, and the airplane slowing …. Within eight seconds of the bird strike … I knew that this was the worst aviation challenge I’d ever faced. It was the most sickening, pit-of-your-stomach, falling-through-the-floor feeling I had ever experienced”. Having taken control, Captain Sullenberger elected to ditch his stricken aircraft in the Hudson River. For Sullenberger, this was the least-worst option. “We’re gonna be in the Hudson” he announced to air traffic control, then proceeded to execute a textbook ditching.

Flight 1549’s passengers were saved not by automation, but by the flight-deck’s capacity for seat-of-the-pants flying. They were saved by a Captain who had soloed on just seven hours and twenty-five minutes flying, had flown fast jets in the United States Air Force and had been a commercial pilot for nearly 30 years. They were saved by a Captain who, drawing on his experience, picked the least-worst option. As happened during the 1983 missile crisis, lives were saved because operators were able and willing to exercise judgement. In Perrow’s argot, lives were saved by introducing slack into a tightly-coupled system.

In July, 2018, Reuters reported that: “Airplane manufacturers are working … to build new cockpits designed for a single aviator in order to ease a global pilot shortage and cut airline costs”. Asked to comment, Cranfield University’s Professor Graham Braithwaite said: “The technology to fly an aircraft on automatic is brilliant …. We are really short of pilots. They are a very expensive resource”. Those who advocate single-pilot flight-decks should ask themselves whether Sullenberger could have saved Flight 1549 on his own. In his autobiography, Sullenberger praised his First Officer: “Jeff and I had met just three days before …. Yet during this dire emergency – with no time to verbalise every action and discuss our situation – we communicated extraordinarily well. Thanks to our training and our immediate observations in the moment of crisis, each of us understood the situation, knew what needed to be done, and had already begun doing our parts in an urgent, yet co-operative fashion”.

Arguably, if Sullenberger had been alone on the flight-deck he would have been overwhelmed. As to Braithwaite’s claim that today’s automatics are ‘brilliant’, during my two decades on the flight-deck I have witnessed numerous malfunctions, including of the autopilot.

Despite the proven safety benefits of a two-person flight-deck, the one-person flight deck is still mooted. In a September, 2010, interview with the Financial Times, Ryanair Chief Executive Michael O’Leary claimed that short-haul flights could be operated safely by a single pilot. Following O’Leary’s claims, Ryanair said: “We are starting the debate so that we can look to reduce costs without compromising safety …. Given the sophistication of our aircraft [the Boeing 737 was designed in the 1960s] we believe that one pilot flying can operate safely on short routes and reduce fares for all passengers”. In August, 2018, Boeing’s vice president Steve Nordlund talked about the airframer’s plans for the flight-deck: “I don’t think you’ll see a pilotless aircraft … in the near future …. But what you may see is more automation and aiding in the cockpit, maybe a change in the crew number up in the cockpit”.

The single-pilot flight-deck, while economically attractive, is an accident waiting to happen. What if the pilot is taken ill? What if, as happened to Germanwings Flight 9525, the pilot decides to commit suicide-by-aircraft? What if, as happened to US Airways Flight 1549, the flight-deck finds itself processing inputs and deciding options under extreme time pressure? Airframers and CEOs should re-familiarise themselves with the phenomenon of ‘task saturation’ – something I have witnessed on the flight-deck. One of the pillars of safety is the monitoring and cross-checking by one pilot of the decisions and actions of the other. Removing one of the pilots removes this safeguard. Having a back-up pilot on the ground – so-called ‘distributed crewing’ – is no substitute for having a pilot on the flightdeck. The industry would be well advised not to jeopardise its improving safety record to save a few dollars. Professor Erik Hollnagel has spent his career encouraging designers to think of operators (for example, pilots) not as liabilities, but as assets. “Humans are … a resource necessary for system flexibility and resilience [providing] flexible solutions to many potential problems” notes the Professor. For his part, Chesley Sullenberger is convinced that he would have lost his aircraft had Jeff Skiles not been at his side: “If Jeff Skiles had been on the ground … there’s absolutely no way. It could not have been”.

Common-cause failures

A common-cause or common-mode failure sees multiple systems disabled by a single failure. The International Organisation for Standardisation (ISO) defines common-cause failures as “failures of different items, resulting from a single event, where these failures are not consequences of each other”. The complexity, coupling and density of aircraft create opportunities for common-cause failure. In 1989, a United Airlines DC- 10 suffered a catastrophic engine failure at 37,000ft that severed all three of the aircraft’s hydraulic lines, leaving the crew having to use differential thrust (from the number one and number three engines) to control the aircraft. Author Andrew Brookes explains that the destruction wrought by the disintegration of engine number two was absolute: “When the [engine number two] fan-disk disintegrated … it flung out shrapnel in all directions. Fifty hits were found in the tail structure, including one measuring 10 x 12 inches, and among other effects, the debris burst severed the three separate hydraulic lines …. Experts were unsure whether any hydraulic system could have survived the disintegration that befell United Flight 232”. The loss of control experienced by Captain Al Haynes and his crew was the product of a common-cause or common-mode failure – a single event (the fan-disk disintegration) taking down multiple systems (the DC-10’s three independent hydraulic systems). There is a positive relationship between density and vulnerability. The denser an aircraft (that is, the more tightly-packed its systems), the more vulnerable it is to common-cause or common-mode failure. The addition of new elements such as in-flight entertainment (IFE) systems is making aircraft denser – and more vulnerable.

img_58-1.jpg
Photo: Steve Riot

Opacity and intractability

Complex, high-speed systems with limited feedback are difficult to control. Operators struggle to understand them. Opacity and intractability make it difficult, or impossible, for operators to take back control in an emergency. The problems of opacity and intractability were first discussed by political scientists Eugene Burdick and Harvey Wheeler in their best-selling 1962 novel Fail-Safe. The novel’s premise – that machines are fallible – is pertinent today. The novel describes how an unanticipated and difficult-to-analyse malfunction in a defence computer sees a squadron of B-58 Hustlers tasked to eliminate Moscow. The film of the book contains noteworthy observations. During a seminar, General Black (played by Dan O’Herlihy), observes: “We’re going too fast. Things are getting out of hand …. We are setting up a war machine that acts faster than the ability of men to control it. We are putting men into situations that are getting too tough for men to handle …. We have got to slow down”.

Automation is framed as the answer to society’s many ills, from creaking health services (computer-based triage) to road deaths (driverless cars). In Fail-Safe, Burdick and Wheeler drew attention to automation’s inherent dangers. In 1983, social scientist Lisanne Bainbridge revisited the evils of opacity and intractability in her seminal paper ‘Ironies of Automation’. She wrote: “If the human operator is not involved in on-line control, she or he will not have detailed knowledge of the current state of the system. One can ask what limitations this places on the possibility for effective manual takeover, whether for stabilisation … or for fault diagnosis”.

More recently, Pamela Munro, a Boeing human-factors specialist, has argued that pilots must be kept in the loop: “Engineers don’t always realise that automation can lull people into complacency …. People are expected to be able to jump in when something goes wrong, but if they haven’t been getting feedback, they lose the ability to analyse the situation”. In his June 2019, testimony to the Aviation Subcommittee of the United States House Committee on Transportation and Infrastructure on the Boeing 737 MAX 8 crashes, Captain Chesley Sullenberger said: “We must … provide detailed system information to pilots that is more complete …. We should all want pilots to experience … challenging situations for the first time in a simulator, not in flight with passengers and crew on board”.

“Examination of their last training records and check-rides made it clear that the co-pilots had not been trained for manual aeroplane handling of approach to stall and stall recovery at high altitude” noted the BEA.

In 2009, an Air France Airbus A330, en route from Rio to Paris, plunged into the sea from altitude, killing everyone on board. According to the Bureau d’Enquêtes et d’Analyses (BEA), immediate causes included “obstruction of the pitot probes by ice crystals that … caused … autopilot disconnection”, “the crew not identifying the approach to stall” and “the crew’s … lack of inputs that would have made it possible to recover from [the high-altitude, high-speed stall]”.

Proximate causes included the flight-crew’s inability to quickly access angle-of-attack data, and the airline’s failure to train stall-identification and recovery skills.

Regarding the former issue, the BEA noted in its July 2012 Final Report: “The aeroplane’s angle-of-attack is not directly displayed to the pilots …. It is essential in order to ensure flight safety to reduce the angle-of-attack when a stall is imminent. Only a direct readout of the angle-of-attack could enable crews to rapidly identify the aerodynamic situation of the aeroplane and take the actions that may be required. Consequently, the BEA recommends … that the European Union Aviation Safety Agency and the Federal Aviation Authority evaluate the relevance of requiring the presence of an angle-of-attack indicator directly accessible to pilots on board aeroplanes”.

Regarding the latter issue, the BEA’s report stated: “Examination of their last training records and check-rides made it clear that the co-pilots had not been trained for manual aeroplane handling of approach to stall and stall recovery at high altitude”. The angle-ofattack data display issue, together with the crew’s inadequate stick-and-rudder skills, created a perfect storm of latent errors (see the work of Professor James Reason) that increased the chance of mishap. A latent error is an accident waiting to happen.

When automation inhibits situation awareness to the point where it is no longer possible for an operator to remedy a malfunction, consideration should be given to removing or re-engineering the system in question. Lives may depend on it. Despite the claims of Steve Nordlund, Michael O’Leary and others, automation is not a panacea. Sometimes it is lethal. The compensation claims and premium hikes resulting from an aircraft careening into a terminal because its single pilot suffered a heart-attack on short-finals will eclipse the dollars saved by eliminating a crew member. United Airlines Flight 232 was saved neither by a computer nor by distributed crewing. It was saved collegially in situ, by Captain Al Haynes, First Officer William Records, Flight Engineer Dudley Dvorak and Check Pilot Dennis Fitch. Haynes is convinced that he could not have saved the aircraft on his own: “We had 103 years of flying experience there in the [DC-10’s] cockpit, trying to get that airplane on the ground, not one minute of which we had actually practised, any one of us. So why would I know more about getting that airplane on the ground … than the other three? So if … we had not let everybody put their input in, it’s a cinch we wouldn’t have made it”. In 1991, The Honourable Company of Air Pilots bestowed on Captain Al Haynes the Hugh Gordon-Burge Memorial Award. An industry that ignores men of integrity like Al Haynes and Chesley Sullenberger is heading for a fall.