British Airways IT Systems failure: Engineer unplugged and replugged the wrong plug!

Yossarian · Jun 6, 2017

British Airways says IT chaos was caused by human error

The boss of British Airways' parent company says that human error caused an IT meltdown that led to travel chaos for 75,000 passengers.

Willie Walsh, chief executive of IAG, said an engineer disconnected a power supply, with the major damage caused by a surge when it was reconnected.

He said there would now be an independent investigation "to learn from the experience".

However, some experts say that blaming a power surge is too simplistic.

Mr Walsh, appearing at an annual airline industry conference in Mexico on Monday, said: "It's very clear to me that you can make a mistake in disconnecting the power.

"It's difficult for me to understand how to make a mistake in reconnecting the power," he said.

He told reporters that the engineer was authorised to be in the data centre, but was not authorised "to do what he did".

[...]

Scepticism

However, an email leaked to the media last week suggested that a contractor doing maintenance work inadvertently switched off the power supply.

The email said: "This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries... After a few minutes of this shutdown, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the systems and significantly exacerbated the problem."

http://www.bbc.co.uk/news/business-40159202

)

)

)

On a serious note:

No way can that be the only explanation. Disaster Recovery Planning procedures of such a major corporation, with multiple Data Centres at different physical locations, are far too robust for something as simple as that to cause meltdown of such key systems.

The IT systems would be designed with failsafe backups such that even if one data centre had a major disaster (eg fire destroying the whole building), the backup systems physically located elsewhere (which were 'mirroring' the primary systems) would have taken over instantly.

Salman · Jun 6, 2017

Yossarian said:
) ) )

On a serious note:

No way can that be the only explanation. Disaster Recovery Planning procedures of such a major corporation, with multiple Data Centres at different physical locations, are far too robust for something as simple as that to cause meltdown of such key systems.

The IT systems would be designed with failsafe backups such that even if one data centre had a major disaster (eg fire destroying the whole building), the backup systems physically located elsewhere (which were 'mirroring' the primary systems) would have taken over instantly.

Yes all too easy to blame the poor non employee, time sheet submitting Nigerian IT worker. It's more down to the lack of DR strategy and cost cutting in the form of offshore service management which all the larger firms are now practicing. I work for Coca-Cola in London, mini IT meltdowns happen quite often with the support services being reached out in places like Manila, India and Poland. Pay peanuts, get monkeys!

Yossarian · Jun 6, 2017

Salman said:
Yes all too easy to blame the poor non employee, time sheet submitting Nigerian IT worker. It's more down to the lack of DR strategy and cost cutting in the form of offshore service management which all the larger firms are now practicing. I work for Coca-Cola in London, mini IT meltdowns happen quite often with the support services being reached out in places like Manila, India and Poland. Pay peanuts, get monkeys!

Sure. But this was not just a mini-meltdown. Apparently every key system, from booking, ticketing, checking-in, customer services, flight planning, systems that calculate optimum payloads and fuel loads per flight, back office systems, even basic communication systems ie virtually everything was affected.

No way that can be caused by just a plug being unplugged/re-plugged. Backup systems/databases 'mirroring' transactions, running on dedicated servers based in multiple Data Centres that are physically/geographically separated from each other, cannot all be affected in this way. Something is being hidden.

JaDed · Jun 6, 2017

Didn't the union blame Indian IT offshore for it?

Yossarian · Jun 6, 2017

JaDed said:
Didn't the union blame Indian IT offshore for it?

Yes. That's why it can't be just a simple 'unplugged the wrong plug' as per my posts above.

BA admitting it's related to offshoring to the Indians brings into question far too many aspects, which affects not just BA but also virtually every bank, organisation, utility that subcontracts it's systems offshore to systems providers in India. That's why there's more to it than is being told. I wouldn't be surprised if the govt. is involved in deciding what/whom to blame.

It's like an airliner crash. If an aspect of the design of the crashed plane is brought into question, then it raises question marks over the safety of all the planes of that design and/or using the same component and/or using the same maintenance procedures/facilities.

So you find a scapegoat to blame

JaDed · Jun 6, 2017

Yossarian said:
Yes. That's why it can't be just a simple 'unplugged the wrong plug' as per my posts above.

BA admitting it's related to offshoring to the Indians brings into question far too many aspects, which affects not just BA but also virtually every bank, organisation, utility that subcontracts it's systems offshore to systems providers in India. That's why there's more to it than is being told. I wouldn't be surprised if the govt. is involved in deciding what/whom to blame.

It's like an airliner crash. If an aspect of the design of the crashed plane is brought into question, then it raises question marks over the safety of all the planes of that design and/or using the same component and/or using the same maintenance procedures/facilities.

So you find a scapegoat to blame

If it's related to data center then it might not necessarily be offshore model fault as it will be various centres and surely BA hasn't opted for the complete outsourcing model.

In the above scenario the DCs will be at worst 3 out of 5 times handled by the BA with electrical and mechanical by third party in this case CBRE which is over there and has agreed to an enquiry.

IIRC my TCS is the offshore company to handle it's support systems and don't think any company as such outsources the entire operation including BA and nothing with DCs would be operated by TCS.

big_gamer007 · Jun 6, 2017

Well Indian IT companies offer cheap support but the quality is really very poor.. The employees do not take ownership, management is poor and projects are generally short staffed.. Having said that if it was a power failure then it's not outsourced companies fault..

big_gamer007 · Jun 6, 2017

Yossarian said:
) ) )

On a serious note:

No way can that be the only explanation. Disaster Recovery Planning procedures of such a major corporation, with multiple Data Centres at different physical locations, are far too robust for something as simple as that to cause meltdown of such key systems.

The IT systems would be designed with failsafe backups such that even if one data centre had a major disaster (eg fire destroying the whole building), the backup systems physically located elsewhere (which were 'mirroring' the primary systems) would have taken over instantly.

I have worked on a lot of projects and can assure you a lot of big companies around the globe does not have proper DR procedure.. It wouldn't surprise me if BA network is the same..

Yossarian · Jun 6, 2017

big_gamer007 said:
I have worked on a lot of projects and can assure you a lot of big companies around the globe does not have proper DR procedure.. It wouldn't surprise me if BA network is the same..

DR is something that the beancounters can't get their heads around. "We've never had a failure before" is often their stock answer. To which one says "The airliner you see flying above has also not crashed before....but if some component fails...". It usually does the trick in getting them to see the light.

Nearly everyone, from the home PC user, to the largest corporation, rigorously takes backups on a regular basis. Often for legal purposes, and kept for years, to keep backups of historical data. But no one ever tries to see if those backups can be used.

No one considers the fact that backups taken a year or two ago would be useless in terms of getting the data off them. Because in the meantime, the hardware, the OS, Systems softwares, DBMS etc. would all have been regularly upgraded such that the backups taken years ago could not be restored and made operable on the current setup.

In the days of reel-to-reel backup tapes, data cartridges or other media, companies would keep backups for years, whilst forgetting to retain the hardware devices associated with such media that would be needed to access this data!

One of my roles for one of my employers involved turning up unannounced at one of our European subsidiaries, declare that imagine a fire (or some other disaster) has just taken place, everyone needs to vacate the offices (or warehouse, or computer suite, ie whatever business aspects of the DR is being tested), and tell the site/warehouse/customer/IT manager(s) to invoke the DR plan to keep key aspects of the business functioning.

The manager(s) in charge of whatever aspect of DR was being tested had their jobs depending upon the outcome of the DR test since it was written in their contracts of employment.

Yossarian · Jun 6, 2017

JaDed said:
If it's related to data center then it might not necessarily be offshore model fault as it will be various centres and surely BA hasn't opted for the complete outsourcing model.

In the above scenario the DCs will be at worst 3 out of 5 times handled by the BA with electrical and mechanical by third party in this case CBRE which is over there and has agreed to an enquiry.

IIRC my TCS is the offshore company to handle it's support systems and don't think any company as such outsources the entire operation including BA and nothing with DCs would be operated by TCS.

At this stage it is not clear what aspect failed. So you cannot rule anything out. You cannot say whether it is, or it is not, the fault of TCS, BA or CBRE.

In fact, you could argue that it's the fault of all of them because the systems failed, along with any Disaster Recovery Plan, which they all have been involved in designing, testing, maintaining, and regularly simulating.

JaDed · Jun 6, 2017

Yossarian said:
At this stage it is not clear what aspect failed. So you cannot rule anything out. You cannot say whether it is, or it is not, the fault of TCS, BA or CBRE.

In fact, you could argue that it's the fault of all of them because the systems failed, along with any Disaster Recovery Plan, which they all have been involved in designing, testing, maintaining, and regularly simulating.

I agree with the second,what I meant was offshore model can lead to.mini app meltdown but not such a major one.

big_gamer007 · Jun 6, 2017

JaDed said:
I agree with the second,what I meant was offshore model can lead to.mini app meltdown but not such a major one.

Nah I have had a team member who brought down share prices of RadioShack by a dollar or something a while back because of shutting down 2500+ stores. It wasn't exactly his fault just the design was so poor that one device was used to generate 5000+ stores certificate renewal the device could not stand the amount of requests which came during the renewal stage..

All this should have been tested and planned and designed in a better manner but when you have cost cutting, lack of quality people and a poor management at an offshore level mistakes like these can happen..

big_gamer007 · Jun 6, 2017

Yossarian said:
DR is something that the beancounters can't get their heads around. "We've never had a failure before" is often their stock answer. To which one says "The airliner you see flying above has also not crashed before....but if some component fails...". It usually does the trick in getting them to see the light.

Nearly everyone, from the home PC user, to the largest corporation, rigorously takes backups on a regular basis. Often for legal purposes, and kept for years, to keep backups of historical data. But no one ever tries to see if those backups can be used.

No one considers the fact that backups taken a year or two ago would be useless in terms of getting the data off them. Because in the meantime, the hardware, the OS, Systems softwares, DBMS etc. would all have been regularly upgraded such that the backups taken years ago could not be restored and made operable on the current setup.

In the days of reel-to-reel backup tapes, data cartridges or other media, companies would keep backups for years, whilst forgetting to retain the hardware devices associated with such media that would be needed to access this data!

One of my roles for one of my employers involved turning up unannounced at one of our European subsidiaries, declare that imagine a fire (or some other disaster) has just taken place, everyone needs to vacate the offices (or warehouse, or computer suite, ie whatever business aspects of the DR is being tested), and tell the site/warehouse/customer/IT manager(s) to invoke the DR plan to keep key aspects of the business functioning.

The manager(s) in charge of whatever aspect of DR was being tested had their jobs depending upon the outcome of the DR test since it was written in their contracts of employment.

I agree but it's not just the fault of offshore lack of quality alone.. Some of the infrastructure in banking projects I have worked in was so old that it's was asking for trouble.. However being a bank they were not willing to risk any upgradation which might cause any loss of service.. Now this is just upgrading the software on the infrastructure imagine how would they react when someone suggests to update the entire design and hardware to the latest standards..

Ultimately as long as it's working no one cares enough to do major infra changes once a major issue happens then questions get raised why this wasn't updated before and then blame game begins.. Also DR drills are supposed to happen every 6 months to check everything is fine in case of major outages but I have seen clients who didn't have any DR drill for years (major banks)..

JaDed · Jun 6, 2017

big_gamer007 said:
Nah I have had a team member who brought down share prices of RadioShack by a dollar or something a while back because of shutting down 2500+ stores. It wasn't exactly his fault just the design was so poor that one device was used to generate 5000+ stores certificate renewal the device could not stand the amount of requests which came during the renewal stage..

All this should have been tested and planned and designed in a better manner but when you have cost cutting, lack of quality people and a poor management at an offshore level mistakes like these can happen..

Remarkable, hopefully never design it that way.

Yossarian · Jun 6, 2017

big_gamer007 said:
I agree but it's not just the fault of offshore lack of quality alone.. Some of the infrastructure in banking projects I have worked in was so old that it's was asking for trouble.. However being a bank they were not willing to risk any upgradation which might cause any loss of service.. Now this is just upgrading the software on the infrastructure imagine how would they react when someone suggests to update the entire design and hardware to the latest standards..

Ultimately as long as it's working no one cares enough to do major infra changes once a major issue happens then questions get raised why this wasn't updated before and then blame game begins.. Also DR drills are supposed to happen every 6 months to check everything is fine in case of major outages but I have seen clients who didn't have any DR drill for years (major banks)..

It's the fault of everyone. The offshores should not take on the contracts unless they can guarantee service, because if something goes wrong once the contract has commenced, then they carry the responsibility and are responsible for the consequences.

As for banks not willing to upgrade, sure, such attitudes do prevail. But if they are afraid to upgrade in case things go wrong due to the upgrade, then those ultimately in charge of the systems should be fired. Because the systems need to be upgraded at some point, and the longer the upgrade is put off, the more likelihood not just in the system crashing, but also the eventual upgrade becomes even more difficult due to legacy softwares.

In my experiences, this is where the offshore operations, and those in charge, leave a lot to be desired. Too much of a "yes sir!, no sir!" mentality, and afraid to argue their case, especially amongst engineers/developers/ managers of offshore service providers from the sub-continent.

Search

British Airways IT Systems failure: Engineer unplugged and replugged the wrong plug!

Yossarian

Test Debutant

Salman

ODI Debutant

Yossarian

Test Debutant

JaDed

Test Star

Yossarian

Test Debutant

JaDed

Test Star

big_gamer007

T20I Debutant

big_gamer007

T20I Debutant

Yossarian

Test Debutant

Yossarian

Test Debutant

JaDed

Test Star

big_gamer007

T20I Debutant

big_gamer007

T20I Debutant

JaDed

Test Star

Yossarian

Test Debutant

Similar threads