Every second the equivalent of 63 billion CDs of data transit through the world’s internet (source: Cisco). That’s 1.5ZB per year (1ZB= 270 B!). As of December 31st 2011, 2.3 billion persons are using the internet, a 5.3X increase from the previous year! (source: internetworldstat.com). As lifestyle in almost every country in every continent is moving towards ubiquitous mobile lifestyle, demand for remote mass storage and cloud computing capacity is rapidly increasing. These numbers are mindboggling and leave us to estimate the impact of failure of this infrastructure leading to service disruption: we are not talking about thousands nor millions of users affected. We are talking about hundred of millions!
Obviously cloud services architectures involve heavy redundancies, mirror imaging of servers in different geographical location, disaster recovery procedures…Still, isn’t there some single point of failure? As it is a well adopted fact that software can show bugs, viruses, worms…how about hardware? When firewalls, watchdogs and other software procedure are commonly put in place, aren’t we less keen to accept hardware failure? Even trickier: what about hardware generated data corruption? The hardware shows no sign of failure, the software is not infected by viruses….still something’s not right.
Soft errors, even as being a small contributor of the overall reliability of systems, can still be the source of undetected failures that propagate to whole systems.
What are we doing to mitigate this problem, especially in cloud computing and data storage infrastructure?