Poll: Common Causes Of Downtime In Your Data Center

Unplanned downtime may be the result of a software bug, human error, equipment failure, power failure, and much more. Last week was a bad one. We faced three different downtime:

  • First, there was a fiber cut for one of our data center resulting into routing anomalies due BGP reroute. Traffic was rerouted but updating those BGP tables took some time to update.
  • Someone from networking team failed to follow proper maintenance procedures for network device resulted into 55 minutes downtime.
  • One of our SAN hardware failure – Many internal UNIX / Linux web applications use SAN to store data including file server, tracking apps, R&D apps, IT help desk, LAN and WAN servers failed. This one lasted for 12 hrs. It was stared around midnight. The vendor replaced entire SAN hardware. Now we have dual stacked SAN as a backup device for internal usage.


Our critical public services (such as web, mail, proxy, LDAP ) and few internal services (such as database) are on high availability network infrastructure, redundant systems by having a business continuity plan. However, the cost for managing such system is pretty high. Now, I’d like to see most common causes of downtime in your data center:

[poll id=”5″]
🐧 Get the latest tutorials on Linux, Open Source & DevOps via RSS feed or Weekly email newsletter.

🐧 6 comments so far... add one


CategoryList of Unix and Linux commands
Disk space analyzersdf duf ncdu pydf
File Managementcat cp mkdir tree
FirewallAlpine Awall CentOS 8 OpenSUSE RHEL 8 Ubuntu 16.04 Ubuntu 18.04 Ubuntu 20.04
Modern utilitiesbat exa
Network UtilitiesNetHogs dig host ip nmap
OpenVPNCentOS 7 CentOS 8 Debian 10 Debian 8/9 Ubuntu 18.04 Ubuntu 20.04
Package Managerapk apt
Processes Managementbg chroot cron disown fg glances gtop jobs killall kill pidof pstree pwdx time vtop
Searchingag grep whereis which
User Informationgroups id lastcomm last lid/libuser-lid logname members users whoami who w
WireGuard VPNAlpine CentOS 8 Debian 10 Firewall Ubuntu 20.04
6 comments… add one
  • snash May 27, 2009 @ 15:22

    Hello,

    I’m sysadmin. Our last downtime was an electrical problème that caused aircooler probleme. And the high temperature cause servers shutdown.
    It was terific !

  • gbor May 27, 2009 @ 20:33

    sorry for the little off-topic, but may be useful for admins 🙂 [From Hungary]
    http://vod.niif.hu/play2/index.php?eid=3&lid=358&bw=500K&lg=hu

  • Ken Carroll May 27, 2009 @ 22:39

    Power/HVAC definitely…our data center is growing too quickly 🙂

  • Dave May 29, 2009 @ 14:17

    Maybe I’m thinking into this too much, but don’t all of these errors come down to human error? Either failure to plan or a decision at the top to not spend the money necessary to have redundant systems. It’s hard to place blame on hardware or software because if proper procedures are in place, none of these should fail!

  • Gaurish Sharma Jun 8, 2009 @ 19:54

    you forgot to add downtime due to Hacking attempts & DDOS :p

  • Christer Edwards Jun 10, 2009 @ 3:06

    Our downtime is most often caused by heat. We’ve grown a lot, and quickly, and our cooling capacity doesn’t match our heat output anymore. We’ve had to shut off a number of non-critical servers and we’ve moved in two industrial fans. Its really annoying, but management won’t spring for HVAC improvements… yet.

Leave a Reply

Your email address will not be published.

Use HTML <pre>...</pre> for code samples. Still have questions? Post it on our forum