Unplanned downtime may be the result of a software bug, human error, equipment failure, power failure, and much more. Last week was a bad one. We faced three different downtime:
- First, there was a fiber cut for one of our data center resulting into routing anomalies due BGP reroute. Traffic was rerouted but updating those BGP tables took some time to update.
- Someone from networking team failed to follow proper maintenance procedures for network device resulted into 55 minutes downtime.
- One of our SAN hardware failure - Many internal UNIX / Linux web applications use SAN to store data including file server, tracking apps, R&D apps, IT help desk, LAN and WAN servers failed. This one lasted for 12 hrs. It was stared around midnight. The vendor replaced entire SAN hardware. Now we have dual stacked SAN as a backup device for internal usage.
Our critical public services (such as web, mail, proxy, LDAP ) and few internal services (such as database) are on high availability network infrastructure, redundant systems by having a business continuity plan. However, the cost for managing such system is pretty high. Now, I'd like to see most common causes of downtime in your data center:
- 30 Handy Bash Shell Aliases For Linux / Unix / Mac OS X
- Top 30 Nmap Command Examples For Sys/Network Admins
- 25 PHP Security Best Practices For Sys Admins
- 20 Linux System Monitoring Tools Every SysAdmin Should Know
- 20 Linux Server Hardening Security Tips
- Linux: 20 Iptables Examples For New SysAdmins
- Top 20 OpenSSH Server Best Security Practices
- Top 20 Nginx WebServer Best Security Practices
- 20 Examples: Make Sure Unix / Linux Configuration Files Are Free From Syntax Errors
- 15 Greatest Open Source Terminal Applications Of 2012
- My 10 UNIX Command Line Mistakes
- Top 10 Open Source Web-Based Project Management Software
- Top 5 Email Client For Linux, Mac OS X, and Windows Users
- The Novice Guide To Buying A Linux Laptop