Today, I’ve upgraded total 8 servers from 4GiB to 8GiB to improve performance of system by inserting additional memory modules. We started each server and checked for memory count at console. All severs booted normally after the upgrade and services such as SMTP, NFS, CIFS, HTTP started as expected. Shortly, afterwords I got a call from help desk about pop3 server for slow performance.
The pop3 server node was giving out timeout errors and download speed was very slow for all MUAs. I tried to ssh into box and it bounced back with 22: Connection refused error. I wasn’t ready to take down server from rack; so I fired KVM over IP java client. Eventually, I found that server is reporting only 2GiB RAM instead of doubling the total memory. This was bad. The worst problem was, POP3 server node did not fall back to backup node. Our LVS (Linux Virtual Server based cluster) failed to detect problem. So I made few changes to pickup working POP3 node.
My investigation revealed that this memory problem occurred because the new RAM was incompatible with the server motherboard. I did verified the available RAM for first five nodes and went to back to office for something else. Another person hooked back the rest and told me that he verified the available RAM. Whenever, I perform memory upgrade, I always verify the amount of memory reported by the system when it is rebooted and I never assume the memory is there.
Another lesson learned – never ever trust third person. If he has verified the available RAM immediately after installing the new modules, we would noticed the problem immediately instead of waiting for users to complain back. Another reason not to perform upgrades on Fridays.