Missing Memory

Posted on in Categories Howto, Linux, RedHat/Fedora Linux, Sys admin, Troubleshooting, UNIX, Windows server last updated April 22, 2009

Today, I’ve upgraded total 8 servers from 4GiB to 8GiB to improve performance of system by inserting additional memory modules. We started each server and checked for memory count at console. All severs booted normally after the upgrade and services such as SMTP, NFS, CIFS, HTTP started as expected. Shortly, afterwords I got a call from help desk about pop3 server for slow performance.

The pop3 server node was giving out timeout errors and download speed was very slow for all MUAs. I tried to ssh into box and it bounced back with 22: Connection refused error. I wasn’t ready to take down server from rack; so I fired KVM over IP java client. Eventually, I found that server is reporting only 2GiB RAM instead of doubling the total memory. This was bad. The worst problem was, POP3 server node did not fall back to backup node. Our LVS (Linux Virtual Server based cluster) failed to detect problem. So I made few changes to pickup working POP3 node.

My investigation revealed that this memory problem occurred because the new RAM was incompatible with the server motherboard. I did verified the available RAM for first five nodes and went to back to office for something else. Another person hooked back the rest and told me that he verified the available RAM. Whenever, I perform memory upgrade, I always verify the amount of memory reported by the system when it is rebooted and I never assume the memory is there.

Another lesson learned – never ever trust third person. If he has verified the available RAM immediately after installing the new modules, we would noticed the problem immediately instead of waiting for users to complain back. Another reason not to perform upgrades on Fridays.

8 comment

  1. This mailinglist seemed pretty interesting at a first glance but has time after time proven itself rather low-tech.. When upgrading servers it could be useful to make sure you stick the right kinds of memories in the machines, how insightful..

  2. We always run memtest (until test#4) on all our new servers, you’d be amazed how many memory modules actually fail. (From my experience, maybe 1 of 1000. (including single-bit errors, and non-“critial” errors.))

    I’m sure you could run many of these systems without noticing this problem, but better safe than sorry! :D

  3. On Red Hat, I’ve seen several issues where PAE kernels were not installed. Someone called their data center, had the ram updated, but failed to install the appropriate kernel. For would very large server provider, I saw them go through 4 sticks of RAM before realizing it was the OS.

Leave a Comment