Storage

Linux and other Unix-like operating systems use the term “swap” to describe both the act of moving memory pages between RAM and disk and the region of a disk the pages are stored on. It is common to use a whole partition of a hard disk for swapping. However, with the 2.6 Linux kernel, swap files are just as fast as swap partitions. Now, many admins (both Windows and Linux/UNIX) follow an old rule of thumb that your swap partition should be twice the size of your main system RAM. Let us say I’ve 32GB RAM, should I set swap space to 64 GB? Is 64 GB of swap space required? How big should your Linux / UNIX swap space be?
[continue reading…]

If your network is heavily loaded you may see some problem with Common Internet File System (CIFS) and NFS under Linux. By default Linux CIFS mount command will try to cache files open by the client. You can use mount option forcedirectio when mounting the CIFS filesystem to disable caching on the CIFS client. This is tested with NETAPP and other storage devices and Novell, CentOS, UNIX and Red Hat Linux systems. This is the only way to avoid data mis-compare and problems.

The default is to attempt to cache ie try to request oplock on files opened by the client (forcedirectio is off). Foredirectio also can indirectly alter the network read and write size, since i/o will now match what was requested by the application, as readahead and writebehind is not being performed by the page cache when forcedirectio is enabled for a mount

mount -t cifs //mystorage/data2 -o username=vivek,password=myPassword,rw,bg,vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,forcedirectio,llock /data2

Refer mount.cifs man page, docs stored at Documentation/filesystems/cifs.txt and fs/cifs/README in the linux kernel source tree for additional options and information.

NFS is pretty old file sharing technology for UNIX based system and storage systems. However, it suffers from performance issues. NFSv4.1 address data access issues by adding a new feature called parallel NFS (pNFS) – a method of introducing Data Access Parallelism. The end result is ultra fast file sharing for clusters and high availability configurations.

The Network File System (NFS) is a stalwart component of most modern local area networks (LANs). But NFS is inadequate for the demanding input- and output-intensive applications commonly found in high-performance computing — or, at least it was. The newest revision of the NFS standard includes Parallel NFS (pNFS), a parallelized implementation of file sharing that multiplies transfer rates by orders of magnitude.

In addition to pNFS, NFSv4.1 provides Sessions, Directory Delegation and Notifications, Multi-server Namespace, ACL/SACL/DACL, Retention Attributions, and SECINFO_NO_NAME.

Fig.01: The conceptual organization of pNFS - Image credit IBM

According to wikipedia:

The NFSv4.1 protocol defines a method of separating the meta-data (names and attributes) of a filesystem from the location of the file data; it goes beyond the simple name/data separation of striping the data amongst a set of data servers. This is different from the traditional NFS server which holds the names of files and their data under the single umbrella of the server. There exists products which are multi-node NFS servers, but the participation of the client in separation of meta-data and data is limited. The NFSv4.1 client can be enabled to be a direct participant in the exact location of file data and avoid solitary interaction with the single NFS server when moving data.

The NFSv4.1 pNFS server is a collection of server resources or components; these are assumed to be controlled by the meta-data server.

The pNFS client still accesses a single meta-data server for traversal or interaction with the namespace; when the client moves data to and from the server it may be directly interacting with the set of data servers belonging to the pNFS server collection.

More information about pNFS

  1. Scale your file system with Parallel NFS
  2. Linux NFS Overview, FAQ and HOWTO Documents
  3. NFSv4 delivers seamless network access
  4. Nfsv4 Status Pages
  5. NFS article from the Wikipedia

Linux target framework (tgt) aims to simplify various SCSI target driver (iSCSI, Fibre Channel, SRP, etc) creation and maintenance. The key goals are the clean integration into the scsi-mid layer and implementing a great portion of tgt in user space.

The developer of IET is also helping to develop Linux SCSI target framework (stgt) which looks like it might lead to an iSCSI target implementation with an upstream kernel component. iSCSI Target can be useful:

a] To setup stateless server / client (used in diskless setups).
b] Share disks and tape drives with remote client over LAN, Wan or the Internet.
c] Setup SAN – Storage array.
d] To setup loadbalanced webcluser using cluster aware Linux file system etc.

In this tutorial you will learn how to have a fully functional Linux iSCSI SAN using tgt framework.
[continue reading…]

You can easily start / stop / pause or take a snapshot from a shell prompt under a Linux / Windows host using vmrun command. This is useful if you do not want to run web interface for starting and/or stopping VMs.

vmrun commands

vmrun -u USER -h ‘https://vmware.server.com:8333/sdk’ -p PASSWORD COMMAND [PARAMETERS] OR
vmrun -u USER -h ‘https://vmware.server.com:8333/sdk’ -p PASSWORD start “[storage] Path/to/.vmx”

Where,
=> -u USER : VMWare server username

=> -h ‘https://vmware.server.com:8333/sdk’ : Local or remote server FQDN / IP address

=> -p PASSWORD : VMWare server password

=> COMMAND [PARAMETERS] : Command can be any one of the following:

POWER COMMANDS           PARAMETERS           DESCRIPTION
--------------           ----------           -----------
start                    Path to vmx file     Start a VM
                         [gui|nogui]

stop                     Path to vmx file     Stop a VM
                         [hard|soft]

reset                    Path to vmx file     Reset a VM
                         [hard|soft]

suspend                  Path to vmx file     Suspend a VM
                         [hard|soft]

pause                    Path to vmx file     Pause a VM

unpause                  Path to vmx file     Unpause a VM

Start a VM called CentOS

To start a virtual machine with Vmware server 2.0 on a Linux host, stored on storage called iscsi:
vmrun -T server -h 'https://vms.nixcraft.in:8333/sdk' -u root -p 'secrete' start "[iSCSI] CentOS52_64/CentOS52_64.vmx"
To start a virtual machine with Workstation on a Windows host (open command prompt by visiting Start > Run > cmd > [enter] key):
vmrun -T ws start "c:\My VMs\centos\centos.vmx"

Stop a VM called CentOS

To stop a virtual machine with Vmware server 2.0 on a Linux host, stored on storage called iscsi:
vmrun -T server -h 'https://vms.nixcraft.in:8333/sdk' -u root -p 'secrete' stop "[iSCSI] CentOS52_64/CentOS52_64.vmx"

Reset a VM called Debian

To reset a virtual machine with Vmware server 2.0 on a Linux host, stored on storage called DISK3:
vmrun -T server -h 'https://sun4k.nixcraft.co.in:8333/sdk' -u root -p 'secrete' reset "[DISK3] Debian5/Debian5.vmx"

A Redundant Array of Independent Drives (or Disks), also known as Redundant Array of Inexpensive Drives (or Disks) (RAID) is a term for data storage schemes that divide and replicate data among multiple hard drives. RAID can be designed to provide increased data reliability or increased I/O performance, though one goal may compromise the other. There different types of RAID levels. But which one you must use for data safety and performance considering that hard drives are commodity priced?
[continue reading…]

Wow, this is a large size desktop hard disk for storing movies, tv shows, music / mp3s, and photos. You can also load multiple operating systems using vmware or other software for testing purpose. This hard disk comes with 5 year warranty and can transfer at 300MB/s. From the article:

It’s been more than 18 months since Hitachi reached the terabyte mark with the Deskstar 7K1000. In that time, all the major players in the hard drive industry have spun up terabytes of their own, and in some cases, offered multiple models targeting different markets. With so many options available and more than enough time for the milestone capacity’s initial buzz to fade, it’s no wonder that the current crop of 1TB drives is more affordable than we’ve ever seen from a flagship capacity. The terabyte, it seems, is old news.

Fig.01: Seagate's Barracuda 7200.11 1.5TB hard drive

The real question is about reliability. How reliable is the hard disk? So far my Seagate 500GB hard disk working fine. I might get one to dump all my multimedia data / files 🙂

A few days ago I noticed that NFS performance between a web server node and NFS server went down by 50%. NFS was optimized and the only thing was updated Red Hat kernel v5.2. I also noticed same trend on CentOS 5.2 64 bit edition.

NFS server crashed each and every time web server node tried to store a large file 20-100 MB each. Read performance was fine but write performance went to hell. Finally, I had to rollback the updates. Recently, while reading Red Hat site I came across the solution.

Updated kernel packages that fix various security issues and several bugs are now available for Red Hat Enterprise Linux 5:

* a 50-75% drop in NFS server rewrite performance, compared to Red Hat
Enterprise Linux 4.6, has been resolved.

After upgrading kernel on both server and client my issue resolved:
# yum update

Following are few situations where you may be interested in performing a filesystem benchmarking.

=> Deploying a new application that is very read and write intensive.
=> Purchased a new storage system and would like to measure the performance.
=> Changing the RAID level and would like to measure the performance of the new RAID.
=> Changing the storage parameters and would like to know the performance impact of this change

This article gives you a jumpstart on performing benchmark on filesystem using iozone a free Filesystem Benchmark utility.

1. Download and Install iozone software

Go to iozone and download the iozone for your appropriate platform. I downloaded the “Linux i386 RPM”. Install the iozone from the RPM as shown below. By default this installs the iozone under /opt/iozone/bin
# cd /tmp
# wget http://www.iozone.org/src/current/iozone-3-303.i386.rpm
# rpm -ivh iozone-3-303.i386.rpm

Sample output:

Preparing...                ########################################### [100%]
   1:iozone                 ########################################### [100%]

Note: You can install iozone under any UNIX / Linux or Windows operating system.

2. Start the performance test

Execute the following command in the background to begin the performance test.

# /opt/iozone/bin/iozone -R -l 5 -u 5 -r 4k -s 100m -F /home/f1 /home/f2 /home/f3 /home/f4 /home/f5 | tee -a /tmp/iozone_results.txt &

Let us review all the individual parameter passed to iozone command.

  • -R : Instructs the iozone to generate excel compatible text output.
  • -l : This is the lower limit on how many process/threads needs to be started by iozone during execution. In this example, iozone will start 5 threads.
  • -u : This is the upper limit on how many process/threads needs to be started by iozone during execution. In this example, iozone will not exceed maximum of 5 threads. If you set -l and -u to the same value, it will run exactly those many number of process/threads. In this example, this will execute exactly 5 threads.
  • -r : This specifies the record size. In this example, the record size for benchmark testing is 4k. This is an important parameter to be set appropriately depending on the purpose of your filesystem performance testing. For e.g. If you are performing benchmark on a filesystem that will host a database, it is appropriate to set this value to the DB block size of the database.
  • -s : This specifies the size of the file that needs to be tested. In this example, iozone will try to perform test on 100Mb file.
  • -F : Specify the temporary filename that should be used by the iozone during testing. The total number of files specified here should match the value specified in -l and -u parameter.

3. Analyze the output of iozone file.

The first part of the output will contain the details about every individual filesystem performance metrics that was tested. for e.g. Initial write, rewrite etc as shown below.

        Iozone: Performance Test of File I/O
                Version $Revision: 3.303 $
                Compiled for 32 bit mode.
                Build: linux

        Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                     Randy Dunlap, Mark Montague, Dan Million,
                     Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
                     Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

        Run began: Thu Jun  22 00:08:51 2008

        Excel chart generation enabled
        Record Size 4 KB
        File size set to 102400 KB
        Command line used: /opt/iozone/bin/iozone -R -l 5 -u 5 -r  4k -s 100m -F /home/f1 /home/f2 /home/f3 /home/f4 /home/f5
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
        Min process = 5
        Max process = 5
        Throughput test with 5 processes
        Each process writes a 102400 Kbyte file in 4 Kbyte records

        Children see throughput for  2 initial writers  =   60172.28 KB/sec
        Parent sees throughput for  2 initial writers   =   45902.89 KB/sec
        Min throughput per process                        =   28564.52 KB/sec
        Max throughput per process                      =   31607.76 KB/sec
        Avg throughput per process                      =   30086.14 KB/sec
        Min xfer                                                  =   92540.00 KB

        Children see throughput for  2 rewriters        =   78658.92 KB/sec
        Parent sees throughput for  2 rewriters         =   34277.52 KB/sec
        Min throughput per process                      =   35743.92 KB/sec
        Max throughput per process                      =   42915.00 KB/sec
        Avg throughput per process                      =   39329.46 KB/sec
        Min xfer                                        =   85296.00 KB

Similar values like above will be generated for readers, re-readers, reverse readers, stride readers, random readers, mixed workload, random writers, pwrite writers, pread readers. The last part of the iozone output will contain the Throughput summary for different metrics as shown below.

Throughput report Y-axis is type of test X-axis is number of processes
Record size = 4 Kbytes
Output is in Kbytes/sec

Initial write       60172.28
Rewrite           78658.92
Read              2125613.88
Re-read          1754367.31
Reverse Read 1603521.50
Stride read      1633166.38
Random read   1583648.75
Mixed workload 1171437.78
Random write    5365.59
Pwrite               26847.44
Pread               2054149.00


(Fig.01: iozone in action)

Iozone does a benchmarking on different types of file system performance metrics. for e.g. Read, Write, Random read. Depending on the application that you are planning to deploy on that particular filesystem, pay attention to the appropriate items. for e.g. If the filesystem hosts an read intensive OLTP database, then pay attention to Random Read, Random write and Mixed workload. If the application does lot of streaming media content, pay attention to the Sequential Read. On a final note, you can generate graphs using the Generate_Graphs and gengnuplot.sh located under /opt/iozone/bin, based on the iozone output.

References:

  • Iozone PDF documentation – Full documentation from iozone.org explaining all the iozone command line options and more.
  • Linux Iozone graph example – This is a sample *.xls file from iozone that shows the kind of excel output that can be generated from iozone.

Good news and great contribution from HP. You can study all those advanced features for academic project.

AdvFS is a file system that was developed by Digital Equipment Corp and continues to be part of HP’s
Tru64 operating system. It offers many advanced features. Continuing its efforts to advance customer adoption of Linux, HP today announced the contribution of its Tru64 UNIX Advanced File System (AdvFS) source code to the open source community. The code on this site is licensed under the GPLv2 to be compatible with the Linux kernel.

The AdvFS source code includes capabilities that increase uptime, enhance security and help ensure maximum performance of Linux file systems. HP will contribute the code as a reference implementation of an enterprise Linux file system under the terms of General Public License Version 2 for compatibility with the Linux kernel, as well as provide design documentation, test suites and engineering resources.

Now the million dollar question – Is there any reason to pick AdvFS fs over any of the other 20+ file systems such as XFS/ext2/ext3 under Linux?