Beowulf Logo

Beowulf Bulk Data Server - Disk Test Results

CESDIS Logo

Testing Methodology

Goals
The goals of the tests done here are to evaluate a disks performance in a bulk-data sense. This means we're interested in the maximum throughput a disk can provide and not very interested (if at all) in the performance on small random reads and writes. These tests were done to find the configuration of hardware (and software to some degree) that will give us the maximum sustained throughput.

Problems with testing disks from user space
or: Why Erik thought it was really necessary to write a kernel module
In previous rounds of disk testing that I've done, I've noticed that Linux's buffer cache can really get in the way of getting consistent picture of the hardware performance. The worst case scenario looks something like this: A process starts writing to a disk. The process has a lot more data to write than the size of main RAM and wants to write it a lot faster than the disk can. The kernel does write caching so eventually main memory gets filled up with dirty disk blocks. I haven't fooled with this on the 2.1.x kernels but on the 2.0.x kernels, memory can get full to the point where normal system operations are severely hampered. For example, memory can get full enough that network packets get dropped because the network drivers can't allocate new memory to store the new packets.

There's a kernel thread called bdflush who's task it is to select buffers to flush to the disk when more free buffers are needed. Depending on how much memory you have, the process of walking the buffer lists looking for victim buffers to flush can become quite expensive. For example, on a 200 MHz PPro system with 128 MB of RAM, the flushing system was expensive enough that the hard disk was not even kept busy with writes it selected. (As evidenced by the hard disk light turning off at times.)

Although these problems are most pronounced when writing, reading still stuffers from the memory flooding effect (although more slowly and with clean buffers) and bdflush is still needed to come along and toss out those buffers and as I mentioned it can take a while to decide which ones to kill.

These huge backlogs can also lead to very inconsistent timings when doing tests from user land with commands like time dd if=/dev/zero of=/dev/hda.

Another item that is not a major concern while doing tests but will be very important in our final application is what kind of impact doing this kind of IO will have on other operations in the system. Doing this kind of IO with unrestricted caching leads to a situation where you cant keep your process's pages in memory.

In an effort to decrease the influence of the buffer cache on my tests, I wrote a kernel module to manage the buffers myself and also control when blocks are actually scheduled to be written. (In the normal scheme of things, you depend on bdflush to schedule your blocks to be written.)

The key to solving these problems is controlling your backlog of dirty blocks and most importantly telling the buffer cache when it can forget blocks. (See the bforget() function in linux/fs/buffer.c) You can control you backlog from user space using O_SYNC but anybody who has used that knows what the performance impact of that is like.

Testing software
The goal of the testing software is run each device as fast as possible and in particular to keep a fast device from waiting on a slow device. In this respect it's different than the MD driver (in raid0 stiping mode) which will force to read block in a round-robin fashion. For example, when testing two devices on the same channel with a 2.0.x kernel, the master device can completely starve the slave device. (I guess this is really a problem with the disk queue but it illustrates the point that one device can get ahead of another.)

The timing stops when the requested number of blocks have been read into memory. The tester will take a few moments more to wait for (and clean up) the blocks that were on the request queue but hadn't finished yet.

The test software is available here.

Interesting Discoveries

The PIIX3's "independent" IDE channels aren't really that independent after all.
You would expect to be able to get roughly twice the bandwidth with two IDE channels as you would with one. This turns out to not be the case with the PIIX3. If you look in the performance section you'll see that the performance of two disks on two channels was roughly the same as a single disk.

Side note: You don't gain anything performance wise by adding a slave disk on a channel since only one disk can be used at a time and you can't select the other disk while a request is progress. The ATA/ATAPI4 specs talk about an optional "overlapped feature set" which would you to select the other device on the bus while one is busy. This should allow a performance gain with a slave disk but to my knowledge, there's no support for this in the Linux kernel.

The PIIX3 is prone to fits of nervousness
When writing continuously for 30 seconds or so (it varies), the Quantum disks were able to cause the following error:
hdc: timeout waiting for DMA
hdc: irq timeout: status=0x58 { DriveReady SeekComplete DataRequest }
I'm not entirely sure what is was causing that. It turns out that turning up the PCI Master Latency Timer, you can make this message go away. (MLT was set to 32, I changed it to 64.) This suggests that the PIIX3's timer was running out before it was finished and the chip aparently can't handle this situation gracefully.

Test Setup and results

How to read these performance results About the performance results: The Test Equipment: The device name tells you how a disk was connected:
DeviceInterface IDE Channel
hda Intel PIIX3 0Master
hdb Intel PIIX3 0Slave
hdc Intel PIIX3 1Master
hdd Intel PIIX3 1Slave
hde Promise Ultra330Master(*)
hdg Promise Ultra331Master(*)
(*) Tests involving these devices were done with the 2.1.105 kernel only since there is no Promise Ultra 33 support in the 2.0.x kernel.
Quantum Fireball SE 8.4 GB
Disks used Throughput (bytes per sec)
Read Write
hda 11.2209.723
hdb 11.2269.723
hda hdb 11.1939.607
hda hdc 12.09810.750
hda hdb hdc hdd 12.09810.799
hde 11.26911.268
hde hdg 22.51222.497
hdc hde hdg 32.88430.540

Maxtor Diamond MAX 88400D8
Disks used Throughput (bytes per sec)
Read Write
hda 11.3077.111
hdb 11.3067.198
hda hdb 11.1977.397
hda hdc 11.93710.801
hda hdb hdc hdd 11.93910.785
hde 11.2927.443
hde hdg 22.59914.656
hdc hde hdg 32.83021.327

IBM Deskstar 8 DHEA-38451
Disks used Throughput (bytes per sec)
Read Write
hda 9.7327.145
hdb 9.7409.124
hda hdb 9.7249.984
hda hdc 12.07010.735
hda hdb hdc hdd 12.07110.787
hde 9.7569.609
hde hdg 19.49319.161
hdc hde hdg 28.90228.129


Contact: Erik Hendriks hendriks@cesdis.gsfc.nasa.gov
Page last modified: 1998/08/21 20:39:06 GMT
CESDIS is operated for NASA by the USRA