The preceding experiments demonstrate two qualitatively distinct operational behaviors between the Beowulf system based on 10 Mbps Ethernet and that based on the new Fast Ethernet technology. The first system is characterized with a parallel disk bandwidth that significantly exceeds the interprocessor communications bandwidth. Therefore, the network imposes a bottleneck to the system. The second system provides sufficient interprocessor bandwidth to match the demands of the disk throughput tested. Therefore the key qualitative distinction is that the Fast Ethernet based Beowulf is a balanced architecture while its predecessor is unbalanced.
It had been previously shown that 10 Mbps Ethernet channels could be used in parallel to provide increased bandwidth. While no one packet would go faster, the aggregate communications bandwidth supporting multiple concurrent traffic sources could be effectively increased. Figure 2 showed that under favorable conditions, the bandwidth gain of two channels with respect to one could reach 70%.
It is a new finding of this paper that Fast Ethernet channels can also scale comparably. Figure 4 demonstrates that Fast Ethernet can achieve a gain in bandwidth from one to two channels of about 74%, essentially the same as the 10 Mbps case. In terms of utilization, the slower channels are superior, achieving 80% of peak compared to Fast Ethernet's 65% of peak for a single channel. While the ratio of the peak performance of the Fast Ethernet versus regular Ethernet is 10, the ratio of their respective sustained throughputs is less at a factor of 8; still a respectable enhancement. But to be fair, the current per port cost of Fast Ethernet compared to regular Ethernet is also a factor of 8 so there is no direct price performance advantage when considering just the networks alone. The performance to cost gain occurs when this network capability is related to the file transfer traffic created by the parallel disk array.
As previously discussed, Figure 3 clearly demonstrates the bandwidth bottleneck caused by the limitations of the 10 Mbps Ethernet, even when more than one is used together. As all the disk traffic is forced to go between processors, the total sustained file transfer bandwidth is throttled down to that capable of being supported by the network, about 1.7 MBytes per second. This is a degradation of greater than a factor of 4 over the best concurrent local disk file transfer rate. It should be noted that a higher aggregate local file transfer bandwidth could be achieved than was done in the actual experiment. To hold the traffic demand constant and at the same time avoid processing node contention, each processor was used only in the role of either a producer or consumer of a file and not both.
The operational behavior of the Fast Ethernet based system is not only better than the prototype system, it performs in a new regime, one that achieves a balance between the demands of the parallel disk array throughput and the capacity of the interprocessor communications to support concurrent remote file transfers. Figure 5 shows the sustained throughput of the system as the seven concurrent file transfers range from entirely local to entirely remote. While the corresponding curves for the Beowulf prototype drop dramatically as file copies become exclusively remote, constrained by the 10 Mbps Ethernets, the Fast Ethernet based system experiences only a small degradation in system file transfer throughput. Instead of the loss of a factor of 4 to 5, the throughput degrades only about 20% across the range of tests.
Figure 6: Beowulf Disk Bandwidth Comparison
(2 channels, Total of 7 local and remote files)
The comparative behavior is best represented in Figure 6. Here, the curves in Figures 3 and 5 for the single case of 2 MByte file transfers is repeated to expose the differences between the 10 Mbps and 100 Mbps Ethernet channels. Even where all the file copies are to their respective local disks, a significant file transfer throughput gain is achieved. This has nothing to do with the networks and is entirely a consequence of the processing node. The improved processor architecture of the Pentium versus the 80486 and the superior properties of the PCI bus [5] with respect to the VESA-local bus combine to provide a performance gain of about 62%. The relative gain increases to a 550% when all file copies are remote at a system cost increase of about 14% for the Fast Ethernet channels.