The fundamental contribution of this paper is that a new balanced operating point for distributed computing systems has been identified and evaluated with direct implications for real-world computing. It has been shown that within the cost constraints of scientific workstations, such systems can be implemented that incorporate close to order-of-magnitude greater mass storage capacity and data access bandwidth than conventional workstations. This achievement is enabled through the exploitation of PC mass market commodity computing subsystems in a parallel structure and demonstrated by a series of experiments conducted with the Beowulf Parallel Workstation.
The seminal finding implied by the results presented in this paper is that the sustained data transfer rates possible with a 16-way parallel disk array can be supported by the parallel combination of dual 100 Mbps Ethernet channels. Using Beowulf as a testbed, it was shown that this configuration is both necessary and sufficient to achieve interprocessor communications rates comparable to those of the disk array. The application of this rapidly emerging Fast Ethernet technology makes scientific workstations of this type feasible for the first time.
The implications for real world computing are significant in that for specific environments substantial benefits can be achieved at little additional cost. In the realm of scientific computing, large data sets must often be manipulated, explored, and visualized. The working set size may easily exceed the capacity of conventional scientific workstations and require repeated access to shared file servers over common local area networks. The reason for repeated access is because ordinarily the entire data set can not be maintained in a conventional workstation and because the data usually requires many repeated examinations. The new capability provided by Beowulf is to stage such large working sets entirely in the workstation with only a single access required to the remote file server upon which the data set resides. The impact on the user is much faster response time, approaching an order of magnitude in some cases, permitting innovative ways of working with research data. For the more global system, it means significantly reduced burden on shared resources, reducing contention and improving response time there as well.
There remains the open question of software support for the use of PopC systems like Beowulf. Distributed computing systems such as PVM and MPI for message passing applications programming and Condor for job stream scheduling have been brought up on Beowulf and are being used. Management of parallel disk arrays, especially from the applications programmer perspective, continues to be a challenge and is the topic of active research by a number of groups around the country. The Beowulf project is evaluating some experimental packages. While the findings reported here do accurately characterize the capabilities of the Beowulf architecture, it does not represent a programming interface that is transparent to the user. Currently, only low level techniques have made this opportunity for superior mass storage in a single user context available. Without improved software tools, the potential for distributed disk arrays in a workstation environment will be lost.