Next: References
This paper was published in SUPERCOMPUTER, Volume XII, Number 2, March 1996, pp 4-20.
For a reprint of the paper, send e-mail to grl@iastate.edu

Performance Comparison of Workstation Clusters for Scientific Computing


Glenn Luecke

Department of Mathematics and Computation Center,
Iowa State University, Ames, Iowa 50011, U.S.A.

James Coyle, Waqar Haque, James Hoekstra and Howard Jespersen

Computation Center, Iowa State University,
Ames, Iowa 50011, U.S.A.


Abstract


This paper compares the performance of workstation clusters from DEC (Alpha Farm), HP, and IBM (SP2) for scientific computing on a selected collection of test suites. These test suites have been designed to evaluate both serial and parallel performance.


Introduction


It is possible to enhance the power of workstations for scientific computations by interconnecting them via a high speed communication network so that they can be used to not only execute serial but also parallel programs. Computers that use this mode of operation include the IBM SP2, the DEC Alpha Farm and clusters of HP workstations. This study compares the performance of these computers on a collection of test suites designed to evaluate serial and parallel performance for scientific computing. Parallelism is expressed in the parallel test suites by using PVM (Parallel Virtual Machine) from Oak Ridge National Laboratory [8]. IBM also provides an optimized version of PVM, called PVMe. Performance results for both PVM and PVMe are reported for the IBM SP2.


Methodology


Comparison among the machines is made by measuring their performance on a variety of test suites. Single node performance is compared by measuring the performance of a collection of scientific kernels and application codes. Parallel performance is determined by single node performance and communication performance. Communication performance is evaluated by measuring the time required to send varying sized messages in a variety of communication patterns. Parallel performance is evaluated by measuring the performance of a parallel matrix multiply and three parallel application codes. In order to make the parallel test programs portable, message passing is done using PVM. Performance was enhanced by calling the PVM subroutine pvmfsetopt.

Often a significant portion of the total execution time of large scientific applications is due to extensive I/O to/from temporary storage. An advantage of the workstation clusters considered in this report is that each node has its own local disc for fast storage of temporary data. The performance of I/O to the local disc is measured for reads and writes of various sized files.

Whenever feasible, test suites are designed to evaluate performance for small, medium and large problems so the dependence of performance on problem size can be seen. All performance results are obtained by using 64-bit real arithmetic. The Fortran and C compilers were set for high optimization, see Table 2. To ensure accurate timings, short tests are looped a sufficient number of times to obtain a time of at least one second. Wall-clock timers were used to measure elapsed time for all parallel and I/O test suites. Cpu timers were used for measuring single node performance. No effort is made to hand optimize any of the codes. For a given vendor, the same compiler option(s) were used for all tests.

A few of the factors which influence performance are:

Observed performance is in general a complex interaction of the above factors along with other factors. The purpose of this paper is not to explain the performance in terms of these factors, but to simply report performance results.


Machine Characteristics


Tables 1 and 2 summarize some of the important characteristics of the machines used for this study.

The HP workstations used for this study are interconnected via a FDDI ring. A multistage communication network is used to interconnect the SP2 nodes. The IBM SP2 can be configured with thin and/or wide nodes where both nodes are based on the 66.5 MHz RS6000 microprocessor; only wide nodes are used for this study. The DEC workstations are interconnected via a FDDI crossbar switch.


Single Node Performance


Single node performance is compared by measuring the performance of a collection of scientific kernels and application codes mostly selected from [3].


Matrix Multiplication


Table 3 contains performance results for the vendor optimized matrix-times-matrix operation, . Performance results for the Fortran ijk and the kij versions are presented in Tables 4 and 5. In these tables, , and are square, real arrays of the indicated size, and the ijk and kij versions refer to the order of the DO loops around .

For the vendor coded serial matrix multiply, the performance was nearly constant for the three problem sizes. However, HP was 40%slower than an SP2 node and a DEC node was about 30%slower, see Table 3.

Tables 4 and 5 illustrate the performance of matrix multiplication for non-unit stride memory accesses. Notice that the SP2 node outperformed the other vendors for problem sizes 50 and 300; however, for problem size 1000, the performance degraded sharply and the SP2 did not perform as well as the other vendors. Tables 4 and 5 also show that the performance of these Fortran variants is significantly less than the vendor optimized routine, see Table 3. Clearly, these compilers are not generating code that can efficiently utilize the underlying hardware.


Linear Equation Solver


Table 6 gives the performance in cpu-seconds of the Lapack [2] routine dgesv which solves the real, general linear system Ax=b using LU factorization with partial pivoting. A high percentage of the CPU time executing dgesv is spent in the vendor optimized dgemm. Since dgemm on an SP2 outperforms HP and DEC, it is not surprising that the SP2 outperforms the others on this test.


Serial Application Codes


The purpose of this section is to compare the performance of application codes from a variety of engineering and scientific disciplines on a single node for the computers used in this study. The other application codes used for single node performance assessment have been selected from [3]. These include a mezo-scale weather model, an astrophysics code that simulates the evolution of self-gravitating systems, and ADM, MDG, FLO52Q, and QCD from the PERFECT Club [7].

The SP2 node performed best on the meteorology code, the HP node performed best on the astrophysics code, and the DEC node performed best on MDG, FLO52Q and QCD codes, see Table 7.


Node I/O Performance


The I/O tests for this study initialize a specified amount of data, write it to a local disk file and then read it back. Table 8 shows I/O transfer rates in MB/sec for files of sizes 5, 25, 100, and 200 MB. Initialization time is not included. The I/O performance results are mixed. For file sizes 5 and 25 MB, the SP2 node performed well on the READs and not well on the WRITEs when compared with the other vendors. For the 200MB file, the SP2 node's I/O performance degraded significantly for both the READ and WRITE operations.

Notice the large transfer rates for the READ operation for file sizes up to 100MB on the SP2 node. This is probably due to buffering being done during the WRITE process thereby eliminating the need for a READ from the local disk.


Communication Performance


Evaluating the performance of the inter-node communication network is an important part of evaluating the performance of a parallel computer. There are so many different ways communication can occur among nodes that it is not feasible to measure the performance of all of them. Tests are designed to evaluate the performance of some of the communication patterns, under heavy and light loads, that we feel are likely to occur during the execution of parallel scientific application codes. Thus, communication performance is measured for each of the following scenarios for 2, 4 and 8 nodes (except item 1 which applies to only two nodes) using PVM with messages of size 8 bytes, 1 KB, 100 KB, and 10 MB. Performance result for both PVM and PVMe are reported. Test routines are written with one node designated as the PVM master and all other nodes designated as PVM slaves. Communications tests are divided into two categories: (1) node-to-node communication, and (2) concurrent communication.

Node-to-Node communication. These tests are designed to measure communication rates from:

Ideally, communication performance results for tests 1.a-1.c would be the same for a given machine. However, Table 20 shows that this is not the case. Communication rates for tests 1.a are not available for HP since the call to pvmfsetopt was accidentally commented out for this one test. This problem was not discovered until after the HP cluster was no longer available for dedicated usage. Notice that the communication rate drops when going from the 100KB message to 10MB for each vendor and for each of the three tests (except for the SP2 PVM results for test 1.b). This drop probably is a result of network saturation. In all cases, SP2 nodes with PVMe significantly outperforms the others.

Concurrent Communication. The following test have been designed to measure the performance of concurrent communication between nodes:

To better evaluate the performance of the broadcast operation, we define a Normalized Broadcast Rate as where the total data rate is measured in KB/sec and N is the total number of nodes involved in the communication. Let be the data rate, in KB/second, when a message is sent from node 1 to node 2. Let be the data rate for broadcasting the same message from node 1 to the other N-1 nodes. If the broadcast operation and communication network were able to concurrently transmit messages to all other nodes, then . In this case, the Normalized Broadcast Rate would remain constant as N increases and hence the rate at which the Normalized Broadcast Rate decreases as N increases indicates how far from optimum the broadcast operation is actually performing.

Tables 9 thru 12 summarize the Normalized Broadcast Rate performance results (items 2.a - 2.d above) using 2, 4 and 8 nodes. Tests 2.a and 2.c are designed to determine if broadcasting from the master node gives different performance as compared to broadcasting from a slave node. Tables 9 and 11 show that this is not the case. This is also true for tests 2.b and 2.d, see Tables 10 and 12. For each vendor, the normalized communication rate drops as the number of nodes increases. DEC outperforms HP for most of the 2.a-2.d tests. However, in all these cases, the concurrent communication rate is significantly higher for the SP2 with PVMe.

As above, let N be the number of nodes numbered from 1 to N. With this numbering, tests 2.e - 2.g are designed to measure the performance of communication between neighboring nodes where nodes 1 and N are considered neighbors. Test 2.g, a variation of test 2.e, is chosen to determine the impact of node ordering on performance. Also observe that the data rate for these tests will increase proportionally with the number of nodes being utilized since communication can be done in parallel. Thus, in a manner similar to the Normalized Broadcast Rate, for these tests we define a Normalized Data Rate to be where the data rate is measured in KB/sec. In an ideal communication network, the Normalized Data Rate should be constant as N increases and hence the degree with which the rate is not constant indicates how far from ideal the given communication network is actually performing. Tables 13 thru 15 show that the Normalized Data Rate for the SP2-PVMe remains nearly constant as the number of processors increases, whereas this is not the case for others.

Ideally, the normalized data rates for tests 2.e and 2.g should be the same for a given vendor. This is in fact nearly true for each vendor (except some of the SP2-PVM results), see Tables 13 and 15. This shows that the communication rate is independent of node ordering, at least for the 2.e and 2.g tests for 2, 4 and 8 processors. For all communication tests, PVMe on the SP2 significantly outperformed the other vendors.


Parallel Performance


In this section, performance results of a parallel matrix multiply code and three application codes on various workstation clusters are presented.

The parallel matrix-times-matrix multiplication, , is evaluated for square matrices of sizes 10, 100, 500 and 1000 for 1, 2, 4 and 8 nodes. For these tests, matrix multiplication is parallelized as follows: Let be the size of each square matrix and let be the number of nodes being utilized. For ease of illustration, assume divides and let . First from node 1, broadcast all of to each of the other nodes and send the second columns of and to node 2, the next columns of and to node 3, ..., the last columns of and to node . Each node then computes times the appropriate column block of and adds these results to the appropriate column block of . All updated column blocks of are then sent back to node 1. The same pvm code is used for all these tests. Thus, for a single node, both the master and slave programs execute on the same node. Table 16 presents the performance in Mflops based on wall-clock timings. Notice that the fast communication rates of PVMe allow the IBM SP2 to perform very well compared with the other vendors.

The following scenario will typically occur when measuring parallel performance of application codes. For small problems the ratio of communication to computation time will usually be large, thus making performance results for small problems highly dependent on the performance of the communication network. In contrast, for large problems, the ratio of communication to computation time will usually be small, thus making performance results for large problems highly dependent on the performance of each node. For these reasons, the performance of the parallel application codes is measured for small, medium and large problem sizes. All three of the application codes considered in this study assume that the number of nodes is small compared with the problem size.

The first parallel application code considered was obtained from Peter Michielse [6][5]. It is written in Fortran and is based on a two-dimensional oil reservoir simulation that uses multigrid and domain decomposition techniques. The master program distributes the initial domain decomposition, after which each processor handles part of the computational domain. Communication takes place in various stages of the program: during the computation of residuals, during the actual smoothing process (which is a variant of block Gauss-Seidel), and during the restriction to coarser multigrid levels. The coarsest levels are handled by applying a stepwise agglomeration/de-agglomeration technique. The results are summarized in Table 17. Notice that performance results are mixed with no machine outperforming the other in all cases. The HP cluster performs the worst for all tests. For a single node, the DEC Alpha cluster performs the best. For two and four nodes, the SP2 outperforms the Alpha cluster for low multigrid levels and vice versa for high multigrid levels.

The second parallel application code was obtained from Ruud van der Paas [9]. It is written in C and is a generalized red/black Poisson solver. This application applies a general domain decomposition technique to a two-dimensional computational domain. Communication is needed across the internal boundaries between the subdomains, and consists of exchange of data in overlap-regions. Within each subdomain, a generalized red/black Poisson solver is applied, which has the flexibility to adjust the amount of so-called inner iterations to the number of data exchange sweeps. The results are summarized in Table 18. Notice that the SP2 with PVMe outperforms the other machines in most cases.

The third parallel application code is written in Fortran and was obtained from Jean Castel-Branco from the Universite Catholique de Louvain, Belgium. This code uses a finite difference method and domain decomposition to solve a two-dimensional diffusion equation for hydrodynamic simulations. The PVM master performs domain decomposition by breaking the 512x512, two-dimensional domain into subdomains of size 512x(512/), where is the number of nodes used. The PVM slaves solve the diffusion equation on subdomains and pass messages to contiguous neighboring subdomains [1]. Table 19 summarizes the performance results for this code. For this application code, the comparative performance results are mixed with the DEC Alpha Farm outperforming the others in two out of three cases.


Conclusions


The performance data contained in this study is from a limited set of scientific kernels and application codes and extrapolating this data to other applications may lead to incorrect conclusions. The performance of a collection of workstations interconnected via a communication network for execution of parallel application codes will depend on the performance of each workstations node, communication network and also on how the application code is parallelized. Single node performance results were mixed with different vendors outperforming the others depending on the test chosen. The I/O performance results were also mixed with no single vendor outperforming the others. Notice that IBM's optimization of PVM for the SP2 (that is, PVMe) provides significant improvement in performance over the non-optimized PVM. PVMe on the SP2 significantly outperformed the others on the communication tests. The fast communication rate achieved by PVMe helped the SP2 to outperform the other vendors for many of the test cases for the parallel application codes; however, DEC and HP did outperform the SP2 on several of these tests. For all of the machines evaluated, broadcast rates did not scale well as the number of processors increased from two to eight. This is also true for the other concurrent communication tests with the exception of SP2-PVMe results.


Acknowledgements


The authors would like to thank Cornell Theory Center, the Pittsburgh Supercomputing Center and the Maui High Performance Computing Center for allowing us to use their machines for this study. The authors would also like to thank Jean Castel-Branco from the Universite Catholique de Louvain for allowing us to use his hydrodynamic code, Ruud van der Paas for the generalized red/black Poisson solver and Peter Michielse for the oil reservoir simulation code for this study. We also thank Bill Celmaster from DEC for providing performance results for the Alpha Farm, and Dan Nordhues from HP for providing us with performance results on HP workstations.




Next: References


grl@iastate.edu
Tue May 14 14:30:31 CDT 1996