This paper compares the performance of workstation clusters from DEC (Alpha Farm), HP, and IBM (SP2) for scientific computing on a selected collection of test suites. These test suites have been designed to evaluate both serial and parallel performance.
It is possible to enhance the power of workstations for scientific
computations by interconnecting them via a high speed communication network so
that they can be used to not only execute serial but also parallel programs.
Computers that use this mode of operation include
the IBM SP2,
the DEC Alpha Farm and clusters of HP workstations.
This study compares the performance of these computers
on a collection of test suites designed to evaluate serial
and parallel performance for scientific computing.
Parallelism is expressed in the parallel test suites by using
PVM (Parallel Virtual Machine) from
Oak Ridge National Laboratory [8].
IBM also provides an optimized version of PVM, called PVMe.
Performance results for both PVM and PVMe are reported for the IBM SP2.
Often a significant portion of the total execution time of large scientific applications is due to extensive I/O to/from temporary storage. An advantage of the workstation clusters considered in this report is that each node has its own local disc for fast storage of temporary data. The performance of I/O to the local disc is measured for reads and writes of various sized files.
Whenever feasible, test suites are designed to evaluate performance for small, medium and large problems so the dependence of performance on problem size can be seen. All performance results are obtained by using 64-bit real arithmetic. The Fortran and C compilers were set for high optimization, see Table 2. To ensure accurate timings, short tests are looped a sufficient number of times to obtain a time of at least one second. Wall-clock timers were used to measure elapsed time for all parallel and I/O test suites. Cpu timers were used for measuring single node performance. No effort is made to hand optimize any of the codes. For a given vendor, the same compiler option(s) were used for all tests.
A few of the factors which influence performance are:
The HP workstations used for this study are
interconnected via a FDDI ring.
A multistage communication
network is used to interconnect the SP2 nodes.
The IBM SP2 can be configured with thin and/or wide nodes where both nodes are
based on the 66.5 MHz RS6000 microprocessor;
only wide nodes are used for this study.
The DEC workstations are interconnected via a
FDDI crossbar switch.
Single node performance is compared by measuring the performance of a
collection of scientific kernels and application codes mostly selected from
[3].
For the vendor coded serial matrix multiply, the performance was nearly constant for the three problem sizes. However, HP was 40%slower than an SP2 node and a DEC node was about 30%slower, see Table 3.
Tables 4 and 5
illustrate the performance of
matrix multiplication for non-unit stride memory accesses.
Notice that the SP2 node outperformed the other vendors for
problem sizes 50 and 300; however, for problem size 1000, the
performance degraded sharply and the SP2 did not perform as well as
the other vendors.
Tables 4 and 5
also show that the performance of these Fortran variants is
significantly less than the vendor optimized routine, see Table 3.
Clearly, these compilers are not generating code that can efficiently
utilize the underlying hardware.
The SP2 node performed best on the meteorology code, the HP node performed
best on the astrophysics code, and the DEC node performed best on
MDG, FLO52Q and QCD codes, see Table 7.
The I/O tests for this study initialize a specified amount of data, write it to a local disk file and then read it back. Table 8 shows I/O transfer rates in MB/sec for files of sizes 5, 25, 100, and 200 MB. Initialization time is not included. The I/O performance results are mixed. For file sizes 5 and 25 MB, the SP2 node performed well on the READs and not well on the WRITEs when compared with the other vendors. For the 200MB file, the SP2 node's I/O performance degraded significantly for both the READ and WRITE operations.
Notice the large transfer rates
for the READ operation for file sizes up to 100MB on
the SP2 node. This is probably due to buffering being done during the
WRITE process thereby eliminating the need for a
READ from the local disk.
Evaluating the performance of the inter-node communication network is an important part of evaluating the performance of a parallel computer. There are so many different ways communication can occur among nodes that it is not feasible to measure the performance of all of them. Tests are designed to evaluate the performance of some of the communication patterns, under heavy and light loads, that we feel are likely to occur during the execution of parallel scientific application codes. Thus, communication performance is measured for each of the following scenarios for 2, 4 and 8 nodes (except item 1 which applies to only two nodes) using PVM with messages of size 8 bytes, 1 KB, 100 KB, and 10 MB. Performance result for both PVM and PVMe are reported. Test routines are written with one node designated as the PVM master and all other nodes designated as PVM slaves. Communications tests are divided into two categories: (1) node-to-node communication, and (2) concurrent communication.
Ideally, communication performance results for tests 1.a-1.c would be the same for a given machine. However, Table 20 shows that this is not the case. Communication rates for tests 1.a are not available for HP since the call to pvmfsetopt was accidentally commented out for this one test. This problem was not discovered until after the HP cluster was no longer available for dedicated usage. Notice that the communication rate drops when going from the 100KB message to 10MB for each vendor and for each of the three tests (except for the SP2 PVM results for test 1.b). This drop probably is a result of network saturation. In all cases, SP2 nodes with PVMe significantly outperforms the others.
To better evaluate
the performance of the broadcast operation,
we define a Normalized Broadcast Rate as
where the total data rate
is measured in KB/sec and N is the total number of nodes involved
in the communication.
Let
be the data rate, in KB/second, when a message
is sent from node 1 to node 2.
Let
be the data rate for broadcasting the same
message from node 1 to the other N-1 nodes. If the broadcast
operation and communication network were able to concurrently
transmit messages to all other nodes, then
. In this case, the Normalized Broadcast Rate would
remain constant as N increases and hence the rate at which the Normalized
Broadcast Rate decreases as N increases indicates how far from optimum the
broadcast operation is actually performing.
Tables 9 thru 12 summarize the Normalized Broadcast Rate performance results (items 2.a - 2.d above) using 2, 4 and 8 nodes. Tests 2.a and 2.c are designed to determine if broadcasting from the master node gives different performance as compared to broadcasting from a slave node. Tables 9 and 11 show that this is not the case. This is also true for tests 2.b and 2.d, see Tables 10 and 12. For each vendor, the normalized communication rate drops as the number of nodes increases. DEC outperforms HP for most of the 2.a-2.d tests. However, in all these cases, the concurrent communication rate is significantly higher for the SP2 with PVMe.
As above, let N be the number of nodes numbered from 1 to
N. With this numbering, tests 2.e - 2.g are designed to measure the
performance of communication between neighboring nodes where nodes 1 and N are
considered neighbors.
Test 2.g, a variation of test 2.e, is chosen to determine the
impact of node ordering on performance.
Also observe
that the data rate for these tests will increase proportionally with the number
of nodes being utilized since communication can be done in parallel.
Thus, in a manner similar to the Normalized
Broadcast Rate, for these tests we define a Normalized Data Rate to be
where the data rate is measured in KB/sec. In an ideal communication
network, the Normalized Data Rate should be constant as N increases and hence
the degree with which the rate is not constant indicates how far from ideal
the given communication network is actually performing.
Tables 13 thru 15 show that the Normalized Data
Rate for the SP2-PVMe remains nearly constant as the number of processors
increases, whereas this is not the case for others.
Ideally, the normalized data rates for tests 2.e and 2.g should be
the same for a given vendor.
This is in fact nearly true for each vendor (except some of the
SP2-PVM results), see
Tables 13 and 15.
This shows that the communication rate is independent of node ordering, at
least for the 2.e and 2.g tests for 2, 4 and 8 processors.
For all communication tests, PVMe on the SP2 significantly outperformed the
other vendors.
In this section, performance results of a parallel matrix multiply code and three application codes on various workstation clusters are presented.
The parallel matrix-times-matrix multiplication, , is
evaluated for square matrices of sizes 10, 100, 500 and 1000 for 1, 2, 4 and 8
nodes. For these tests, matrix multiplication is parallelized as follows:
Let
be the size of each square matrix and let
be the number of
nodes being utilized. For ease of illustration, assume
divides
and let
. First from node 1, broadcast all of
to each of
the other
nodes and send the second
columns of
and
to
node 2, the next
columns of
and
to node 3, ..., the
last
columns of
and
to node
. Each node then
computes
times the appropriate column block of
and adds these results
to the appropriate column block of
. All updated column blocks of
are
then sent back to node 1.
The same pvm code is used for all these tests. Thus, for a single node,
both the master and slave programs execute on the same node.
Table 16
presents the performance in Mflops
based on wall-clock timings.
Notice that the fast communication rates of PVMe allow the IBM SP2 to
perform very well compared with the other vendors.
The following scenario will typically occur when measuring parallel performance of application codes. For small problems the ratio of communication to computation time will usually be large, thus making performance results for small problems highly dependent on the performance of the communication network. In contrast, for large problems, the ratio of communication to computation time will usually be small, thus making performance results for large problems highly dependent on the performance of each node. For these reasons, the performance of the parallel application codes is measured for small, medium and large problem sizes. All three of the application codes considered in this study assume that the number of nodes is small compared with the problem size.
The first parallel application code considered was obtained
from Peter Michielse [6][5].
It is written in Fortran and is
based on a two-dimensional oil reservoir simulation that uses multigrid and
domain decomposition techniques. The
master program distributes the initial domain decomposition, after which each
processor handles part of the computational domain. Communication takes place
in various stages of the program: during the computation of residuals, during
the actual smoothing process (which is a variant of block Gauss-Seidel), and
during the restriction to coarser multigrid levels. The coarsest levels are
handled by applying a stepwise agglomeration/de-agglomeration technique.
The results are summarized
in Table 17
.
Notice that performance results are mixed with no machine outperforming the
other in all cases. The HP cluster performs the worst for all
tests.
For a single node, the DEC Alpha cluster performs the best. For two and
four nodes, the SP2 outperforms the Alpha cluster for low multigrid levels
and vice versa for high multigrid levels.
The second parallel application code was obtained from Ruud van der Paas [9]. It is written in C and is a generalized red/black Poisson solver. This application applies a general domain decomposition technique to a two-dimensional computational domain. Communication is needed across the internal boundaries between the subdomains, and consists of exchange of data in overlap-regions. Within each subdomain, a generalized red/black Poisson solver is applied, which has the flexibility to adjust the amount of so-called inner iterations to the number of data exchange sweeps. The results are summarized in Table 18. Notice that the SP2 with PVMe outperforms the other machines in most cases.
The third parallel application code is written in Fortran and was obtained
from Jean Castel-Branco from the Universite Catholique de Louvain, Belgium.
This code uses a finite difference method and domain decomposition to solve a
two-dimensional diffusion equation for hydrodynamic simulations. The PVM
master performs domain decomposition by breaking the 512x512, two-dimensional domain
into subdomains of size 512x(512/), where
is the number of nodes
used. The PVM slaves solve the diffusion equation on subdomains and
pass messages to contiguous neighboring subdomains [1].
Table 19 summarizes the performance results for this code.
For this application code, the comparative performance results are mixed
with the DEC Alpha Farm outperforming the others in two out of three
cases.
The performance data contained in this study is from a limited set of
scientific kernels and application codes and extrapolating this data to other
applications may lead to incorrect conclusions.
The performance of a collection of workstations interconnected via
a communication network for execution of parallel application codes
will depend on the performance of each workstations node,
communication network and also on how the application code is parallelized.
Single node performance results were mixed with different vendors
outperforming the others depending on the test chosen. The I/O
performance results were also mixed with no single vendor outperforming
the others.
Notice that IBM's optimization of PVM for the SP2 (that is, PVMe) provides
significant improvement in performance over the non-optimized PVM.
PVMe on the SP2 significantly outperformed the others
on the communication tests.
The fast communication rate achieved by PVMe helped the SP2 to
outperform the other vendors for many of the test cases for
the parallel application codes;
however, DEC and HP did outperform the SP2 on several of these
tests.
For all of the machines evaluated, broadcast rates did not scale
well as the number of processors increased from two to eight. This is
also true for the other concurrent communication tests with
the exception of SP2-PVMe results.
The authors would like to thank Cornell Theory Center, the Pittsburgh
Supercomputing Center and the Maui High Performance Computing Center for
allowing us to use their machines for this study. The authors would also
like to thank Jean Castel-Branco from the Universite Catholique de Louvain
for allowing us to use his hydrodynamic code,
Ruud van der Paas for the generalized red/black Poisson solver
and Peter Michielse for the oil reservoir simulation code for this study.
We also thank Bill Celmaster from DEC for providing performance results
for the Alpha Farm,
and Dan Nordhues from HP for providing us
with performance results on HP workstations.