Scalability and Performance of OpenMP and MPI on

a 128-Processor SGI Origin 2000

Glenn R. Luecke and Wei-Hua Lin

grl@iastate.edu, whlin@iastate.edu

291 Durham Center

Iowa State University

Ames, Iowa 50011-2251, USA

August 16, 2000

Abstract

The purpose of this paper is to investigate the scalability and performance of seven, simple OpenMP test programs and to compare their performance with equivalent MPI programs on an SGI Origin 2000. Since the memory in the SGI Origin 2000 is physically distributed among its nodes, special data distribution directives were used to make sure that the OpenMP implementation had the same data distribution as the MPI implementation. Without specifying the data distribution, performance of the OpenMP implementation would be poorer. Each test was executed for a large message and a small message making a total of 14 experiments. Even with carefully specifying data distributions for the OpenMP implementations, 9 of the MPI implementations performed and scaled significantly better than the corresponding OpenMP implementations for these simple tests. The authors recommend that the OpenMP standard be enhanced to allow the reduction option for arrays to be used.

 

 

  1. Introduction

It is well known that writing programs for distributed memory parallel computers is often significantly more difficult than writing programs for shared memory parallel computers. It is also well known that shared memory parallel computers do not scale well to hundreds of processors whereas distributed memory parallel computers can scale well to hundreds of processors. Thus, those who need hundreds of processors to execute their programs are currently forced to write programs for distributed memory computers with MPI [2] instead of writing their programs for shared memory computers with OpenMP [1]. The SGI Origin 2000 is a "hybrid" between shared and distributed memory since memory is physically distributed among nodes (each node has two processors sharing a common memory) but there is a single address space for all memory. Programmers have the option of writing their parallel programs using either OpenMP or MPI on this machine. The purpose of this paper is to investigate the scalability and performance of seven simple OpenMP test programs and to compare their performance with equivalent MPI programs on an SGI Origin 2000.

2 Test Environment

The SGI Origin 2000 used for this study is a 128-processor machine located in Eagan, Minnesota, with 64 nodes with each node consisting of two 300 MHz MIPS R12000 processors sharing a common memory. This processor has two levels of cache. The primary data-cache and instruction-cache are both two-way set associative, each of size 32*1024 bytes. The 8*1024*1024 byte secondary cache is used for both data and instructions. The communication network is a hypercube for up to 32 processors. For more than 32 processors, multiple hypercubes are interconnected via a CrayLink Interconnect, see [3] for more information. All programs were written in Fortran 90 using version 7.3.1.m, compiled with optimization (O3) and 64 bit mode (-64), and using IRIX 6.5. All tests were run one at a time with nobody else using the machine.

 

 

3 Timing Methodology

Timing were done by first flushing the caches on all processors by changing the values in the real*8 array flush(8*1024*1024/8). Thus, flush has the size of the secondary cache, 8 Mbytes. The cache was flushed by adding .0000123 to flush; however, any number small enough to prevent floating point overflow would also work. Throughout this paper, p is the number of processors/threads used and ntest is the number of timing tests performed in a single job. ntest was set to 100 for all tests. For the OpenMP tests, the following was used for timing.

integer,parameter :: ntest=100

real*8 :: time(ntest), local_time(p,ntest)

. . .

!$omp parallel default(none) private(t,k,flush) shared(time,local_time)

. . .

t = timef() ! the first call may be undefined

do k = 1, ntest

flush = flush + .0000123 ! flush the cache

!$omp barrier

t = timef() ! time in milliseconds, 1/1000 of a second

... OpenMP code to be timed ...

local_time(omp_get_thread_num(),k) = timef() - t

!$omp barrier

A(k) = A(k) + flush(k) ! prevent loop splitting

time(k) = maxval(local_time(0:p-1,k))*0.001d0 ! time in seconds

enddo

!$omp end parallel

The following was used to time the MPI tests.

integer,parameter :: ntest=100

real*8 :: time(ntest), local_time(ntest)

. . .

do k = 1, ntest

flush = flush + .0000123 ! flush the cache

call mpi_barrier(mpi_comm_world,ierror)

t = timef()

... MPI code to be timed ...

local_time(k) = timef() - t

call mpi_barrier(mpi_comm_world,ierror)

A(k) = A(k) + flush(k) ! prevent loop splitting

enddo

call mpi_reduce(local_time,time,ntest,mpi_real8,mpi_max,0,mpi_comm_world,ierror)

time = 0.001d0*time ! time in seconds

The first barrier in both of the above code segments is to ensure that all processors reach this point before starting the timing by calling the wall-clock timer. The second barrier is to make sure all processors have completed the "code to be timed". To prevent the compiler from splitting out the cache flushing, the statement "A(k) = A(k) + flush(k)" was added, where A is an array involved in the communication being timed. To prevent the compiler from considering part of all of the "code to be timed" as dead code and eliminating it, later in the program the value of A(1) and flush(1) were written. Each processor/thread records its time in local_time. The maximum over all processors/threads is then stored in the time array.

Figure 1 shows times in seconds for 100 trials for the dot product test with n = 10240 and p = 128, where p is the number of processors used. Notice that there are several "spikes" in the data with the first spike being the first timing. This kind of behavior is typical of the data found for all tests. The first timing usually was significantly longer than most of the other timings (likely due to the additional setup time required for the first call to subroutines and functions), so the time for the first trial was always removed. The other spikes are probably due to the operating system interrupting the execution of the program. The average of the 99 trials (the first trial is removed) is 14.3 milliseconds, which is much longer than most of the other trials. The authors decided to measure times for each operation by first filtering out the spikes as follows. Compute the median value after the first time trial is removed. All times that are greater than 1.8 times this median value are then removed. The authors considerate it to be inappropriate to remove more than 10% of the data. If more than 10% of the data is removed by the above procedure then only the largest 10% of the spikes are removed. Using this procedure, the filtered value for the time for figure 1 is 5.4 milliseconds instead of 14.3 milliseconds.

Figure 1. 100 trials for the OpenMP Dot Product test with n = 10240 and p = 128.

 

 

4 Data Distribution on the Origin 2000

On the SGI Origin 2000, data for MPI programs is local to the executing processor unless there is not enough local memory. When there is not enough local memory, memory on the other nodes is used. There was sufficient local memory for all the tests in this paper. OpenMP is designed for shared memory machines and hence does not contain any data distribution directives. Since the memory on the SGI Origin is not physically shared, SGI provides data distribution directives to allow users to specify how data is placed on processors. If no data distribution directives are used, then data is automatically distributed via the "first touch" mechanism [3] which places the data on the processor where it is first used. Since placement of data can significantly affect performance, data distribution directives were used for each test. There are two sets of data distribution directives available on the Origin: the !$sgi distribute directives, and the !$sgi distribute_reshape directives, see [3]. All tests use the !$sgi distribute_reshape in order to ensure that the data is distributed as specified. To illustrate the difference in performance of OpenMP with and without data distribution directives, test 3 with n = 4*1024*1024/8 was run without data distribution directives. The resulting time was about four times longer without data distribution directives than with data distribution directives.

 

 

5 Communication Tests and Performance Results

For all tests p denotes the number of processes/threads used for the test. All floating-point variables are declared as real*8. Unless specified differently, arrays x and y are declared as real*8. The integer variables rank and thread indicate the rank/thread of the executing processor. Except for the ping-pong test, all tests were run with p = 8, 16, 32, 64, and 128. All !$omp do directives are scheduled with the static option to ensure that arrays are distributed as desired. There are three possible scheduling options for !$omp do directives: static, dynamic, and guided. Testing showed that the static option always provided the best performance for the machine used.

 

Test 1: Sending Message between "Close" and "Distant" Processors.

The purpose of this test is to compare the performance of OpenMP and MPI implementations of the time for sending messages between "close" and "distant" processors. This was done by measuring the time to ‘ping pong’ array A from processor 0 to processor 1, from processor 0 to processor 2, …, from processor 0 to processor p-1. Each ‘ping pong’ is timed separately, as indicated in section 3. Arrays A and B are block distributed across p processors for the OpenMP implementation. Notice that for both the MPI and OpenMP versions of this test, both processors send A and use B instead of A to receive the message. If A were also used to receive the message, this would place A in the cache of this processor so the subsequent sending of A would likely be faster than the first send.

One must call the barrier twice in the OpenMP code since mpi_recv is blocking. The OpenMP code used is:

real*8 :: A(n*p), B(n*p)

!$sgi distribute_reshape A(block), B(block)

. . .

!$omp parallel shared(A,B) private(thread,i,j,k)

thread = omp_get_thread_num()

do j = 1, p-1

do k = 1, ntest

. . .

t = timef() !! Start Timer

if (thread == j) B(j*n + 1:j*n+n) = A(1:n)

!$omp barrier

if (thread == 0) B(1:n) = A(j*n + 1:j*n + n)

!$omp barrier

local_time(omp_get_thread_num(),k) = timef()-t !! End of Timer

. . .

enddo

enddo

!$omp end parallel

Notice that since A is block distributed among p threads and has n*p elements, A(1:n) is on thread 0 and A(j*n + 1,j*n + n) is on thread j.

The MPI code used is:

real*8 :: A(n), B(n)

. . .

do j = 1, p-1

do k = 1, ntest

. . .

t = timef() !! Start Timer

if (rank == 0) then

call mpi_send (A,n,mpi_real8,j,1, mpi_comm_world,ierr)

call mpi_recv(B,n,mpi_real8,j,2,mpi_comm_world,status,ierr)

endif

if (rank == j) then

call mpi_recv(B,n,mpi_real8,0,2,mpi_comm_world,status,ierr)

call mpi_send (A,n,mpi_real8,0,1, mpi_comm_world,ierr)

endif

local_time(k) = timef() – t !! End of Timer

. . .

enddo

enddo

 

Figures 2 and 3 contain the results in milliseconds for sending 4 Mbyte and 8 Byte messages, respectively. For a "perfectly" scalable computer, these graphs would be horizontal lines. This is not the case in figures 2 and 3, but the OpenMP version does give graphs that are closer to a horizontal line than the MPI version. Notice that for 4 Mbyte messages the OpenMP version is about 2.5 times slower than the MPI version. For the 8 Byte message, the OpenMP version is a little faster than the MPI version. The authors do not know why the OpenMP version is so slow for large messages.

 

Figure 2. Test 1 for a 4 Mbyte message.

 

Figure 3. Test 1 for an 8 Byte message.

 

 

 

 

Test 2: The Right Shift Test

This test compares the performance of the right shift operation implemented using OpenMP and using MPI. The following is the OpenMP code for this test.

real*8 :: A(n*p), B(n*p)

!$sgi distribute_reshape A(block), B(block)

. . .

!$omp parallel private(i, j, k) shared(A, B)

. . .

t = timef() !! Start Timer

j = omp_get_thread_num()

i = modulo(j+1,p)

B(i*n+1:i*n+n) = A(j*n+1:j*n+n)

local_time(omp_get_thread_num(),k) = timef() - t !! End of Timer

. . .

!$omp end parallel

For the MPI implementation, processor of rank k sends array A(1:n) to processor (k+1), modulo(p), and receives B(1:n) from processor (k-1), modulo(p); for k = 0, p-1.

real*8 :: A(n), B(n)

. . .

t = timef() !! Start Timer

i = modulo(rank+1, p)

j = modulo(rank-1, p)

call mpi_sendrecv(A,n,mpi_real8,i,1,B,n,mpi_real8,j,1,mpi_comm_world,status,ierr)

local_time(k) = timef() – t !! End of Timer

Figures 4 and 5 contain the performance results for the right shift test for 4 Mbyte and 8 Byte messages, respectively. In a "perfectly" scalable computer, all right shifts would be executed concurrently and hence the time would be independent of the number of processors and the graphs would be horizontal lines. The graphs in figure 4 for the 4 Mbyte message are roughly horizontal lines for 32 to 128 processors. However, the MPI implementation for the 8 Byte message is far from a horizontal line and performs significantly worse than the OpenMP implementation. For the 4 Mbyte messages, the MPI implementation is roughly 1.3 times faster than the OpenMP implementation. There appears to be a significant performance problem with SGI’s implementation of mpi_sendrecv for small messages.

Figure 4. Test 2 results in milliseconds for 4 MByte messages.

Figure 5. Test 2 results in milliseconds for 8 Byte messages

 

 

Test 3: Summing Elements of an Array

This test measures the time to sum the components of a one-dimensional array A whose elements are block distributed across p processors in the OpenMP implementation. The following is the OpenMP code.

real*8 :: A(n*p)

!$sgi distribute_reshape A(block)

sum = 0.0

!$omp parallel shared(A,sum)

. . .

t = timef() !! Start Timer

!$omp do schedule(static),reduction(+:sum)

do j = 1, n*p

sum = sum + A(j)

enddo

!$omp enddo

local_time(omp_get_thread_num(), k) = timef() – t !! End of Timer

. . .

!$omp end parallel

For the MPI implementation, s stores the values of the local sums and sum holds the value of the sum of all components of the array. The MPI code used for this test is.

real*8 :: A(n)

. . .

t = timef() !! Start Timer

s = 0.0

do j = 1, n

s = s + A(j)

enddo

call mpi_reduce (s,sum,1,mpi_real8,mpi_sum,0,mpi_comm_world,ierr)

local_time(k) = timef() – t !! End of Timer

Figures 6 and 7 contain the performance results for this test for n = 4*1024*1024/8 and n = 1, respectively. Since the "reduction(+:sum)" option was used in the OpenMP implementation, one would expect the OpenMP implementation to perform well compared to the MPI implementation. However, the OpenMP implementation does not scale nearly as well as the MPI implementation for n = 1. They do perform about the same for n = 4*1024*1024/8. The OpenMP code could have also been written by first adding local sums and then updating the global sum variable in a critical region. However, the performance of this modified code is a little worse than when the above code is executed.

 

 

 

 

 

Figure 6. Test 3 results in milliseconds for n = 4*1024*1024/8.

Figure 7. Test 3 results in milliseconds for n = 1.

Test 4: Dot Product

This test compares the performance for calculating the dot product of two arrays x and y using OpenMP and MPI. Elements of x and y are block distributed across p processors for the OpenMP implementation. The OpenMP code used for this test is.

real*8 :: x(n*p), y(n*p)

!$sgi distribute_reshape x(block), y(block)

. . .

t = timef() !! Start Timer

sum = 0.0

!$omp parallel shared(sum, x, y)

!$omp do schedule(static),reduction(+:sum)

do i =1, n*p

sum = sum + x(i)*y(i)

enddo

!$omp enddo

local_time(omp_get_thread_num(), k) = timef() – t !! End of Timer

. . .

!$omp end parallel

For the MPI implementation, s stores the values of the local sums and sum holds the value of the dot product of x and y. The MPI code used for this test is.

real*8 :: x(n), y(n)

. . .

t = timef() !! Start Timer

s = 0.0

do i = 1, n

s = s + x(i)*y(i)

enddo

call mpi_reduce (s,sum,1,mpi_real8,mpi_sum,0,mpi_comm_world,ierror)

local_time(k) = timef() – t !! End of Timer

Figures 8 and 9 contain the performance results for this test for n = 80*1024/8 and n = 1, respectively. Notice that the times for n = 80*1024/8 in figure 8 are nearly the same as those for n = 1 in figure 9. This is because the time required for the extra additions is small compared with the rest of the time. In both cases, the MPI implementation performs and scales better than the OpenMP implementation. There appears to be a significant performance problem with OpenMP for this test even though the "reduction(+:sum)" option was used.

Figure 8. Test 4 results in milliseconds for n = 80*1024/8.

Figure 9. Test 4 results in milliseconds for n = 1.

Test 5: Matrix Times Vector

This test compares the performance of OpenMP and MPI implementations of the matrix times vector operation, y = y + Ax, where for the OpenMP implementation A is column block distributed across p processors, x is block distributed, and y is replicated on all processors. The OpenMP version of this test also used the temporary array psum and the code used is:

real*8 :: y(n), A(n,n*p), x(n*p), psum(n)

!$sgi distribute_reshape A(*,block), x(block)

. . .

!$omp parallel shared(A,x,y) private(psum)

t = timef() !! Start Timer

psum=0.0

!$omp do schedule(static)

do j = 1, n*p

psum(:) = psum(:) + A(:,j)*x(j)

enddo

!$omp enddo

!$omp critical

y(:) = y(:) + psum(:)

!$omp end critical

local_time(omp_get_thread_num(), k) = timef() – t !! End of Timer

. . .

!$omp end parallel

The MPI version requires the use of the temporary arrays psum and sum. The following is the MPI version for this test:

real*8 :: y(n), A(n,n), x(n), psum(n), sum(n)

. . .

t = timef() !! Start Timer

psum = 0.0

do j = 1, n

psum(:) = psum(:) + A(:,j)*x(j)

enddo

call mpi_reduce(psum,sum,n,mpi_real8,mpi_sum,0,mpi_comm_world,ierror)

y(:) = y(:) + sum(:)

local_time(k) = timef() – t !! End of Timer

Figures 10 and 11 contain the performance results for this test for n = 128 and n = 1, respectively. In both cases, the MPI implementation performs and scales significantly better than the OpenMP implementation. This is likely due to the fact that the critical region in the OpenMP program is executed serially whereas mpi_reduce uses parallelism (probably a binary tree algorithm) for updating y. If OpenMP would allow one to use the "reduction(+:sum)" option for arrays, then vendors would have the opportunity to generate efficient parallel code for the above operation.

 

 

 

 

 

Figure 10. Test 5 results in milliseconds for n = 128.

Figure 11. Test 5 results in milliseconds for n = 1.

Test 6: Matrix Addition

This test compares the performance of OpenMP and MPI implementations of the sum of two matrices, C = A + B, where for the OpenMP implementation these arrays are column block distributed across p processors. Notice that no communication is required for this test. The OpenMP version of this test is:

real*8,dimension(n,n*p) :: A, B, C

!$sgi distribute_reshape A(*,block), B(*,block), C(*,block)

. . .

!$omp parallel shared(A,B,C) private(j)

. . .

t = timef() !! Start Timer

!$omp do schedule(static)

do j = 1, n*p

C(:,j) = A(:,j) + B(:,j)

enddo

!$omp enddo

local_time(omp_get_thread_num(), k) = timef() – t !! End of Timer

. . .

!$omp end parallel

The MPI version of this test is:

real*8,dimension(n,n) :: A, B, C

. . .

t = timef() !! Start Timer

C = A + B

local_time(k) = timef() – t !! End of Timer

Figures 12 and 13 contain the performance results for this test for n = 512 and n = 1, respectively. For this very simple test, one would expect OpenMP and MPI to perform about the same. However, for n = 1, the MPI implementation performs and scales significantly better than the OpenMP implementation.

 

 

 

Figure 12. Test 6 results in milliseconds for n = 512

Figure 13. Test 6 results in milliseconds for n = 1

 

Test 7: Matrix-Matrix Multiplication

This test compares the performance of OpenMP and MPI implementations of the product of two matrices C = C + AB. For the OpenMP implementation A is an n´ (n*p) array column distributed across p processors, B is an (n*p)´ n array row distributed across p processors, and C is and n´ n array on processor 0. The OpenMP version of this test is:

real*8 A(n,n*p),B(n*p,n)

!$sgi distribute_reshape A(*,block), B(block,*)

real*8,dimension(n,n) :: local_C, C

. . .

!$omp parallel shared(A,B,C) private(local_C,i,j,k)

. . .

t = timef() !! Start Timer

local_C = 0.0d0

!$omp do

do k = 1, n*p

do i = 1, n

do j = 1, n

local_C(i,j) = local_C(i,j) + A(i,k) * B(k,j)

enddo

enddo

enddo

!$omp enddo

!$omp critical

C = C + local_C

!$omp end critical

local_time(k) = timef() – t !! End of Timer

. . .

!$omp end parallel

The MPI version of this test is:

real*8,dimension(n,n):: A, B, local_C, C

. . .

t = timef() !! Start Timer

do k = 1, n

do i = 1, n

do j = 1, n

local_C (i,j) = local_C(i,j) + A(i,k) * B(k,j)

enddo

enddo

call mpi_reduce(local_C,C,n*n,mpi_real8,mpi_sum,0,mpi_comm_world,ierror)

enddo

local_time(k) = timef() – t !! End of Timer

Figures 14 and 15 contain the performance results for this test for n = 512 and n = 1, respectively. Notice that in both cases, the MPI implementation scales well and the OpenMP implementation scales very poorly. As in test 5, this poor performance of OpenMP is likely due to the fact that the critical region must be executed serially (as required by the OpenMP specification) whereas mpi_reduce is executed in parallel.

Figure 14. Test 7 results in seconds for n = 512.

Figure 15. Test 7 results in milliseconds for n = 1.

6 Conclusions

The purpose of this paper is to investigate the scalability and performance of seven, simple OpenMP test programs and to compare their performance with equivalent MPI programs on an SGI Origin 2000. Since the memory in the SGI Origin 2000 is physically distributed among its nodes, special data distribution directives were used to make sure that the OpenMP implementation had the same data distribution as the MPI implementation. Without specifying the data distribution, performance of the OpenMP implementation would be poorer.

Each test was executed for a large message and a small message making a total of 14 experiments. For 9 of the 14 experiments, the MPI implementations performed and scaled significantly better than the corresponding OpenMP implementation. For 4 of the experiments, the MPI and OpenMP implementations performed roughly the same. For the small message size for the right shift test, the OpenMP implementation performed and scaled significantly better than the corresponding MPI implementation. Therefore, even with carefully specifying data distributions for the OpenMP implementations, 9 of the MPI implementations performed and scaled significantly better than the corresponding OpenMP implementations for these simple tests. Reduction operations are commonly used in scientific calculations. OpenMP allows vendor to optimize reductions for scalars as illustrated in test 4 (although this optimization could be improved on the Origin). However, OpenMP does not allow this optimization for arrays. Without this capability it difficult to write efficient OpenMP implementations of common operations such as matrix-time-vector (test 5) and matrix-time-matrix (test 7). Thus, the authors recommend that the OpenMP standard be enhanced to allow the reduction option for arrays to be used.

 

7 References

  1. OpenMP Specification Web Site. http://www.openmp.org.
  2. MPI standard Web Site. http://www-unix.mcs.anl.gov/mpi/index.html.
  3. J. Fier. Performance Tuning Optimization for Origin 2000 and Onyx 2. Silicon Graphics, 1996. http://techpubs.sgi.com.
  4. G. R. Luecke, B. Raffin, and J. J. Coyle. Comparing the Communication Performance and Scalability of a SGI Origin 2000, a cluster of Origin 2000’s and a Cray T3E-1200 using SHMEM and MPI Routines. The Journal of Performance Evaluation and Modeling for Computer Systems, October 1999. http://hpc-journal.ecs.soton.ac.uk/PEMCS/.

8 Acknowledgements

We would like to thank Silicon Graphics, Incorporated for giving us access to their Origin 2000 for this study.

Appendix A: Test Results

Proc

MPI

OpenMP

Proc

MPI

OpenMP

Proc

MPI

OpenMP

1

0.17405

0.15008

44

0.21591

0.12906

87

0.22096

0.12961

2

0.12182

0.14522

45

0.17916

0.13894

88

0.19076

0.14820

3

0.16571

0.14277

46

0.23042

0.13239

89

0.20166

0.12636

4

0.19085

0.14719

47

0.28015

0.14007

90

0.14439

0.13140

5

0.16627

0.13491

48

0.17647

0.14547

91

0.20940

0.13128

6

0.12343

0.13164

49

0.19756

0.13676

92

0.19943

0.14033

7

0.16327

0.13114

50

0.18801

0.13456

93

0.22540

0.13284

8

0.14519

0.13105

51

0.18802

0.13660

94

0.18121

0.13263

9

0.19190

0.13610

52

0.14985

0.13545

95

0.20179

0.13870

10

0.20274

0.13469

53

0.22870

0.13108

96

0.14388

0.13264

11

0.13138

0.13488

54

0.13996

0.13140

97

0.18474

0.13171

12

0.12486

0.13612

55

0.18234

0.13284

98

0.18857

0.13354

13

0.19060

0.14004

56

0.23602

0.13748

99

0.14492

0.13087

14

0.12313

0.13479

57

0.17676

0.13010

100

0.16613

0.13204

15

0.17426

0.12552

58

0.19133

0.13132

101

0.21266

0.13484

16

0.23257

0.13684

59

0.23990

0.15771

102

0.13638

0.13348

17

0.16562

0.13487

60

0.14532

0.12908

103

0.18436

0.13333

18

0.14639

0.13191

61

0.22833

0.13895

104

0.22527

0.14045

19

0.25602

0.13908

62

0.22987

0.13153

105

0.16358

0.13188

20

0.17378

0.14093

63

0.18280

0.14304

106

0.17440

0.13409

21

0.21135

0.13332

64

0.19241

0.13155

107

0.19220

0.13380

22

0.21415

0.13554

65

0.23377

0.13976

108

0.13345

0.13060

23

0.16044

0.14030

66

0.13980

0.13347

109

0.21971

0.13500

24

0.12146

0.13375

67

0.21764

0.14062

110

0.22867

0.12665

25

0.22458

0.12798

68

0.19259

0.13076

111

0.16924

0.13223

26

0.18167

0.12873

69

0.14971

0.13566

112

0.15843

0.14749

27

0.14861

0.13321

70

0.19627

0.12851

113

0.23977

0.14305

28

0.22148

0.13373

71

0.23699

0.13109

114

0.13678

0.13592

29

0.16978

0.12744

72

0.15842

0.13482

115

0.16650

0.13735

30

0.13542

0.13251

73

0.23731

0.13963

116

0.21497

0.14444

31

0.23334

0.13285

74

0.22426

0.13178

117

0.13168

0.14608

32

0.21790

0.13736

75

0.17767

0.13138

118

0.14887

0.12680

33

0.13955

0.13340

76

0.17346

0.12828

119

0.21047

0.13369

34

0.22368

0.13450

77

0.21418

0.12524

120

0.14707

0.12980

35

0.27179

0.13617

78

0.13332

0.12954

121

0.15392

0.13511

36

0.17159

0.13688

79

0.18440

0.12953

122

0.22933

0.13740

37

0.23084

0.13441

80

0.19136

0.12657

123

0.14722

0.15063

38

0.22310

0.12913

81

0.17811

0.13692

124

0.15193

0.12760

39

0.13992

0.12844

82

0.17975

0.13463

125

0.23120

0.13372

40

0.19126

0.13077

83

0.27187

0.13758

126

0.15502

0.13288

41

0.25451

0.12957

84

0.16608

0.13891

127

0.15160

0.13032

42

0.14966

0.13816

85

0.20117

0.13920

43

0.22806

0.13455

86

0.25354

0.12833

Table 1: Test 1 results in milliseconds for an 8 Byte message.

Proc Number

MPI

OpenMP

Proc Number

MPI

OpenMP

Proc Number

MPI

OpenMP

1

89.72394

251.36440

44

99.10971

266.29721

87

109.49805

255.87039

2

83.21764

262.22881

45

99.81645

254.37840

88

113.08969

250.00200

3

84.59500

256.03840

46

100.30452

252.15640

89

112.94395

260.95801

4

90.21360

255.79720

47

99.63263

258.75440

90

112.78191

256.03840

5

89.34736

316.21880

48

102.01353

249.94360

91

112.29680

255.02239

6

90.01320

256.03280

49

102.68031

252.69240

92

116.68670

256.83761

7

90.31889

256.69440

50

101.72310

257.12080

93

117.71979

252.59840

8

90.94697

303.74240

51

102.52377

265.41121

94

117.18499

259.04240

9

88.95202

256.63280

52

116.37313

261.15840

95

118.09192

255.48640

10

88.60717

247.68600

53

115.96083

261.45081

96

118.26320

247.42280

11

88.26615

253.63000

54

116.17596

258.43200

97

118.44874

285.90640

12

91.08663

254.79919

55

117.02779

269.68800

98

118.29102

303.74240

13

90.72441

258.00400

56

123.59173

260.61400

99

118.00852

256.63280

14

219.94830

260.75400

57

116.26953

254.65280

100

109.63821

262.22881

15

221.62021

251.68080

58

115.61564

260.73880

101

108.99699

256.03840

16

105.22085

253.48840

59

114.23834

253.65160

102

109.85847

255.79720

17

105.16950

258.41200

60

103.58529

258.84239

103

110.08379

316.21880

18

105.64853

255.87039

61

103.58559

257.18320

104

108.52510

256.03280

19

105.70492

250.00200

62

103.13302

249.45760

105

108.66865

256.69440

20

95.35535

260.95801

63

103.79404

265.94521

106

108.99980

249.94360

21

95.48839

256.03840

64

120.36163

253.65160

107

108.61672

252.69240

22

95.25715

255.02239

65

119.11013

256.69440

108

104.09929

257.12080

23

95.01758

256.83761

66

120.53880

303.74240

109

103.87025

265.41121

24

94.78233

252.59840

67

121.83490

256.63280

110

103.07906

256.63280

25

94.99303

259.04240

68

105.56076

262.22881

111

104.52920

247.68600

26

94.06268

255.48640

69

105.27497

256.03840

112

111.04492

253.63000

27

94.88901

247.42280

70

105.17023

255.79720

113

111.30107

254.79919

28

94.61384

285.90640

71

105.78888

316.21880

114

111.31593

258.00400

29

93.97203

257.09840

72

104.22440

256.03280

115

111.65165

260.75400

30

94.11628

258.54520

73

105.26745

256.69440

116

103.38948

251.68080

31

94.99695

257.13680

74

104.06732

249.94360

117

103.72090

253.48840

32

109.56971

254.95841

75

104.87533

252.69240

118

103.80603

258.41200

33

109.98735

250.16480

76

134.65164

257.12080

119

103.50697

255.87039

34

109.78217

250.88640

77

132.45814

265.41121

120

111.59939

250.00200

35

111.58338

253.93120

78

127.91500

256.63280

121

104.72835

260.95801

36

112.11820

270.78401

79

128.49435

247.68600

122

102.24204

256.03840

37

111.63712

265.46280

80

108.85060

253.63000

123

102.48340

255.02239

38

111.72617

263.14839

81

108.81991

254.79919

124

97.60257

256.83761

39

112.20314

259.74559

82

109.54080

258.00400

125

97.23977

252.59840

40

110.70665

254.02280

83

109.27527

260.75400

126

97.83089

259.04240

41

109.72619

252.91601

84

109.13360

251.68080

127

97.93619

255.48640

42

110.27547

255.92880

85

109.00524

253.48840

43

110.90849

251.28201

86

108.78884

258.41200

Table 2: Test 1 results in milliseconds for a 4 MByte message.

 

 

Number of Processors

Small Message Size

Large Message Size

MPI

OpenMP

MPI

OpenMP

2

0.08836

0.01210

110.05308

123.70044

4

0.20996

0.01226

102.71691

129.79605

8

0.18834

0.01516

107.30495

130.20557

16

0.20204

0.02982

120.10400

169.79771

32

0.31571

0.04249

122.63455

187.59256

64

0.43802

0.03771

140.20393

197.29718

128

0.99056

0.04144

131.67589

181.64581

Table 3: Results of Right Shift Test in milliseconds.

 

Number of Processors

Small Message Size

Large Message Size

MPI

OpenMP

MPI

OpenMP

2

0.17184

0.10950

17.74027

15.14267

4

0.26932

0.07985

18.10493

15.71384

8

0.30780

0.09961

19.77006

21.56141

16

0.25433

0.12885

19.48238

19.16049

32

0.32323

0.43332

16.21149

18.01448

64

0.48466

1.04917

24.48468

25.25756

128

1.27333

5.01547

22.84760

26.76295

Table 4: Result of Summing Elements of an Array Test.

 

Number of Processors

Small Message Size

Large Message Size

MPI

OpenMP

MPI

OpenMP

2

0.35820

0.47300

0.25754

0.48000

4

0.30335

0.81600

0.38257

0.74700

8

0.38689

0.82500

0.31821

0.82700

16

0.47501

0.85900

0.40480

0.85000

32

0.51352

1.17900

0.49341

1.21400

64

0.71350

1.44600

0.67548

1.43500

128

0.75809

5.40200

0.83057

5.36700

Table 5: Result of Dot Product Test.

 

 

Number of Processors

Small Message Size

Large Message Size

MPI

OpenMP

MPI

OpenMP

2

0.40409

0.34849

0.41317

0.32861

4

0.43980

0.73026

0.46872

0.74449

8

0.58427

0.89533

0.59253

0.88674

16

0.77829

1.19132

0.78144

1.25176

32

0.79476

3.45444

0.81653

3.56695

64

0.96169

8.08260

1.05923

8.35535

128

1.16052

16.32955

1.38008

85.11001

Table 6: Results of Matrix Times Vector Test in millisecond.

 

Number of Processors

Small Message Size

Large Message Size

MPI

OpenMP

MPI

OpenMP

2

0.00074

0.05029

24.943927

30.652509

4

0.00099

0.04630

24.979418

32.587022

8

0.00154

0.07439

24.995871

32.759396

16

0.00144

0.08476

18.718656

32.559974

32

0.00227

0.15187

19.566440

43.461355

64

0.00138

0.22214

30.855798

45.372073

128

0.00162

0.13986

25.493576

49.484702

Table 7: Results of Matrix Addition Test in milliseconds.

 

Number of Processors

Small Message Size

Large Message Size

MPI

OpenMP

MPI

OpenMP

2

0.17111

0.05103

681.48316

2030.20619

4

0.30371

0.08360

718.58947

2028.06075

8

0.25569

0.15246

750.87955

2054.26156

16

0.39853

0.40401

648.17432

2346.63203

32

0.61716

1.57737

839.89392

3371.24213

64

0.93830

4.79924

968.12697

6446.41928

128

1.81589

18.40094

905.37065

65270.26787

Table 8: Results of Matrix-Matrix Multiplication Test in milliseconds.