Scalability and Performance of OpenMP and MPI on
a 128-Processor SGI Origin 2000
Glenn R. Luecke and Wei-Hua Lin
grl@iastate.edu, whlin@iastate.edu
291 Durham Center
Iowa State University
Ames, Iowa 50011-2251, USA
August 16, 2000
Abstract
The purpose of this paper is to investigate the scalability and performance of seven, simple OpenMP test programs and to compare their performance with equivalent MPI programs on an SGI Origin 2000. Since the memory in the SGI Origin 2000 is physically distributed among its nodes, special data distribution directives were used to make sure that the OpenMP implementation had the same data distribution as the MPI implementation. Without specifying the data distribution, performance of the OpenMP implementation would be poorer. Each test was executed for a large message and a small message making a total of 14 experiments. Even with carefully specifying data distributions for the OpenMP implementations, 9 of the MPI implementations performed and scaled significantly better than the corresponding OpenMP implementations for these simple tests. The authors recommend that the OpenMP standard be enhanced to allow the reduction option for arrays to be used.
It is well known that writing programs for distributed memory parallel computers is often significantly more difficult than writing programs for shared memory parallel computers. It is also well known that shared memory parallel computers do not scale well to hundreds of processors whereas distributed memory parallel computers can scale well to hundreds of processors. Thus, those who need hundreds of processors to execute their programs are currently forced to write programs for distributed memory computers with MPI [2] instead of writing their programs for shared memory computers with OpenMP [1]. The SGI Origin 2000 is a "hybrid" between shared and distributed memory since memory is physically distributed among nodes (each node has two processors sharing a common memory) but there is a single address space for all memory. Programmers have the option of writing their parallel programs using either OpenMP or MPI on this machine. The purpose of this paper is to investigate the scalability and performance of seven simple OpenMP test programs and to compare their performance with equivalent MPI programs on an SGI Origin 2000.
2 Test Environment
The SGI Origin 2000 used for this study is a 128-processor machine located in Eagan, Minnesota, with 64 nodes with each node consisting of two 300 MHz MIPS R12000 processors sharing a common memory. This processor has two levels of cache. The primary data-cache and instruction-cache are both two-way set associative, each of size 32*1024 bytes. The 8*1024*1024 byte secondary cache is used for both data and instructions. The communication network is a hypercube for up to 32 processors. For more than 32 processors, multiple hypercubes are interconnected via a CrayLink Interconnect, see [3] for more information. All programs were written in Fortran 90 using version 7.3.1.m, compiled with optimization (O3) and 64 bit mode (-64), and using IRIX 6.5. All tests were run one at a time with nobody else using the machine.
3 Timing Methodology
Timing were done by first flushing the caches on all processors by changing the values in the real*8 array flush(8*1024*1024/8). Thus, flush has the size of the secondary cache, 8 Mbytes. The cache was flushed by adding .0000123 to flush; however, any number small enough to prevent floating point overflow would also work. Throughout this paper, p is the number of processors/threads used and ntest is the number of timing tests performed in a single job. ntest was set to 100 for all tests. For the OpenMP tests, the following was used for timing.
integer,parameter :: ntest=100
real*8 :: time(ntest), local_time(p,ntest)
. . .
!$omp parallel default(none) private(t,k,flush) shared(time,local_time)
. . .
t = timef() ! the first call may be undefined
do k = 1, ntest
flush = flush + .0000123 ! flush the cache
!$omp barrier
t = timef() ! time in milliseconds, 1/1000 of a second
... OpenMP code to be timed ...
local_time(omp_get_thread_num(),k) = timef() - t
!$omp barrier
A(k) = A(k) + flush(k) ! prevent loop splitting
time(k) = maxval(local_time(0:p-1,k))*0.001d0 ! time in seconds
enddo
!$omp end parallel
The following was used to time the MPI tests.
integer,parameter :: ntest=100
real*8 :: time(ntest), local_time(ntest)
. . .
do k = 1, ntest
flush = flush + .0000123 ! flush the cache
call mpi_barrier(mpi_comm_world,ierror)
t = timef()
... MPI code to be timed ...
local_time(k) = timef() - t
call mpi_barrier(mpi_comm_world,ierror)
A(k) = A(k) + flush(k) ! prevent loop splitting
enddo
call mpi_reduce(local_time,time,ntest,mpi_real8,mpi_max,0,mpi_comm_world,ierror)
time = 0.001d0*time ! time in seconds
The first barrier in both of the above code segments is to ensure that all processors reach this point before starting the timing by calling the wall-clock timer. The second barrier is to make sure all processors have completed the "code to be timed". To prevent the compiler from splitting out the cache flushing, the statement "A(k) = A(k) + flush(k)" was added, where A is an array involved in the communication being timed. To prevent the compiler from considering part of all of the "code to be timed" as dead code and eliminating it, later in the program the value of A(1) and flush(1) were written. Each processor/thread records its time in local_time. The maximum over all processors/threads is then stored in the time array.
Figure 1 shows times in seconds for 100 trials for the dot product test with n = 10240 and p = 128, where p is the number of processors used. Notice that there are several "spikes" in the data with the first spike being the first timing. This kind of behavior is typical of the data found for all tests. The first timing usually was significantly longer than most of the other timings (likely due to the additional setup time required for the first call to subroutines and functions), so the time for the first trial was always removed. The other spikes are probably due to the operating system interrupting the execution of the program. The average of the 99 trials (the first trial is removed) is 14.3 milliseconds, which is much longer than most of the other trials. The authors decided to measure times for each operation by first filtering out the spikes as follows. Compute the median value after the first time trial is removed. All times that are greater than 1.8 times this median value are then removed. The authors considerate it to be inappropriate to remove more than 10% of the data. If more than 10% of the data is removed by the above procedure then only the largest 10% of the spikes are removed. Using this procedure, the filtered value for the time for figure 1 is 5.4 milliseconds instead of 14.3 milliseconds.

4 Data Distribution on the Origin 2000
On the SGI Origin 2000, data for MPI programs is local to the executing processor unless there is not enough local memory. When there is not enough local memory, memory on the other nodes is used. There was sufficient local memory for all the tests in this paper. OpenMP is designed for shared memory machines and hence does not contain any data distribution directives. Since the memory on the SGI Origin is not physically shared, SGI provides data distribution directives to allow users to specify how data is placed on processors. If no data distribution directives are used, then data is automatically distributed via the "first touch" mechanism [3] which places the data on the processor where it is first used. Since placement of data can significantly affect performance, data distribution directives were used for each test. There are two sets of data distribution directives available on the Origin: the !$sgi distribute directives, and the !$sgi distribute_reshape directives, see [3]. All tests use the !$sgi distribute_reshape in order to ensure that the data is distributed as specified. To illustrate the difference in performance of OpenMP with and without data distribution directives, test 3 with n = 4*1024*1024/8 was run without data distribution directives. The resulting time was about four times longer without data distribution directives than with data distribution directives.
5 Communication Tests and Performance Results
For all tests p denotes the number of processes/threads used for the test. All floating-point variables are declared as real*8. Unless specified differently, arrays x and y are declared as real*8. The integer variables rank and thread indicate the rank/thread of the executing processor. Except for the ping-pong test, all tests were run with p = 8, 16, 32, 64, and 128. All !$omp do directives are scheduled with the static option to ensure that arrays are distributed as desired. There are three possible scheduling options for !$omp do directives: static, dynamic, and guided. Testing showed that the static option always provided the best performance for the machine used.
Test 1: Sending Message between "Close" and "Distant" Processors.
The purpose of this test is to compare the performance of OpenMP and MPI implementations of the time for sending messages between "close" and "distant" processors. This was done by measuring the time to ‘ping pong’ array A from processor 0 to processor 1, from processor 0 to processor 2, …, from processor 0 to processor p-1. Each ‘ping pong’ is timed separately, as indicated in section 3. Arrays A and B are block distributed across p processors for the OpenMP implementation. Notice that for both the MPI and OpenMP versions of this test, both processors send A and use B instead of A to receive the message. If A were also used to receive the message, this would place A in the cache of this processor so the subsequent sending of A would likely be faster than the first send.
One must call the barrier twice in the OpenMP code since mpi_recv is blocking. The OpenMP code used is:
real*8 :: A(n*p), B(n*p)
!$sgi distribute_reshape A(block), B(block)
. . .
!$omp parallel shared(A,B) private(thread,i,j,k)
thread = omp_get_thread_num()
do j = 1, p-1
do k = 1, ntest
. . .
t = timef() !! Start Timer
if (thread == j) B(j*n + 1:j*n+n) = A(1:n)
!$omp barrier
if (thread == 0) B(1:n) = A(j*n + 1:j*n + n)
!$omp barrier
local_time(omp_get_thread_num(),k) = timef()-t !! End of Timer
. . .
enddo
enddo
!$omp end parallel
Notice that since A is block distributed among p threads and has n*p elements, A(1:n) is on thread 0 and A(j*n + 1,j*n + n) is on thread j.
The MPI code used is:
real*8 :: A(n), B(n)
. . .
do j = 1, p-1
do k = 1, ntest
. . .
t = timef() !! Start Timer
if (rank == 0) then
call mpi_send (A,n,mpi_real8,j,1, mpi_comm_world,ierr)
call mpi_recv(B,n,mpi_real8,j,2,mpi_comm_world,status,ierr)
endif
if (rank == j) then
call mpi_recv(B,n,mpi_real8,0,2,mpi_comm_world,status,ierr)
call mpi_send (A,n,mpi_real8,0,1, mpi_comm_world,ierr)
endif
local_time(k) = timef() – t !! End of Timer
. . .
enddo
enddo
Figures 2 and 3 contain the results in milliseconds for sending 4 Mbyte and 8 Byte messages, respectively. For a "perfectly" scalable computer, these graphs would be horizontal lines. This is not the case in figures 2 and 3, but the OpenMP version does give graphs that are closer to a horizontal line than the MPI version. Notice that for 4 Mbyte messages the OpenMP version is about 2.5 times slower than the MPI version. For the 8 Byte message, the OpenMP version is a little faster than the MPI version. The authors do not know why the OpenMP version is so slow for large messages.


Test 2: The Right Shift Test
This test compares the performance of the right shift operation implemented using OpenMP and using MPI. The following is the OpenMP code for this test.
real*8 :: A(n*p), B(n*p)
!$sgi distribute_reshape A(block), B(block)
. . .
!$omp parallel private(i, j, k) shared(A, B)
. . .
t = timef() !! Start Timer
j = omp_get_thread_num()
i = modulo(j+1,p)
B(i*n+1:i*n+n) = A(j*n+1:j*n+n)
local_time(omp_get_thread_num(),k) = timef() - t !! End of Timer
. . .
!$omp end parallel
For the MPI implementation, processor of rank k sends array A(1:n) to processor (k+1), modulo(p), and receives B(1:n) from processor (k-1), modulo(p); for k = 0, p-1.
real*8 :: A(n), B(n)
. . .
t = timef() !! Start Timer
i = modulo(rank+1, p)
j = modulo(rank-1, p)
call mpi_sendrecv(A,n,mpi_real8,i,1,B,n,mpi_real8,j,1,mpi_comm_world,status,ierr)
local_time(k) = timef() – t !! End of Timer
Figures 4 and 5 contain the performance results for the right shift test for 4 Mbyte and 8 Byte messages, respectively. In a "perfectly" scalable computer, all right shifts would be executed concurrently and hence the time would be independent of the number of processors and the graphs would be horizontal lines. The graphs in figure 4 for the 4 Mbyte message are roughly horizontal lines for 32 to 128 processors. However, the MPI implementation for the 8 Byte message is far from a horizontal line and performs significantly worse than the OpenMP implementation. For the 4 Mbyte messages, the MPI implementation is roughly 1.3 times faster than the OpenMP implementation. There appears to be a significant performance problem with SGI’s implementation of mpi_sendrecv for small messages.


Figure 5. Test 2 results in milliseconds for 8 Byte messages
Test 3: Summing Elements of an Array
This test measures the time to sum the components of a one-dimensional array A whose elements are block distributed across p processors in the OpenMP implementation. The following is the OpenMP code.
real*8 :: A(n*p)
!$sgi distribute_reshape A(block)
sum = 0.0
!$omp parallel shared(A,sum)
. . .
t = timef() !! Start Timer
!$omp do schedule(static),reduction(+:sum)
do j = 1, n*p
sum = sum + A(j)
enddo
!$omp enddo
local_time(omp_get_thread_num(), k) = timef() – t !! End of Timer
. . .
!$omp end parallel
For the MPI implementation, s stores the values of the local sums and sum holds the value of the sum of all components of the array. The MPI code used for this test is.
real*8 :: A(n)
. . .
t = timef() !! Start Timer
s = 0.0
do j = 1, n
s = s + A(j)
enddo
call mpi_reduce (s,sum,1,mpi_real8,mpi_sum,0,mpi_comm_world,ierr)
local_time(k) = timef() – t !! End of Timer
Figures 6 and 7 contain the performance results for this test for n = 4*1024*1024/8 and n = 1, respectively. Since the "reduction(+:sum)" option was used in the OpenMP implementation, one would expect the OpenMP implementation to perform well compared to the MPI implementation. However, the OpenMP implementation does not scale nearly as well as the MPI implementation for n = 1. They do perform about the same for n = 4*1024*1024/8. The OpenMP code could have also been written by first adding local sums and then updating the global sum variable in a critical region. However, the performance of this modified code is a little worse than when the above code is executed.
Figure 6.
Test 3 results in milliseconds for n = 4*1024*1024/8.

Test 4: Dot Product
This test compares the performance for calculating the dot product of two arrays x and y using OpenMP and MPI. Elements of x and y are block distributed across p processors for the OpenMP implementation. The OpenMP code used for this test is.
real*8 :: x(n*p), y(n*p)
!$sgi distribute_reshape x(block), y(block)
. . .
t = timef() !! Start Timer
sum = 0.0
!$omp parallel shared(sum, x, y)
!$omp do schedule(static),reduction(+:sum)
do i =1, n*p
sum = sum + x(i)*y(i)
enddo
!$omp enddo
local_time(omp_get_thread_num(), k) = timef() – t !! End of Timer
. . .
!$omp end parallel
For the MPI implementation, s stores the values of the local sums and sum holds the value of the dot product of x and y. The MPI code used for this test is.
real*8 :: x(n), y(n)
. . .
t = timef() !! Start Timer
s = 0.0
do i = 1, n
s = s + x(i)*y(i)
enddo
call mpi_reduce (s,sum,1,mpi_real8,mpi_sum,0,mpi_comm_world,ierror)
local_time(k) = timef() – t !! End of Timer
Figures 8 and 9 contain the performance results for this test for n = 80*1024/8 and n = 1, respectively. Notice that the times for n = 80*1024/8 in figure 8 are nearly the same as those for n = 1 in figure 9. This is because the time required for the extra additions is small compared with the rest of the time. In both cases, the MPI implementation performs and scales better than the OpenMP implementation. There appears to be a significant performance problem with OpenMP for this test even though the "reduction(+:sum)" option was used.


Figure 9. Test 4 results in milliseconds for n = 1.
Test 5: Matrix Times Vector
This test compares the performance of OpenMP and MPI implementations of the matrix times vector operation, y = y + Ax, where for the OpenMP implementation A is column block distributed across p processors, x is block distributed, and y is replicated on all processors. The OpenMP version of this test also used the temporary array psum and the code used is:
real*8 :: y(n), A(n,n*p), x(n*p), psum(n)
!$sgi distribute_reshape A(*,block), x(block)
. . .
!$omp parallel shared(A,x,y) private(psum)
t = timef() !! Start Timer
psum=0.0
!$omp do schedule(static)
do j = 1, n*p
psum(:) = psum(:) + A(:,j)*x(j)
enddo
!$omp enddo
!$omp critical
y(:) = y(:) + psum(:)
!$omp end critical
local_time(omp_get_thread_num(), k) = timef() – t !! End of Timer
. . .
!$omp end parallel
The MPI version requires the use of the temporary arrays psum and sum. The following is the MPI version for this test:
real*8 :: y(n), A(n,n), x(n), psum(n), sum(n)
. . .
t = timef() !! Start Timer
psum = 0.0
do j = 1, n
psum(:) = psum(:) + A(:,j)*x(j)
enddo
call mpi_reduce(psum,sum,n,mpi_real8,mpi_sum,0,mpi_comm_world,ierror)
y(:) = y(:) + sum(:)
local_time(k) = timef() – t !! End of Timer
Figures 10 and 11 contain the performance results for this test for n = 128 and n = 1, respectively. In both cases, the MPI implementation performs and scales significantly better than the OpenMP implementation. This is likely due to the fact that the critical region in the OpenMP program is executed serially whereas mpi_reduce uses parallelism (probably a binary tree algorithm) for updating y. If OpenMP would allow one to use the "reduction(+:sum)" option for arrays, then vendors would have the opportunity to generate efficient parallel code for the above operation.


Test 6: Matrix Addition
This test compares the performance of OpenMP and MPI implementations of the sum of two matrices, C = A + B, where for the OpenMP implementation these arrays are column block distributed across p processors. Notice that no communication is required for this test. The OpenMP version of this test is:
real*8,dimension(n,n*p) :: A, B, C
!$sgi distribute_reshape A(*,block), B(*,block), C(*,block)
. . .
!$omp parallel shared(A,B,C) private(j)
. . .
t = timef() !! Start Timer
!$omp do schedule(static)
do j = 1, n*p
C(:,j) = A(:,j) + B(:,j)
enddo
!$omp enddo
local_time(omp_get_thread_num(), k) = timef() – t !! End of Timer
. . .
!$omp end parallel
The MPI version of this test is:
real*8,dimension(n,n) :: A, B, C
. . .
t = timef() !! Start Timer
C = A + B
local_time(k) = timef() – t !! End of Timer
Figures 12 and 13 contain the performance results for this test for n = 512 and n = 1, respectively. For this very simple test, one would expect OpenMP and MPI to perform about the same. However, for n = 1, the MPI implementation performs and scales significantly better than the OpenMP implementation.


Test 7: Matrix-Matrix Multiplication
This test compares the performance of OpenMP and MPI implementations of the product of two matrices C = C + AB. For the OpenMP implementation A is an n´ (n*p) array column distributed across p processors, B is an (n*p)´ n array row distributed across p processors, and C is and n´ n array on processor 0. The OpenMP version of this test is:
real*8 A(n,n*p),B(n*p,n)
!$sgi distribute_reshape A(*,block), B(block,*)
real*8,dimension(n,n) :: local_C, C
. . .
!$omp parallel shared(A,B,C) private(local_C,i,j,k)
. . .
t = timef() !! Start Timer
local_C = 0.0d0
!$omp do
do k = 1, n*p
do i = 1, n
do j = 1, n
local_C(i,j) = local_C(i,j) + A(i,k) * B(k,j)
enddo
enddo
enddo
!$omp enddo
!$omp critical
C = C + local_C
!$omp end critical
local_time(k) = timef() – t !! End of Timer
. . .
!$omp end parallel
The MPI version of this test is:
real*8,dimension(n,n):: A, B, local_C, C
. . .
t = timef() !! Start Timer
do k = 1, n
do i = 1, n
do j = 1, n
local_C (i,j) = local_C(i,j) + A(i,k) * B(k,j)
enddo
enddo
call mpi_reduce(local_C,C,n*n,mpi_real8,mpi_sum,0,mpi_comm_world,ierror)
enddo
local_time(k) = timef() – t !! End of Timer
Figures 14 and 15 contain the performance results for this test for n = 512 and n = 1, respectively. Notice that in both cases, the MPI implementation scales well and the OpenMP implementation scales very poorly. As in test 5, this poor performance of OpenMP is likely due to the fact that the critical region must be executed serially (as required by the OpenMP specification) whereas mpi_reduce is executed in parallel.


6 Conclusions
The purpose of this paper is to investigate the scalability and performance of seven, simple OpenMP test programs and to compare their performance with equivalent MPI programs on an SGI Origin 2000. Since the memory in the SGI Origin 2000 is physically distributed among its nodes, special data distribution directives were used to make sure that the OpenMP implementation had the same data distribution as the MPI implementation. Without specifying the data distribution, performance of the OpenMP implementation would be poorer.
Each test was executed for a large message and a small message making a total of 14 experiments. For 9 of the 14 experiments, the MPI implementations performed and scaled significantly better than the corresponding OpenMP implementation. For 4 of the experiments, the MPI and OpenMP implementations performed roughly the same. For the small message size for the right shift test, the OpenMP implementation performed and scaled significantly better than the corresponding MPI implementation. Therefore, even with carefully specifying data distributions for the OpenMP implementations, 9 of the MPI implementations performed and scaled significantly better than the corresponding OpenMP implementations for these simple tests. Reduction operations are commonly used in scientific calculations. OpenMP allows vendor to optimize reductions for scalars as illustrated in test 4 (although this optimization could be improved on the Origin). However, OpenMP does not allow this optimization for arrays. Without this capability it difficult to write efficient OpenMP implementations of common operations such as matrix-time-vector (test 5) and matrix-time-matrix (test 7). Thus, the authors recommend that the OpenMP standard be enhanced to allow the reduction option for arrays to be used.
7 References
8 Acknowledgements
We would like to thank Silicon Graphics, Incorporated for giving us access to their Origin 2000 for this study.
Appendix A: Test Results
|
Proc |
MPI |
OpenMP |
Proc |
MPI |
OpenMP |
Proc |
MPI |
OpenMP |
|
1 |
0.17405 |
0.15008 |
44 |
0.21591 |
0.12906 |
87 |
0.22096 |
0.12961 |
|
2 |
0.12182 |
0.14522 |
45 |
0.17916 |
0.13894 |
88 |
0.19076 |
0.14820 |
|
3 |
0.16571 |
0.14277 |
46 |
0.23042 |
0.13239 |
89 |
0.20166 |
0.12636 |
|
4 |
0.19085 |
0.14719 |
47 |
0.28015 |
0.14007 |
90 |
0.14439 |
0.13140 |
|
5 |
0.16627 |
0.13491 |
48 |
0.17647 |
0.14547 |
91 |
0.20940 |
0.13128 |
|
6 |
0.12343 |
0.13164 |
49 |
0.19756 |
0.13676 |
92 |
0.19943 |
0.14033 |
|
7 |
0.16327 |
0.13114 |
50 |
0.18801 |
0.13456 |
93 |
0.22540 |
0.13284 |
|
8 |
0.14519 |
0.13105 |
51 |
0.18802 |
0.13660 |
94 |
0.18121 |
0.13263 |
|
9 |
0.19190 |
0.13610 |
52 |
0.14985 |
0.13545 |
95 |
0.20179 |
0.13870 |
|
10 |
0.20274 |
0.13469 |
53 |
0.22870 |
0.13108 |
96 |
0.14388 |
0.13264 |
|
11 |
0.13138 |
0.13488 |
54 |
0.13996 |
0.13140 |
97 |
0.18474 |
0.13171 |
|
12 |
0.12486 |
0.13612 |
55 |
0.18234 |
0.13284 |
98 |
0.18857 |
0.13354 |
|
13 |
0.19060 |
0.14004 |
56 |
0.23602 |
0.13748 |
99 |
0.14492 |
0.13087 |
|
14 |
0.12313 |
0.13479 |
57 |
0.17676 |
0.13010 |
100 |
0.16613 |
0.13204 |
|
15 |
0.17426 |
0.12552 |
58 |
0.19133 |
0.13132 |
101 |
0.21266 |
0.13484 |
|
16 |
0.23257 |
0.13684 |
59 |
0.23990 |
0.15771 |
102 |
0.13638 |
0.13348 |
|
17 |
0.16562 |
0.13487 |
60 |
0.14532 |
0.12908 |
103 |
0.18436 |
0.13333 |
|
18 |
0.14639 |
0.13191 |
61 |
0.22833 |
0.13895 |
104 |
0.22527 |
0.14045 |
|
19 |
0.25602 |
0.13908 |
62 |
0.22987 |
0.13153 |
105 |
0.16358 |
0.13188 |
|
20 |
0.17378 |
0.14093 |
63 |
0.18280 |
0.14304 |
106 |
0.17440 |
0.13409 |
|
21 |
0.21135 |
0.13332 |
64 |
0.19241 |
0.13155 |
107 |
0.19220 |
0.13380 |
|
22 |
0.21415 |
0.13554 |
65 |
0.23377 |
0.13976 |
108 |
0.13345 |
0.13060 |
|
23 |
0.16044 |
0.14030 |
66 |
0.13980 |
0.13347 |
109 |
0.21971 |
0.13500 |
|
24 |
0.12146 |
0.13375 |
67 |
0.21764 |
0.14062 |
110 |
0.22867 |
0.12665 |
|
25 |
0.22458 |
0.12798 |
68 |
0.19259 |
0.13076 |
111 |
0.16924 |
0.13223 |
|
26 |
0.18167 |
0.12873 |
69 |
0.14971 |
0.13566 |
112 |
0.15843 |
0.14749 |
|
27 |
0.14861 |
0.13321 |
70 |
0.19627 |
0.12851 |
113 |
0.23977 |
0.14305 |
|
28 |
0.22148 |
0.13373 |
71 |
0.23699 |
0.13109 |
114 |
0.13678 |
0.13592 |
|
29 |
0.16978 |
0.12744 |
72 |
0.15842 |
0.13482 |
115 |
0.16650 |
0.13735 |
|
30 |
0.13542 |
0.13251 |
73 |
0.23731 |
0.13963 |
116 |
0.21497 |
0.14444 |
|
31 |
0.23334 |
0.13285 |
74 |
0.22426 |
0.13178 |
117 |
0.13168 |
0.14608 |
|
32 |
0.21790 |
0.13736 |
75 |
0.17767 |
0.13138 |
118 |
0.14887 |
0.12680 |
|
33 |
0.13955 |
0.13340 |
76 |
0.17346 |
0.12828 |
119 |
0.21047 |
0.13369 |
|
34 |
0.22368 |
0.13450 |
77 |
0.21418 |
0.12524 |
120 |
0.14707 |
0.12980 |
|
35 |
0.27179 |
0.13617 |
78 |
0.13332 |
0.12954 |
121 |
0.15392 |
0.13511 |
|
36 |
0.17159 |
0.13688 |
79 |
0.18440 |
0.12953 |
122 |
0.22933 |
0.13740 |
|
37 |
0.23084 |
0.13441 |
80 |
0.19136 |
0.12657 |
123 |
0.14722 |
0.15063 |
|
38 |
0.22310 |
0.12913 |
81 |
0.17811 |
0.13692 |
124 |
0.15193 |
0.12760 |
|
39 |
0.13992 |
0.12844 |
82 |
0.17975 |
0.13463 |
125 |
0.23120 |
0.13372 |
|
40 |
0.19126 |
0.13077 |
83 |
0.27187 |
0.13758 |
126 |
0.15502 |
0.13288 |
|
41 |
0.25451 |
0.12957 |
84 |
0.16608 |
0.13891 |
127 |
0.15160 |
0.13032 |
|
42 |
0.14966 |
0.13816 |
85 |
0.20117 |
0.13920 |
|||
|
43 |
0.22806 |
0.13455 |
86 |
0.25354 |
0.12833 |
Table 1: Test 1 results in milliseconds for an 8 Byte message.
|
Proc Number |
MPI |
OpenMP |
Proc Number |
MPI |
OpenMP |
Proc Number |
MPI |
OpenMP |
|
1 |
89.72394 |
251.36440 |
44 |
99.10971 |
266.29721 |
87 |
109.49805 |
255.87039 |
|
2 |
83.21764 |
262.22881 |
45 |
99.81645 |
254.37840 |
88 |
113.08969 |
250.00200 |
|
3 |
84.59500 |
256.03840 |
46 |
100.30452 |
252.15640 |
89 |
112.94395 |
260.95801 |
|
4 |
90.21360 |
255.79720 |
47 |
99.63263 |
258.75440 |
90 |
112.78191 |
256.03840 |
|
5 |
89.34736 |
316.21880 |
48 |
102.01353 |
249.94360 |
91 |
112.29680 |
255.02239 |
|
6 |
90.01320 |
256.03280 |
49 |
102.68031 |
252.69240 |
92 |
116.68670 |
256.83761 |
|
7 |
90.31889 |
256.69440 |
50 |
101.72310 |
257.12080 |
93 |
117.71979 |
252.59840 |
|
8 |
90.94697 |
303.74240 |
51 |
102.52377 |
265.41121 |
94 |
117.18499 |
259.04240 |
|
9 |
88.95202 |
256.63280 |
52 |
116.37313 |
261.15840 |
95 |
118.09192 |
255.48640 |
|
10 |
88.60717 |
247.68600 |
53 |
115.96083 |
261.45081 |
96 |
118.26320 |
247.42280 |
|
11 |
88.26615 |
253.63000 |
54 |
116.17596 |
258.43200 |
97 |
118.44874 |
285.90640 |
|
12 |
91.08663 |
254.79919 |
55 |
117.02779 |
269.68800 |
98 |
118.29102 |
303.74240 |
|
13 |
90.72441 |
258.00400 |
56 |
123.59173 |
260.61400 |
99 |
118.00852 |
256.63280 |
|
14 |
219.94830 |
260.75400 |
57 |
116.26953 |
254.65280 |
100 |
109.63821 |
262.22881 |
|
15 |
221.62021 |
251.68080 |
58 |
115.61564 |
260.73880 |
101 |
108.99699 |
256.03840 |
|
16 |
105.22085 |
253.48840 |
59 |
114.23834 |
253.65160 |
102 |
109.85847 |
255.79720 |
|
17 |
105.16950 |
258.41200 |
60 |
103.58529 |
258.84239 |
103 |
110.08379 |
316.21880 |
|
18 |
105.64853 |
255.87039 |
61 |
103.58559 |
257.18320 |
104 |
108.52510 |
256.03280 |
|
19 |
105.70492 |
250.00200 |
62 |
103.13302 |
249.45760 |
105 |
108.66865 |
256.69440 |
|
20 |
95.35535 |
260.95801 |
63 |
103.79404 |
265.94521 |
106 |
108.99980 |
249.94360 |
|
21 |
95.48839 |
256.03840 |
64 |
120.36163 |
253.65160 |
107 |
108.61672 |
252.69240 |
|
22 |
95.25715 |
255.02239 |
65 |
119.11013 |
256.69440 |
108 |
104.09929 |
257.12080 |
|
23 |
95.01758 |
256.83761 |
66 |
120.53880 |
303.74240 |
109 |
103.87025 |
265.41121 |
|
24 |
94.78233 |
252.59840 |
67 |
121.83490 |
256.63280 |
110 |
103.07906 |
256.63280 |
|
25 |
94.99303 |
259.04240 |
68 |
105.56076 |
262.22881 |
111 |
104.52920 |
247.68600 |
|
26 |
94.06268 |
255.48640 |
69 |
105.27497 |
256.03840 |
112 |
111.04492 |
253.63000 |
|
27 |
94.88901 |
247.42280 |
70 |
105.17023 |
255.79720 |
113 |
111.30107 |
254.79919 |
|
28 |
94.61384 |
285.90640 |
71 |
105.78888 |
316.21880 |
114 |
111.31593 |
258.00400 |
|
29 |
93.97203 |
257.09840 |
72 |
104.22440 |
256.03280 |
115 |
111.65165 |
260.75400 |
|
30 |
94.11628 |
258.54520 |
73 |
105.26745 |
256.69440 |
116 |
103.38948 |
251.68080 |
|
31 |
94.99695 |
257.13680 |
74 |
104.06732 |
249.94360 |
117 |
103.72090 |
253.48840 |
|
32 |
109.56971 |
254.95841 |
75 |
104.87533 |
252.69240 |
118 |
103.80603 |
258.41200 |
|
33 |
109.98735 |
250.16480 |
76 |
134.65164 |
257.12080 |
119 |
103.50697 |
255.87039 |
|
34 |
109.78217 |
250.88640 |
77 |
132.45814 |
265.41121 |
120 |
111.59939 |
250.00200 |
|
35 |
111.58338 |
253.93120 |
78 |
127.91500 |
256.63280 |
121 |
104.72835 |
260.95801 |
|
36 |
112.11820 |
270.78401 |
79 |
128.49435 |
247.68600 |
122 |
102.24204 |
256.03840 |
|
37 |
111.63712 |
265.46280 |
80 |
108.85060 |
253.63000 |
123 |
102.48340 |
255.02239 |
|
38 |
111.72617 |
263.14839 |
81 |
108.81991 |
254.79919 |
124 |
97.60257 |
256.83761 |
|
39 |
112.20314 |
259.74559 |
82 |
109.54080 |
258.00400 |
125 |
97.23977 |
252.59840 |
|
40 |
110.70665 |
254.02280 |
83 |
109.27527 |
260.75400 |
126 |
97.83089 |
259.04240 |
|
41 |
109.72619 |
252.91601 |
84 |
109.13360 |
251.68080 |
127 |
97.93619 |
255.48640 |
|
42 |
110.27547 |
255.92880 |
85 |
109.00524 |
253.48840 |
|||
|
43 |
110.90849 |
251.28201 |
86 |
108.78884 |
258.41200 |
Table 2: Test 1 results in milliseconds for a 4 MByte message.
|
Number of Processors |
Small Message Size |
Large Message Size |
||
|
MPI |
OpenMP |
MPI |
OpenMP |
|
|
2 |
0.08836 |
0.01210 |
110.05308 |
123.70044 |
|
4 |
0.20996 |
0.01226 |
102.71691 |
129.79605 |
|
8 |
0.18834 |
0.01516 |
107.30495 |
130.20557 |
|
16 |
0.20204 |
0.02982 |
120.10400 |
169.79771 |
|
32 |
0.31571 |
0.04249 |
122.63455 |
187.59256 |
|
64 |
0.43802 |
0.03771 |
140.20393 |
197.29718 |
|
128 |
0.99056 |
0.04144 |
131.67589 |
181.64581 |
Table 3: Results of Right Shift Test in milliseconds.
|
Number of Processors |
Small Message Size |
Large Message Size |
||
|
MPI |
OpenMP |
MPI |
OpenMP |
|
|
2 |
0.17184 |
0.10950 |
17.74027 |
15.14267 |
|
4 |
0.26932 |
0.07985 |
18.10493 |
15.71384 |
|
8 |
0.30780 |
0.09961 |
19.77006 |
21.56141 |
|
16 |
0.25433 |
0.12885 |
19.48238 |
19.16049 |
|
32 |
0.32323 |
0.43332 |
16.21149 |
18.01448 |
|
64 |
0.48466 |
1.04917 |
24.48468 |
25.25756 |
|
128 |
1.27333 |
5.01547 |
22.84760 |
26.76295 |
Table 4: Result of Summing Elements of an Array Test.
|
Number of Processors |
Small Message Size |
Large Message Size |
||
|
MPI |
OpenMP |
MPI |
OpenMP |
|
|
2 |
0.35820 |
0.47300 |
0.25754 |
0.48000 |
|
4 |
0.30335 |
0.81600 |
0.38257 |
0.74700 |
|
8 |
0.38689 |
0.82500 |
0.31821 |
0.82700 |
|
16 |
0.47501 |
0.85900 |
0.40480 |
0.85000 |
|
32 |
0.51352 |
1.17900 |
0.49341 |
1.21400 |
|
64 |
0.71350 |
1.44600 |
0.67548 |
1.43500 |
|
128 |
0.75809 |
5.40200 |
0.83057 |
5.36700 |
Table 5: Result of Dot Product Test.
|
Number of Processors |
Small Message Size |
Large Message Size |
||
|
MPI |
OpenMP |
MPI |
OpenMP |
|
|
2 |
0.40409 |
0.34849 |
0.41317 |
0.32861 |
|
4 |
0.43980 |
0.73026 |
0.46872 |
0.74449 |
|
8 |
0.58427 |
0.89533 |
0.59253 |
0.88674 |
|
16 |
0.77829 |
1.19132 |
0.78144 |
1.25176 |
|
32 |
0.79476 |
3.45444 |
0.81653 |
3.56695 |
|
64 |
0.96169 |
8.08260 |
1.05923 |
8.35535 |
|
128 |
1.16052 |
16.32955 |
1.38008 |
85.11001 |
Table 6: Results of Matrix Times Vector Test in millisecond.
|
Number of Processors |
Small Message Size |
Large Message Size |
||
|
MPI |
OpenMP |
MPI |
OpenMP |
|
|
2 |
0.00074 |
0.05029 |
24.943927 |
30.652509 |
|
4 |
0.00099 |
0.04630 |
24.979418 |
32.587022 |
|
8 |
0.00154 |
0.07439 |
24.995871 |
32.759396 |
|
16 |
0.00144 |
0.08476 |
18.718656 |
32.559974 |
|
32 |
0.00227 |
0.15187 |
19.566440 |
43.461355 |
|
64 |
0.00138 |
0.22214 |
30.855798 |
45.372073 |
|
128 |
0.00162 |
0.13986 |
25.493576 |
49.484702 |
Table 7: Results of Matrix Addition Test in milliseconds.
|
Number of Processors |
Small Message Size |
Large Message Size |
||
|
MPI |
OpenMP |
MPI |
OpenMP |
|
|
2 |
0.17111 |
0.05103 |
681.48316 |
2030.20619 |
|
4 |
0.30371 |
0.08360 |
718.58947 |
2028.06075 |
|
8 |
0.25569 |
0.15246 |
750.87955 |
2054.26156 |
|
16 |
0.39853 |
0.40401 |
648.17432 |
2346.63203 |
|
32 |
0.61716 |
1.57737 |
839.89392 |
3371.24213 |
|
64 |
0.93830 |
4.79924 |
968.12697 |
6446.41928 |
|
128 |
1.81589 |
18.40094 |
905.37065 |
65270.26787 |
Table 8: Results of Matrix-Matrix Multiplication Test in milliseconds.