The Twisted Decomposition - Synchronizing data; send and receive buffers overlap

Case 2. Synchronizing data; send and receive buffers overlap

3.5.7 The Twisted Decomposition

ENDDO

CALL MPI_ISEND(x(ii,jend), iblklen, MPI_REAL,

& inext, 1, MPI_COMM_WORLD, ireqs, ierr) CALL MPI_WAIT(ireqs, istatus, ierr)

ENDDO ...

The above program is best understood by putting yourself in the place of the process and by thinking what you have to do, rather than by playing the role of the conductor who orchestrates the processes. One remark about the program:

the calls of MPI_WAIT immediately follow MPI_IRECV and MPI_ISEND, which is logically equivalent to using blocking subroutines. Is there any danger of

deadlocks? (See “Case 2. Receive first and then send” on page 27.) The answer is no , because there is always one process (process 0) that does not receive but only sends.

Figure 77. Data Flow in the Pipeline Method

In Figure 77, data transmissions are completed from left to right. In an open string topology like this, there will not be a deadlock as long as one-directional

communication is concerned. On the other hand, in a closed string topology, deadlock may occur even in the case of one-directional communication. (See

the ADI (alternating direction implicit) method. Whether you distribute the matrix row-wise or column-wise, you cannot escape from the dependences on both dimensions.

PROGRAM main

PARAMETER (mx = ..., my = ...) REAL x(0:mx,0:my)

...

! Loop A

DO j = 1, my DO i = 1, mx

x(i,j) = x(i,j) + x(i-1,j) ENDDO

ENDDO

! Loop B

DO j = 1, my DO i = 1, mx

x(i,j) = x(i,j) + x(i,j-1) ENDDO

ENDDO ...

END

One way to parallelize this program is to use the pipeline method described in 3.5.6, “The Pipeline Method” on page 79, but the average degree of parallelism might not meet your performance requirement. Of course, you can increase the degree of parallelism by using a smaller block size in the pipeline method, but it makes the communication overhead increase. The twisted decomposition method described below allows all processes to start computations at once without any stand-by time for some processes, thereby it provides better load balance than the pipeline method. The downside is that it is difficult and time-consuming to write the program because the data is distributed to processes in a complex way.

Figure 79. The Twisted Decomposition

In applying the twisted decomposition method, rows and columns of the two-dimensional array ^x()are divided into^nprocsblocks, thus making^nprocs² blocks, where ^nprocsis the number of processes. Apply the block coordinate system(I, J)in order to identify the location of blocks (Figure 79 (c)). In the twisted decomposition, block^{(I, J)}is assigned to process(I - J + nprocs) modulo ^nprocs. In Loop A, all processes do computation of the row ^I=0, then^I=1,

and finally^I=2. In Loop B, all processes do computation of the column^J=0, then J=1, and finallyJ=2. Before proceeding to the next block, each process sends and receives boundary elements to/from neighboring processes. Note that every row and every column in the block coordinate system contains all ranks so that the degree of parallelism is always the number of processes. The following is the parallelized version of the program.

PROGRAM main INCLUDE ’mpif.h’

INTEGER istatus(MPI_STATUS_SIZE)

INTEGER, ALLOCATABLE :: is(:), ie(:), js(:), je(:) PARAMETER (mx = ..., my = ..., m = ...)

DIMENSION x(0:mx,0:my) DIMENSION bufs(m),bufr(m) CALL MPI_INIT(ierr)

CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) ALLOCATE (is(0:nprocs-1), ie(0:nprocs-1)) ALLOCATE (js(0:nprocs-1), je(0:nprocs-1)) DO ix = 0, nprocs - 1

CALL para_range(1, mx, nprocs, ix, is(ix), ie(ix)) CALL para_range(1, my, nprocs, ix, js(ix), je(ix)) ENDDO

inext = myrank + 1

IF (inext == nprocs) inext = 0 iprev = myrank - 1

IF (iprev == -1) iprev = nprocs - 1 ...

! Loop A

DO ix = 0, nprocs - 1

iy = MOD(ix + myrank, nprocs) ista = is(ix)

iend = ie(ix) jsta = js(iy) jend = je(iy)

jlen = jend - jsta + 1 IF (ix /= 0) THEN

CALL MPI_IRECV(bufr(jsta), jlen, MPI_REAL, inext, 1,

& MPI_COMM_WORLD, ireqr, ierr) CALL MPI_WAIT(ireqr, istatus, ierr)

CALL MPI_WAIT(ireqs, istatus, ierr) DO j = jsta, jend

x(ista-1,j) = bufr(j) ENDDO

ENDIF

DO j = jsta, jend DO i = ista, iend

x(i,j) = x(i,j) + x(i-1,j) ENDDO

ENDDO

IF (ix /= nprocs - 1) THEN DO j = jsta, jend

bufs(j) = x(iend,j) ENDDO

CALL MPI_ISEND(bufs(jsta), jlen, MPI_REAL, iprev, 1,

& MPI_COMM_WORLD, ireqs, ierr) ENDIF

ENDDO

! Loop B

DO iy = 0, nprocs - 1

ix = MOD(iy + nprocs - myrank, nprocs) ista = is(ix)

iend = ie(ix) jsta = js(iy) jend = je(iy)

ilen = iend - ista + 1 IF (iy /= 0) THEN

CALL MPI_IRECV(x(ista,jsta-1), ilen, MPI_REAL, iprev, 1,

& MPI_COMM_WORLD, ireqr, ierr) CALL MPI_WAIT(ireqr, istatus, ierr)

CALL MPI_WAIT(ireqs, istatus, ierr) ENDIF

DO j = jsta, jend DO i = ista, iend

x(i,j) = x(i,j) + x(i,j-1) ENDDO

ENDDO

IF (iy /= nprocs - 1) THEN

CALL MPI_ISEND(x(ista,jend), ilen, MPI_REAL, inext, 1,

& MPI_COMM_WORLD, ireqs, ierr) ENDIF

ENDDO

DEALLOCATE (is, ie) DEALLOCATE (js, je) ...

CALL MPI_FINALIZE(ierr) END

Be careful about the location of the MPI_WAIT statements (shown in bold face in the program) corresponding to MPI_ISEND, which avoids deadlock. First, confirm that data transmissions take place correctly among processes. Consider Loop B, for example. The structure of Loop B is:

! Loop B

DO iy = 0, nprocs - 1 ...

IF (iy /= 0) THEN Receive

Wait for receive to complete Wait for send to complete ENDIF

...

IF (iy /= nprocs - 1) THEN Send

ENDIF ENDDO

In this example, there is a chain where a certain process always receives data from a specific process that points to it, as shown in Figure 80, and always sends data to a specific process which is pointed to by that process in the figure.

Figure 80. Data Flow in the Twisted Decomposition Method

The above loop is expanded into flat statements as shown in Figure 81. It is easy to see that the program works correctly and that deadlocks will not occur.

Figure 81. Loop B Expanded

Next, consider what happens if MPI_WAIT is called just after MPI_ISEND. In that case, essentially every process calls blocking subroutines MPI_SEND and MPI_RECV, in this order, which may cause deadlock due to a shortage of the system buffer for sending the data. See the arguments in “Case 1. Send first and then receive” on page 26. The root cause of the problem is that all processes do blocking send and then blocking receive in this order and that the topology of processes is closed in terms of data flow (see Figure 80). These two conditions constitute the possible deadlocks. Note that in the pipeline method (refer 3.5.6,

“The Pipeline Method” on page 79), the latter condition is not satisfied, that is, the process topology is open.

문서에서 RS/6000 SP: Practical MPI Programming (페이지 99-103)