From: owner-discuss@xxxxxxxxxxxxxxx [mailto:owner-discuss@xxxxxxxxxxxxxxx] On Behalf Of Bob Walkup
Sent: Friday, June 03, 2005 8:21 AM
To: Rajeev Thakur
Cc: discuss@xxxxxxxxxxxxxxx
Subject: RE: [bgl-discuss] MPI Buffer Problem
To continue the discussion of running out of memory due to unexpected messages, it would be possible to ensure that messages are expected by posting MPI_Irecvs on rank 0, but you would need a barrier like this:
if (myrank .eq. 0) then
post mpi_irecv() for each expected message
end if
call mpi_barrier()
if (myrank .ne. 0) call mpi_send() to rank 0
if (myrank .eq. 0) call mpi_waitall() on all the irecv requests
Without the barrier, it is possible for senders to initiate sends before the irecvs are posted, but the barrier forces the senders to wait until the irecvs are posted. If rank 0 posts all MPI_Irecv() calls first, there has to be a valid recv buffer for each MPI_Irecv(). This may require more memory than the serialized version with MPI_Recv() - it just depends on what needs to be done with the data that is coming in. For a sobering thought, consider the memory required to gather a very modest amount of data, say 10 KB, from each of 64K tasks ... that would take 640 MB, which is more than can fit on a BG/L node. So on a Blue Gene system, you really have to watch out for memory issues if you want to scale to very large numbers of processors.
I am not very familiar with the implementation of MPI_Gather() or MPI_Gatherv(). I think they use some kind of tree algorithm designed to perform well on most systems, but I would not assume that skimping on memory or protecting against unexpected messages were major concerns. So on Blue Gene in some cases it might be necessary to use a flow-control approach like the one outlined below.
Regards,
Bob Walkup (walkup@xxxxxxxxxx, 914-945-1512)
-------------------------------------------------------------------------
"Rajeev Thakur" <thakur@xxxxxxxxxxx>
Sent by: owner-discuss@xxxxxxxxxxxxxxx06/02/2005 09:46 PM
To: Bob Walkup/Watson/IBM@IBMUS, <discuss@xxxxxxxxxxxxxxx>
cc:
Subject: RE: [bgl-discuss] MPI Buffer Problem
Would it work if the user posted Irecvs on rank 0 and then did a Waitall? Another option is to use MPI_Gatherv.
Rajeev
From: owner-discuss@xxxxxxxxxxxxxxx [mailto:owner-discuss@xxxxxxxxxxxxxxx] On Behalf Of Bob Walkup
Sent: Thursday, June 02, 2005 3:47 PM
To: discuss@xxxxxxxxxxxxxxx
Subject: Fw: [bgl-discuss] MPI Buffer Problem
----- Forwarded by Bob Walkup/Watson/IBM on 06/02/2005 04:48 PM -----
Bob Walkup 06/02/2005 04:25 PM
To: Douglas Sondak <sondak@xxxxxxxxxx>
cc:
From: Bob Walkup/Watson/IBM@IBMUS
Subject: Re: [bgl-discuss] MPI Buffer ProblemLink
This sounds like MPI rank 0 is running out of memory because it is allocating buffers for messages that have been sent before a matching mpi_recv was posted. This kind of problem is common, and generally occurs during operations that gather data onto one MPI process. It should be possible to slightly re-structure the MPI code, to eliminate "unexpected" messages. For example, code like this can fail:
if (myrank .eq. 0) then
do pe = 1, numpes
call mpi_recv(rbuf, count, type, pe, ...)
end do
else
call mpi_send(sbuf, count, type, 0, ...)
end if
In the code above, the mpi_recv() calls are blocking and get completed one at a time, while all of the senders try to send at once. That can force the receiver to allocate buffers for the "unexpected" messages (the messages that don't have a corresponding receive posted). One solution is to introduce control flow. For example:
if (myrank .eq. 0) then
do pe = 1, numpes
call mpi_send(flag, 1, mpi_integer, pe, ...)
call mpi_recv(rbuf, count, type, pe, ...)
end do
else
call mpi_recv(flag, 1, mpi_integer, 0, ...)
call mpi_send(sbuf, count, type, 0, ...)
end if
What the modified code does is to change the sequence such that each MPI process sends data to rank 0 when rank 0 asks for it, and not before. This eliminates unexpected messages, and totally serializes the "gather" operation. There are other potential solutions, but the idea is to make sure that matching receives get posted before the sends start pouring in.
The same kind of problem can occur on other platforms, but this would normally hit Blue Gene first, because of the small amount of memory on each node. Also, this problem tends to be much more severe for very large parallel jobs (thousands of processes), but can still be managed by making the messages "expected".
Regards,
Bob Walkup (walkup@xxxxxxxxxx, 914-945-1512)
---------------------------------------------------------------
Douglas Sondak <sondak@xxxxxxxxxx>
Sent by: owner-discuss@xxxxxxxxxxxxxxx06/02/2005 03:09 PM
To: discuss@xxxxxxxxxxxxxxx
cc: sondak@xxxxxxxxxx
Subject: [bgl-discuss] MPI Buffer Problem
I'm getting the following error message when I try to run a
Fortran 90/MPI code on our newly-installed Blue Gene/L at Boston
University:
RVZ: cannot allocate unexpected buffer
The code has run successfully in the past on a wide variety of
platforms including IBM p690, SGI Origin3000, linux clusters, SGI
Altix, etc.
I am running on 4 processors. (This is a test case.) The routine in
which the problem is occurring collects arrays on one processor using
standard blocking sends and receives. A total of 4 messages are sent:
proc. 1 sends one message to proc. 0
proc. 2 sends one message to proc. 0
proc. 3 sends two messages to proc. 0
Each message is 5,752,300 bytes. The error occurs when receiving the
message from proc. 1. The messages from procs. 2 and 3 work fine. I
found that if I reduce the size of the proc. 1 message to 43,928
bytes, it works fine (leaving the sizes of the other 3 messages at
5,752,300 bytes).
I suspected a memory problem, so I tried eliminating the 3 messages
that work, and only sent the single message from proc. 1 to proc. 0.
This still fails with the same error message.
I looked at the postings at the ANL web site, and found a posting
about a similar error message, with the word "eager" rather than
"RVZ." I tried changing the eager limit to 6,000,000 (larger than
the message), and this didn't help. As a shot in the dark I also
tried a small eager limit (1024), and this didn't help either.
Has anyone seen anything like this? Might anyone have a suggestion
about diagnosing the problem? I'm now at something of a loss as to
how to proceed. Thanks!
___________________________________________________________
Doug Sondak Boston University
email: sondak@xxxxxx Office of Information Technology
phone: (617)353-8273 111 Cummington Street
fax : (617)353-6260 Boston, MA 02215
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.