[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Fw: [bgl-discuss] MPI failure, simplified
- To: <discuss@xxxxxxxxxxxxxxx>
- Subject: Re: Fw: [bgl-discuss] MPI failure, simplified
- From: "Rajeev Thakur" <thakur@xxxxxxxxxxx>
- Date: Sat, 6 May 2006 23:31:15 -0500
- Delivered-to: bglm-discuss-outgoing@mailbouncer.mcs.anl.gov
- Delivered-to: bglm-discuss@mailbouncer.mcs.anl.gov
- Sender: owner-discuss@xxxxxxxxxxxxxxx
- Thread-index: AcZxbuPjm6njrdztRLav/kvCAE00egAIBSxA
I was looking up something else in the MPI Standard when I came across this
example under "Semantics of point-to-point communication". Example 3.7, pg
32, MPI-1.1.
It says:
"Example 3.7 An exchange of messages.
CALL MPI_COMM_RANK(comm, rank, ierr)
IF (rank.EQ.0) THEN
CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr)
CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, status, ierr)
ELSE ! rank.EQ.1
CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, status, ierr)
CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr)
END IF
This program will succeed even if no buffer space for data is available."
Note that it does not put any restriction on the size of the sendbuf. It
says the program will succeed even if there is no buffer space available.
Going by this, I would say that the Standard says that Steve's code should
work.
Rajeev
-----------------------------------------------
* To: Stephen Siegel <siegel@xxxxxxxxxxxx>
* Subject: Re: Fw: [bgl-discuss] MPI failure, simplified
* From: Susan Coghlan <smc@xxxxxxxxxxx>
* Date: Tue, 21 Feb 2006 09:59:59 -0600 (CST)
* Cc: discuss@xxxxxxxxxxxxxxx
* Delivered-to: bglm-discuss-outgoing@xxxxxxxxxxxxxxxxxxxxxxx
* Delivered-to: bglm-discuss@xxxxxxxxxxxxxxxxxxxxxxx
* In-reply-to: <Pine.LNX.4.58.0602131507430.24253@xxxxxxxxxxxxxxx>
* References:
<OF9F54A6BD.6DD0F830-ON85257109.007316EB-85257109.00733C08@xxxxxxxxxx>
<Pine.LNX.4.58.0602131416160.20888@xxxxxxxxxxxxxxxxxxxx>
<Pine.LNX.4.58.0602131507430.24253@xxxxxxxxxxxxxxx>
* Sender: owner-discuss@xxxxxxxxxxxxxxx
I finally have a response back from the IBM developers, actually two
responses. I've included them below.
Development update:
In the BG/L MPI implementation, both sends are taking place simultaneously
(see commentary in the code below). However, since rank 0's receive has
not been posted yet, the send from rank 1 arrives as an unexpected
message, causing the memory allocation to be attempted.
if (myRank == 0) {
MPI_Isend(ptr,80000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req0);
// 1. Nothing is sent yet.
MPI_Wait(&req0,MPI_STATUS_IGNORE); // 2. Data is sent now.
MPI_Irecv(ptr+80000000,160000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req1);
// 3. This probably does not occur
// before rank 1 starts sending
// its data (see line 3 in rank 1
// below).
MPI_Wait(&req1,MPI_STATUS_IGNORE);
} else {
MPI_Irecv(ptr,80000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req0);
// 1. It is likely that this receive
// was posted before the data
// arrived, so the data is not
// unexpected.
MPI_Isend(ptr+80000000,160000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req1);
// 2. Nothing is sent yet.
MPI_Wait(&req0,MPI_STATUS_IGNORE); // 3. This causes the data to be
sent,
// in parallel with data being
// received.
MPI_Wait(&req1,MPI_STATUS_IGNORE);
}
According to the MPI spec, this application is "safe" because it works
when the Isend's are replaced with Ssend's. Our implementation is perhaps
too agressive in assuming there is enough memory to accomodate the data
being sent.
We can think of ways to change MPI to make this work, but this runs the
risk of negatively impacting the performance of the "normal case" where
the amount of data is significantly less and there is enough memory to
handle the unexpected messages. It is likely that an option would need to
be introduced requesting this change in behavior. As such, this would not
be a trivial change and it would need to be considered together with other
enhancements being made.
Even after such a change was made available, we would not recommend that
this application take advantage of it because it is not optimal. Rather,
we recommend two changes for this application, which can be implemented
immediately, and which will perform optimally:
1. Post the MPI_Irecv's first. This will eliminate the overhead of
handling unexpected messages and subsequent copying of the data.
2. Post a barrier after issuing the MPI_Irecv's to guarantee that the
receives are posted before data is sent.
The following is the modified version of this, as recommended:
if (myRank == 0) {
MPI_Irecv(ptr+80000000,160000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req1);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Isend(ptr,80000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req0);
MPI_Wait(&req0,MPI_STATUS_IGNORE);
MPI_Wait(&req1,MPI_STATUS_IGNORE);
} else {
MPI_Irecv(ptr,80000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req0);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Isend(ptr+80000000,160000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req1);
MPI_Wait(&req0,MPI_STATUS_IGNORE);
MPI_Wait(&req1,MPI_STATUS_IGNORE);
}
First, I want to clarify a statement that I made previously: The
statement was "According to the MPI spec, this application is "safe"
because it works when the Isend's are replaced with Ssend's." This is not
true. This application is *not* safe because (according to the MPI spec),
"A program is ?safe? if no message buffering is required for the program
to complete." In the original program, message buffering is required
because a message from rank 1 can arrive at rank 0 before rank 0 has
posted its receive for it. Even when the Isends are replaced with Ssends,
rank 1 can still send its data before rank 0 has posted a receive for it
(rank 1's sequence of operations is Irecv, Ssend, Wait).
There can easily be confusion is you interpret an Ssend as waiting for the
corresponding receive to be posted BEFORE SENDING ANY DATA. However, the
MPI spec defines Ssend as follows: "A send that uses the synchronous mode
can be started whether or not a matching receive was posted. However, the
send will complete successfully only if a matching receive is posted, and
the receive operation has started to receive the message sent by the
synchronous send."
Thus, to be safe (portable to any platform), and to ensure that there is
enough space on the receiving end to receive the data, and to perform
optimally (receiving unexpected data is not optimal, even when there is
enough space to hold it), application writers should do as we recommended
previously:
1. Post the MPI_Irecv's first. This will eliminate the overhead of
handling unexpected messages and subsequent copying of the data.
2. Post a barrier after issuing the MPI_Irecv's to guarantee that the
receives are posted before data is sent.
This latest program that the customer sent in is also not safe. While
rank 0 is receiving data from rank 1, data is arriving at rank 0 from
ranks 2, and 3 (remember, an Ssend "can be started whether or not a
matching receive was posted"). To make this program safe (portable) and
optimal, change it according to the recommendation. It will look
something like the following (new or changed lines are marked with //
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<).
#include<stdlib.h>
#include<assert.h>
#include<stdio.h>
#include "mpi.h"
int main (int argc, char *argv[]) {
int myRank, numProcs, i;
unsigned char* ptr;
int MB = 1000000;
MPI_Request reqs[4]; // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
assert(numProcs == 4);
/* 374 MB or more required for eventual crash */
ptr = (unsigned char*)malloc(374*MB);
assert(ptr);
if (myRank == 0) {
for (i = 1; i < numProcs; i++)
MPI_Irecv(ptr+(i-1)*80*MB,80*MB,MPI_BYTE,i,0,MPI_COMM_WORLD,&reqs[i]); //
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
MPI_Barrier(MPI_COMM_WORLD); // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
for (i = 1; i < numProcs; i++) // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
MPI_Wait(&reqs[i],MPI_STATUS_IGNORE); //
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
}
else {
MPI_Barrier(MPI_COMM_WORLD); // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
MPI_Ssend(ptr,80*MB,MPI_BYTE,0,0,MPI_COMM_WORLD); /* or MPI_Send */
}
free(ptr);
printf("Proc %d has completed successfully\n", myRank);
fflush(stdout);
MPI_Finalize();
// printf("All done %d\n",myRank);
// fflush(stdout);
exit(0);
}
On Mon, 13 Feb 2006, Susan Coghlan wrote:
>
> Stephen,
>
> I can add this information to the existing PMR. PMR stands for Problem
> Management Report. I can, and will, notify you of any developments in the
> PMR. Right now, all that has happened on the existing PMR is that it has
> been sent to development (on 2/2), nothing has been posted for that PMR
> since then.
>
> Susan.
>
> On Mon, 13 Feb 2006, Stephen Siegel wrote:
>
> > This concerns the MPI failure on BGL when sending large messages in
> > certain scenarios. Thanks to everyone who replied to my earlier post.
> > I agree that the best solution would be an MPI implementation that
> > fully complies with the MPI Standard in all cases, so we don't have to
> > rewrite our code. However, in the interest of expediency, I did try
> > Steven Pieper's suggestion of replacing my MPI_Isend/MPI_Waits with
> > MPI_Ssends. Unfortunately, my program still failed, with a similar
> > error message. Below is another simple program that produces a
> > similar failure with numProcs=4 (co-proc mode again), using only
> > MPI_Ssend and MPI_Recv. (The MPI_Ssend can be replaced with MPI_Send
> > and it still fails in the same way.) Again, this is a program that I
> > think should always work, if the MPI implementation conforms to the
> > Standard.
> >
> > Notice that Process 0 mallocs 374 MB, and then tries to receive 3
> > messages, each 80 MB, into the malloced region. The error says
> > something to the effect that 2 of the messages are "unexpected" and
> > there isn't enough memory for them. This may have something to do
> > with the fact that 373+160=533 and perhaps that is how many MB are
> > actually available to a node. If the MPI implementation is trying to
> > store the messages in some other region of memory before transferring
> > them into the receive buffer I have allocated, it might discover there
> > isn't sufficient memory to do that. This is all speculation, but I
> > point it out in case it helps someone figure out what is going on.
> >
> > Question: What is a PMR? Is there any way I can be notified of
> > developments in this PMR?
> >
> > Thanks again,
> >
> > Steve
> >
> > Stephen Siegel
> > Senior Research Scientist
> > Department of Computer Science
> > University of Massachusetts Amherst
> >
> >
> > #include<stdlib.h>
> > #include<assert.h>
> > #include<stdio.h>
> > #include "mpi.h"
> >
> > int main (int argc, char *argv[]) {
> > int myRank, numProcs, i;
> > unsigned char* ptr;
> > int MB = 1000000;
> >
> > MPI_Init(&argc, &argv);
> > MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
> > MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
> > assert(numProcs == 4);
> > /* 374 MB or more required for eventual crash */
> > ptr = (unsigned char*)malloc(374*MB);
> > assert(ptr);
> > if (myRank == 0)
> > for (i = 1; i < numProcs; i++)
> >
MPI_Recv(ptr+(i-1)*80*MB,80*MB,MPI_BYTE,i,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE
);
> > else
> > MPI_Ssend(ptr,80*MB,MPI_BYTE,0,0,MPI_COMM_WORLD); /* or MPI_Send */
> > free(ptr);
> > printf("Proc %d has completed successfully\n", myRank);
> > fflush(stdout);
> > MPI_Finalize();
> > }
> >
> >
> > Rzv:cannot allocate unexpected buffer from R:2 T:0 C:0
> > Dumping 9 frames
> > Frame 0: 0x206fa4
> > Frame 1: 0x20940c
> > Frame 2: 0x23e16c
> > Frame 3: 0x237b14
> > Frame 4: 0x23a0c4
> > Frame 5: 0x2071c0
> > Frame 6: 0x2047d4
> > Frame 7: 0x200618
> > Frame 8: 0x20016c
> > Posted Queue:
> > -------------
> > Posted Requests 0, Total Mem: 0 bytes
> > Unexpected Queue:
> > -----------------
> > Unexpected Requests 2, Total Mem: 160000000 bytes
> > Fatal: Cannot allocate buffer for unexpected message<Feb 13
12:14:46.721155> BE_MPI (Info) : \IO - Output thread terminated
> >
> >
> > - --------------------------------------------------------------------
> > To add or remove yourself from this mailing list, use the 'notifyme'
> > command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
> >
> >
>
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.