[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bgl-discuss] MPI failure, simplified



At 10:31 PM 2/1/2006, Stephen Siegel wrote:
I posted an earlier message about an MPI failure I was getting on BGL
when passing some large messages.  I can now produce a similar failure
with a very simple program.  The code is below, followed by the
(excerpted) output to stderr when run on 2 procs (co-proc mode).

Each proc allocates a 400 MB buffer.  Proc 0 posts a send to proc 1 of
the first 80 MB, waits for that send to complete, then posts a receive
into the next 160 MB and waits for that request to complete.  Proc 1
posts a recv from proc 0 for the first 80 MB, then posts a send of the
next 160 MB, then waits for both requests to complete.  It seems to me
that this is a correct "safe" MPI program, going by the MPI Standard.

The error message, "...cannot allocate unexpected buffer from...
unexpected requests 1, Total Mem: 160 MB ..." suggests that the MPI
implementation is trying to allocate 160 MB and it can't.  It seems to
me that it shouldn't have to allocate this memory--it should just
deliver the message directly into the receive buffer.  (That is the
point of the rendezvous protocol.)

Question: is this a bug in the MPI implementation on BGL, or am I
missing something?
Bug might be too strong a statement, but I agree with your interpretation - the MPI implementation should not be allocating space for Irecvs and this program should work. There's always some tension over where to set the eager vs. rendezvous threshold for both performance and space reasons, and the MPI standard doesn't specify when eager, rendezvous, or something else should be used. But this program should work.

Bill


Thanks,

Steve


---------------------------------------------------------------------
#include<stdlib.h>
#include<assert.h>
#include<stdio.h>
#include "mpi.h"

int main (int argc, char *argv[]) {
int myRank, numProcs;
unsigned char* ptr;
MPI_Request req0;
MPI_Request req1;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
if (numProcs != 2) {
fprintf(stderr, "Usage: mpiexec -np 2 ./exp2c\n");
fflush(stderr);
return 1;
}
ptr = (unsigned char*)malloc(400000000);
assert(ptr);
if (myRank == 0) {
MPI_Isend(ptr,80000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req0);
MPI_Wait(&req0,MPI_STATUS_IGNORE);
MPI_Irecv(ptr+80000000,160000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req1);
MPI_Wait(&req1,MPI_STATUS_IGNORE);
} else {
MPI_Irecv(ptr,80000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req0);
MPI_Isend(ptr+80000000,160000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req1);
MPI_Wait(&req0,MPI_STATUS_IGNORE);
MPI_Wait(&req1,MPI_STATUS_IGNORE);
}
free(ptr);
printf("Proc %d has completed successfully\n", myRank);
fflush(stdout);
MPI_Finalize();
}

---------------------------------------------------------------------

.
.
.
<Feb 01 22:11:58.663360> BE_MPI (Info) : IO - Threads initialized
Rzv:cannot allocate unexpected buffer from R:1 T:0 C:0
Dumping 9 frames
Frame 0: 0x2078f0
Frame 1: 0x209da8
Frame 2: 0x23e25c
Frame 3: 0x237c04
Frame 4: 0x23a1b4
Frame 5: 0x207b0c
Frame 6: 0x2052b0
Frame 7: 0x200614
Frame 8: 0x20016c
Posted Queue:
-------------
Posted Requests 0, Total Mem: 0 bytes
Unexpected Queue:
-----------------
Unexpected Requests 1, Total Mem: 160000000 bytes
Fatal: Cannot allocate buffer for unexpected message<Feb 01 22:12:03.767341> BE_MPI (Info) : IO - Output thread terminated
<Feb 01 22:12:03.898684> BE_MPI (Info) : Job 44154 switched to state TERMINATED ('T')
.
.
.

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
William Gropp
http://www.mcs.anl.gov/~gropp
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.