[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bgl-discuss] MPI failure, simplified
I posted an earlier message about an MPI failure I was getting on BGL
when passing some large messages. I can now produce a similar failure
with a very simple program. The code is below, followed by the
(excerpted) output to stderr when run on 2 procs (co-proc mode).
Each proc allocates a 400 MB buffer. Proc 0 posts a send to proc 1 of
the first 80 MB, waits for that send to complete, then posts a receive
into the next 160 MB and waits for that request to complete. Proc 1
posts a recv from proc 0 for the first 80 MB, then posts a send of the
next 160 MB, then waits for both requests to complete. It seems to me
that this is a correct "safe" MPI program, going by the MPI Standard.
The error message, "...cannot allocate unexpected buffer from...
unexpected requests 1, Total Mem: 160 MB ..." suggests that the MPI
implementation is trying to allocate 160 MB and it can't. It seems to
me that it shouldn't have to allocate this memory--it should just
deliver the message directly into the receive buffer. (That is the
point of the rendezvous protocol.)
Question: is this a bug in the MPI implementation on BGL, or am I
missing something?
Thanks,
Steve
---------------------------------------------------------------------
#include<stdlib.h>
#include<assert.h>
#include<stdio.h>
#include "mpi.h"
int main (int argc, char *argv[]) {
int myRank, numProcs;
unsigned char* ptr;
MPI_Request req0;
MPI_Request req1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
if (numProcs != 2) {
fprintf(stderr, "Usage: mpiexec -np 2 ./exp2c\n");
fflush(stderr);
return 1;
}
ptr = (unsigned char*)malloc(400000000);
assert(ptr);
if (myRank == 0) {
MPI_Isend(ptr,80000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req0);
MPI_Wait(&req0,MPI_STATUS_IGNORE);
MPI_Irecv(ptr+80000000,160000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req1);
MPI_Wait(&req1,MPI_STATUS_IGNORE);
} else {
MPI_Irecv(ptr,80000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req0);
MPI_Isend(ptr+80000000,160000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req1);
MPI_Wait(&req0,MPI_STATUS_IGNORE);
MPI_Wait(&req1,MPI_STATUS_IGNORE);
}
free(ptr);
printf("Proc %d has completed successfully\n", myRank);
fflush(stdout);
MPI_Finalize();
}
---------------------------------------------------------------------
.
.
.
<Feb 01 22:11:58.663360> BE_MPI (Info) : IO - Threads initialized
Rzv:cannot allocate unexpected buffer from R:1 T:0 C:0
Dumping 9 frames
Frame 0: 0x2078f0
Frame 1: 0x209da8
Frame 2: 0x23e25c
Frame 3: 0x237c04
Frame 4: 0x23a1b4
Frame 5: 0x207b0c
Frame 6: 0x2052b0
Frame 7: 0x200614
Frame 8: 0x20016c
Posted Queue:
-------------
Posted Requests 0, Total Mem: 0 bytes
Unexpected Queue:
-----------------
Unexpected Requests 1, Total Mem: 160000000 bytes
Fatal: Cannot allocate buffer for unexpected message<Feb 01 22:12:03.767341> BE_MPI (Info) : IO - Output thread terminated
<Feb 01 22:12:03.898684> BE_MPI (Info) : Job 44154 switched to state TERMINATED ('T')
.
.
.
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.