[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bgl-discuss] MPI failure, simplified



I posted an earlier message about an MPI failure I was getting on BGL
when passing some large messages.  I can now produce a similar failure
with a very simple program.  The code is below, followed by the
(excerpted) output to stderr when run on 2 procs (co-proc mode).

Each proc allocates a 400 MB buffer.  Proc 0 posts a send to proc 1 of
the first 80 MB, waits for that send to complete, then posts a receive
into the next 160 MB and waits for that request to complete.  Proc 1
posts a recv from proc 0 for the first 80 MB, then posts a send of the
next 160 MB, then waits for both requests to complete.  It seems to me
that this is a correct "safe" MPI program, going by the MPI Standard.

The error message, "...cannot allocate unexpected buffer from...
unexpected requests 1, Total Mem: 160 MB ..." suggests that the MPI
implementation is trying to allocate 160 MB and it can't.  It seems to
me that it shouldn't have to allocate this memory--it should just
deliver the message directly into the receive buffer.  (That is the
point of the rendezvous protocol.)

Question: is this a bug in the MPI implementation on BGL, or am I
missing something?

Thanks,

  Steve


---------------------------------------------------------------------
#include<stdlib.h>
#include<assert.h>
#include<stdio.h>
#include "mpi.h"

int main (int argc, char *argv[]) {
  int myRank, numProcs;
  unsigned char* ptr;
  MPI_Request req0;
  MPI_Request req1;

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
  MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
  if (numProcs != 2) {
    fprintf(stderr, "Usage: mpiexec -np 2 ./exp2c\n");
    fflush(stderr);
    return 1;
  }
  ptr = (unsigned char*)malloc(400000000);
  assert(ptr);
  if (myRank == 0) {
    MPI_Isend(ptr,80000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req0);
    MPI_Wait(&req0,MPI_STATUS_IGNORE);
    MPI_Irecv(ptr+80000000,160000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req1);
    MPI_Wait(&req1,MPI_STATUS_IGNORE);
  } else {
    MPI_Irecv(ptr,80000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req0);
    MPI_Isend(ptr+80000000,160000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req1);
    MPI_Wait(&req0,MPI_STATUS_IGNORE);
    MPI_Wait(&req1,MPI_STATUS_IGNORE);
  }
  free(ptr);
  printf("Proc %d has completed successfully\n", myRank);
  fflush(stdout);
  MPI_Finalize();
}

---------------------------------------------------------------------

.
.
.
<Feb 01 22:11:58.663360> BE_MPI (Info) : IO - Threads initialized
Rzv:cannot allocate unexpected buffer from R:1 T:0 C:0
Dumping 9 frames
        Frame 0:  0x2078f0
        Frame 1:  0x209da8
        Frame 2:  0x23e25c
        Frame 3:  0x237c04
        Frame 4:  0x23a1b4
        Frame 5:  0x207b0c
        Frame 6:  0x2052b0
        Frame 7:  0x200614
        Frame 8:  0x20016c
Posted Queue:
-------------
Posted Requests 0, Total Mem: 0 bytes
Unexpected Queue:
-----------------
Unexpected Requests 1, Total Mem: 160000000 bytes
Fatal:  Cannot allocate buffer for unexpected message<Feb 01 22:12:03.767341> BE_MPI (Info) : IO - Output thread terminated
<Feb 01 22:12:03.898684> BE_MPI (Info) : Job 44154 switched to state TERMINATED ('T')
.
.
.

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.