[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [bgl-discuss] MPI failure, simplified




I would suggest opening a PMR with IBM.  Although it might be possible to code around this, it would be preferable to get a fix in the library.  There is a good test case ready to go - and working through IBM support is probably the best approach.  I think this is an important issue - it is just too easy to run out of memory with the current implementation.

Regards,
Bob Walkup (walkup@us.ibm.com, 914-945-1512)
--------------------------------------------------------------


"Andrew Siegel" <siegela@mcs.anl.gov>
Sent by: owner-discuss@bgl.mcs.anl.gov

02/02/2006 12:24 PM

       
        To:        <discuss@bgl.mcs.anl.gov>
        cc:        
        Subject:        RE: [bgl-discuss] MPI failure, simplified



To be clear, this isn't an EAGER vs RENDEZVOUS issue, since we are way way
past the EAGER limit (as indicated also by the error message). My
understanding of the MPI standard is that the implementation is free to
buffer or not, but that no matter what it does that this is guaranteed to
work (at least for "small" numbers of outstanding messages). Anyhow, would
it be wise to contact the MPI IBM people (George Almasi?) and ask some
specific questions. Not being able to do such operations reliably totally
kills several of the important algorithms that we're relying on.
-andrew

-----Original Message-----
From: owner-discuss@bgl.mcs.anl.gov [mailto:owner-discuss@bgl.mcs.anl.gov]
On Behalf Of William Gropp
Sent: Thursday, February 02, 2006 8:12 AM
To: Stephen Siegel
Cc: discuss@bgl.mcs.anl.gov; support@bgl.mcs.anl.gov
Subject: Re: [bgl-discuss] MPI failure, simplified

At 10:31 PM 2/1/2006, Stephen Siegel wrote:
>I posted an earlier message about an MPI failure I was getting on BGL
>when passing some large messages.  I can now produce a similar failure
>with a very simple program.  The code is below, followed by the
>(excerpted) output to stderr when run on 2 procs (co-proc mode).
>
>Each proc allocates a 400 MB buffer.  Proc 0 posts a send to proc 1 of
>the first 80 MB, waits for that send to complete, then posts a receive
>into the next 160 MB and waits for that request to complete.  Proc 1
>posts a recv from proc 0 for the first 80 MB, then posts a send of the
>next 160 MB, then waits for both requests to complete.  It seems to me
>that this is a correct "safe" MPI program, going by the MPI Standard.
>
>The error message, "...cannot allocate unexpected buffer from...
>unexpected requests 1, Total Mem: 160 MB ..." suggests that the MPI
>implementation is trying to allocate 160 MB and it can't.  It seems to
>me that it shouldn't have to allocate this memory--it should just
>deliver the message directly into the receive buffer.  (That is the
>point of the rendezvous protocol.)
>
>Question: is this a bug in the MPI implementation on BGL, or am I
>missing something?

Bug might be too strong a statement, but I agree with your interpretation -
the MPI implementation should not be allocating space for Irecvs and this
program should work.  There's always some tension over where to set the
eager vs. rendezvous threshold for both performance and space reasons, and
the MPI standard doesn't specify when eager, rendezvous, or something else
should be used.  But this program should work.

Bill


>Thanks,
>
>   Steve
>
>
>---------------------------------------------------------------------
>#include<stdlib.h>
>#include<assert.h>
>#include<stdio.h>
>#include "mpi.h"
>
>int main (int argc, char *argv[]) {
>   int myRank, numProcs;
>   unsigned char* ptr;
>   MPI_Request req0;
>   MPI_Request req1;
>
>   MPI_Init(&argc, &argv);
>   MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
>   MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
>   if (numProcs != 2) {
>     fprintf(stderr, "Usage: mpiexec -np 2 ./exp2c\n");
>     fflush(stderr);
>     return 1;
>   }
>   ptr = (unsigned char*)malloc(400000000);
>   assert(ptr);
>   if (myRank == 0) {
>     MPI_Isend(ptr,80000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req0);
>     MPI_Wait(&req0,MPI_STATUS_IGNORE);
>     MPI_Irecv(ptr+80000000,160000000,MPI_BYTE,1,0,MPI_COMM_WORLD,&req1);
>     MPI_Wait(&req1,MPI_STATUS_IGNORE);
>   } else {
>     MPI_Irecv(ptr,80000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req0);
>     MPI_Isend(ptr+80000000,160000000,MPI_BYTE,0,0,MPI_COMM_WORLD,&req1);
>     MPI_Wait(&req0,MPI_STATUS_IGNORE);
>     MPI_Wait(&req1,MPI_STATUS_IGNORE);
>   }
>   free(ptr);
>   printf("Proc %d has completed successfully\n", myRank);
>   fflush(stdout);
>   MPI_Finalize();
>}
>
>---------------------------------------------------------------------
>
>.
>.
>.
><Feb 01 22:11:58.663360> BE_MPI (Info) : IO - Threads initialized
>Rzv:cannot allocate unexpected buffer from R:1 T:0 C:0
>Dumping 9 frames
>         Frame 0:  0x2078f0
>         Frame 1:  0x209da8
>         Frame 2:  0x23e25c
>         Frame 3:  0x237c04
>         Frame 4:  0x23a1b4
>         Frame 5:  0x207b0c
>         Frame 6:  0x2052b0
>         Frame 7:  0x200614
>         Frame 8:  0x20016c
>Posted Queue:
>-------------
>Posted Requests 0, Total Mem: 0 bytes
>Unexpected Queue:
>-----------------
>Unexpected Requests 1, Total Mem: 160000000 bytes
>Fatal:  Cannot allocate buffer for unexpected message<Feb 01
>22:12:03.767341> BE_MPI (Info) : IO - Output thread terminated
><Feb 01 22:12:03.898684> BE_MPI (Info) : Job 44154 switched to state
>TERMINATED ('T')
>.
>.
>.
>
>- --------------------------------------------------------------------
>To add or remove yourself from this mailing list, use the 'notifyme'
>command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.

William Gropp
http://www.mcs.anl.gov/~gropp

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.