[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fw: [bgl-discuss] MPI failure, simplified



Stephen,

I can add this information to the existing PMR.  PMR stands for Problem
Management Report.  I can, and will, notify you of any developments in the
PMR. Right now, all that has happened on the existing PMR is that it has
been sent to development (on 2/2), nothing has been posted for that PMR
since then.

Susan.

On Mon, 13 Feb 2006, Stephen Siegel wrote:

> This concerns the MPI failure on BGL when sending large messages in
> certain scenarios.  Thanks to everyone who replied to my earlier post.
> I agree that the best solution would be an MPI implementation that
> fully complies with the MPI Standard in all cases, so we don't have to
> rewrite our code.  However, in the interest of expediency, I did try
> Steven Pieper's suggestion of replacing my MPI_Isend/MPI_Waits with
> MPI_Ssends.  Unfortunately, my program still failed, with a similar
> error message.  Below is another simple program that produces a
> similar failure with numProcs=4 (co-proc mode again), using only
> MPI_Ssend and MPI_Recv.  (The MPI_Ssend can be replaced with MPI_Send
> and it still fails in the same way.)  Again, this is a program that I
> think should always work, if the MPI implementation conforms to the
> Standard.
>
> Notice that Process 0 mallocs 374 MB, and then tries to receive 3
> messages, each 80 MB, into the malloced region.  The error says
> something to the effect that 2 of the messages are "unexpected" and
> there isn't enough memory for them.  This may have something to do
> with the fact that 373+160=533 and perhaps that is how many MB are
> actually available to a node.  If the MPI implementation is trying to
> store the messages in some other region of memory before transferring
> them into the receive buffer I have allocated, it might discover there
> isn't sufficient memory to do that.  This is all speculation, but I
> point it out in case it helps someone figure out what is going on.
>
> Question: What is a PMR?  Is there any way I can be notified of
> developments in this PMR?
>
> Thanks again,
>
> Steve
>
> Stephen Siegel
> Senior Research Scientist
> Department of Computer Science
> University of Massachusetts Amherst
>
>
> #include<stdlib.h>
> #include<assert.h>
> #include<stdio.h>
> #include "mpi.h"
>
> int main (int argc, char *argv[]) {
>   int myRank, numProcs, i;
>   unsigned char* ptr;
>   int MB = 1000000;
>
>   MPI_Init(&argc, &argv);
>   MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
>   MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
>   assert(numProcs == 4);
>   /* 374 MB or more required for eventual crash */
>   ptr = (unsigned char*)malloc(374*MB);
>   assert(ptr);
>   if (myRank == 0)
>     for (i = 1; i < numProcs; i++)
>       MPI_Recv(ptr+(i-1)*80*MB,80*MB,MPI_BYTE,i,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
>   else
>     MPI_Ssend(ptr,80*MB,MPI_BYTE,0,0,MPI_COMM_WORLD); /* or MPI_Send */
>   free(ptr);
>   printf("Proc %d has completed successfully\n", myRank);
>   fflush(stdout);
>   MPI_Finalize();
> }
>
>
> Rzv:cannot allocate unexpected buffer from R:2 T:0 C:0
> Dumping 9 frames
>         Frame 0:  0x206fa4
>         Frame 1:  0x20940c
>         Frame 2:  0x23e16c
>         Frame 3:  0x237b14
>         Frame 4:  0x23a0c4
>         Frame 5:  0x2071c0
>         Frame 6:  0x2047d4
>         Frame 7:  0x200618
>         Frame 8:  0x20016c
> Posted Queue:
> -------------
> Posted Requests 0, Total Mem: 0 bytes
> Unexpected Queue:
> -----------------
> Unexpected Requests 2, Total Mem: 160000000 bytes
> Fatal:  Cannot allocate buffer for unexpected message<Feb 13 12:14:46.721155> BE_MPI (Info) : \IO - Output thread terminated
>
>
> - --------------------------------------------------------------------
> To add or remove yourself from this mailing list, use the 'notifyme'
> command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
>
>

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.