[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Fw: [bgl-discuss] MPI failure, simplified
Stephen,
I can add this information to the existing PMR. PMR stands for Problem
Management Report. I can, and will, notify you of any developments in the
PMR. Right now, all that has happened on the existing PMR is that it has
been sent to development (on 2/2), nothing has been posted for that PMR
since then.
Susan.
On Mon, 13 Feb 2006, Stephen Siegel wrote:
> This concerns the MPI failure on BGL when sending large messages in
> certain scenarios. Thanks to everyone who replied to my earlier post.
> I agree that the best solution would be an MPI implementation that
> fully complies with the MPI Standard in all cases, so we don't have to
> rewrite our code. However, in the interest of expediency, I did try
> Steven Pieper's suggestion of replacing my MPI_Isend/MPI_Waits with
> MPI_Ssends. Unfortunately, my program still failed, with a similar
> error message. Below is another simple program that produces a
> similar failure with numProcs=4 (co-proc mode again), using only
> MPI_Ssend and MPI_Recv. (The MPI_Ssend can be replaced with MPI_Send
> and it still fails in the same way.) Again, this is a program that I
> think should always work, if the MPI implementation conforms to the
> Standard.
>
> Notice that Process 0 mallocs 374 MB, and then tries to receive 3
> messages, each 80 MB, into the malloced region. The error says
> something to the effect that 2 of the messages are "unexpected" and
> there isn't enough memory for them. This may have something to do
> with the fact that 373+160=533 and perhaps that is how many MB are
> actually available to a node. If the MPI implementation is trying to
> store the messages in some other region of memory before transferring
> them into the receive buffer I have allocated, it might discover there
> isn't sufficient memory to do that. This is all speculation, but I
> point it out in case it helps someone figure out what is going on.
>
> Question: What is a PMR? Is there any way I can be notified of
> developments in this PMR?
>
> Thanks again,
>
> Steve
>
> Stephen Siegel
> Senior Research Scientist
> Department of Computer Science
> University of Massachusetts Amherst
>
>
> #include<stdlib.h>
> #include<assert.h>
> #include<stdio.h>
> #include "mpi.h"
>
> int main (int argc, char *argv[]) {
> int myRank, numProcs, i;
> unsigned char* ptr;
> int MB = 1000000;
>
> MPI_Init(&argc, &argv);
> MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
> MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
> assert(numProcs == 4);
> /* 374 MB or more required for eventual crash */
> ptr = (unsigned char*)malloc(374*MB);
> assert(ptr);
> if (myRank == 0)
> for (i = 1; i < numProcs; i++)
> MPI_Recv(ptr+(i-1)*80*MB,80*MB,MPI_BYTE,i,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
> else
> MPI_Ssend(ptr,80*MB,MPI_BYTE,0,0,MPI_COMM_WORLD); /* or MPI_Send */
> free(ptr);
> printf("Proc %d has completed successfully\n", myRank);
> fflush(stdout);
> MPI_Finalize();
> }
>
>
> Rzv:cannot allocate unexpected buffer from R:2 T:0 C:0
> Dumping 9 frames
> Frame 0: 0x206fa4
> Frame 1: 0x20940c
> Frame 2: 0x23e16c
> Frame 3: 0x237b14
> Frame 4: 0x23a0c4
> Frame 5: 0x2071c0
> Frame 6: 0x2047d4
> Frame 7: 0x200618
> Frame 8: 0x20016c
> Posted Queue:
> -------------
> Posted Requests 0, Total Mem: 0 bytes
> Unexpected Queue:
> -----------------
> Unexpected Requests 2, Total Mem: 160000000 bytes
> Fatal: Cannot allocate buffer for unexpected message<Feb 13 12:14:46.721155> BE_MPI (Info) : \IO - Output thread terminated
>
>
> - --------------------------------------------------------------------
> To add or remove yourself from this mailing list, use the 'notifyme'
> command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
>
>
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.