[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bgl-discuss] error messages



I was wondering if anyone could help me interpret the error messages
generated by my MPI/C program executed on 32 nodes (co-proc mode) on
BGL. (See output to stderr below.)  Some relevant facts:

  - the program sends a lot of messages from each proc to every other
    proc.  However, I don't think the number of posted messages
    is ever "excessive".  I can be more specific if need be.
  - the size of each message is 10 MB
  - the communication is carried out solely by Isend, Irecv (and a few
    Sendrecv_replace)
  - I don't think any malloc from my application failed -- I check the
    pointer returned by malloc with an assertion every time.

It seems like some component of the MPI infrastructure ran out of
memory.  If that's the case, I don't think it should cause a crash:
according to the MPI Standard, if the MPI implementation doesn't have
enough memory to buffer the message, it should just block the sender
until it does have enough memory, or until a matching receive is
posted and the message can be delivered directly into the receive
buffer.  This might slow things down, but it shouldn't crash.

Is there any documentation for the error messages returned below?  Are
there any other documents people can recommend that will help me
become more familiar with MPI on BGL?

Thanks,

Steve Siegel
Senior Research Scientist
Dept. of Computer Science
UMass Amherst


<Jan 18 16:40:53.076072> FE_MPI (Info) : Initializing MPIRUN
/bin/bash: SHELL: readonly variable
/bin/bash: PATH: readonly variable
/bin/bash: line 1: dircolors: No such file or directory
<Jan 18 16:40:54.463549> BRIDGE (Info) : The machine serial number (alias) is BGL
<Jan 18 16:40:54.605361> FE_MPI (Info) : Back-End invoked:
<Jan 18 16:40:54.605391> FE_MPI (Info) :   - Service Node: 172.30.1.100
<Jan 18 16:40:54.605401> FE_MPI (Info) :   - Back-End pid: 19490 (on service node)
<Jan 18 16:40:54.605411> FE_MPI (Info) : Preparing partition
<Jan 18 16:40:54.605569> BE_MPI (Info) : Examining specified partition
<Jan 18 16:40:54.703550> BE_MPI (Info) : Checking partition R000_J104-32 initial state ...
<Jan 18 16:40:54.703585> BE_MPI (Info) : Partition R000_J104-32 initial state = FREE ('F')
<Jan 18 16:40:54.703600> BE_MPI (Info) : Checking partition owner...
<Jan 18 16:40:54.703611> BE_MPI (Info) : Checking if the partition is busy ...
<Jan 18 16:40:54.742059> BE_MPI (Info) : Checking partition size ...
<Jan 18 16:40:54.742083> BE_MPI (Info) : Setting new owner
<Jan 18 16:40:54.794105> BE_MPI (Info) : Booting partition
<Jan 18 16:41:26.342133> BE_MPI (Info) : Partition is ready
<Jan 18 16:41:26.443370> BE_MPI (Info) : Done preparing partition
<Jan 18 16:41:26.483456> FE_MPI (Info) : Adding job
<Jan 18 16:41:26.734384> FE_MPI (Info) : Job added with the following id: 42777
<Jan 18 16:41:26.734678> FE_MPI (Info) : Starting job 42777
<Jan 18 16:41:26.932378> FE_MPI (Info) : Waiting for job to terminate
<Jan 18 16:41:37.652380> BE_MPI (Info) : IO - Connection from Ciodb established on fd=12
<Jan 18 16:41:37.656316> BE_MPI (Info) : IO - Threads initialized
Rzv:cannot allocate unexpected buffer from R:30 T:0 C:124
Dumping 12 frames
	Frame 0:  0x23482c
	Frame 1:  0x2372bc
	Frame 2:  0x256fe8
Rzv:cannot allocate unexpected buffer from R:30 T:0 C:124
	Frame 3:  0x250b5c
Rzv:cannot allocate unexpected buffer from R:30 T:0 C:124
Dumping 12 frames
	Frame 4:  0x252f40
Dumping 12 frames
Rzv:cannot allocate unexpected buffer from R:30 T:0 C:124
	Frame 0:  0x23482c
	Frame 5:  0x234a48
	Frame 0:  0x23482c
Rzv:cannot allocate unexpected buffer from R:30 T:0 C:124
Dumping 12 frames
	Frame 1:  0x2372bc
Rzv:cannot allocate unexpected buffer from R:30 T:0 C:124
	Frame 6:  0x231b80
Rzv:cannot allocate unexpected buffer from R:30 T:0 C:124
	Frame 1:  0x2372bc
Dumping 12 frames
Rzv:cannot allocate unexpected buffer from R:21 T:0 C:124
	Frame 0:  0x23482c
	Frame 2:  0x256fe8
Rzv:cannot allocate unexpected buffer from R:30 T:0 C:124
Dumping 12 frames
Rzv:cannot allocate unexpected buffer from R:23 T:0 C:124
	Frame 7:  0x204c94
Dumping 12 frames
	Frame 2:  0x256fe8
	Frame 0:  0x23482c
Dumping 12 frames
	Frame 1:  0x2372bc
	Frame 3:  0x250b5c
Dumping 12 frames
	Frame 0:  0x23482c
Dumping 12 frames
	Frame 8:  0x208dcc
	Frame 0:  0x23482c
	Frame 3:  0x250b5c
	Frame 1:  0x2372bc
	Frame 0:  0x23482c
	Frame 2:  0x256fe8
	Frame 4:  0x252f40
	Frame 0:  0x23482c
	Frame 1:  0x2372bc
	Frame 0:  0x23482c
	Frame 9:  0x200b20
	Frame 1:  0x2372bc
	Frame 4:  0x252f40
	Frame 2:  0x256fe8
	Frame 1:  0x2372bc
	Frame 3:  0x250b5c
	Frame 5:  0x234a48
	Frame 1:  0x2372bc
	Frame 2:  0x256fe8
	Frame 1:  0x2372bc
	Frame 10:  0x200710
	Frame 2:  0x256fe8
	Frame 5:  0x234a48
	Frame 3:  0x250b5c
	Frame 2:  0x256fe8
	Frame 4:  0x252f40
	Frame 6:  0x231b80
	Frame 2:  0x256fe8
	Frame 3:  0x250b5c
	Frame 2:  0x256fe8
	Frame 11:  0x20016c
	Frame 3:  0x250b5c
	Frame 6:  0x231b80
	Frame 4:  0x252f40
	Frame 3:  0x250b5c
	Frame 5:  0x234a48
	Frame 7:  0x204c94
	Frame 3:  0x250b5c
	Frame 4:  0x252f40
	Frame 3:  0x250b5c
Posted Queue:
	Frame 4:  0x252f40
	Frame 7:  0x204c94
	Frame 5:  0x234a48
	Frame 4:  0x252f40
Rzv:cannot allocate unexpected buffer from R:30 T:0 C:124
	Frame 6:  0x231b80
	Frame 8:  0x208dcc
	Frame 4:  0x252f40
	Frame 5:  0x234a48
	Frame 4:  0x252f40
-------------
	Frame 5:  0x234a48
	Frame 8:  0x208dcc
	Frame 6:  0x231b80
	Frame 5:  0x234a48
Dumping 12 frames
	Frame 7:  0x204c94
	Frame 9:  0x200b20
	Frame 5:  0x234a48
	Frame 6:  0x231b80
	Frame 5:  0x234a48
Posted Requests 8, Total Mem: 10000000 bytes
	Frame 6:  0x231b80
	Frame 9:  0x200b20
	Frame 7:  0x204c94
	Frame 6:  0x231b80
	Frame 0:  0x23482c
	Frame 8:  0x208dcc
	Frame 10:  0x200710
	Frame 6:  0x231b80
	Frame 7:  0x204c94
	Frame 6:  0x231b80
Unexpected Queue:
	Frame 7:  0x204c94
	Frame 10:  0x200710
	Frame 8:  0x208dcc
	Frame 7:  0x204c94
	Frame 1:  0x2372bc
	Frame 9:  0x200b20
	Frame 11:  0x20016c
	Frame 7:  0x204c94
	Frame 8:  0x208dcc
	Frame 7:  0x204c94
-----------------
	Frame 8:  0x208dcc
	Frame 11:  0x20016c
	Frame 9:  0x200b20
	Frame 8:  0x208dcc
	Frame 2:  0x256fe8
	Frame 10:  0x200710
Posted Queue:
	Frame 8:  0x208dcc
	Frame 9:  0x200b20
	Frame 8:  0x208dcc
Unexpected Requests 14, Total Mem: 140000000 bytes
	Frame 9:  0x200b20
Posted Queue:
	Frame 10:  0x200710
	Frame 9:  0x200b20
	Frame 3:  0x250b5c
	Frame 11:  0x20016c
-------------
	Frame 9:  0x200b20
	Frame 10:  0x200710
	Frame 9:  0x200b20
Fatal:  Cannot allocate buffer for unexpected message	Frame 10:  0x200710
-------------
	Frame 11:  0x20016c
	Frame 10:  0x200710
	Frame 4:  0x252f40
Posted Queue:
Posted Requests 7, Total Mem: 8750000 bytes
	Frame 10:  0x200710
	Frame 11:  0x20016c
	Frame 10:  0x200710
	Frame 11:  0x20016c
Posted Requests 6, Total Mem: 7500000 bytes
Posted Queue:
	Frame 11:  0x20016c
	Frame 5:  0x234a48
-------------
Unexpected Queue:
	Frame 11:  0x20016c
Posted Queue:
	Frame 11:  0x20016c
Posted Queue:
Unexpected Queue:
-------------
Posted Queue:
	Frame 6:  0x231b80
Posted Requests 5, Total Mem: 6250000 bytes
-----------------
Posted Queue:
-------------
Posted Queue:
-------------
-----------------
Posted Requests 4, Total Mem: 5000000 bytes
-------------
	Frame 7:  0x204c94
Unexpected Queue:
Unexpected Requests 14, Total Mem: 140000000 bytes
-------------
Posted Requests 3, Total Mem: 3750000 bytes
-------------
Posted Requests 3, Total Mem: 3750000 bytes
Unexpected Requests 14, Total Mem: 140000000 bytes
Unexpected Queue:
Posted Requests 3, Total Mem: 3750000 bytes
	Frame 8:  0x208dcc
-----------------
Fatal:  Cannot allocate buffer for unexpected messagePosted Requests 2, Total Mem: 2500000 bytes
Unexpected Queue:
Posted Requests 7, Total Mem: 8750000 bytes
Unexpected Queue:
Fatal:  Cannot allocate buffer for unexpected message-----------------
Unexpected Queue:
	Frame 9:  0x200b20
Unexpected Requests 14, Total Mem: 140000000 bytes
Unexpected Queue:
-----------------
Unexpected Queue:
-----------------
Unexpected Requests 14, Total Mem: 140000000 bytes
-----------------
	Frame 10:  0x200710
Fatal:  Cannot allocate buffer for unexpected message-----------------
Unexpected Requests 14, Total Mem: 140000000 bytes
-----------------
Unexpected Requests 14, Total Mem: 140000000 bytes
Fatal:  Cannot allocate buffer for unexpected messageUnexpected Requests 14, Total Mem: 140000000 bytes
	Frame 11:  0x20016c
Unexpected Requests 14, Total Mem: 140000000 bytes
Fatal:  Cannot allocate buffer for unexpected messageUnexpected Requests 14, Total Mem: 140000000 bytes
Fatal:  Cannot allocate buffer for unexpected messageFatal:  Cannot allocate buffer for unexpected messagePosted Queue:
Fatal:  Cannot allocate buffer for unexpected messageFatal:  Cannot allocate buffer for unexpected message-------------
Posted Requests 2, Total Mem: 2500000 bytes
Unexpected Queue:
-----------------
Unexpected Requests 14, Total Mem: 140000000 bytes
Fatal:  Cannot allocate buffer for unexpected message<Jan 18 16:41:42.761326> BE_MPI (Info) : IO - Output thread terminated
<Jan 18 16:41:42.896776> BE_MPI (Info) : Job 42777 switched to state TERMINATED ('T')
<Jan 18 16:41:42.896804> BE_MPI (Info) : Job successfully terminated
<Jan 18 16:41:43.334884> BE_MPI (ERROR): The error message in the job record is as follows:
<Jan 18 16:41:43.334910> BE_MPI (ERROR):   "killed by exit(1) on node 10"
<Jan 18 16:41:43.479685> FE_MPI (Info) : BG/L job exit status = (1)
<Jan 18 16:41:43.479715> FE_MPI (Info) : Job terminated normally
<Jan 18 16:41:43.479856> BE_MPI (Info) : Starting cleanup sequence
<Jan 18 16:41:43.479876> BE_MPI (Info) : BG/L Job alredy terminated / hasn't been added
<Jan 18 16:41:43.597821> BE_MPI (ERROR): The error message in the job record is as follows:
<Jan 18 16:41:43.597845> BE_MPI (ERROR):   "killed by exit(1) on node 10"
<Jan 18 16:41:43.698575> BE_MPI (Info) : Destroying partition R000_J104-32
<Jan 18 16:41:54.392182> BE_MPI (Info) : Partition R000_J104-32 switched to state FREE ('F')
<Jan 18 16:41:54.501141> BE_MPI (Info) : ==   BE completed   ==
<Jan 18 16:41:54.514211> FE_MPI (Info) : ==   FE completed   ==
<Jan 18 16:41:54.514246> FE_MPI (Info) : == Exit status:   1 ==

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.