[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bgl-discuss] MPI failure, simplified



A few points to add to what has already been said:

First, I agree that the failure is occurring in the way that the IBM
developer has described.  I also received a message from Doug Sondak
(Boston Univ.) who gave me the same explanation and another way to fix
the problem: alter the application code so that (1) the receiver first
posts the receive and then sends a short message to the sender saying
"I'm ready to receive now", and (2) the sender first receives the
short message, then posts the big send.  This ensures that the receive
is posted before the send and I think is a better solution than the
Barrier one proposed by IBM.  (For one thing, it only involves the two
processes that are concerned with this communication and not all
processes in the communicator).

I tried Doug's solution and my application runs fine, though I think
the extra synchronization it introduces is taking a big toll on
performance, though I will have to look into it further and try some
ways to amelioriate the problem.

Second, the argument that the simple example program I wrote is
"unsafe" can't possibly be correct.  By the same reasoning, the simple
program

if (myRank == 0) {
  MPI_Send(some message to proc 1);
} else if (myRank == 1) {
  MPI_Recv(some message from proc 0);
}

would also be unsafe because the message could arrive at proc 1 before
proc 1 begins to execute the MPI_Recv and so would be buffered there
by the BGL MPI implementation.  Of course, this program (and my
example) are both safe because the MPI implementation always has the
option of blocking the sender until a matching receive is posted and
then delivering the message directly into the receive buffer.  No
message buffering is *required* by either program.  For some reason it
seems like the IBM developers have chosen to not use the blocking
option (by a rendezvous or some other protocol).

I agree with William Gropp that a clarification of what is
"pathological" would help.  However, it seemed pretty clear to me that
the MPI Standard was referring to a program that allows an enormous
number of outstanding requests to accumulate (or an implementation
that can only handle a very small number), which is not the case here.
I also thought the "pathological" discussion only applied to
nonblocking functions, while the problem on BGL can occur with just
plain old blocking MPI_Send/MPI_Recv (and even MPI_Ssend/MPI_Recv), as
I showed with the second example I sent in.

I think a more serious problem with the MPI_Standard is this, from
MPI-2, Section 2.8, "Error Handling":

  In addition, a *resource error* may occur when a program exceeds
  the amount of available system resources (number of pending
  messages, system buffers, etc.).  The occurrence of this type
  of error depends on the amount of available resources in the
  system and the resource allocation mechanism used; this may differ
  from system to system.  A high-quality implementation will provide
  generous limits on the important resources so as to alleviate
  the portability problem this represents.

I see the reason for including "pending messages" --- in a finite
universe it is not possible to handle an arbitrary number of postings.
However, I don't see the need for including "system buffers".  Since
the implementation can always force the sender to block, it seems to
me there is always a good way for it to proceed even if it has 0
system buffer space.

I think a more precise description on when resource errors can/cannot
occur would be helpful.  As it stands, the formulation above is so
vague (e.g., "etc.") that there is no "guarranteed minimum" that the
user can expect from the MPI implementation when it comes to memory
use, which makes it very hard to write a portable program.

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.