[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bgl-discuss] Blue Gene and mpirun error codes



 Hi Fiona,
we also ran into into the problem with the compiler messages.
Our IBM site engineer Michael Hennecke found a workaround and
we have opened PMR 92722,033,724 for this.
Regards
Jutta


Problem: XL compiler runtime errors on BlueGene only produce an 15xx-xxx error number, but no error texts. . This is apparently caused by the fact that the message catalogs reside in /opt/ibmcmp/msg, which exists on the SN and FENs but not on the BG nodes.

Here is a workaround, verified with V1R1M0:
.
# copy the message catalogs to a place below /bgl...
mkdir -p /bgl/local/opt/ibmcmp/msg/
cp -p /opt/ibmcmp/msg/en_US /bgl/local/opt/xlcmp/msg/
.
# create a SITEDIST rc file to point /opt to it on the IONs...
mkdir -p /bgl/dist/etc/rc.d/init.d
cd       /bgl/dist/etc/rc.d/init.d
cat <<E_O_F >> ./xlmsg
.
# link to XL compiler runtime message catalogs
#
export LANG=en_US
ln -s /bgl/local/opt /opt
E_O_F
.
# activate the new rc file for runlevel 3...
mkdir -p /bgl/dist/etc/rc.d/rc3.d
cd       /bgl/dist/etc/rc.d/rc3.d
ln -s ../init.d/xlmsg ./S10xlmsg
.

Fiona Reid wrote:
Hi Everyone,

I have a question regarding the error codes returned by mpirun:

For helloworld which runs normally:
<Nov  4 10:50:30> FE_MPI (Info) : BG/L job exit status = (0)
<Nov  4 10:50:44> FE_MPI (Info) : == Exit status:   0 ==

For a helloworld which is forced to core dump: <Nov 4 10:53:05> BE_MPI (ERROR): The error message in the job record is as follows:
<Nov 4 10:53:05> BE_MPI (ERROR): "killed with signal 5"
<Nov 4 10:53:06> FE_MPI (Info) : BG/L job exit status = (133)
<Nov 4 10:53:21> FE_MPI (Info) : == Exit status: 133 ==


For the OCCAM code which crashes but doesn't produce a core: 1525-003
1525-003
1525-001
<Nov 4 10:57:10> BE_MPI (ERROR): The error message in the job record is as follows:
<Nov 4 10:57:10> BE_MPI (ERROR): "killed by exit(1) on node 0"
<Nov 4 10:57:11> FE_MPI (Info) : BG/L job exit status = (1)
<Nov 4 10:57:26> FE_MPI (Info) : == Exit status: 0 ==


Can anyone explain the what the different error codes mean?
E.g. what do 1525-001  and 1525-003 mean?
     what does "killed with signal 5" mean?
     what does "kill by exit(1) on node 0" mean?
     what does "Exit status: 133" mean?

Many thanks,

Fiona

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.


--
--------------------------------------------------------------
Jutta Docter                    E-mail: J.Docter@xxxxxxxxxxxxx
Forschungszentrum Juelich GmbH  Phone:  (+49) 2461 61-6763
ZAM                             Fax:    (+49) 2461 61-6656
D 52425 Juelich                 GERMANY
--------------------------------------------------------------


- -------------------------------------------------------------------- To add or remove yourself from this mailing list, use the 'notifyme' command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.