[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [bgl-discuss] Can BGL jobs wind up working with corrupted memory?
This is a BG/L thing, not a Linux thing. It has always disturbed me, though
I never imagined that it would cause corrupted data that might not be
noticed. Anyhow, I believe that Bob walkup has written a simple tool that
will check for this and abort if it occurs. Katherine: did Bob provide this
tool at the workshop? If not, it should be no problem to get it or write it
quickly ourselves.
-----Original Message-----
From: owner-discuss@xxxxxxxxxxxxxxx [mailto:owner-discuss@xxxxxxxxxxxxxxx]
On Behalf Of Steven Pieper
Sent: Wednesday, May 11, 2005 10:29 AM
To: discuss@xxxxxxxxxxxxxxx
Subject: [bgl-discuss] Can BGL jobs wind up working with corrupted memory?
Below is a question I originally sent to bgl-support and some responses
that Susan got from IBM about it. The summary is that a program
that has run many times failed with a segmentation error. This was
after 28 hours and some 5,000,000 uses of the loop that the core dump
pointed to. I did not pursue IBM's first response, but I am very
disturbed by the second response. The key parts are
>>> Note that the "end of heap" is greater than the current "stack frame
>>> pointer" - there's a good chance the application picked up a base
address
>>> from a corrupted section of memory and attempted to dereference it.
>>> The stack and heap can potentially collide if the application allocates
a
>>> fair amount of data on the stack. When memory is allocated, the brk()
>>> routine assures that the end-of-heap is less than the current stack
>>> pointer. However, the reverse is not true - an application can
allocate
>>> space on the stack without checking the current sbrk() value. It can do
>>> this via normal function calls, or compiler-generated operations like
C's
>>> alloca().
If I read this correctly, they are saying I could be working with
corrupted data anytime I am using most of the memory. In a Fortran
program with the compiler creating arrays of its own, there is no real
way I can know. Corrupted data will result in a seg. fault as in this
case if I am lucky; it might just give me a bad numerical result.
Is this true? Is it a bug of BGL or a general Unix bug?
This application is a good candidate to trip up on this sort of
problem. It is working very close to the 256K limit for vn mode. The
Monte Carlo nature of the program results in fluctuations in the amount
of memory explicitly allocated by Fortran ALLOCATE and DEALLOCATE
commands. From time to time the job does fail because an allocate
command cannot get memory -- that is fine, the allocate command returns
an error code and I know. But I am very worried that calculations that
look like they worked may have used corrupted memory at some point.
Comments?
Steve
-------------------------------------------------------
My original note to bgl-support:
6 May 05 - x.mc.a9_t1-new - be9_gs_il2t-a-Vijk5_old-gij.gfmc1c.pt1
Job 2456 failed after 28 hours with the following core dump. The dump
points to a loop that had been successfully run some 5,000,000 times in
this job (on all slaves or ~80,000 times on this slave). Many other
jobs using the same module have successfully finished after equally long
runs.
Is there a chance this is a machine problem?
Can I learn more about what caused the seg vio, like subscript values?
Steve
software signal..................0x0000000b (SIGSEGV - segmentation
violation)
generated by interrupt...........0x0000000d (data TLB error interrupt)
while executing instruction at...0x00174470
dereferencing memory at..........0xdb208da0
general purpose registers:
r00=0x0f927ee0 r01=0x0f91fb90 r02=0x1eeeeeee r03=0x0053f25f
r04=0x00000008 r05=0x00000004 r06=0x00000010 r07=0x00000008
r08=0x00000020 r09=0x00000010 r10=0x04801998 r11=0x08dbe76c
r12=0x0b0a802c r13=0x1eeeeeee r14=0xfffffff4 r15=0x0f935260
r16=0xdb210da0 r17=0x00000e1e r18=0x0000018d r19=0xffffffe8
r20=0x00000d47 r21=0x0f935250 r22=0x00000008 r23=0x071f9780
r24=0x071f9768 r25=0x0a2ba684 r26=0x071f9778 r27=0x0a2ba660
r28=0x071f9770 r29=0x0a2ba65c r30=0x0a2ba668 r31=0x0fa73720
special purpose registers:
lr=0x001734e0 cr=0x24424440 xer=0x00000002 ctr=0x00000054
memory:
stack top............0x0ff00000
stack frame pointer..0x0f91fb90
end of heap..........0x0fae6000
start of program.....0x00100000
Personality
XYZ Coordinates..... 1, 3, 1
MPI Rank............ 29
# of interrupts... 726878
Interrupt History (current TB=000040b3 2af55495)
TB=000040b305f5e90f iar=002f95a8 sp=0fd8dde0 (system call interrupt)
TB=000040b305f6037f iar=002fbdf4 sp=0fd8d0e0 (system call interrupt)
TB=000040b305f61caf iar=002fbdf4 sp=0fce4f80 (system call interrupt)
TB=000040b31b7167f7 iar=002fbdf4 sp=0fce4f80 (system call interrupt)
TB=000040b31b7681d3 iar=0030efe4 sp=0fd8d230 (system call interrupt)
TB=000040b31b768a1b iar=0030efe4 sp=0fd8d230 (system call interrupt)
TB=000040b31b76a00f iar=002fbdf4 sp=0fa44cc0 (system call interrupt)
TB=000040b32343b8c7 iar=00174470 sp=0f91fb90 (data TLB error interrupt)
DCRs:
DCR BGL_TSDCR_RE_SND_XP = 0
DCR BGL_TSDCR_RE_SND_XM = 0
DCR BGL_TSDCR_RE_SND_YP = 0
DCR BGL_TSDCR_RE_SND_YM = 0
DCR BGL_TSDCR_RE_SND_ZP = 0
DCR BGL_TSDCR_RE_SND_ZM = 0
Function call chain:
0x00174470
0x001734dc
0x00168090
0x00168d10
0x00144820
0x0013b86c
0x00135a1c
0x00100168
End of stack
these are:
wav
/bgl/home1/spieper/gfmc/mc/fbnwaves.f:1109
wav
/bgl/home1/spieper/gfmc/mc/fbnwaves.f:837
wftn
/bgl/home1/spieper/gfmc/mc/wftn.f:113
kinetic
/bgl/home1/spieper/gfmc/mc/kinetic.f:42
get_expect
/bgl/home1/spieper/gfmc/mc/get_expect.f:124
make_expect
/bgl/home1/spieper/gfmc/mc/master.f:4218
_main
/bgl/home1/spieper/gfmc/mc/master.f:3913
_start_blrts
1109 of fbnwaves.f is ctrij(i,j) = in loop:
do n=nmod+1,nml,4
ctrij(i,j)=ctrij(i,j)
& +phimc(iphi)*cffmc(mlij(iphi))
& +phimc(iphi+1)*cffmc(mlij(iphi+1))
& +phimc(iphi+2)*cffmc(mlij(iphi+2))
& +phimc(iphi+3)*cffmc(mlij(iphi+3))
iphi=iphi+4
end do !n
-------------------------------------------------------
first response from IBM
I can see from the dump that the bad address touched is 0xdb208da0
(remember with 512MB there are no addrs above 0x2000000 except for I/O
registers). I see that R16 has a similar value that is off by 32k. So
it's possible that R16 is being used as a base for subscripting. It's hard
to tell without seeing a disassembly of the code, and even then I'm not
experienced reading asm generated from Fortran. The easiest thing to do,
probably, is add manual bounds checking or use a compiler option to get
that if there is one.
-------------------------------------------------------
second response from IBM
It appears that the applications stack and heap collided:
stack top............0x0ff00000
stack frame pointer..0x0f91fb90
end of heap..........0x0fae6000
start of program.....0x00100000
Note that the "end of heap" is greater than the current "stack frame
pointer" - there's a good chance the application picked up a base address
from a corrupted section of memory and attempted to dereference it.
The stack and heap can potentially collide if the application allocates a
fair amount of data on the stack. When memory is allocated, the brk()
routine assures that the end-of-heap is less than the current stack
pointer. However, the reverse is not true - an application can allocate
space on the stack without checking the current sbrk() value. It can do
this via normal function calls, or compiler-generated operations like C's
alloca().
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.