[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bgl-discuss] Can BGL jobs wind up working with corrupted memory?



Hi Pete,
> Last week the TAU folks visited, and we discussed a new memory tool
>which could give you a "headroom" measurement over the life of your
>program, essentially looking at the difference between the stack and
>heap and how much you have left.  You could then see which routines
>put you closest to overflow.

>Such a tool is in the works....
>
>-Pete
	We're released TAU v2.14.4 with support for memory headroom evaluation options.
	I've enclosed a description of the various options below.
	http://www.cs.uoregon.edu/research/paracomp/tau
	Thanks,
	- Sameer

TAU's memory headroom API and -PROFILEHEADROOM measurement option
-----------------------------------------------------------------

TAU's memory evaluation options fall into two categories:
1) Memory utilization options that examine how much heap memory is currently
used, and
2) Memory headroom evaluation options that examine how much a program can grow
(or how much headroom it has) before it runs out of free memory on the heap.
TAU tries to call malloc with chunks that progressively increase in size, until
all memory is exhausted. Then it frees those chunks, keeping track of how much
memory it successfully allocated.

In this document, we examine the second set of options.

2a) TAU_TRACK_MEMORY_HEADROOM()
This call sets up a signal handler that is invoked every 10 seconds by an
interrupt. Inside, it evaluates how much memory it can allocate and associates
it with the callstack. The user can vary the size of the callstack by setting
the environment variable TAU_CALLSTACK_DEPTH (default is 2).
The examples/headroom/track subdirectory has an example that illustrates the
use of this call.  To disable tracking this headroom at runtime, the user
may call:
TAU_DISABLE_TRACKING_MEMORY_HEADROOM() and call
TAU_ENABLE_TRACKING_MEMORY_HEADROOM() to re-enable tracking of the headroom.
To set a different interrupt interval, call
TAU_SET_INTERRUPT_INTERVAL(value)
where value (in seconds) represents the inter-interrupt interval.

A sample profile generated has:
USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0
---------------------------------------------------------------------------------------
NumSamples   MaxValue   MinValue  MeanValue  Std. Dev.  Event Name
---------------------------------------------------------------------------------------
         3       4067       4061       4065      2.828  Memory Headroom Left (in MB)
         3       4067       4061       4065      2.828  Memory Headroom Left (in MB) : void quicksort(int *, int, int)   => void quicksort(int *, int, int)
--------------------------------------------------------------------------------

2b) TAU_TRACK_MEMORY_HEADROOM_HERE()
Sometimes it is useful to track the memory available at a certain point in the
program, rather than rely on an interrupt. TAU_TRACK_MEMORY_HEADROOM_HERE()
allows us to examine the memory available at a particular location in the source
code and associate it with the currently executing callstack.
The examples/headroom/here subdirectory has an example that illustrates this usage.

  ary = new double [1024*1024*50];
  TAU_TRACK_MEMORY_HEADROOM_HERE(); /* takes a sample here!  */
  sleep(1);

A sample profile looks like this:

USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0
---------------------------------------------------------------------------------------
NumSamples   MaxValue   MinValue  MeanValue  Std. Dev.  Event Name
---------------------------------------------------------------------------------------
         3       3672       3672       3672          0  Memory Headroom Left (in MB)
         1       3672       3672       3672          0  Memory Headroom Left (in MB) : main() (calls f1, f5) => f1() (sleeps 1 sec, calls f2, f4)
         1       3672       3672       3672          0  Memory Headroom Left (in MB) : main() (calls f1, f5) => f1() (sleeps 1 sec, calls f2, f4) => f4() (sleeps 4 sec, calls f2)
         1       3672       3672       3672          0  Memory Headroom Left (in MB) : main() (calls f1, f5) => f5() (sleeps 5 sec)
---------------------------------------------------------------------------------------

2c) -PROFILEHEADROOM
Similar to the -PROFILEMEMORY configuration option that takes a sample of the
memory utilization at each function entry, we now have -PROFILEHEADROOM. In this
-PROFILEHEADROOM option, a sample is taken at instrumented function's entry and
associated with the function name. This option is meant to be used as a
debugging aid due the high cost associated with executing a series of malloc
calls. The cost was 106 microseconds on an IBM BG/L (700 MHz CPU). To use this
option, simply configure TAU with the -PROFILEHEADROOM option and choose any
method for instrumentation (PDT, MPI, hand instrumentation). You do not need
to annotate the source code in any special way (as is required for 2a and 2b).
The examples/headroom/available subdirectory has a simple example that produces the following profile when TAU is configured with the -PROFILEHEADROOM option.

USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0
---------------------------------------------------------------------------------------
NumSamples   MaxValue   MinValue  MeanValue  Std. Dev.  Event Name
---------------------------------------------------------------------------------------
         1       4071       4071       4071          0  f1() (sleeps 1 sec, calls f2, f4) - Memory Headroom Available (MB)
         2       3671       3671       3671          0  f2() (sleeps 2 sec, calls f3) - Memory Headroom Available (MB)
         2       3671       3671       3671          0  f3() (sleeps 3 sec) - Memory Headroom Available (MB)
         1       3671       3671       3671          0  f4() (sleeps 4 sec, calls f2) - Memory Headroom Available (MB)
         1       3671       3671       3671          0  f5() (sleeps 5 sec) - Memory Headroom Available (MB)
         1       4071       4071       4071          0  main() (calls f1, f5) - Memory Headroom Available (MB)
---------------------------------------------------------------------------------------

If you any suggestions for memory options in tau, please send us an e-mail at
tau-team@xxxxxxxxxxxxxxx

On Wed, 11 May 2005, Pete Beckman wrote:

> At 10:29 AM -0500 5/11/05, Steven Pieper wrote:
>
> >   Is this true?  Is it a bug of BGL or a general Unix bug?
>
> A couple notes:
>
> In this case, it is not a Unix bug, or Linux bug, as Andrew points
> out, since there is no Unix/Linux OS running on the compute notes.
> Instead, IBM has a super-lightweight program loader that essentially
> loads your application and then lets it loose.  It does not do too
> much more.
>
> >   If I read this correctly, they are saying I could be working with
> >corrupted data anytime I am using most of the memory.  In a Fortran
> >program with the compiler creating arrays of its own, there is no real
> >way I can know.  Corrupted data will result in a seg. fault as in this
> >case if I am lucky; it might just give me a bad numerical result.
>
> Because the OS is so small and lightweight on BG/L, it is skimpy on
> memory bounds tests. On normal Unix-based operating systems, memory
> is protected and divided into chunks so overruns are easy to spot.
> On BG/L, to keep things simpler, there is no security :-)  you can
> touch any part of memory, and corrupt it in any way you like.
>
> The stack/alloc bug described by you and IBM is a little different
> than simple memory corrupting, it actually happens because you run
> out of memory, but nobody discovers it.  Instead of actually
> generating an error and explaining that you are out of space, it just
> walks over memory already in use.
>
> So... we should try to get IBM to add bounds-checking so when you run
> out of memory you get a good error.  In the mean time, tools can help
> let you know when you are actually very very low on memory.
>
> Last week the TAU folks visited, and we discussed a new memory tool
> which could give you a "headroom" measurement over the life of your
> program, essentially looking at the difference between the stack and
> heap and how much you have left.  You could then see which routines
> put you closest to overflow.
>
> Such a tool is in the works....
>
> -Pete
>
>
> >
>
> >Below is a question I originally sent to bgl-support and some responses
> >that Susan got from IBM about it.  The summary is that a program
> >that has run many times failed with a segmentation error.  This was
> >after 28 hours and some 5,000,000 uses of the loop that the core dump
> >pointed to.  I did not pursue IBM's first response, but I am very
> >disturbed by the second response.  The key parts are
> >
> >>>>  Note that the "end of heap" is greater than the current "stack frame
> >>>>  pointer" - there's a good chance the application picked up a base address
> >>>>  from a corrupted section of memory and attempted to dereference it.
> >
> >>>>  The stack and heap can potentially collide if the application allocates a
> >>>>  fair amount of data on the stack.  When memory is allocated, the brk()
> >>>>  routine assures that the end-of-heap is less than the current stack
> >>>>  pointer.   However, the reverse is not true - an application can allocate
> >>>>  space on the stack without checking the current sbrk() value.  It can do
> >>>>  this via normal function calls, or compiler-generated operations like C's
> >>>>  alloca().
> >
> >
> >
> >
> >   This application is a good candidate to trip up on this sort of
> >problem.  It is working very close to the 256K limit for vn mode.  The
> >Monte Carlo nature of the program results in fluctuations in the amount
> >of memory explicitly allocated by Fortran ALLOCATE and DEALLOCATE
> >commands.  From time to time the job does fail because an allocate
> >command cannot get memory -- that is fine, the allocate command returns
> >an error code and I know.  But I am very worried that calculations that
> >look like they worked may have used corrupted memory at some point.
> >
> >
> >Comments?
> >
> >Steve
> >
> >-------------------------------------------------------
> >
> >My original note to bgl-support:
> >
> >6 May 05 - x.mc.a9_t1-new - be9_gs_il2t-a-Vijk5_old-gij.gfmc1c.pt1
> >
> >Job 2456 failed after 28 hours with the following core dump.  The dump
> >points to a loop that had been successfully run some 5,000,000 times in
> >this job (on all slaves or ~80,000 times on this slave).  Many other
> >jobs using the same module have successfully finished after equally long
> >runs.
> >
> >Is there a chance this is a machine problem?
> >Can I learn more about what caused the seg vio, like subscript values?
> >
> >Steve
> >
> >
> >software signal..................0x0000000b (SIGSEGV - segmentation violation)
> >generated by interrupt...........0x0000000d (data TLB error interrupt)
> >while executing instruction at...0x00174470
> >dereferencing memory at..........0xdb208da0
> >
> >general purpose registers:
> >r00=0x0f927ee0 r01=0x0f91fb90 r02=0x1eeeeeee r03=0x0053f25f
> >r04=0x00000008 r05=0x00000004 r06=0x00000010 r07=0x00000008
> >r08=0x00000020 r09=0x00000010 r10=0x04801998 r11=0x08dbe76c
> >r12=0x0b0a802c r13=0x1eeeeeee r14=0xfffffff4 r15=0x0f935260
> >r16=0xdb210da0 r17=0x00000e1e r18=0x0000018d r19=0xffffffe8
> >r20=0x00000d47 r21=0x0f935250 r22=0x00000008 r23=0x071f9780
> >r24=0x071f9768 r25=0x0a2ba684 r26=0x071f9778 r27=0x0a2ba660
> >r28=0x071f9770 r29=0x0a2ba65c r30=0x0a2ba668 r31=0x0fa73720
> >
> >special purpose registers:
> >lr=0x001734e0  cr=0x24424440 xer=0x00000002 ctr=0x00000054
> >
> >memory:
> >stack top............0x0ff00000
> >stack frame pointer..0x0f91fb90
> >end of heap..........0x0fae6000
> >start of program.....0x00100000
> >
> >Personality
> >XYZ Coordinates..... 1, 3, 1
> >MPI Rank............ 29
> >
> ># of interrupts... 726878
> >
> >Interrupt History  (current TB=000040b3 2af55495)
> >TB=000040b305f5e90f  iar=002f95a8 sp=0fd8dde0 (system call interrupt)
> >TB=000040b305f6037f  iar=002fbdf4 sp=0fd8d0e0 (system call interrupt)
> >TB=000040b305f61caf  iar=002fbdf4 sp=0fce4f80 (system call interrupt)
> >TB=000040b31b7167f7  iar=002fbdf4 sp=0fce4f80 (system call interrupt)
> >TB=000040b31b7681d3  iar=0030efe4 sp=0fd8d230 (system call interrupt)
> >TB=000040b31b768a1b  iar=0030efe4 sp=0fd8d230 (system call interrupt)
> >TB=000040b31b76a00f  iar=002fbdf4 sp=0fa44cc0 (system call interrupt)
> >TB=000040b32343b8c7  iar=00174470 sp=0f91fb90 (data TLB error interrupt)
> >
> >DCRs:
> >DCR BGL_TSDCR_RE_SND_XP = 0
> >DCR BGL_TSDCR_RE_SND_XM = 0
> >DCR BGL_TSDCR_RE_SND_YP = 0
> >DCR BGL_TSDCR_RE_SND_YM = 0
> >DCR BGL_TSDCR_RE_SND_ZP = 0
> >DCR BGL_TSDCR_RE_SND_ZM = 0
> >
> >Function call chain:
> >0x00174470
> >0x001734dc
> >0x00168090
> >0x00168d10
> >0x00144820
> >0x0013b86c
> >0x00135a1c
> >0x00100168
> >End of stack
> >
> >these are:
> >
> >wav
> >/bgl/home1/spieper/gfmc/mc/fbnwaves.f:1109
> >wav
> >/bgl/home1/spieper/gfmc/mc/fbnwaves.f:837
> >wftn
> >/bgl/home1/spieper/gfmc/mc/wftn.f:113
> >kinetic
> >/bgl/home1/spieper/gfmc/mc/kinetic.f:42
> >get_expect
> >/bgl/home1/spieper/gfmc/mc/get_expect.f:124
> >make_expect
> >/bgl/home1/spieper/gfmc/mc/master.f:4218
> >_main
> >/bgl/home1/spieper/gfmc/mc/master.f:3913
> >_start_blrts
> >
> >
> >1109 of fbnwaves.f is ctrij(i,j) = in loop:
> >
> >                    do n=nmod+1,nml,4
> >                       ctrij(i,j)=ctrij(i,j)
> >      &                   +phimc(iphi)*cffmc(mlij(iphi))
> >      &                   +phimc(iphi+1)*cffmc(mlij(iphi+1))
> >      &                   +phimc(iphi+2)*cffmc(mlij(iphi+2))
> >      &                   +phimc(iphi+3)*cffmc(mlij(iphi+3))
> >                       iphi=iphi+4
> >                    end do       !n
> >
> >
> >-------------------------------------------------------
> >
> >first response from IBM
> >
> >
> >
> >I can see from the dump that the bad address touched is 0xdb208da0
> >(remember with 512MB there are no addrs above 0x2000000 except for I/O
> >registers).  I see that R16 has a similar value that is off by 32k.  So
> >it's possible that R16 is being used as a base for subscripting.  It's hard
> >to tell without seeing a disassembly of the code, and even then I'm not
> >experienced reading asm generated from Fortran.  The easiest thing to do,
> >probably, is add manual bounds checking or use a compiler option to get
> >that if there is one.
> >
> >
> >-------------------------------------------------------
> >
> >second response from IBM
> >
> >
> >It appears that the applications stack and heap collided:
> >
> >stack top............0x0ff00000
> >stack frame pointer..0x0f91fb90
> >end of heap..........0x0fae6000
> >start of program.....0x00100000
> >
> >Note that the "end of heap" is greater than the current "stack frame
> >pointer" - there's a good chance the application picked up a base address
> >from a corrupted section of memory and attempted to dereference it.
> >
> >The stack and heap can potentially collide if the application allocates a
> >fair amount of data on the stack.  When memory is allocated, the brk()
> >routine assures that the end-of-heap is less than the current stack
> >pointer.   However, the reverse is not true - an application can allocate
> >space on the stack without checking the current sbrk() value.  It can do
> >this via normal function calls, or compiler-generated operations like C's
> >alloca().
> >
> >
> >- --------------------------------------------------------------------
> >To add or remove yourself from this mailing list, use the 'notifyme'
> >command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
>
>
> --
> ---
> Pete Beckman                                Phone: 630-252-9020
> Argonne National Laboratory                 Email: beckman@xxxxxxxxxxx
> MCS-221
> 9700 South Cass Avenue
> Argonne, Illinois 60439-4844, USA
> PGP: 12C0 4357 1197 7BC7 8BBB  B38A 869A ECE1 D7F0 6CD5
>
> - --------------------------------------------------------------------
> To add or remove yourself from this mailing list, use the 'notifyme'
> command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
>

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.