[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bgl-discuss] Can BGL jobs wind up working with corrupted memory?



> Yes, the BG/L software should automatically report a clear error,
> explaining where the application ran out of memory, and what the call
> stack looked like.

I agree that the system should provide protection mechanism.

>
> Everything else is just a work-around for an obvious IBM bug.

Esp. for computational program, there are stack consuming things
such as recursive function.

I executed a program like chkstack.c (in attachment).
It just exited with "Job terminated normally",
instead of "Stack Overflow". This is completely "abnormal exit".

[0]: stack usage = 533594112 bytes
[0]: stack usage = 533725184 bytes
<May 20 08:53:08> FE_MPI (Info) : Job terminated normally

* 533725184 bytes -> 509MB


BTW, chkstack just causes seg.v with a msg like "stack overflow"
on a regular system. On my linux box, it hanged at around 10M stack usage.

>
> That said, the TAU memory tool for measuring "memory headroom" during
> an application is very nice. It works with Fortran, and is more than
> just a tool to address this particular IBM bug. It is really a tool
> for understanding how your memory gets used during your program
> execution, and where you reach memory "low water marks".
>
> For example, you may find out that one routine is consistently a
> memory hog, and that fixing that area could let you crank the problem
> size for all the other parts of the code.
>
> With these small-memory machines, the problem will be around for a
> while. Machines built on the cell processor will also require a lot of
> memory management from both the compiler, run-time system, and
> middleware.
>
> -Pete
>
> At 4:34 PM -0500 5/19/05, Steven Pieper wrote:
>
>> What really bothers me is that this `feature' of the BGL kernal seems
>> to allow for incorrect numerical results with no warning. The user
>> has to be aware that this problem might be happening and build in
>> checks, if that is possible. How do I know how big to increase
>> the program's data space? Ho do I know how much of the stack
>> or heap the compiler is using in each subroutine? etc.
>>
>> steve
>>
>>
>>
>>>>> Hi,
>>>>>
>>>>> problem is that BG/L compute node kernel doesn't check stack overflow
>>>>> However, I believe that the OS can check overflow of heap memory.
>>>>> In fact,of course, the malloc() routine on BG/L returns 0 if run
>>>>> out of
>>>>> memory.
>>>>>
>>>>> Fortunatelly, sbrk() is available on BG/L glibc.
>>>>> so we can increment the program data space to
>>>>> reduce possiblity of stack overflow which eats out of the heap
>>>>> memory.
>>>>>
>>>>> For example, in C code,
>>>>>
>>>>> sbrk( 1024*1024 ); // increments the program's data space by 1M bytes
>>>>>
>>>>> Sorry I don't know how to call glibc function such as sbrk() from
>>>>> Fortran.
>>>>>
>>>>>
>>>>> I wrote a small C code to make sure sbrk() works or not.
>>>>>
>>>>> This is not perfect solution but it works.
>>>>> The perfect solution is to add stack overflow check to
>>>>> BG/L compute node kernel.
>>>>>
>>>>>
>>>>> Sorry if I miss the point..
>>>>>
>>>>>
>>>>> ------------------------
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <unistd.h>
>>>>>
>>>>> int main(int argc, char* argv[])
>>>>> {
>>>>> unsigned long size,totalsize,s;
>>>>> void *p;
>>>>> printf("sbrk()=%p\n", sbrk(0));
>>>>> if( argc > 1 ) {
>>>>> int inc_kb;
>>>>> inc_kb = atoi(argv[1]);
>>>>> printf("inc. stack size %dkb\n", inc_kb);
>>>>> sbrk(inc_kb*1024);
>>>>> printf("sbrk()=%p\n", sbrk(0));
>>>>> }
>>>>> size = 256*1024*1024;
>>>>> totalsize = 0;
>>>>>
>>>>> for(s=size ; s>1024; s>>=2 ) {
>>>>> for(;;) {
>>>>> p = malloc( s );
>>>>> if( !p ) break;
>>>>> totalsize += s;
>>>>> }
>>>>> }
>>>>> printf("%6.2f MB alloc'ed\n", (double)totalsize/1024.0/1024.0);
>>>>> return 0;
>>>>> }
>>>>>
>>>>>
>>>>> # bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gcc
>>>>> -Wall a.c
>>>>> # cqsub -t 10 -n 1 a.out
>>>>> # cat 2958.output
>>>>> [0]: sbrk()=0x162740
>>>>> [0]: inc. stack size 1024kb
>>>>> [0]: sbrk()=0x278000
>>>>> [0]: 507.94 MB alloc'ed
>>>>>
>>>>>
>>>>>
>>>>> >Does this help me with my Fortran application?
>>>>> >
>>>>> >Even if it does, is IBM being complained to? I find it completely
>>>>> unacceptable
>>>>> >if the user has to monitor that the operating system is
>>>>> mismanaging memory.
>>>>> >
>>>>> >Thanks
>>>>> >Steve
>>>>> >
>>>>> > >
>>>>> >>>>Hi Pete,
>>>>> >>>> >>>>
>>>>> >>>>>Last week the TAU folks visited, and we discussed a new
>>>>> memory tool
>>>>> >>>>>which could give you a "headroom" measurement over the life
>>>>> of your
>>>>> >>>>>program, essentially looking at the difference between the
>>>>> stack and
>>>>> >>>>>heap and how much you have left. You could then see which
>>>>> routines
>>>>> >>>>>put you closest to overflow.
>>>>> >>>>> >>>>>
>>>>> >>>>>Such a tool is in the works....
>>>>
>> >>> >>>>>
>>
>>>>> >>>>>-Pete
>>>>> >>>>> >>>>>
>>>>> >>>> We're released TAU v2.14.4 with support for memory headroom
>>>>> evaluation options.
>>>>> >>>> I've enclosed a description of the various options below.
>>>>> >>>> http://www.cs.uoregon.edu/research/paracomp/tau
>>>>> >>>> Thanks,
>>>>> >>>> - Sameer
>>>>> >>>>
>>>>> >>>> >>>>
>>>>> >
>>>>> >-
>>>>> --------------------------------------------------------------------
>>>>> >To add or remove yourself from this mailing list, use the 'notifyme'
>>>>> >command on any BGL machine. To remove: notifyme -n, to add:
>>>>> notifyme -y.
>>>>> >
>>>>> > >
>>>>>
>>>>> -
>>>>> --------------------------------------------------------------------
>>>>> To add or remove yourself from this mailing list, use the 'notifyme'
>>>>> command on any BGL machine. To remove: notifyme -n, to add:
>>>>> notifyme -y.
>>>>>
>>>>>
>>
>> - --------------------------------------------------------------------
>> To add or remove yourself from this mailing list, use the 'notifyme'
>> command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
>
>
>

#include <stdio.h>

struct stupid {
  char buf[1024*128];
};

static unsigned long stack_usage=0;

static void foo( struct stupid a )
{
   stack_usage += sizeof( struct stupid );
   fprintf(stderr, "stack usage  = %lu bytes\n", stack_usage );
   foo( a );
}

int main(int argc, char *argv[])
{
  struct  stupid a;
  fprintf(stderr, "sizeof stupid=%d\n",  sizeof(a) );
  foo(a);
  return 0;
}