[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bgl-discuss] Can BGL jobs wind up working with corrupted memory?



What really bothers me is that this `feature' of the BGL kernal seems
to allow for incorrect numerical results with no warning.   The user
has to be aware that this problem might be happening and build in
checks, if that is possible.  How do I know how big to increase
the program's data space?  Ho do I know how much of the stack
or heap the compiler is using in each subroutine?  etc.

steve



>>> Hi,
>>> 
>>> problem is that BG/L compute node kernel doesn't check stack overflow
>>> However, I believe that the OS can check overflow of heap memory.
>>> In fact,of course, the malloc() routine on BG/L returns 0 if run out of
>>> memory.
>>> 
>>> Fortunatelly, sbrk() is available on BG/L glibc.
>>> so we can increment the program data space to
>>> reduce possiblity of stack overflow which eats out of the heap memory.
>>> 
>>> For example, in C code,
>>> 
>>> sbrk( 1024*1024 ); // increments the program's data space by 1M bytes
>>> 
>>> Sorry I don't know how to call glibc function such as sbrk() from Fortran.
>>> 
>>> 
>>> I wrote a small C code to make sure sbrk() works or not.
>>> 
>>> This is not perfect solution but it works.
>>> The perfect solution is to add stack overflow check to
>>> BG/L compute node kernel.
>>> 
>>> 
>>> Sorry if I miss the point..
>>> 
>>> 
>>> ------------------------
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <unistd.h>
>>> 
>>> int main(int argc, char* argv[])
>>> {
>>> unsigned long size,totalsize,s;
>>> void *p;
>>> printf("sbrk()=%p\n", sbrk(0));
>>> if( argc > 1 ) {
>>> int inc_kb;
>>> inc_kb = atoi(argv[1]);
>>> printf("inc. stack size %dkb\n", inc_kb);
>>> sbrk(inc_kb*1024);
>>> printf("sbrk()=%p\n", sbrk(0));
>>> }
>>> size = 256*1024*1024;
>>> totalsize = 0;
>>> 
>>> for(s=size ; s>1024; s>>=2 ) {
>>> for(;;) {
>>> p = malloc( s );
>>> if( !p ) break;
>>> totalsize += s;
>>> }
>>> }
>>> printf("%6.2f MB alloc'ed\n", (double)totalsize/1024.0/1024.0);
>>> return 0;
>>> }
>>> 
>>> 
>>> # bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gcc -Wall a.c
>>> # cqsub -t 10 -n 1 a.out
>>> # cat 2958.output
>>> [0]: sbrk()=0x162740
>>> [0]: inc. stack size 1024kb
>>> [0]: sbrk()=0x278000
>>> [0]: 507.94 MB alloc'ed
>>> 
>>> 
>>> 
>>> >Does this help me with my Fortran application?
>>> >
>>> >Even if it does, is IBM being complained to?  I find it completely unacceptable
>>> >if the user has to monitor that the operating system is mismanaging memory.
>>> >
>>> >Thanks
>>> >Steve
>>> >
>>> >  
>>> >
>>> >>>>Hi Pete,
>>> >>>>        
>>> >>>>
>>> >>>>>Last week the TAU folks visited, and we discussed a new memory tool
>>> >>>>>which could give you a "headroom" measurement over the life of your
>>> >>>>>program, essentially looking at the difference between the stack and
>>> >>>>>heap and how much you have left.  You could then see which routines
>>> >>>>>put you closest to overflow.
>>> >>>>>          
>>> >>>>>
>>> >>>>>Such a tool is in the works....
>>> >>>>>
>>> >>>>>-Pete
>>> >>>>>          
>>> >>>>>
>>> >>>>	We're released TAU v2.14.4 with support for memory headroom evaluation options.
>>> >>>>	I've enclosed a description of the various options below.
>>> >>>>	http://www.cs.uoregon.edu/research/paracomp/tau
>>> >>>>	Thanks,
>>> >>>>	- Sameer
>>> >>>>
>>> >>>>        
>>> >>>>
>>> >
>>> >- --------------------------------------------------------------------
>>> >To add or remove yourself from this mailing list, use the 'notifyme'
>>> >command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
>>> >
>>> >  
>>> >
>>> 
>>> - --------------------------------------------------------------------
>>> To add or remove yourself from this mailing list, use the 'notifyme'
>>> command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
>>> 
>>> 

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.