[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bgl-discuss] 30% faster DGEMM than ESSL DGEMMS



Jaewook,

I currently have fast matrix-matrix product routines for small
matrix sizes ( N ~ 12--200 ), which form the core of our code.
These were hand-coded by Vernon Austel and John Gunnels at IBM
and use the double-hummer.  They give a significant performance
boost over standard compiled code.

I realize this is in a slightly different parameter space than
you've considered, so this is just fyi.

Paul


On Thu, 18 May 2006, Jaewook Shin wrote:

> Hi all,
>
> Just in case this is useful to someone.
>
> Recently, I have tuned DGEMM code in C for the Double Hummer on BG/L. The code
> achieved 30% speedup over the DGEMMS in ESSL. I am sure this is not the best
> that can be achieved and I didn't explore many alternatives but just as one
> reference if someone is serious about squeezing the last drop of performance
> out of 440.
>
> http://www-unix.mcs.anl.gov/~jaewook/pub/dgemm.pub.c
>
> I wish I could find about it if anyone has attained the comparable or higher
> speedup on a single processor for square matrix multiplication.
>
> Experimental settings
>
> - 1024x1024 double precision matrices
> - manual coding based on the built-in functions to use Double Hummer
> - Two-level cache tiling: 256x256 for L3, 32x32 for L1
> - Superword-Level Locality: 4x4x4 for i, j, k-loop
> - Loop interchang and loop coalescing on L1-tiled k-loop
> - Some loop invariant code motion and redundant code elimination
>
> The iteration counts are the multiple of tile sizes so that there is no
> trailing loops but if tile sizes are not even, you need to have trailing
> loops that can be just the original loop. One caveat is that you have to
> guarantee that the start addresses of the arrays are aligned to 16 byte
> boundaries.
>
> Compilation command:
>
> > blrts_xlc -I/bgl/BlueLight/ppcfloor/bglsys/include -O5 -qarch=440d
> -qtune=440 -qmaxmem=64000 dgemm.pub.c -L/bgl/BlueLight/ppcfloor/bglsys/lib
> -lrts.rts -ldevices.rts -o dgemm.pub.exe -DPRINT -lessln
> -L/soft/tools/essl-rev1/ -lxlfmath -L/opt/ibmmath/lib -I/opt/ibmmath/include
> -L/opt/ibmcmp/xlf/bg/10.1/blrts_lib -lxlf90 -L/opt/ibmcmp/xlf/bg/10.1/lib
> -qlist -qsource
>
>
> By the way, does anyone know if DGEMMS in ESSL use Double Hummer
> instructions ? I read that a subset of ESSL routines use the SIMD
> instructions but I didn't dare to disassemble the code.
>
> -jaewook-
>
> - --------------------------------------------------------------------
> To add or remove yourself from this mailing list, use the 'notifyme'
> command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
>
>

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.