[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bgl-discuss] 30% faster DGEMM than ESSL DGEMMS
- To: discuss@xxxxxxxxxxxxxxx
- Subject: [bgl-discuss] 30% faster DGEMM than ESSL DGEMMS
- From: Jaewook Shin <jaewook@xxxxxxxxxxx>
- Date: Thu, 18 May 2006 14:39:18 -0500
- Delivered-to: bglm-discuss-outgoing@mailbouncer.mcs.anl.gov
- Delivered-to: bglm-discuss@mailbouncer.mcs.anl.gov
- Organization: ANL/MCS
- Sender: owner-discuss@xxxxxxxxxxxxxxx
- User-agent: KMail/1.8.2
Hi all,
Just in case this is useful to someone.
Recently, I have tuned DGEMM code in C for the Double Hummer on BG/L. The code
achieved 30% speedup over the DGEMMS in ESSL. I am sure this is not the best
that can be achieved and I didn't explore many alternatives but just as one
reference if someone is serious about squeezing the last drop of performance
out of 440.
http://www-unix.mcs.anl.gov/~jaewook/pub/dgemm.pub.c
I wish I could find about it if anyone has attained the comparable or higher
speedup on a single processor for square matrix multiplication.
Experimental settings
- 1024x1024 double precision matrices
- manual coding based on the built-in functions to use Double Hummer
- Two-level cache tiling: 256x256 for L3, 32x32 for L1
- Superword-Level Locality: 4x4x4 for i, j, k-loop
- Loop interchang and loop coalescing on L1-tiled k-loop
- Some loop invariant code motion and redundant code elimination
The iteration counts are the multiple of tile sizes so that there is no
trailing loops but if tile sizes are not even, you need to have trailing
loops that can be just the original loop. One caveat is that you have to
guarantee that the start addresses of the arrays are aligned to 16 byte
boundaries.
Compilation command:
> blrts_xlc -I/bgl/BlueLight/ppcfloor/bglsys/include -O5 -qarch=440d
-qtune=440 -qmaxmem=64000 dgemm.pub.c -L/bgl/BlueLight/ppcfloor/bglsys/lib
-lrts.rts -ldevices.rts -o dgemm.pub.exe -DPRINT -lessln
-L/soft/tools/essl-rev1/ -lxlfmath -L/opt/ibmmath/lib -I/opt/ibmmath/include
-L/opt/ibmcmp/xlf/bg/10.1/blrts_lib -lxlf90 -L/opt/ibmcmp/xlf/bg/10.1/lib
-qlist -qsource
By the way, does anyone know if DGEMMS in ESSL use Double Hummer
instructions ? I read that a subset of ESSL routines use the SIMD
instructions but I didn't dare to disassemble the code.
-jaewook-
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.