[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bgl-discuss] Re: the simple test



>>> 
>>> About the lack of improved performance, this is consistent with what I've
>>> seen in general -- this is two fpos per three loads from memory, and if
>>> the memory is out of l1-cache then you don't get 2x increase for the
>>> quadword load, and essentially the loop time is dominated by memory
>>> references. Try a*x[i]+b, where a and b are in registers, and you can get
>>> almost a 2x speedup.
>>> 

I was not saying that for these very simple loops it is slower -- I
don't know about that.  For my real subroutine, 440d is still slower.
Since in the real subroutine it was claiming that it was not
vectorizing, decided to try this very simple thing and got the
conflicting simdization messages and assembler output.

steve

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.