[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bgl-discuss] Re: the simple test
Steve: It might be wise to copy this to the group since I'm not sure my
answer is correct, and we can get other feedback that way. In any case,
here is what I notice.
First, to restate the question (i'll just deal with the first loop, others
are the same):
Question: New vectorizer report says that following loop doesn't
vectorize, and in any case it always runs slower. What is happening?
subroutine test ( a, b, c )
complex*16, dimension(100) :: a, b, c
integer :: i;
(alignment assertions seem optional in this case)
do i = 1, 100
a(i) = b(i)*c(i)
enddo
return
end
When compiled as:
blrts_xlf -S -c -O3 -qnounroll -qarch=440d -qhot -qreport -qlist -qsource -qdebug=diagnostic test.f
(I'm turning off unrolling so it's easier to read assembler)
the vectorizer says:
** test === End of Compilation 1 ===
Examine loop <1> on line 7 ignore( )
non supported vector element types <complex double>: ((complex double
*)((char *).b + -16))->b[].rns2.[$.CIV0 + 1] * ((complex double *)((char
*).c + -16))->c[].rns1.[$.CIV0 + 1]
mem access with nonnatural alignment ((char *).a + -16 + (16)*($.CIV0
+ 1))
not stream processed
(non_simdizable)
My take: First, the loop clearly is vectorizing regardless of what the
vectorization report says. Below I copy assembler main loop where you can
see the quadword load and "complex" and "replicated" fma instructions.
Note also that this could be hand-coded a little more efficiently.
lfpdx ... (start of main loop)
addi r4,r4,16 (increment address in register by 16 bytes)
lfpdx ...
fxcpmadd ...
addi r5,r5,16
fsmr fp32, fp35
fmr fp0,fp3
fxcxnpma ...
addi r3,r3,16
stfpdx ... (store result)
fmr fp5,fp2
bc BO_dCTR_NZERO,CR0_LT,$-0x2c (ie jump up 11 instructions ...)
My _guess_ is that the vectorization report is only giving info on what
the back-end simd'izer was able to do. Note that even without qhot the
double-hummer instructions are generated, so this presumably is happening
at an earlier stage.
About the lack of improved performance, this is consistent with what I've
seen in general -- this is two fpos per three loads from memory, and if
the memory is out of l1-cache then you don't get 2x increase for the
quadword load, and essentially the loop time is dominated by memory
references. Try a*x[i]+b, where a and b are in registers, and you can get
almost a 2x speedup.
This is all speculation. Please chime in if you've noticed anything
inconsistent ...
-andrew
subroutine test ( a, b, c )
real*8, dimension(100) :: r
complex*16, dimension(100) :: a, b, c, d
real*8, dimension(2, 100) :: a2, b2, c2
do i = 1, 100
a(i) = b(i)*c(i)
enddo
return
entry test2 ( a, r, d )
do i = 1, 100
d(i) = r(i)*a(i)
enddo
return
entry test3 ( a2, b2, c2 )
do i = 1, 100
a2(1,i) = b2(1,i)*c2(1,i) - b2(2,i)*c2(2,i)
a2(2,i) = b2(1,i)*c2(2,i) + b2(2,i)*c2(1,i)
enddo
return
end