Closed
Description
Bugzilla Link | 33674 |
Version | trunk |
OS | All |
CC | @topperc,@chriselrod,@hfinkel,@RKSimon,@ZviRackover |
Extended Description
This is the simplified C-code:
if (B[i] > 1)
Sum += A[i];
%8 = load <16 x i32>, <16 x i32>* %7, align 4, !dbg !27, !tbaa !30
%9 = icmp sgt <16 x i32> %8, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
%10 = getelementptr inbounds i32, i32* %0, i64 %4
%11 = bitcast i32* %10 to <16 x i32>*
%12 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %11, i32 4, <16 x i1> %9, <16 x i32> undef)
%13 = select <16 x i1> %9, <16 x i32> %12, <16 x i32> zeroinitializer
%14 = add nsw <16 x i32> %5, %13
This code generates the following sequence:
vmovdqu32 zmm2, zmmword ptr [rsi + rax]
vpcmpgtd k1, zmm2, zmm0
vmovdqu32 zmm2 {k1} {z}, zmmword ptr [rdi + rax]
vmovdqa32 zmm2 {k1} {z}, zmm2
vpaddd zmm1, zmm1, zmm2
The better sequence:
vpcmpd k1, zmm3, ZMMWORD PTR [rsi+rcx*4], 1
vpaddd zmm4{k1}, zmm4, ZMMWORD PTR [rdi+rcx*4]