Skip to content

[AVX-512] select + add can produce better sequence #33021

Closed
@delena

Description

@delena
Bugzilla Link 33674
Version trunk
OS All
CC @topperc,@chriselrod,@hfinkel,@RKSimon,@ZviRackover

Extended Description

This is the simplified C-code:

    if (B[i] > 1)
      Sum += A[i];

  %8 = load <16 x i32>, <16 x i32>* %7, align 4, !dbg !27, !tbaa !30
  %9 = icmp sgt <16 x i32> %8, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %10 = getelementptr inbounds i32, i32* %0, i64 %4
  %11 = bitcast i32* %10 to <16 x i32>*
  %12 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %11, i32 4, <16 x i1> %9, <16 x i32> undef)
  %13 = select <16 x i1> %9, <16 x i32> %12, <16 x i32> zeroinitializer
  %14 = add nsw <16 x i32> %5, %13

This code generates the following sequence:

        vmovdqu32       zmm2, zmmword ptr [rsi + rax]
        vpcmpgtd        k1, zmm2, zmm0
        vmovdqu32       zmm2 {k1} {z}, zmmword ptr [rdi + rax]
        vmovdqa32       zmm2 {k1} {z}, zmm2
        vpaddd  zmm1, zmm1, zmm2

The better sequence:

        vpcmpd    k1, zmm3, ZMMWORD PTR [rsi+rcx*4], 1        
        vpaddd    zmm4{k1}, zmm4, ZMMWORD PTR [rdi+rcx*4]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions