Newest 'sse4' Questions

1 vote

1 answer

100 views

sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result

I want to use the following code to compute the intersection of array a and array b: #include <nmmintrin.h> #include <cstdint> #include <cstdio> void test(uint16_t *a, uint16_t *b) { ...

zelin

21

asked Nov 12, 2024 at 9:58

3 votes

2 answers

153 views

Zero remaining Bytes after first Zero in SSE Register

For this question, I will use the notation 1 for a byte with all ones (0xFF) and 0 for a byte with all zeros. I am looking for a way to zero the remaining bytes in a SSE register after the first zero ...

Crigges

1,243

asked May 24, 2024 at 6:54

2 votes

1 answer

287 views

What does "SSE 4.2 insanity" mean in the "if consteval" proposal paper?

I was reading a C++ paper on if consteval (§3.2), and saw a code showing a constexpr strlen implementation: constexpr size_t strlen(char const* s) { if constexpr (std::is_constant_evaluated()) { ...

Chi_Iroh

1,170

asked Jun 3, 2023 at 11:05

1 vote

0 answers

188 views

The correct way to search for a substring in a string

the most part of my question is: how to deal with the cases when a string loaded into __m128i contains only part of a substring? the requirement: to search escaped sequences or the '"' (double ...

niXman

71

asked Oct 26, 2022 at 9:42

3 votes

1 answer

1k views

Intrinsic inverse to _mm_movemask_epi8

So first I'll just describe the task: I need to: Compare two __m128i. Somehow do the bitwise and of the result with a certain uint16_t value (probably using _mm_movemask_epi8 first and then just &...

Andrew S.

467

asked Jul 7, 2022 at 13:29

1 vote

0 answers

75 views

Auto-vectorization for hand-unrolled initialized tiled-computation versus simple loop with no initialization

In optimization for an AABB collision detection algorithm's inner-most 4-versus-4 comparison part, I am stuck at simplifying code at the same time gaining(or just retaining) performance. Here is the ...

huseyin tugrul buyukisik

11.9k

asked Apr 6, 2022 at 16:35

1 vote

1 answer

277 views

Why does the pseudocode of _mm_insert_ps calculate %8?

Within the intel intrinsics guide, the pseudocode for the operation of _mm_insert_ps, the following is defined: FOR j := 0 to 3 i := j*32 IF imm8[j%8] dst[i+31:i] := 0 ELSE ...

Brotcrunsher

2,264

asked Jan 28, 2022 at 11:29

3 votes

1 answer

236 views

Is there a way to cast integers to bytes, knowing these ints are in range of bytes. Using SSE?

In an xmm register I have 3 integers with values less than 256. I want to cast these to bytes, and save them to memory. I don't know how to approach it. I was thinking about getting those numbers from ...

thomas113412

79

asked Jan 15, 2022 at 12:20

0 votes

0 answers

334 views

Intel Intrinsics Comparing Two Strings

I am attempting to build a header parser with fast processing. I have two issues, one is that there is a bug in the code below. void parse_with_simd(const char *buffer, const int buffer_len) { const ...

Christopher Clark

35

asked Aug 23, 2021 at 0:19

1 vote

1 answer

183 views

Optimizing find_first_not_of with SSE4.2 or earlier

I am writing a textual packet analyzer for a protocol and in optimizing it I found that a great bottleneck is the find_first_not_of call. In essence, I need to find if a packet is valid if it contains ...

senseiwa

2,499

asked Mar 8, 2021 at 16:34

0 votes

1 answer

522 views

Undefined intel_sse4_strlen

I am running into an issue. After I compiled my program with no problem, then I ran it and got an error that I could not figure out: I did "nm -u 64rm | grep intel" and got the following: ...

inflator

39

asked Jan 8, 2021 at 20:15

4 votes

2 answers

342 views

SSE4.1 unsigned integer comparison with overflow

Is there any way to perform a comparison like C >= (A + B) with SSE2/4.1 instructions considering 16 bit unsigned addition (_mm_add_epi16()) can overflow? The code snippet looks like- #define ...

Kaustubh

73

asked Dec 17, 2020 at 15:18

5 votes

3 answers

484 views

How to simulate pcmpgtq on sse2?

PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? ...

Dan Weber

431

asked Dec 6, 2020 at 8:36

0 votes

1 answer

565 views

Is it beneficial to use glibc's strlen()/strcmp() or roll your own based on SSE4.2? [closed]

According to "Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)" (Intel, 2008) [they] added instructions to assist in character searches and comparison on two operands ...

user1016031

123

asked Oct 26, 2020 at 2:24

5 votes

2 answers

7k views

How do I enable SSE4.1 and SSE3 (but NOT AVX) in MSVC

I am trying to enable different simd support using MSVC. There is a page talking about enabling some simd, such as SSE2, AVX, AVX2 https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?...

knightyangpku

77

asked Sep 24, 2020 at 19:59

Collectives™ on Stack Overflow

sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result

Zero remaining Bytes after first Zero in SSE Register

What does "SSE 4.2 insanity" mean in the "if consteval" proposal paper?

The correct way to search for a substring in a string

Intrinsic inverse to _mm_movemask_epi8

Auto-vectorization for hand-unrolled initialized tiled-computation versus simple loop with no initialization

Why does the pseudocode of _mm_insert_ps calculate %8?

Is there a way to cast integers to bytes, knowing these ints are in range of bytes. Using SSE?

Intel Intrinsics Comparing Two Strings

Optimizing find_first_not_of with SSE4.2 or earlier

Undefined intel_sse4_strlen

SSE4.1 unsigned integer comparison with overflow

How to simulate pcmpgtq on sse2?

Is it beneficial to use glibc's strlen()/strcmp() or roll your own based on SSE4.2? [closed]

How do I enable SSE4.1 and SSE3 (but NOT AVX) in MSVC

Hot Network Questions

Collectives™ on Stack Overflow

Related Tags