Skip to main content
Filter by
Sorted by
Tagged with
1 vote
1 answer
100 views

sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result

I want to use the following code to compute the intersection of array a and array b: #include <nmmintrin.h> #include <cstdint> #include <cstdio> void test(uint16_t *a, uint16_t *b) { ...
zelin's user avatar
  • 21
3 votes
2 answers
153 views

Zero remaining Bytes after first Zero in SSE Register

For this question, I will use the notation 1 for a byte with all ones (0xFF) and 0 for a byte with all zeros. I am looking for a way to zero the remaining bytes in a SSE register after the first zero ...
Crigges's user avatar
  • 1,243
2 votes
1 answer
287 views

What does "SSE 4.2 insanity" mean in the "if consteval" proposal paper?

I was reading a C++ paper on if consteval (§3.2), and saw a code showing a constexpr strlen implementation: constexpr size_t strlen(char const* s) { if constexpr (std::is_constant_evaluated()) { ...
Chi_Iroh's user avatar
  • 1,170
1 vote
0 answers
188 views

The correct way to search for a substring in a string

the most part of my question is: how to deal with the cases when a string loaded into __m128i contains only part of a substring? the requirement: to search escaped sequences or the '"' (double ...
niXman's user avatar
  • 71
3 votes
1 answer
1k views

Intrinsic inverse to _mm_movemask_epi8

So first I'll just describe the task: I need to: Compare two __m128i. Somehow do the bitwise and of the result with a certain uint16_t value (probably using _mm_movemask_epi8 first and then just &...
Andrew S.'s user avatar
  • 467
1 vote
0 answers
75 views

Auto-vectorization for hand-unrolled initialized tiled-computation versus simple loop with no initialization

In optimization for an AABB collision detection algorithm's inner-most 4-versus-4 comparison part, I am stuck at simplifying code at the same time gaining(or just retaining) performance. Here is the ...
huseyin tugrul buyukisik's user avatar
1 vote
1 answer
277 views

Why does the pseudocode of _mm_insert_ps calculate %8?

Within the intel intrinsics guide, the pseudocode for the operation of _mm_insert_ps, the following is defined: FOR j := 0 to 3 i := j*32 IF imm8[j%8] dst[i+31:i] := 0 ELSE ...
Brotcrunsher's user avatar
  • 2,264
3 votes
1 answer
236 views

Is there a way to cast integers to bytes, knowing these ints are in range of bytes. Using SSE?

In an xmm register I have 3 integers with values less than 256. I want to cast these to bytes, and save them to memory. I don't know how to approach it. I was thinking about getting those numbers from ...
thomas113412's user avatar
0 votes
0 answers
334 views

Intel Intrinsics Comparing Two Strings

I am attempting to build a header parser with fast processing. I have two issues, one is that there is a bug in the code below. void parse_with_simd(const char *buffer, const int buffer_len) { const ...
Christopher Clark's user avatar
1 vote
1 answer
183 views

Optimizing find_first_not_of with SSE4.2 or earlier

I am writing a textual packet analyzer for a protocol and in optimizing it I found that a great bottleneck is the find_first_not_of call. In essence, I need to find if a packet is valid if it contains ...
senseiwa's user avatar
  • 2,499
0 votes
1 answer
522 views

Undefined intel_sse4_strlen

I am running into an issue. After I compiled my program with no problem, then I ran it and got an error that I could not figure out: I did "nm -u 64rm | grep intel" and got the following: ...
inflator's user avatar
4 votes
2 answers
342 views

SSE4.1 unsigned integer comparison with overflow

Is there any way to perform a comparison like C >= (A + B) with SSE2/4.1 instructions considering 16 bit unsigned addition (_mm_add_epi16()) can overflow? The code snippet looks like- #define ...
Kaustubh's user avatar
5 votes
3 answers
484 views

How to simulate pcmpgtq on sse2?

PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? ...
Dan Weber's user avatar
  • 431
0 votes
1 answer
565 views

Is it beneficial to use glibc's strlen()/strcmp() or roll your own based on SSE4.2? [closed]

According to "Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)" (Intel, 2008) [they] added instructions to assist in character searches and comparison on two operands ...
user1016031's user avatar
5 votes
2 answers
7k views

How do I enable SSE4.1 and SSE3 (but NOT AVX) in MSVC

I am trying to enable different simd support using MSVC. There is a page talking about enabling some simd, such as SSE2, AVX, AVX2 https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?...
knightyangpku's user avatar

15 30 50 per page