57 questions
1
vote
1
answer
100
views
sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result
I want to use the following code to compute the intersection of array a and array b:
#include <nmmintrin.h>
#include <cstdint>
#include <cstdio>
void test(uint16_t *a, uint16_t *b) {
...
3
votes
2
answers
153
views
Zero remaining Bytes after first Zero in SSE Register
For this question, I will use the notation 1 for a byte with all ones (0xFF) and 0 for a byte with all zeros.
I am looking for a way to zero the remaining bytes in a SSE register after the first zero ...
2
votes
1
answer
287
views
What does "SSE 4.2 insanity" mean in the "if consteval" proposal paper?
I was reading a C++ paper on if consteval (§3.2), and saw a code showing a constexpr strlen implementation:
constexpr size_t strlen(char const* s) {
if constexpr (std::is_constant_evaluated()) {
...
1
vote
0
answers
188
views
The correct way to search for a substring in a string
the most part of my question is: how to deal with the cases when a string loaded into __m128i contains only part of a substring?
the requirement: to search escaped sequences or the '"' (double ...
3
votes
1
answer
1k
views
Intrinsic inverse to _mm_movemask_epi8
So first I'll just describe the task:
I need to:
Compare two __m128i.
Somehow do the bitwise and of the result with a certain uint16_t value (probably using _mm_movemask_epi8 first and then just &...
1
vote
0
answers
75
views
Auto-vectorization for hand-unrolled initialized tiled-computation versus simple loop with no initialization
In optimization for an AABB collision detection algorithm's inner-most 4-versus-4 comparison part, I am stuck at simplifying code at the same time gaining(or just retaining) performance.
Here is the ...
1
vote
1
answer
277
views
Why does the pseudocode of _mm_insert_ps calculate %8?
Within the intel intrinsics guide, the pseudocode for the operation of _mm_insert_ps, the following is defined:
FOR j := 0 to 3
i := j*32
IF imm8[j%8]
dst[i+31:i] := 0
ELSE
...
3
votes
1
answer
236
views
Is there a way to cast integers to bytes, knowing these ints are in range of bytes. Using SSE?
In an xmm register I have 3 integers with values less than 256. I want to cast these to bytes, and save them to memory. I don't know how to approach it.
I was thinking about getting those numbers from ...
0
votes
0
answers
334
views
Intel Intrinsics Comparing Two Strings
I am attempting to build a header parser with fast processing. I have two issues, one is that there is a bug in the code below.
void parse_with_simd(const char *buffer, const int buffer_len) {
const ...
1
vote
1
answer
183
views
Optimizing find_first_not_of with SSE4.2 or earlier
I am writing a textual packet analyzer for a protocol and in optimizing it I found that a great bottleneck is the find_first_not_of call.
In essence, I need to find if a packet is valid if it contains ...
0
votes
1
answer
522
views
Undefined intel_sse4_strlen
I am running into an issue. After I compiled my program with no problem, then I ran it and got an error that I could not figure out:
I did "nm -u 64rm | grep intel" and got the following:
...
4
votes
2
answers
342
views
SSE4.1 unsigned integer comparison with overflow
Is there any way to perform a comparison like C >= (A + B) with SSE2/4.1 instructions considering 16 bit unsigned addition (_mm_add_epi16()) can overflow?
The code snippet looks like-
#define ...
5
votes
3
answers
484
views
How to simulate pcmpgtq on sse2?
PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask.
How does one support this functionality on instructions sets predating sse4.2?
...
0
votes
1
answer
565
views
Is it beneficial to use glibc's strlen()/strcmp() or roll your own based on SSE4.2? [closed]
According to "Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)" (Intel, 2008) [they] added instructions to assist in character searches and comparison on two operands ...
5
votes
2
answers
7k
views
How do I enable SSE4.1 and SSE3 (but NOT AVX) in MSVC
I am trying to enable different simd support using MSVC.
There is a page talking about enabling some simd, such as SSE2, AVX, AVX2
https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?...