Without the complete code is hard to make a precise assessment. Nonetheless, you can ensure that no race-conditions occur during the evaluation and reallocation of the array by encapsulating those operations within a mutual exclusion region.
That can be achieved by adding a barrier
before the array's reallocation, followed by a single
region (so that only one thread reallocates the array A
if needed). The single
clause has an implicit barrier at the end of it, so all threads will wait until the array is reallocated.
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < n; i++) {
/* get a stuff array with impredictably many elements */
int k = foo(i); // determine number of elements
double stuff[k];
bar(i, k, stuff); // fill the stuff array
#pragma omp barrier
#pragma omp single
{
if (len + k > capacity)
{
capacity = 2*capacity + k;
A = realloc(A, capacity * sizeof(*A));
}
}
for (int j = 0; j < k; k++) // copy
A[len++] = stuff[j];
}
// If n does not evenly divided among threads
int rest = n % total_threads;
if(rest != 0){
int thread_id = omp_get_thread_num();
if(total_threads - rest >= thread_id)
{
#pragma omp barrier
#pragma omp barrier
}
}
}
The downside of this approach is that you can only use a static
scheduler on the #pragma omp for
; because we are calling the #pragma omp barrier
within the loop we need to ensure that all threads call that barrier the same amount of times. Hence, the reason why after the loop has finished we also need to figure out if the number of iterations of the for loop evenly divides among threads (i.e., if all threads have called the same number of times the barrier
).
int rest = n % total_threads;
if(rest != 0){
int thread_id = omp_get_thread_num();
if(total_threads - rest >= thread_id)
{
#pragma omp barrier
#pragma omp barrier
}
}
If it does not, we need to ensure that the missing threads call the barrier
accordingly. We called the barrier twice because that is the number of barriers called per each loop iteration.
In the code above, I am assuming the default chunk for the static distribution. For a n = 10 and 4 threads it would mean that:
thread 0 : works with iterations 0, 1, and 2
thread 1 : works with iterations 3, 4, and 5
thread 2 : works with iterations 6, 7
thread 3 : works with iterations 8, 9
So both threads with ID 2 and 3 need to call again the barrier. So the rest=2, which means that if(2 >= thread_id)
.
This approach is error-prone, and one might say it is hackish. Alternative approaches that you can try (if they fit your requirements)
is to have an array A
per thread, and each thread can safely resize its array without having to synchronize among each other. After the parallel region, you ensure that the initial thread merges the value of all those arrays into the single array A
, accordingly;
Try to find out a safe upper-bound for the size of A
and just pre-allocate that array, accordingly.
IMO, the last option is by far the best one:
More performant because you avoid the resizing of the array and synchronization among threads;
The code is much more readable and less error-prone.