0

Suppose many threads try to append elements at the end of a dynamically allocated array. If there is not enough room, the array must be reallocated, but then its address in memory may change, and other threads must be aware of the change. Assume we have the following sequential code:

int capacity = 0;
int len = 0;
double *A = NULL;

for (int i = 0; i < n; i++) {
    /* get a stuff array with impredictably many elements */
    int k = foo(i);     // determine number of elements
    double stuff[k];
    bar(i, k, stuff);   // fill the stuff array
    
    /* Append stuff at the end of A */

    if (len + k > capacity) {  // not enough space in A? Realloc twice the size
        capacity = 2*capacity + k;
        A = realloc(A, capacity * sizeof(*A));  // pointer may change
    }
    for (int j = 0; j < k; k++)  // copy
        A[len++] = stuff[j];
}

I would like to parallelize the iterations of the loop. foo and bar are thread-safe. I would like to minimize the use of critical sections. In particular, I would like to leave the copy loop out of it.

A possible solution would consist in using a fair reader/writer lock. Threads would write-acquire the lock to realloc the array, and hold a read lock during the copy loop. If a thread waits on a a write lock, it should block others from acquiring a read lock.

What is the best way to do this in pure OpenMP?

1
  • Hi Charles do you have further questions, unfortunately I don't think OpenMP is good for this type of problems
    – dreamcrash
    Commented Mar 18, 2021 at 21:30

1 Answer 1

0

Without the complete code is hard to make a precise assessment. Nonetheless, you can ensure that no race-conditions occur during the evaluation and reallocation of the array by encapsulating those operations within a mutual exclusion region.

That can be achieved by adding a barrier before the array's reallocation, followed by a single region (so that only one thread reallocates the array A if needed). The single clause has an implicit barrier at the end of it, so all threads will wait until the array is reallocated.

#pragma omp parallel
{
    #pragma omp for 
    for (int i = 0; i < n; i++) {
        /* get a stuff array with impredictably many elements */
        int k = foo(i);     // determine number of elements
        double stuff[k];
        bar(i, k, stuff);   // fill the stuff array
    
        #pragma omp barrier
        #pragma omp single
        {
           if (len + k > capacity) 
           {
               capacity = 2*capacity + k;
               A = realloc(A, capacity * sizeof(*A)); 
           }
        }
        for (int j = 0; j < k; k++)  // copy
           A[len++] = stuff[j];
    } 
    // If n does not evenly divided among threads
    int rest = n % total_threads;
    if(rest != 0){
       int thread_id = omp_get_thread_num();
       if(total_threads - rest >= thread_id)
       {
          #pragma omp barrier
          #pragma omp barrier
       }
    }
}

The downside of this approach is that you can only use a static scheduler on the #pragma omp for; because we are calling the #pragma omp barrier within the loop we need to ensure that all threads call that barrier the same amount of times. Hence, the reason why after the loop has finished we also need to figure out if the number of iterations of the for loop evenly divides among threads (i.e., if all threads have called the same number of times the barrier).

int rest = n % total_threads;
if(rest != 0){
   int thread_id = omp_get_thread_num();
   if(total_threads - rest >= thread_id)
   {
      #pragma omp barrier
      #pragma omp barrier
   }
}

If it does not, we need to ensure that the missing threads call the barrier accordingly. We called the barrier twice because that is the number of barriers called per each loop iteration.

In the code above, I am assuming the default chunk for the static distribution. For a n = 10 and 4 threads it would mean that:

thread 0 : works with iterations 0, 1, and 2
thread 1 : works with iterations 3, 4, and 5
thread 2 : works with iterations 6, 7
thread 3 : works with iterations 8, 9

So both threads with ID 2 and 3 need to call again the barrier. So the rest=2, which means that if(2 >= thread_id).

This approach is error-prone, and one might say it is hackish. Alternative approaches that you can try (if they fit your requirements)

  1. is to have an array A per thread, and each thread can safely resize its array without having to synchronize among each other. After the parallel region, you ensure that the initial thread merges the value of all those arrays into the single array A, accordingly;

  2. Try to find out a safe upper-bound for the size of A and just pre-allocate that array, accordingly.

IMO, the last option is by far the best one:

  1. More performant because you avoid the resizing of the array and synchronization among threads;

  2. The code is much more readable and less error-prone.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.