hngopher.com

       [HN Gopher] OpenMP 6.0
       ___________________________________________________________________
        
       OpenMP 6.0
        
       Author : mshachkov
       Score  : 67 points
       Date   : 2024-11-14 19:00 UTC (4 hours ago)
        
 (HTM) web link (www.openmp.org)
 (TXT) w3m dump (www.openmp.org)
        
       | phkahler wrote:
       | OpenMP is one of the easiest ways to make existing code run
       | across CPU cores. In the simplest cases you simply add a single
       | #pragma to C code and it goes N times faster. This is when you're
       | running a function in a loop with no side effects. Some examples
       | I've done:
       | 
       | 1) ray tracing. Looping over all the pixels in an image using ray
       | tracing to determine the color of each pixel. The algorithm and
       | data structures are complex but don't change during the
       | rendering. N cores is about N times as fast.
       | 
       | 2) in Solvespace we had a small loop which calls a tessellation
       | function on a bunch of NURBS surfaces. The function was appending
       | triangles to a list, so I made a thread-local list for each call
       | and combined them after to avoid writes to shared data structure.
       | Again N times faster with very little effort.
       | 
       | The code is also fine to build single threaded without change if
       | you don't have OpenMP. Your compiler will just ignore the
       | #pragmas.
        
         | ddavis wrote:
         | OpenMP is great. I've done something similar to your second
         | case (thread local objects that are filled in parallel and
         | later combined). In the case of "OpenMP off" (pragmas ignored),
         | is it possible to avoid the overhead of the thread local object
         | essentially getting copied into the final object (since no
         | OpenMP means only a single thread local object)? I avoided this
         | by implementing a separate code path, but I'm just wondering if
         | there are any tricks I missed that would allow still a single
         | code path
        
           | Jtsummers wrote:
           | Give one of the threads (thread ID 0, for instance) special
           | privileges. Its list is the one everything else is appended
           | to, then there's only concatenation or copying if you have
           | more than one thread.
           | 
           | Or, pre-allocate the memory and let each thread write to its
           | own subset of the final collection and avoid the combine step
           | entirely. This works regardless of the number of threads you
           | use so long as you know the maximum amount of memory you
           | might need to allocate. If it has no calculable upper bound,
           | you will need to use other techniques.
        
       | Conscat wrote:
       | OpenMP was pivotal to my last workplace, but because some
       | customers required MSVC, we barely had support for OpenMP 2.0.
        
       | dsp_person wrote:
       | I was just googling to see if there's any Emscripten/WASM
       | implementation of OpenMP. The emscripten github issue [1] has a
       | link to this "simpleomp" [2][3] where
       | 
       | > In ncnn project, we implement a minimal openmp runtime for
       | webassembly target
       | 
       | > It only works for #pragma omp parallel for num_threads(N)
       | 
       | [1] https://github.com/emscripten-core/emscripten/issues/13892
       | 
       | [2] https://github.com/Tencent/ncnn/blob/master/src/simpleomp.h
       | 
       | [3]
       | https://github.com/Tencent/ncnn/blob/master/src/simpleomp.cp...
        
       | fxj wrote:
       | You can now (already in OpenMP5) use it to write GPU programs.
       | Intels OneAPI uses OpenMP 5.5 to write programs for the Intel
       | PonteVecchio GPUs which are on par to the Nvidia A100.
       | 
       | https://www.intel.com/content/www/us/en/docs/oneapi/optimiza...
       | 
       | gcc also provides support for NVidia and AMD GPUs
       | 
       | https://gcc.gnu.org/wiki/Offloading
       | 
       | Here is an example how you can use openmp for running a kernel on
       | a nvidia A100:
       | 
       | https://people.montefiore.uliege.be/geuzaine/INFO0939/notes/...
       | #include <stdlib.h>       #include <stdio.h>       #include
       | <omp.h>            void saxpy(int n, float a, float *x, float *y)
       | {       double elapsed = -1.0 \* omp_get_wtime();            //
       | We don't need to map the variable a as scalars are firstprivate
       | by default       #pragma omp target teams distribute parallel for
       | map(to:x[0:n]) map(tofrom:y[0:n])       for(int i = 0; i < n;
       | i++) {         y[i] = a * x[i] + y[i];       }            elapsed
       | += omp_get_wtime();       printf("saxpy done in %6.3lf
       | seconds.\n", elapsed);       }            int main() {       int
       | n = 2000000;       float *x = (float*) malloc(n*sizeof(float));
       | float *y = (float*) malloc(n*sizeof(float));       float alpha =
       | 2.0;            #pragma omp parallel for       for (int i = 0; i
       | < n; i++) {          x[i] = 1;          y[i] = i;       }
       | saxpy(n, alpha, x, y);            free(x);       free(y);
       | return 0;       }
        
       ___________________________________________________________________
       (page generated 2024-11-14 23:01 UTC)