[HN Gopher] OpenMP 6.0
___________________________________________________________________
OpenMP 6.0
Author : mshachkov
Score : 67 points
Date : 2024-11-14 19:00 UTC (4 hours ago)
(HTM) web link (www.openmp.org)
(TXT) w3m dump (www.openmp.org)
| phkahler wrote:
| OpenMP is one of the easiest ways to make existing code run
| across CPU cores. In the simplest cases you simply add a single
| #pragma to C code and it goes N times faster. This is when you're
| running a function in a loop with no side effects. Some examples
| I've done:
|
| 1) ray tracing. Looping over all the pixels in an image using ray
| tracing to determine the color of each pixel. The algorithm and
| data structures are complex but don't change during the
| rendering. N cores is about N times as fast.
|
| 2) in Solvespace we had a small loop which calls a tessellation
| function on a bunch of NURBS surfaces. The function was appending
| triangles to a list, so I made a thread-local list for each call
| and combined them after to avoid writes to shared data structure.
| Again N times faster with very little effort.
|
| The code is also fine to build single threaded without change if
| you don't have OpenMP. Your compiler will just ignore the
| #pragmas.
| ddavis wrote:
| OpenMP is great. I've done something similar to your second
| case (thread local objects that are filled in parallel and
| later combined). In the case of "OpenMP off" (pragmas ignored),
| is it possible to avoid the overhead of the thread local object
| essentially getting copied into the final object (since no
| OpenMP means only a single thread local object)? I avoided this
| by implementing a separate code path, but I'm just wondering if
| there are any tricks I missed that would allow still a single
| code path
| Jtsummers wrote:
| Give one of the threads (thread ID 0, for instance) special
| privileges. Its list is the one everything else is appended
| to, then there's only concatenation or copying if you have
| more than one thread.
|
| Or, pre-allocate the memory and let each thread write to its
| own subset of the final collection and avoid the combine step
| entirely. This works regardless of the number of threads you
| use so long as you know the maximum amount of memory you
| might need to allocate. If it has no calculable upper bound,
| you will need to use other techniques.
| Conscat wrote:
| OpenMP was pivotal to my last workplace, but because some
| customers required MSVC, we barely had support for OpenMP 2.0.
| dsp_person wrote:
| I was just googling to see if there's any Emscripten/WASM
| implementation of OpenMP. The emscripten github issue [1] has a
| link to this "simpleomp" [2][3] where
|
| > In ncnn project, we implement a minimal openmp runtime for
| webassembly target
|
| > It only works for #pragma omp parallel for num_threads(N)
|
| [1] https://github.com/emscripten-core/emscripten/issues/13892
|
| [2] https://github.com/Tencent/ncnn/blob/master/src/simpleomp.h
|
| [3]
| https://github.com/Tencent/ncnn/blob/master/src/simpleomp.cp...
| fxj wrote:
| You can now (already in OpenMP5) use it to write GPU programs.
| Intels OneAPI uses OpenMP 5.5 to write programs for the Intel
| PonteVecchio GPUs which are on par to the Nvidia A100.
|
| https://www.intel.com/content/www/us/en/docs/oneapi/optimiza...
|
| gcc also provides support for NVidia and AMD GPUs
|
| https://gcc.gnu.org/wiki/Offloading
|
| Here is an example how you can use openmp for running a kernel on
| a nvidia A100:
|
| https://people.montefiore.uliege.be/geuzaine/INFO0939/notes/...
| #include <stdlib.h> #include <stdio.h> #include
| <omp.h> void saxpy(int n, float a, float *x, float *y)
| { double elapsed = -1.0 \* omp_get_wtime(); //
| We don't need to map the variable a as scalars are firstprivate
| by default #pragma omp target teams distribute parallel for
| map(to:x[0:n]) map(tofrom:y[0:n]) for(int i = 0; i < n;
| i++) { y[i] = a * x[i] + y[i]; } elapsed
| += omp_get_wtime(); printf("saxpy done in %6.3lf
| seconds.\n", elapsed); } int main() { int
| n = 2000000; float *x = (float*) malloc(n*sizeof(float));
| float *y = (float*) malloc(n*sizeof(float)); float alpha =
| 2.0; #pragma omp parallel for for (int i = 0; i
| < n; i++) { x[i] = 1; y[i] = i; }
| saxpy(n, alpha, x, y); free(x); free(y);
| return 0; }
___________________________________________________________________
(page generated 2024-11-14 23:01 UTC)