[HN Gopher] What is parallelism, anyhow? (2009)
___________________________________________________________________
What is parallelism, anyhow? (2009)
Author : moonchild
Score : 29 points
Date : 2022-05-08 10:39 UTC (12 hours ago)
(HTM) web link (web.archive.org)
(TXT) w3m dump (web.archive.org)
| whistl034 wrote:
| Is there a font missing or something? When I try to read the
| paper, every instruction dependency is displayed as x ? y, with a
| question mark between each object, even when the author implies
| there is some difference between x ? y and x ? y. I'm sure there
| is supposed to be some other characters there, because without
| them the paper makes no sense.
|
| Perhaps it's something that only shows up properly on Windows
| computers?
| rzzzt wrote:
| I thought it's going to be mojibake, but it's not of the
| variety, just a regular question mark. One candidate is the
| "precedes" symbol: https://codepoints.net/U+227A
| Jtsummers wrote:
| The precedes symbol is < and the parallel symbol is |.
|
| https://www.cprogramming.com/parallelism.html - another copy
| without this problem.
| bruce343434 wrote:
| You skimmed too fast, from that same article:
|
| > We say that an instruction x precedes an instruction y,
| sometimes denoted x ? y, if x must complete before y can begin.
| In a diagram for the dag, x ? y means that there is a positive-
| length path from x to y. If neither x ? y nor y ? x, we say the
| instructions are in parallel, denoted x ? y. The figure below
| illustrates a multithreaded dag:
| Jtsummers wrote:
| That quote presents exactly the confusion for GP. x ? y is
| described to mean _both_ that x precedes y and that x and y
| are in parallel. It suggests that _something_ got lost in
| translation here.
|
| https://www.cprogramming.com/parallelism.html
|
| Non-archive link: x < y (x precedes y) and x | y (x is
| parallel with y).
| erwincoumans wrote:
| A great comment about conflicts between parts of a system
| fighting for resources (parallelism, memory, caches) is hidden at
| the end of the article:
|
| "@Ilya: You assume that the desired result is to have a 9 month
| process. What if the desired result is to have a newborn
| available for filming a commercial for baby food?
|
| A good real-life example, based on a true story: You want data to
| be sorted. You may start out with a standard sort algorithm.
| Suddenly you find, that it is slow. You find out, that you can
| increase the speed 5 times by replacing quicksort with mergesort.
| Then, you find out, that you can increase the speed 10 times more
| by using an implementation, that knows how to use the CPU cache
| well. Then, you realize, that Mergesort can use more CPUs, so you
| use more CPUs. Suddenly, your application is 100 times slower,
| because the writes to memory from one CPU is flushing the cache
| from the other CPU. This didn't happen on your development
| system, because the dual-CPU system was made in a different
| way... so you realize that you need to make the parallel
| programming differently.
|
| Finally, you find that you need to separate the lists, that you
| merge in your mergesort algorithm, better. After a while, your
| algorithm works well on your development system and on the target
| system.
|
| Finally satisfied, you put the parallel mergesort algorithm into
| all places where it makes sense. After a short while, the test
| department comes back, and reports that the application runs
| slower when your multithreaded sort algorithm is used. The reason
| is, that it now uses 2 CPU caches, leaving less cache memory for
| other parts of the application, slowing down these parts more,
| than the speed benefit of using 2 CPUs for sorting.
|
| You are finally asked to remove the parallel processing from your
| mergesort in order to increase the speed of the system.
|
| posted @ Monday, August 11, 2008 7:21 AM by Lars D"
| hinkley wrote:
| Or somewhere in there you discover that 99.9% of the time you
| only need the top ten elements in the list so sorting the whole
| thing is foolish anyway. Or that you're better off sorting the
| list as the data is collected in the first place because then
| the cost per entry is log(n), and reads swamp writes by a wide
| margin.
| eatonphil wrote:
| Title is "What the $#@! is Parallelism, Anyhow?", which was
| probably easy to guess but not completely clear from "$". I
| assume this was over-eager HN autoformatting. :)
| phkahler wrote:
| I think anyone doing C++ and wanting performance should start
| using OpenMP immediately. It's really easy to use without
| changing much code. In some cases it's just adding some pragmas.
|
| Once you wring out the easy parallel parts you'll probably be
| facing the slow algorithms that have poor time complexity. That's
| when the real work begins. Reducing time complexity often
| requires some good data structures and/or fancy algorithms.
| maxwell86 wrote:
| lol please no
|
| If you are using C++, and want to parallelize something, just
| add "std::execution::par" to your algorithms.
|
| Instead of writing "std::for_each(...)" just write
| "std::for_each(std::execution::par, ...)".
|
| That's it. It really is that simple. And with the right
| compilers you can just compile the code to run on FPGAS, GPUs,
| or whatever.
|
| For someone that knows C++, doing that is the lowest barrier of
| entry, and gets you most of the way there without having to
| learn "some other programming language" like OpenMP (or
| anything else).
| gumby wrote:
| > If you are using C++, and want to parallelize something,
| just add "std::execution::par" to your algorithms.
|
| Do any of the shipping standard libraries actually implement
| execution policies? I only use gcc and clang so have to
| resort to TBB to get this capability.
| dragontamer wrote:
| OpenMP can parellize a for loop like:
| #pragma openmp parallel for for(int i=0; i<1000000;
| i++) C[i] = A[i] + B[i];
|
| It's normal C++ and the pragma auto-parallelizes the loop.
| It's actually really easy and convenient.
|
| OpenMP is probably the easiest join/fork model of parallelism
| on any C++ system I've used. It doesn't always get the best
| utilization of your CPU cores, but it's really, really
| simple.
|
| It's the best way to start IMO, far easier than std::thread,
| or switching to functional style for other libraries. Just
| write the same code as before but with a few #pragma omp
| statements here and there.
| maxwell86 wrote:
| > It's normal C++
|
| C++ is an ISO standard, you can use it _everywhere_, in
| space, in automotive, in aviation, in trains, in medical
| devices, _everywhere_.
|
| OpenMP is not C++, it is a different programming language
| than C++.
|
| OpenMP is not an ISO standard, you can't use it in _most_
| domains that you can use C++.
|
| Your example: #pragma openmp parallel for
| for(int i=0; i<1000000; i++) C[i] = A[i] +
| B[i];
|
| shows how bad OpenMP is.
|
| It does not run in parallel on GPUs or on FPGAS (lacking
| target offload directives), and you can't use it on most
| domains in which you can use C++.
|
| The following is ISO standard C++:
| std::for_each_n(std::execution::par,
| std::ranges::iota(0).begin(), 1000000, [](int i) {
| C[i] = A[i] + B[i]; });
|
| it runs in _parallel_ EVERYWHERE: GPUs, CPUs, FPGAS, and it
| is certified for all domains for which C++ is (that is: all
| domains).
|
| Show me how to sort an array in parallel on _ANY_ hardware
| (CPUs, GPUs, FPGAs) with OpenMP. With C++ is as simple as:
| std::sort(std::execution::par, array.begin(), array.end());
|
| If you have a GPU, this offloads to the GPU. If you have an
| FPGA, this offloads to the FPGA. If you have a CPU with 200
| cores, this uses those 200 cores.
|
| There is no need to turn your ISO C++ compliant program or
| libraries into OpenMP. That prevents them from being used
| by many domains on which C++ runs on. It also adds an
| external dependency for parallelism, for no good reason.
|
| For any problem that OpenMP can solve, OpenMP is _always_ a
| worse solution than just using strictly ISO standard and
| compliant C++.
|
| OpenMP has completely lost a reason to exist. It's not 1990
| anymore.
| oxff wrote:
| People often get parallelism, concurrency, multi-thread and async
| concepts mixed in weird ways.
| amelius wrote:
| I think that's because parallelism and concurrency are words
| that were originally very close in meaning and both could have
| been slapped on the same concept.
| marcosdumay wrote:
| Do they come from geometry or does the geometry terms come
| from them?
|
| On manual work, pretty much as on geometry, they are almost
| antonyms. If people can do a few tasks in parallel, that
| means different people can do them at the same time without
| going over each other. But if two people tasks as concurrent,
| that means only one can happen at a time.
|
| It's the mapping of them into computer terms that confused
| everybody, because concurrency on computers is the opposite
| of what it is for people. At least the modern meaning of
| "concurrency", because it's got changed when computers became
| multi-tasking.
| jjtheblunt wrote:
| There was certainly no difference in their use the late 1980s
| working on early shared memory multiprocessors fulltime at a
| big academic/industrial collaboration.
|
| 20 years later the younger kids with PhDs would distinguish,
| and i realized the terms had specialized in a way I'd not
| noticed.
| CogitoCogito wrote:
| I think it's best to be aware of the differences, but not
| overly pedantic about it. If you legitimately don't know what
| people mean, then ask, but otherwise just accept that people
| are sometimes a bit unclear about it. I know the difference
| between all these things, but I'll often mix up terminology
| as well.
| paskozdilar wrote:
| Concepts such as "multi-threading" and "async/await" are
| particular implementations of concurrency and, potentially,
| parallelism. They are often used by people who are used to a
| particular implementation and don't see a reason to use any
| more abstract word.
|
| But mathematically speaking, "concurrency" and "parallelism"
| are very much related - by definition, in any concurrent
| algorithm, there are branches of computation that can be safely
| parallelized; and every parallel algorithm is, by definition,
| concurrent.
|
| The difference between the two, I think, is that parallelism
| colloquialy implies parallel execution (e.g. OpenMP, Numpy),
| while concurrency does not (e.g. NodeJS) - concurrency is just
| independence of computation.
|
| That's the reason functional languages generally excel at
| concurrency - by having explicit dependencies between
| computational objects, the compiler can just topologically sort
| the dependency graph, and trivially find which branches of
| computation can be parallelized.
| User23 wrote:
| As I understand it the unifying concept is nondeterminism.
| It's a logical framework that can make sense of parallelism,
| concurrency, and interrupts.
|
| Nondeterministic doesn't mean unpredictable. For example a
| pure functional language without side effects is free to
| evaluate function parameters in any order and still always
| return the same result.
|
| Other times a nondeterministic algorithm can produce
| different results each execution but still produce a result
| that satisfies some predicate. For example iterating over a
| Go hash makes no promises about the order elements will be
| visited in, but it does promise they will all be visited. So
| if your reduce function is commutative and associative you'll
| get the same result regardless of ordering.
| bob1029 wrote:
| Whenever I find myself wrapped around a post on these topics, I
| force myself back to 1 fundamental question:
|
| > Will this action/design increase or decrease the amount of
| latency incurred on the hot path relative to the physical
| layout of my processor.
|
| So far, keeping an eye on this has guided me well. There are no
| silver bullets either. async/await is elegant, but it incurrs
| substantial latency overhead compared to busy waiting over
| batches of work (which itself has obvious caveats).
| agumonkey wrote:
| IMO it's a matter of logical/functional dependency. Parallelism
| implies no deps, therefore full overlap is possible.
| Concurrency less so, so you need joins/rendez-vous.
| [deleted]
| dragontamer wrote:
| I think I finally figured it out recently.
|
| Concurrency is an I/O parallelism technique. One core can make
| 10,000 TCP connections at the same time thanks to async, golang
| lightweight threads and whatnot.
|
| This is useful because many I/O operations are high latency. A
| TCP connection, like HTTPS, takes high latency but the I/O
| device supports many operations.
|
| -------
|
| If you are I/O latency bound, you use concurrency to increase
| performance. Even with one CPU, concurrency or I/O parallelism
| helps a lot.
|
| But if you are CPU bound, you need to use totally different
| techniques (parallel processing)
| pengaru wrote:
| Coroutines are a form of concurrency, and they _may_ be
| executed in parallel if scheduled on separate cores /threads,
| but usually are just cooperatively scheduled serially.
| Concurrency is not an IO-specific concept.
| dragontamer wrote:
| The issue is that splitting up work into different blocks
| wrecks your L1 cache and locality of data.
|
| So if your code is: A(data)
| B(data) C(data)
|
| Doing A(data1), B(data1), C(data1) then A(data2), B(data2),
| C(data2) is much much faster than A(data1), A(data2),
| B(data1), B(data2), C(data1), C(data2).
|
| In most cases anyway (assuming data1 and data2 have a
| portion that fits somewhere in the cache hierarchy).
|
| So even if you are doing parallel execution, it ends up
| being a lot slower than sequential.
|
| --------
|
| The goal of parallel processing is to actually get faster
| in terms of CPU time. Just grabbing concurrency techniques
| can be counterproductive.
| paskozdilar wrote:
| You're on a right track - the most useful application of
| concurrency nowdays is I/O, because of the order-of-magnitude
| larger latency than CPU.
|
| However, concurrency is much more than IO. Concurrency can be
| best defined as a lack of ordering of computation. For
| example, you want to compute A, B, C and their values are
| independent of each other, the order you compute them does
| not matter - that is concurrency. If your language has means
| of expressing "compute these things, i dont care about the
| order", then your language supports concurrency.
|
| I suggest reading Communicating Sequential Processes by Tony
| Hoare. It is a very influential work on concurrency - and the
| proof of its influence is the design of the Go programming
| language.
| Jtsummers wrote:
| The current mainstream distinction seems to be:
|
| Parallelism - _actually_ running (or intended to run) in
| parallel. A lot of computational work is like this.
|
| Concurrency - _may_ run in parallel, but is about design as
| much, if not more, than execution.
|
| Concurrency is a way (in the current mainstream sense) to
| decompose a program into multiple "processes" (generic sense)
| much like we decompose programs into functions, procedures,
| and objects in other languages. I think it was Rob Pike's
| talk "Concurrency is not parallelism" which pushed this
| distinction further into the mainstream.
|
| https://go.dev/blog/waza-talk - Concurrency is not
| parallelsim, Rob Pike
|
| Go, Erlang (later Elixir), and Clojure in particular seem to
| have pushed the idea of using concurrency (above sense) in
| the design of programs into the mainstream. Specifically
| bringing concurrency into "the small". Concurrency is an
| innate quality of networked programs, but _within_ a program
| people often write sequential (non-concurrent) code and only
| add concurrency elements to access parallelism for
| performance (vice adding them because they clarify the code
| or its design). But many problems are neatly decomposed into
| well-performing concurrent designs that _may_ get a
| performance boost if you have multiple cores available, but
| it isn 't (strictly) the point of using concurrency.
___________________________________________________________________
(page generated 2022-05-08 23:02 UTC)