https://www.cppstories.com/2021/par-copyif/ C++ Stories Modern C++ and Native Code (converting from bfilipek.com) Toggle navigation C++ Stories * * Start Here * Author * Resources * Archives * Privacy * Premium * My Book [cpp17indet] Last Update: 22 February 2021 Implementing Parallel copy_If in C++ [parfilter] Table of Contents * + The Basics + Benchmark Code + The Naive Approach + Compose Algorithms o Benchmark + Blocks o Benchmarks + Approach with std::future o Benchmarks + Move To Other Thread + Summary In a blog post about a dozen ways to filter elements, I mentioned only serial versions of the code. But how about leveraging concurrency? Maybe we can throw some more threads and async tasks and complete the copy faster? For example, I have 6 cores on my machine, so it would be nice to see, like 5x speedup over the sequential copy? In C++17 we have parallel algorithms, so let's try calling std::copy_if with std::execution::par. If we go to the implementation of std::copy_if in the MSVC libraries, the parallel version we can see the following: // VS 2019 16.8 // not parallelized at present, parallelism expected to be feasible in a future release _REQUIRE_PARALLEL_ITERATOR(_FwdIt1); _REQUIRE_PARALLEL_ITERATOR(_FwdIt2); return _STD copy_if(_First, _Last, _Dest, _Pass_fn(_Pred)); That's why it's time to write my version :) Disclaimer: those are only my experiments (mostly to learn something); if you want to use it in your projects, then please measure, measure and measure :) The Basics In a basic form C++17's parallel algorithms are very simple to enable. Just pass a std::execution::par and you're done! For example: std::sort(std::execution::par, ...); std::for_each(std::execution::par, ...); The code invokes a bunch of threads (possibly leveraging some existing thread pool) and will kick smaller tasks in batches on multiple threads. We should keep in mind that such invocation will always generate more work than the sequential version! And the cost of preparation, setting up the batches, kicking off thread pool, synchronisation - that adds a visible cost to the whole processing. Ideally running things in parallel works best for lots of objects and also when small tasks are separate. A perfect example: std::vector numbers(SOME_BIG_COUNT); std::for_each(std::execution::par, begin(numbers), end(numbers), [](double& d){ d = complexComputation(); // no dependency here }); You can read my previous experiments with parallel algorithms: * The Amazing Performance of C++17 Parallel Algorithms, is it Possible? - C++ Stories + In the articles, I showed some "real" use cases with Fresnel and 3D Vectors and got speedup almost linear to the number of cores in my system. * How to Boost Performance with Intel Parallel STL and C++17 Parallel Algorithms - C++ Stories On the other case with code like: std::sort(std::execution::par, begin(numbers), end(numbers)); You'll see some speedup (when you have a large number of objects), but it won't be linear to the numbers of cores. This is because sort needs to shuffle things around in a container, and to do it safely, the algorithm has to perform some synchronisation so that other threads see the correct results. Benchmark Code For our tests (apart from simple debug output), I'll be using the following code. const size_t VEC_SIZE = argc > 1 ? atoi(argv[1]) : 10; std::vector> testVec(VEC_SIZE); std::ranges::generate(testVec.begin(), testVec.end(), []() mutable { return std::pair{ GenRandom(-10.0, 10.0), GenRandom(-10.0, 10.0) }; }); auto test = [](auto& elem) { auto sn = sin(elem.first) * cos(elem.second + 10.0); return sn > 0.0; }; In general, I'd like to have a bit more computation than elem%2 == 0. What's more, each element is 16 bytes, so the object is also not super small. The Naive Approach Similarly to std::sort our filter/copy_if function is not trivial to parallelise. We can think about it in the following way: * we have to run a predicate function on all elements - in most cases, it doesn't depend on other elements and can be best to perform on many threads * but then we have to put matching elements in the new container. This is a variable step and requires some synchronisation among threads. For a start it's good to implement a brute force approach and learn from that: template auto FilterCopyIfParNaive(const std::vector& vec, Pred p) { std::vector out; std::mutex mut; std::for_each(std::execution::par, begin(vec), end(vec), [&out, &mut, p](auto&& elem) { if (p(elem)) { std::unique_lock lock(mut); out.push_back(elem); } }); return out; } How does it work? We run all steps in parallel, thanks to std::for_each and std::execution::par, but then we need to synchronise when we want to put the element in the output container. As you can notice, all operations that modify the state of the container has to be protected. Let's see the performance: // 4 cores / 8 threads benchmark vec size: 100000 transform only seq : 2.5878 ms, ret: 100000 transform only par : 1.3734 ms, ret: 100000 FilterCopyIf : 5.3675 ms, ret: 50203 FilterCopyIfParNaive : 9.1836 ms, ret: 50203 And on my 6 core: // 6 cores / 12 threads benchmark vec size: 100000 transform only seq : 2.223 ms, ret: 100000 transform only par : 0.5507 ms, ret: 100000 FilterCopyIf : 3.851 ms, ret: 50203 FilterCopyIfParNaive : 10.1295 ms, ret: 50203 Upps... only ~2 or 3 times slower :) (I compare FilterCopyIf against FilterCopyIfNaive). For comparison I also included transform only seq and transform only par which is just a simple transform run over the collection: std::vector buffer(testVec.size()); RunAndMeasure("transform only seq ", [&testVec, &buffer, &test]() { std::transform(begin(testVec), end(testVec), begin(buffer), test); return buffer.size(); }); RunAndMeasure("transform only par ", [&testVec, &buffer, &test]() { std::transform(std::execution::par, begin(testVec), end(testVec), begin(buffer), test); return buffer.size(); }); Please notice that buffer is created outside the transform lambda, so we don't pay the price for its initialisation. See how it nicely scales with many cores. Compose Algorithms What else can we do? I suggest the composition of several algorithms: * Run std::transform on all input elements to compute the predicate function, store the boolean result in a temporary container. * Then we need to compute the final position of the matching elements - this can be done by invoking std::exlusive_scan * Later, we need to create the final results and merge the computed values. See the illustration: [par_compos] Here's the code template auto FilterCopyIfParCompose(const std::vector& vec, Pred p) { std::vector buffer(vec.size()); std::vector idx(vec.size()); std::transform(std::execution::par, begin(vec), end(vec), begin(buffer), [&p](const T& elem) { return p(elem); }); std::exclusive_scan(std::execution::par, begin(buffer), end(buffer), begin(idx), 0); std::vector out(idx.back()+1); std::vector indexes(vec.size()); std::iota(indexes.begin(), indexes.end(), 0); std::for_each(std::execution::par, begin(indexes), end(indexes), [&buffer, &vec, &idx, &out](size_t i) { if (buffer[i]) out[idx[i]] = vec[i]; }); return out; } output from a sample execution: input : 0, 1, 1, 0, 1, 1, 0, 1, 1, 1 buffer : 0, 1, 1, 0, 1, 1, 0, 1, 1, 1 idx : 0, 0, 1, 2, 2, 3, 4, 4, 5, 6 out : 1, 2, 4, 5, 7, 8, 9 Woh, woh... but this is so much code now! Can this even work? So... yes, it works, and in some cases, it will be faster than the sequential version. Here are the main caveats: * The code adds substantially more work * We use additional buffers and containers, so we need more memory. Benchmark Let's have a test run. Can this be faster than the sequential version? // 4 cores / 8 threads benchmark vec size: 100000 transform only seq : 2.5878 ms, ret: 100000 transform only par : 1.3734 ms, ret: 100000 FilterCopyIf : 5.3675 ms, ret: 50203 FilterCopyIfParNaive : 9.1836 ms, ret: 50203 FilterCopyIfParCompose : 3.03 ms, ret: 50203 FilterCopyIfParComposeSeq : 2.3454 ms, ret: 50203 FilterCopyIfParTransformPush: 2.5735 ms, ret: 50203 And for 6 cores: // 6 cores / 12 threads benchmark vec size: 100000 transform only seq : 2.3379 ms, ret: 100000 transform only par : 0.5979 ms, ret: 100000 FilterCopyIf : 3.675 ms, ret: 50203 FilterCopyIfParNaive : 10.0073 ms, ret: 50203 FilterCopyIfParCompose : 1.2936 ms, ret: 50203 FilterCopyIfParComposeSeq : 1.0754 ms, ret: 50203 FilterCopyIfParTransformPush: 2.0103 ms, ret: 50203 FilterCopyIfParComposeSeq - is a version of FilterCopyIfParCompose with a simple loop to copy the results: for (size_t i = 0; i < vec.size(); ++i) if (buffer[i]) out[idx[i]] = vec[i]; And FilterCopyIfParTransformPush is another variation where we have only std::transform to be run in parallel, and then we use regular push_back. template auto FilterCopyIfParTransformPush(const std::vector& vec, Pred p) { std::vector buffer(vec.size()); std::transform(std::execution::par, begin(vec), end(vec), begin(buffer), [&p](const T& elem) {return p(elem); } ); std::vector out; for (size_t i = 0; i < vec.size(); ++i) if (buffer[i]) out.push_back(vec[i]); return out; } But we can see that this version is 2x faster than the sequential! (for 4 cores) and 3x faster for 6 cores! So it's a promising approach. Blocks Let's try another approach. This time we'll split work into smaller chunks and then call copy_if separately: template auto FilterCopyIfParChunks(const std::vector& vec, Pred p) { const auto chunks = std::thread::hardware_concurrency(); const auto chunkLen = vec.size() / chunks; std::vector indexes(chunks); std::iota(indexes.begin(), indexes.end(), 0); std::vector> copiedChunks(chunks); std::for_each(std::execution::par, begin(indexes), end(indexes), [&](size_t i) { auto startIt = std::next(std::begin(vec), i * chunkLen); auto endIt = std::next(startIt, chunkLen); std::copy_if(startIt, endIt, std::back_inserter(copiedChunks[i]), p); }); std::vector out; for (const auto& part : copiedChunks) out.insert(out.end(), part.begin(), part.end()); if (vec.size() % chunks != 0) { auto startIt = std::next(std::begin(vec), chunks * chunkLen); std::copy_if(startIt, end(vec), std::back_inserter(out), p); } return out; } Benchmarks // 4 cores / 8 threads benchmark vec size: 100000 transform only seq : 2.5878 ms, ret: 100000 transform only par : 1.3734 ms, ret: 100000 FilterCopyIf : 5.3675 ms, ret: 50203 FilterCopyIfParNaive : 9.1836 ms, ret: 50203 FilterCopyIfParCompose : 3.03 ms, ret: 50203 FilterCopyIfParComposeSeq : 2.3454 ms, ret: 50203 FilterCopyIfParTransformPush: 2.5735 ms, ret: 50203 FilterCopyIfParChunks : 2.9718 ms, ret: 50203 6 Cores: // 6 cores / 12 threads benchmark vec size: 100000 transform only seq : 2.3379 ms, ret: 100000 transform only par : 0.5979 ms, ret: 100000 FilterCopyIf : 3.675 ms, ret: 50203 FilterCopyIfParNaive : 10.0073 ms, ret: 50203 FilterCopyIfParCompose : 1.2936 ms, ret: 50203 FilterCopyIfParComposeSeq : 1.0754 ms, ret: 50203 FilterCopyIfParTransformPush: 2.0103 ms, ret: 50203 FilterCopyIfParChunks : 2.0974 ms, ret: 50203 This version is a bit simpler to implement, but it doesn't bring that much speed. Still, it's faster than the sequential version. Approach with std::future The previous version was quite promising, but we also have another way to split our tasks. Rather than relying on std::execution::par we can kick several std::future objects and then wait for them to finish. I've found a similar idea in the book called "C++ High Performance" Disclaimer: I don't have the book, but its Github Repo seems to be publicly available: Cpp-High-Performance/ copy_if_split_into_two_parts.cpp template auto FilterCopyIfParChunksFuture(const std::vector& vec, Pred p) { const auto chunks = std::thread::hardware_concurrency(); const auto chunkLen = vec.size() / chunks; std::vector>> tasks(chunks); for (size_t i = 0; i < chunks; ++i) { auto startIt = std::next(std::begin(vec), i * chunkLen); auto endIt = std::next(startIt, chunkLen); tasks[i] = std::async(std::launch::async, [=, &p] { std::vector chunkOut; std::copy_if(startIt, endIt, std::back_inserter(chunkOut), p); return chunkOut; }); } std::vector out; for (auto& ft : tasks) { auto part = ft.get(); out.insert(out.end(), part.begin(), part.end()); } // remaining part: if (vec.size() % chunks != 0) { auto startIt = std::next(std::begin(vec), chunks * chunkLen); std::copy_if(startIt, end(vec), std::back_inserter(out), p); } return out; } Benchmarks // 4 cores / 8 threads benchmark vec size: 100000 transform only seq : 2.5878 ms, ret: 100000 transform only par : 1.3734 ms, ret: 100000 FilterCopyIf : 5.3675 ms, ret: 50203 FilterCopyIfParNaive : 9.1836 ms, ret: 50203 FilterCopyIfParCompose : 3.03 ms, ret: 50203 FilterCopyIfParComposeSeq : 2.3454 ms, ret: 50203 FilterCopyIfParTransformPush: 2.5735 ms, ret: 50203 FilterCopyIfParChunks : 2.9718 ms, ret: 50203 FilterCopyIfParChunksFuture : 2.5091 ms, ret: 50203 For 6 cores: // 6 cores / 12 threads benchmark vec size: 100000 transform only seq : 2.3379 ms, ret: 100000 transform only par : 0.5979 ms, ret: 100000 FilterCopyIf : 3.675 ms, ret: 50203 FilterCopyIfParNaive : 10.0073 ms, ret: 50203 FilterCopyIfParCompose : 1.2936 ms, ret: 50203 FilterCopyIfParComposeSeq : 1.0754 ms, ret: 50203 FilterCopyIfParTransformPush: 2.0103 ms, ret: 50203 FilterCopyIfParChunks : 2.0974 ms, ret: 50203 FilterCopyIfParChunksFuture : 1.9456 ms, ret: 50203 And a bit bigger vector and all techniques compared: // 4 cores / 8 threads benchmark vec size: 1000000 transform only seq : 24.7069 ms, ret: 1000000 transform only par : 5.9799 ms, ret: 1000000 FilterCopyIf : 45.2647 ms, ret: 499950 FilterCopyIfParNaive : 84.0741 ms, ret: 499950 FilterCopyIfParCompose : 17.1237 ms, ret: 499950 FilterCopyIfParComposeSeq : 16.7736 ms, ret: 499950 FilterCopyIfParTransformPush: 21.2285 ms, ret: 499950 FilterCopyIfParChunks : 22.1941 ms, ret: 499950 FilterCopyIfParChunksFuture : 22.4486 ms, ret: 499950 And 6 cores: // 6 cores / 12 threads benchmark vec size: 1000000 transform only seq : 24.7731 ms, ret: 1000000 transform only par : 2.8692 ms, ret: 1000000 FilterCopyIf : 35.6397 ms, ret: 499950 // base line FilterCopyIfParNaive : 102.079 ms, ret: 499950 FilterCopyIfParCompose : 9.3953 ms, ret: 499950 FilterCopyIfParComposeSeq : 9.9909 ms, ret: 499950 FilterCopyIfParTransformPush: 13.9003 ms, ret: 499950 FilterCopyIfParChunks : 13.2688 ms, ret: 499950 FilterCopyIfParChunksFuture : 12.6284 ms, ret: 499950 Move To Other Thread As you can see, we can make the code faster, but still, you need a lot of elements to process (I guess 50k...100k at least), and also, you have to pay the price for extra setup and even memory use. So maybe we should stick to the sequential version? As usual, it depends on your environment and requirements. But if you work with multiple threads, it might be wise to rely on the sequential copy_if and keep other threads busy. For example, you could kick off some small "copy" thread, do some other jobs concurrently while waiting for the copy to finish. There are various scenarios and approaches here. Summary It was fun! This article aimed not to create the best parallel algorithm but experiment and learn something. I hope the ideas presented here give you some hints for your projects. I relied entirely on a solid multithreading framework based on the Standard Library. Still, there are even more options when using Intel TBB or other highly sophisticated computing frameworks. As you could see, I urged to create the output vector dynamically. This creates some extra overhead, so in your solution, you might maybe limit this. Why not allocate the same size as the input container? Perhaps that can work in your solution? Maybe we could shirk the vector later? Or possibly leverage some smart allocators? Lots of questions and a lot of ways to experiment :) I should also mention that in a recent Overload magazine ACCU there's an article: A Case Against Blind Use of C++ Parallel Algorithms so you might also have a look at it. The article lists five problems you might want to consider before jumping and putting std::execustion::par everywhere. See my code in a separate Github Repo: https://github.com/fenbf/articles/blob/master/filterElements/ filters.cpp Thanks Victor Ciura (@ciura_victor) for feedback and hits to this article. Back To You * I showed 4 different techniques, but maybe you have some more? What would you suggest? * I tested my code on MSVC, but in GCC, parallel algorithms are available through Intel TBB. Have you tried them there? It would be nice to see results from that compiler/library. I've prepared a valuable bonus if you're interested in Modern C++! Learn all major features of recent C++ Standards! Check it out here: Download a free copy of C++20/C++17 Ref Cards! Similar Articles: * 12 Different Ways to Filter Containers in Modern C++ * Float vs Double * Modern C++: Safety and Expressiveness with override and final * Vector of Objects vs Vector of Pointers * A Programmer's Guide to Performance Analysis & Tuning on Modern CPUs Tags: cpp, cpp17, performance, multithreading, Table of Contents * + The Basics + Benchmark Code + The Naive Approach + Compose Algorithms o Benchmark + Blocks o Benchmarks + Approach with std::future o Benchmarks + Move To Other Thread + Summary C++17 In Detail C++ Lambda Story Recent Articles Implementing Parallel copy_If in C++ Modern C++: Safety and Expressiveness with override and final C++ Lambda Story in Print --------------------------------------------------------------------- 1. 2. Modern C++: Safety and Expressiveness with override and final >> (c) 2011-2021, Bartlomiej Filipek Disclaimer: Any opinions expressed herein are in no way representative of those of my employers. All data and information provided on this site is for informational purposes only. I try to write complete and accurate articles, but the web-site will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use. This site contains ads or referral links, which provide me with a commission. Thank you for your understanding. Built on the Hugo Platform!