Parallel algorithms implementation it’s a new feature of C++17 standard. Which allow to use simple STL functions in multiple threads.
At the moment (17.07.2018) only one compiler supports Parallel Algorithms. It’s Visual Studio 2017 ( versions 15.3 and higher).
To build C++ projects with parallel STL you need to build with a flag:
/std:c++17
In C++17 we have a new term – Execution policy
It’s the first parameter to parallel functions from STL. It allows you to choose, how can be function executed. To use it you need to include
#include <execution>
Types of execution policy:
std::seq – without using threads or vectorization, sequenced instructions. This is a default value.
std::par – parallel execution with using threads. When using parallel execution policy, it is the programmer’s responsibility to avoid deadlocks.
std::par_unseq – parallel execution using threads and vectorization or migrating across threads(such as by parent-stealing scheduler). Functions with this policy are permitted to execute in unordered fashion in unspecified threads and unsequenced with other threads.
When uncaught exception will be thrown, in all 3 cases will be called std::terminate().
For example, we have a method
1 |
std::sort(a.begin(), a.end()); // std:seq is used by default, will be executed in one thread |
and here we want to execute sort in parallel:
1 |
std::sort(std::execution::par, a.begin(), a.end()); |
In case of std::par policy, the code needs to be checked for parallel execution issues:
1 2 3 4 5 6 7 8 9 10 11 12 |
int main() { int num[] = {1, 2, 3}; std::vector a = { 4, 5, 6 }; std::for_each(std::execution::par, std::begin(num),std::end(num), [&](auto v) { a.push_back(v + 1); // Error: data race }); return 0; } |
1 2 3 4 5 6 7 8 9 10 11 12 |
int main() { int num[] = {1, 2, 3}; std::vector a = { 4, 5, 6 }; std::mutex m; std::for_each(std::execution::par, std::begin(num), std::end(num), [&](auto v) { std::lock_guard lock(m); a.push_back(v + 1); // OK }); } |
In case of std::par_unseq you can’t add lock there because of possible changes in sequence commands. Users are not allowed to allocate or deallocate memory, acquire mutexes, use non-lock-free std::atomic specializations, or perform any other vectorization-unsafe operations.
1 2 3 4 5 |
std::for_each(std::execution::par_unseq, std::begin(num), std::end(num), [&](auto v) { std::lock_guard lock(m); a.push_back(v + 1); }); |
What is the difference between parallel execution and vectorized parallel execution?
Vectorization means that instructions in a method will be moved to get a boost in execution.
Example (parallel execution without vectorization):
Thread 1:
- int x = 10;
- int x += 10;
- call <some method>
Thread 2:
- int x = 10;
- int x += 10;
- call <some method>
Thread 3:
- int x = 10;
- x += 10;
- call <some method>
Example (execution with vectorization):
- int x = 10;
- int x = 10;
- int x = 10;
- x+=10;
- x+=10;
- x+=10;
- call <some method>
- call <some method>
- call <some method>
In C++17 algorithms library there are presented 7 new functions:
- std::reduce()
- std::inclusive_scan()
- std::exclusive_scan()
- std::transform_reduce()
- std::transform_inclusive_scan()
- std::transform_exclusive_scan()
- std::for_each_n()
All algorithms by default use std::seq execution policy. To use std::execution::par you need to have ForwardIterator.
reduce() – it’s unordered std::accumulate() which can be run in parallel. Takes the same parameters and execution policy parameter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
#include <iostream> #include <execution> int main() { std::vector v(10'000'007, 0.5); { auto t1 = std::chrono::high_resolution_clock::now(); double result = std::accumulate(v.begin(), v.end(), 0.0); auto t2 = std::chrono::high_resolution_clock::now(); std::chrono::duration<double, std::milli> rr = t2 - t1; std::cout << std::fixed << "std::accumulate = " << rr.count() << std::endl; std::cout << "result = " << result << std::endl; } { auto t1 = std::chrono::high_resolution_clock::now(); double result = std::reduce(std::execution::par_unseq, v.begin(), v.end()); auto t2 = std::chrono::high_resolution_clock::now(); std::chrono::duration<double, std::milli> rr = t2 - t1; std::cout << std::fixed << "std::reduce = " << rr.count() << std::endl; std::cout << "result = " << result << std::endl; } return 0; } |
inclusive_scan() – it’s unordered std::partial_sum() which can be run in parallel. Takes the same parameters and execution policy parameter.
By default inclusive_scan() uses std::plus<> operator.
The algorithm uses next formula:
vector[i] = i1 + i2 + … + in
For example:
1 2 3 4 5 6 7 8 9 10 11 |
#include <iostream> #include <execution> #include <iterator> int main() { std::vector a = { 1, 2, 3, 4, 5 }; std::inclusive_scan(a.begin(), a.end(), std::ostream_iterator<int>(std::cout, " ")); // output: 1 3 6 10 15 } |
if you need another operator, you can pass it by param:
1 |
std::inclusive_scan(std::execution::par, a.begin(), a.end(), a.begin(),std::minus()); |
or use a lambda:
1 2 3 |
auto my_operator = [](auto a, auto b) {return (a + 1) * (b + 1); }; std::inclusive_scan(std::execution::par, a.begin(), a.end(), a.begin(), my_operator); |
exclusive_scan() – behaviour is almost same as in inclusive_scan() except elements are in (i – 1) range.
vector[i] = (i1 – 1) + (i2 – 1) + … + (in – 1)
1 2 3 4 5 6 7 8 9 10 |
#include <iostream> #include <execution> #include <iterator> int main() { std::vector a = { 1, 2, 3, 4, 5 }; std::exclusive_scan(std::execution::par, a.begin(), a.end(), a.begin(), 1); // output: 0 1 3 6 10 } |
also as with inclusive_scan() you can use another operator or lambda.
transform_reduce() – it has almost behavior like reduce(), except it has one more operation before accumulating values. By default it uses std::multiplies<> operator.
Example:
1 2 3 4 5 6 7 8 9 10 |
#include <iostream> #include <execution> #include <iterator> int main() { std::vector a = { 1, 2, 3 }; int result = std::transform_reduce(std::execution::par, a.begin(), a.end(), a.begin(), 0); // result: 14 = 1*1 + 2*2 + 3*3 } |
or if you need another operation you can pass them using lambda.
1 2 |
int result = std::transform_reduce(std::execution::par, a.begin(), a.end(), a.begin(), 0, [](auto a, auto b) {return a + b; }, [](auto a, auto b) {return a; }); |
transform_inclusive_scan() – same as inclusive_scan() but with one more lambda to modify object after inclusive_scan.
For example:
1 2 3 4 5 6 |
std::vector a = { 1, 2, 3 }; std::transform_inclusive_scan(std::execution::par, a.begin(), a.end(), a.begin(), [](auto a, auto b) {return a + b; }, [](auto a) {return a * 2; }); // 1*2 + 2*2 + 3*2 // output: 2 6 12 |
transform_exclusive_scan() – same as exclusive_scan() but with one more lambda to modify object after exclusive_scan.
1 2 3 4 5 6 |
std::vector a = { 1, 2, 3 }; std::transform_exclusive_scan(std::execution::par, a.begin(), a.end(), a.begin(), 0, [](auto a, auto b) {return a + b; }, [](auto a) {return a + 1; }); // 0 + (1 + 1) + (2 + 1) // output: 0 2 5 |
for_each_n() – apply function to first n elements in array in parallel.
For example:
1 2 3 4 5 6 |
std::vector a = { 1, 2, 3 }; // add for 2 first element +1 std::for_each_n(std::execution::par, a.begin(), 2, [](auto& a) { a += 1; }); // output: 2 3 3 |
Terrific post but I was wondering if you could write a litte more on this subject?
I’d be very grateful if you could elaborate a
little bit more. Cheers!
okay, thanks for comment)
Pretty nice post. I simply stumbled upon your weblog and wanted to mention that I have truly enjoyed surfing around your blog posts.
After all I’ll be subscribing in your rss feed and I am hoping you write once more soon!