Table of Contents
Shared Memory Parallelism with HPX
As we have seen in class, HPX provides a library based mechanism for parallelizing a job for shared-memory execution. One of the principles upon which HPX is based is that the program source code does not need to be changed significantly – the library manages all concurrency. At the same time, there are parameters about the running parallel program that we would like the parallel program to be aware of. For example, we might want to specify the number of threads that a program uses to execute. Or, we might want to partition data (or a loop) ourselves.
Now, with the HPX philosophy, the program itself can be both sequential and parallel – and it should largely be “unaware” of its parallelization. (This should be true for any parallel program – concerns should be well-separated.) Rather than being limited to using command-line arguments as a way to pass information to HPX, you can use environment variables as well. Environment variables also have the advantage (compared to command-line arguments) that they can be accessed from any arbitrary part of the program – including from a library. You can find a list of environment variables that HPX uses on its documentation site. The most important environment variable is HPX_NUM_WORKER_THREADS
, which sets the (maximum) number of threads that HPX will use at run-time.
Environment variables
It is important to remember that environment variables don’t actually do anything. Setting the value of
HPX_NUM_WORKER_THREADS
to some value doesn’t automatically parallelize your program or causes HPX to have that many threads. Rather it is a variable whose value any program can read – it is up to the program what action to take (if any) in response to the value it reads.
When a program is compiled with HPX, it can be run just as with any other program – execute it on the shell command line.
hpx_info
When running a parallel program, there are several pieces of information we would like to have. For example, in the last assignment, we partitioned our work into pieces and let a thread or task execute each piece (in parallel). For that assignment, we specified the number of threads/tasks to use so that we could experiment with different numbers of threads. In general, however, we don’t want the user to just arbitrarily set the number of threads that a job runs with. We might not want to run with more threads than there are cores available, for example. Or, we might want to set some resource limit on how many threads are used (which might be more or less than the number of cores). That is, we want to let the runtime system have some control over how our parallel program runs (and doing this automatically is in the spirit of HPX).
Consider the following program (work/parallelism/hello_hpx/hpx_info.cpp
) that inspects the HPX environment:
#include <hpx/hpx_main.hpp>
#include <hpx/hpx.hpp>
int main() {
char const* env_name = "HPX_NUM_WORKER_THREADS";
std::println("{} = {}", env_name, std::getenv(env_name));
std::println("std::hardware_concurrency() = {}", std::thread::hardware_concurrency());
std::println("hpx::get_num_worker_threads() = {}", hpx::get_num_worker_threads());
return 0;
}
Now, build and run this program. When I run this in my shell I get the following output (the numbers might vary depending on your hardware):
$ ./build/work/parallelism/hello_hpx/hpx_info --hpx:threads=4
HPX_NUM_WORKER_THREADS =
std::hardware_concurrency() = 8
hpx::get_num_worker_threads() = 4
In this case, HPX_NUM_WORKER_THREADS
is not set so HPX uses default values. However, we have to be careful how we interpret this. The first line shows that HPX_NUM_WORKER_THREADS
is not set. The second line is obtained with a call to the C++ standard library (outside of HPX) and shows that the available hardware concurrency (number of cores) is equal to four. The next two lines were obtained with calls to functions in the HPX library. They show that the maximum available number of HPX threads is equal to four. The final line, however, may seem curious – it says that the number of threads is equal to one.
Recall from lecture that HPX enables using the fork-join model (amongst other things). But, how many threads actually run at one time with HPX? And how can we tell?
Let’s add a simple function to the above:
int hpx_thread_count() {
int n = 0;
hpx::experimental::run_on_all(
hpx::experimental::reduction_plus(n),
[](int& local_n) {
++local_n;
});
return n;
}
You may wonder why the lambda now has to expect one additional argument (the
int& local_n
). We need this as the use of thereduction_plus
will pass a reference to a thread-local variable to the lambda, which ensures there will be no race conditions. Before the functionrun_on_all
returns, it will reduce the values of all thread-locals and store the result in the initial argument the reduction object was created from.
This function will create the default number of threads and each one of those will increment the variable local_n
by one – the final value of n
(as reduced by summing up all local_n
instances) will be the number of threads that executed in parallel. If we add a call to hpx_thread_count()
to the above program we obtain as output:
$ ./build/work/parallelism/hello_hpx/hpx_info --hpx:threads=4
HPX_NUM_WORKER_THREADS =
std::hardware_concurrency() = 8
hpx::get_num_worker_threads() = 4
hpx_thread_count() = 4
While the system this was run on has 8 cores (as reported by
std::thread::hardware_concurrency
) we restrict HPX to use only 4 cores (by supplying the command line argument--hpx:threads=4
). All HPX programs support a set of predefined command line options, you can see a description of all of those by running your executable with--hpx:help
.
HPX is Still Threads
One advantage as a programmer to using HPX is that parallelization can be done in line. When we parallelize programs explicitly using threads and tasks in C++, we had to bundle the work we wanted to be done into a separate function and launch a thread or task to execute that function. With HPX we can parallelize directly in line using a lambda (in the simplest case). For example, we can create a parallel “Hello World” as follows (see work/parallelism/hello_hpx/hello_hpx_parallel.cpp
):
int main() {
hpx::experimental::run_on_all(
[]() {
std::println("Hello! I am thread {} of {}",
hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
std::println("My C++ std::thread id is {}", std::this_thread::get_id());
});
return 0;
}
There is (almost) no helper function needed.
But, HPX is not a panacea. Underneath the covers, HPX still runs threads. And it doesn’t do anything special to fix the various issues we have seen with threads. This is the output from one run of the above program:
$ ./build/work/parallelism/hello_hpx/hello_hpx_parallel
Hello! I am thread Hello! I am thread 0 of 4
Hello! I am thread 2 of 4
My C++ std::thread id is 140325565663104
My C++ std::thread id is 140325592336192
3 of 4
My C++ std::thread id is 140325561464768
Hello! I am thread 1 of 4
My C++ std::thread id is 140325569861440
The first thing to notice is that we can get the thread id for the current thread with C++ standard library mechanisms. Problematically though, note that the strings that are being printed are interleaved with each other. We have seen this before in multithreading – when we had race conditions.
Just as with the explicit threading case, HPX can have race conditions when shared data is being written. However, as with the mechanisms for parallelization, the HPX mechanisms for protecting data are expressed using the C++ standard synchronization primitives. We can fix this race by using a hpx::mutex
as usual.
In the context of HPX you should always use the HPX facilities for synchronizing access to data instead of the C++ standard ones – i.e. use
hpx::mutex
instead ofstd::mutex
,hpx::condition_variable
, instead ofstd::condition_variable
, etc. HPX has all facilities reimplemented that the C++ standard mandates in terms of concurrency and parallelism. You can still use the standard’s locking facilities, though:std::lock_guard
,std::unique_lock
, etc. Those work well with the HPX synchronization primitives.
Examples
We explore a few variants of HPX parallelization below.
The code for the first example is contained in the file work/parallelism/hello_hpx/hello_hpx_parallel.cpp
and can be compiled to hello_hpx_parallel
. The provided CMake build system has the related directives so you should be able to build it.
All of the following examples can be built from the corresponding .cpp
in a similar manner.
hpx::run_on_all
In the first example, we just tell the compiler to execute the following block in parallel
hpx::experimental::run_on_all(
[]() {
std::println("Hello! I am thread {} of {}",
hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
std::println("My C++ std::thread id is {}", std::this_thread::get_id());
});
Experiment with this program a little bit until you get a feel for how HPX is managing threads and a feeling for how the HPX concepts of threads etc. map to what you already know about threads.
For example, try setting different values of HPX_NUM_WORKER_THREADS
in your environment (or experiment with different values for the --hpx:threads
command line option). You should see different numbers of “Hello” messages being printed. What happens when you have more threads than hardware
concurrency? Are the thread IDs unique?
hpx::run_on_all with a given number of threads
The code for this example is contained in the file work/parallelism/hello_hpx/hello_hpx_2_threads.cpp
. In this version, we tell HPX to execute the following block in parallel, using 2 threads.
hpx::experimental::run_on_all(
2,
[]() {
std::println("Hello! I am thread {} of {}",
hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
std::println("My C++ std::thread id is {}", std::this_thread::get_id());
});
What effect does setting the environment variable HPX_NUM_WORKER_THREADS
have on this program (or alternatively, using different values for the --hpx:threads
command line option)?
Protect sequential code using a critical section
The code for this example is contained in the file work/parallelism/hello_hpx/hello_hpx_critical.cpp
hpx::mutex mtx;
{
std::lock_guard l(mtx);
std::println("Hello! I am thread {} of {}",
hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
std::println("My C++ std::thread id is {}", std::this_thread::get_id());
}
How many threads get run when you execute this example? Does setting HPX_NUM_WORKER_THREADS
have any effect?
Protect parallel code using a critical section
The code for this example is contained in the file work/parallelism/hello_hpx/hello_hpx_parallel_critical.cpp
hpx::mutex mtx;
hpx::experimental::run_on_all(
num_threads,
[]() {
std::print("Hello! ");
std::println("I am thread {} of {}",
hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
{
std::lock_guard l(mtx);
std::print("Hello from inside the critical section! ");
std::println("I am thread {} of {}",
hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
std::println("My C++ std::thread id is {}", std::this_thread::get_id());
}
});
The program has a parallel region that prints the “Hello World” information as before, as well as a protected critical section. How many threads get run when you execute this example? Does setting HPX_NUM_WORKER_THREADS
have any effect? Is there any difference in how the text for the critical and non-critical portions gets printed?
Exercises
Norm
Previously, we explored parallelizing some functions for computing the Euclidean norm of a vector. Here we are going to look at how to use HPX to parallelize that same computation.
The main source files for these exercises is in the norm
subdirectory. There is a separate driver program for each exercise. The actual norm functions that we will be parallelizing are in the (single) file norms.hpp
in the include subdirectory.
There are five different functions that you will parallelize using HPX: norm_block_reduction()
norm_block_critical()
, norm_cyclic_reduction(
), norm_cyclic_critical()
and norm_parfor()
.
The drivers are named according to the function they test and can be made into the corresponding executable.
The drivers all operate in the same way (dispatching to run
, which is contained in include/norm_utils.hpp
). A driver can accept up to three arguments:
$ ./build/work/parallelism/norm/norm_block_critical [lo] [hi] [max_threads]
where lo
is the smallest size vector to be tested, hi
is the upper bound of vector size to be tested, and max_threads
is the upper bound of threads to test.
As with the norm drivers in More Warmup, these drivers will run test problems
from lo
to hi
, increasing in factors of two at each test, and also test with one to max_threads
, again in powers of two. The programs will print measured GFlops for each test.
Look through the code for run()
in norm_utils.hpp
. How are we setting the number of threads for HPX to use?
Parallelizing loops using HPX
We demonstrated in class, how
hpx::experimental::for_loop
can be used to convert any index-based sequential for loop into an equivalent parallelized version. In short, if given a sequential loop:for(size_t i = 0; i != N; ++i) { // any code that relies on i }
it can be parallelized as:
hpx::experimental::for_loop( hpx::execution::par, 0, N, [&](size_t i) { // any code that relies on i });
We have also shown that thread-safe reductions can be easily added to this by using one of the HPX reduction operations, e.g.,
int n = 0; hpx::experimental::for_loop( hpx::execution::par, 0, N, hpx::experimental::reduction_plus(n), [&](size_t i, int& local_n) { // any code that relies on i // local_n refers to a thread-local instance of n });
We have not shown in class that it is possible to supply the number of threads to use to run the parallel
for_loop
as well:int n = 0; hpx::experimental::for_loop( hpx::execution::experimental::with_processing_units_count( hpx::execution::par, num_threads), 0, N, hpx::experimental::reduction_plus(n), [&](size_t i, int& local_n) { // any code that relies on i // local_n refers to a thread-local instance of n });
Please feel free to use this scheme to control the amount of threads to be used in your implementation of the
norm()
functions below.Another feature of HPX not presented in class may come in handy whenever you need to increment the loop counter using a stride larger than one. In this case,
hpx::experimental::for_loop_strided
may be useful. This facility allows to additionally provide stride to any loop, it is identical tofor_loop
otherwise:hpx::experimental::for_loop_strided( hpx::execution::par, 0, N, stride, // stride is an integer larger than one [&](size_t i) { // any code that relies on i });
Using HPX, parallelize the
norm()
functions in the filework/include/norms.hpp
, as indicated below:
norm_block_reduction()
– the function should be parallelized in block fashion according to the number of available HPX threads. The update to the shared variable sum should be protected using an HPX reduction directive. Some of the scaffolding for this function has been provided. The basic ideas should be familiar: determining the block size, assigning work to each worker. The difference is that rather than specifying how to partition the problem and then determining the number of threads to use, we are specifying how many threads to use and then determining how to partition the problem. Also, we don’t need to create a helper function.norm_block_critical()
– the function should be parallelized in block fashion according to the number of available HPX threads. The update to the shared variable sum should be protected using an HPX critical section.norm_cyclic_reduction()
– the function should be parallelized in cyclic fashion according to the number of available HPX threads. The update to the shared variable sum should be protected using an HPX reduction directive.norm_cyclic_critical()
– the function should be parallelized in cyclic fashion according to the number of available HPX threads. The update to the shared variable sum should be protected using an HPX critical section.These functions are tested by the program
work/parallelism/norm/norms_test.cpp
.
Answer the following questions in
results/answers.md
:
- Which version of
norm
above provides the best parallel performance? How do the results compare to one of the parallelized versions ofnorm
from More Warmup? Justify why one is faster than the other. You can select any one of the parallelized versions ofnorm
from More Warmup.- Which version of
norm
above provides the best parallel performance for larger problems (i.e., problems at the top end of the default sizes in the drivers or larger)? How do the results compare to one of the parallelized versions ofnorm
from More Warmup? Justify why one is faster than the other. You can select any one of the parallelized versions ofnorm
from More Warmup. It can be either a different version or the same version of the previous question.- Which version of
norm
above provides the best parallel performance for small problems (i.e., problems smaller than the low end of the default sizes in the drivers)? How do the results compare to one of the parallelized versions ofnorm
from More Warmup? Justify why one is faster than the other. You can select any one of the parallelized versions ofnorm
from More Warmup. It can be either a different version or the same version of the previous two questions.
Next up: A Multithreaded Buddhabrot Renderer