Table of Contents

Shared Memory Parallelism with HPX

As we have seen in class, HPX provides a library based mechanism for parallelizing a job for shared-memory execution. One of the principles upon which HPX is based is that the program source code does not need to be changed significantly – the library manages all concurrency. At the same time, there are parameters about the running parallel program that we would like the parallel program to be aware of. For example, we might want to specify the number of threads that a program uses to execute. Or, we might want to partition data (or a loop) ourselves.

Now, with the HPX philosophy, the program itself can be both sequential and parallel – and it should largely be “unaware” of its parallelization. (This should be true for any parallel program – concerns should be well-separated.) Rather than being limited to using command-line arguments as a way to pass information to HPX, you can use environment variables as well. Environment variables also have the advantage (compared to command-line arguments) that they can be accessed from any arbitrary part of the program – including from a library. You can find a list of environment variables that HPX uses on its documentation site. The most important environment variable is HPX_NUM_WORKER_THREADS, which sets the (maximum) number of threads that HPX will use at run-time.

Environment variables

It is important to remember that environment variables don’t actually do anything. Setting the value of HPX_NUM_WORKER_THREADS to some value doesn’t automatically parallelize your program or causes HPX to have that many threads. Rather it is a variable whose value any program can read – it is up to the program what action to take (if any) in response to the value it reads.

When a program is compiled with HPX, it can be run just as with any other program – execute it on the shell command line.

hpx_info

When running a parallel program, there are several pieces of information we would like to have. For example, in the last assignment, we partitioned our work into pieces and let a thread or task execute each piece (in parallel). For that assignment, we specified the number of threads/tasks to use so that we could experiment with different numbers of threads. In general, however, we don’t want the user to just arbitrarily set the number of threads that a job runs with. We might not want to run with more threads than there are cores available, for example. Or, we might want to set some resource limit on how many threads are used (which might be more or less than the number of cores). That is, we want to let the runtime system have some control over how our parallel program runs (and doing this automatically is in the spirit of HPX).

Consider the following program (work/parallelism/hello_hpx/hpx_info.cpp) that inspects the HPX environment:

#include <hpx/hpx_main.hpp>
#include <hpx/hpx.hpp>

int main() {

  char const* env_name = "HPX_NUM_WORKER_THREADS";
  std::println("{}        = {}", env_name, std::getenv(env_name));

  std::println("std::hardware_concurrency()   = {}", std::thread::hardware_concurrency());
  std::println("hpx::get_num_worker_threads() = {}", hpx::get_num_worker_threads());

  return 0;
}

Now, build and run this program. When I run this in my shell I get the following output (the numbers might vary depending on your hardware):

$ ./build/work/parallelism/hello_hpx/hpx_info --hpx:threads=4
HPX_NUM_WORKER_THREADS        =
std::hardware_concurrency()   = 8
hpx::get_num_worker_threads() = 4

In this case, HPX_NUM_WORKER_THREADS is not set so HPX uses default values. However, we have to be careful how we interpret this. The first line shows that HPX_NUM_WORKER_THREADS is not set. The second line is obtained with a call to the C++ standard library (outside of HPX) and shows that the available hardware concurrency (number of cores) is equal to four. The next two lines were obtained with calls to functions in the HPX library. They show that the maximum available number of HPX threads is equal to four. The final line, however, may seem curious – it says that the number of threads is equal to one.

Recall from lecture that HPX enables using the fork-join model (amongst other things). But, how many threads actually run at one time with HPX? And how can we tell?

Let’s add a simple function to the above:

int hpx_thread_count() {
  int n = 0;
  hpx::experimental::run_on_all(
    hpx::experimental::reduction_plus(n), 
    [](int& local_n) { 
      ++local_n; 
    });
  return n;
}

You may wonder why the lambda now has to expect one additional argument (the int& local_n). We need this as the use of the reduction_plus will pass a reference to a thread-local variable to the lambda, which ensures there will be no race conditions. Before the function run_on_all returns, it will reduce the values of all thread-locals and store the result in the initial argument the reduction object was created from.

This function will create the default number of threads and each one of those will increment the variable local_n by one – the final value of n (as reduced by summing up all local_n instances) will be the number of threads that executed in parallel. If we add a call to hpx_thread_count() to the above program we obtain as output:

$ ./build/work/parallelism/hello_hpx/hpx_info --hpx:threads=4
HPX_NUM_WORKER_THREADS        =
std::hardware_concurrency()   = 8
hpx::get_num_worker_threads() = 4
hpx_thread_count()            = 4

While the system this was run on has 8 cores (as reported by std::thread::hardware_concurrency) we restrict HPX to use only 4 cores (by supplying the command line argument --hpx:threads=4). All HPX programs support a set of predefined command line options, you can see a description of all of those by running your executable with --hpx:help.

HPX is Still Threads

One advantage as a programmer to using HPX is that parallelization can be done in line. When we parallelize programs explicitly using threads and tasks in C++, we had to bundle the work we wanted to be done into a separate function and launch a thread or task to execute that function. With HPX we can parallelize directly in line using a lambda (in the simplest case). For example, we can create a parallel “Hello World” as follows (see work/parallelism/hello_hpx/hello_hpx_parallel.cpp):

int main() {

  hpx::experimental::run_on_all(
    []() {
      std::println("Hello! I am thread {} of {}",
        hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
      std::println("My C++ std::thread id is {}", std::this_thread::get_id());
    });

  return 0;
}

There is (almost) no helper function needed.

But, HPX is not a panacea. Underneath the covers, HPX still runs threads. And it doesn’t do anything special to fix the various issues we have seen with threads. This is the output from one run of the above program:

$ ./build/work/parallelism/hello_hpx/hello_hpx_parallel
Hello! I am thread Hello! I am thread 0 of 4
Hello! I am thread 2 of 4
My C++ std::thread id is 140325565663104
My C++ std::thread id is 140325592336192
3 of 4
My C++ std::thread id is 140325561464768
Hello! I am thread 1 of 4
My C++ std::thread id is 140325569861440

The first thing to notice is that we can get the thread id for the current thread with C++ standard library mechanisms. Problematically though, note that the strings that are being printed are interleaved with each other. We have seen this before in multithreading – when we had race conditions.

Just as with the explicit threading case, HPX can have race conditions when shared data is being written. However, as with the mechanisms for parallelization, the HPX mechanisms for protecting data are expressed using the C++ standard synchronization primitives. We can fix this race by using a hpx::mutex as usual.

In the context of HPX you should always use the HPX facilities for synchronizing access to data instead of the C++ standard ones – i.e. use hpx::mutex instead of std::mutex, hpx::condition_variable, instead of std::condition_variable, etc. HPX has all facilities reimplemented that the C++ standard mandates in terms of concurrency and parallelism. You can still use the standard’s locking facilities, though: std::lock_guard, std::unique_lock, etc. Those work well with the HPX synchronization primitives.

Examples

We explore a few variants of HPX parallelization below.

The code for the first example is contained in the file work/parallelism/hello_hpx/hello_hpx_parallel.cpp and can be compiled to hello_hpx_parallel. The provided CMake build system has the related directives so you should be able to build it.

All of the following examples can be built from the corresponding .cpp in a similar manner.

hpx::run_on_all

In the first example, we just tell the compiler to execute the following block in parallel

hpx::experimental::run_on_all(
  []() {
    std::println("Hello! I am thread {} of {}",
      hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
    std::println("My C++ std::thread id is {}", std::this_thread::get_id());
  });

Experiment with this program a little bit until you get a feel for how HPX is managing threads and a feeling for how the HPX concepts of threads etc. map to what you already know about threads.

For example, try setting different values of HPX_NUM_WORKER_THREADS in your environment (or experiment with different values for the --hpx:threads command line option). You should see different numbers of “Hello” messages being printed. What happens when you have more threads than hardware concurrency? Are the thread IDs unique?

hpx::run_on_all with a given number of threads

The code for this example is contained in the file work/parallelism/hello_hpx/hello_hpx_2_threads.cpp. In this version, we tell HPX to execute the following block in parallel, using 2 threads.

hpx::experimental::run_on_all(
  2,
  []() {
    std::println("Hello! I am thread {} of {}",
      hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
    std::println("My C++ std::thread id is {}", std::this_thread::get_id());
  });

What effect does setting the environment variable HPX_NUM_WORKER_THREADS have on this program (or alternatively, using different values for the --hpx:threads command line option)?

Protect sequential code using a critical section

The code for this example is contained in the file work/parallelism/hello_hpx/hello_hpx_critical.cpp

hpx::mutex mtx;
{
  std::lock_guard l(mtx);
  std::println("Hello! I am thread {} of {}",
    hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
  std::println("My C++ std::thread id is {}", std::this_thread::get_id());
}

How many threads get run when you execute this example? Does setting HPX_NUM_WORKER_THREADS have any effect?

Protect parallel code using a critical section

The code for this example is contained in the file work/parallelism/hello_hpx/hello_hpx_parallel_critical.cpp

  hpx::mutex mtx;
  hpx::experimental::run_on_all(
    num_threads,
    []() {
      std::print("Hello! ");
      std::println("I am thread {} of {}",
        hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
      {
        std::lock_guard l(mtx);
          std::print("Hello from inside the critical section! ");
        std::println("I am thread {} of {}",
          hpx::get_worker_thread_num(), hpx::get_num_worker_threads());
        std::println("My C++ std::thread id is {}", std::this_thread::get_id());
      }
    });

The program has a parallel region that prints the “Hello World” information as before, as well as a protected critical section. How many threads get run when you execute this example? Does setting HPX_NUM_WORKER_THREADS have any effect? Is there any difference in how the text for the critical and non-critical portions gets printed?

Exercises

Norm

Previously, we explored parallelizing some functions for computing the Euclidean norm of a vector. Here we are going to look at how to use HPX to parallelize that same computation.

The main source files for these exercises is in the norm subdirectory. There is a separate driver program for each exercise. The actual norm functions that we will be parallelizing are in the (single) file norms.hpp in the include subdirectory.

There are five different functions that you will parallelize using HPX: norm_block_reduction() norm_block_critical(), norm_cyclic_reduction(), norm_cyclic_critical() and norm_parfor().

The drivers are named according to the function they test and can be made into the corresponding executable.

The drivers all operate in the same way (dispatching to run, which is contained in include/norm_utils.hpp). A driver can accept up to three arguments:

$ ./build/work/parallelism/norm/norm_block_critical [lo] [hi] [max_threads]

where lo is the smallest size vector to be tested, hi is the upper bound of vector size to be tested, and max_threads is the upper bound of threads to test.

As with the norm drivers in More Warmup, these drivers will run test problems from lo to hi, increasing in factors of two at each test, and also test with one to max_threads, again in powers of two. The programs will print measured GFlops for each test.

Look through the code for run() in norm_utils.hpp. How are we setting the number of threads for HPX to use?

Parallelizing loops using HPX

We demonstrated in class, how hpx::experimental::for_loop can be used to convert any index-based sequential for loop into an equivalent parallelized version. In short, if given a sequential loop:
for(size_t i = 0; i != N; ++i) {
    // any code that relies on i 
}
it can be parallelized as:
hpx::experimental::for_loop(
    hpx::execution::par,
    0, N, 
    [&](size_t i) {
        // any code that relies on i
    });
We have also shown that thread-safe reductions can be easily added to this by using one of the HPX reduction operations, e.g.,
int n = 0;
hpx::experimental::for_loop(
    hpx::execution::par,
    0, N, 
    hpx::experimental::reduction_plus(n),
    [&](size_t i, int& local_n) {
        // any code that relies on i
        // local_n refers to a thread-local instance of n
    });
We have not shown in class that it is possible to supply the number of threads to use to run the parallel for_loop as well:
int n = 0;
hpx::experimental::for_loop(
    hpx::execution::experimental::with_processing_units_count(
        hpx::execution::par, num_threads),
    0, N, 
    hpx::experimental::reduction_plus(n),
    [&](size_t i, int& local_n) {
        // any code that relies on i
        // local_n refers to a thread-local instance of n
    });
Please feel free to use this scheme to control the amount of threads to be used in your implementation of the norm() functions below.

Another feature of HPX not presented in class may come in handy whenever you need to increment the loop counter using a stride larger than one. In this case, hpx::experimental::for_loop_strided may be useful. This facility allows to additionally provide stride to any loop, it is identical to for_loop otherwise:
hpx::experimental::for_loop_strided(
    hpx::execution::par,
    0, N, stride,   // stride is an integer larger than one
    [&](size_t i) {
        // any code that relies on i
    });

Using HPX, parallelize the norm() functions in the file work/include/norms.hpp, as indicated below:

norm_block_reduction() – the function should be parallelized in block fashion according to the number of available HPX threads. The update to the shared variable sum should be protected using an HPX reduction directive. Some of the scaffolding for this function has been provided. The basic ideas should be familiar: determining the block size, assigning work to each worker. The difference is that rather than specifying how to partition the problem and then determining the number of threads to use, we are specifying how many threads to use and then determining how to partition the problem. Also, we don’t need to create a helper function.

norm_block_critical() – the function should be parallelized in block fashion according to the number of available HPX threads. The update to the shared variable sum should be protected using an HPX critical section.

norm_cyclic_reduction() – the function should be parallelized in cyclic fashion according to the number of available HPX threads. The update to the shared variable sum should be protected using an HPX reduction directive.

norm_cyclic_critical() – the function should be parallelized in cyclic fashion according to the number of available HPX threads. The update to the shared variable sum should be protected using an HPX critical section.

These functions are tested by the program work/parallelism/norm/norms_test.cpp.

Answer the following questions in results/answers.md:

Which version of norm above provides the best parallel performance? How do the results compare to one of the parallelized versions of norm from More Warmup? Justify why one is faster than the other. You can select any one of the parallelized versions of norm from More Warmup.

Which version of norm above provides the best parallel performance for larger problems (i.e., problems at the top end of the default sizes in the drivers or larger)? How do the results compare to one of the parallelized versions of norm from More Warmup? Justify why one is faster than the other. You can select any one of the parallelized versions of norm from More Warmup. It can be either a different version or the same version of the previous question.

Which version of norm above provides the best parallel performance for small problems (i.e., problems smaller than the low end of the default sizes in the drivers)? How do the results compare to one of the parallelized versions of norm from More Warmup? Justify why one is faster than the other. You can select any one of the parallelized versions of norm from More Warmup. It can be either a different version or the same version of the previous two questions.

Next up: A Multithreaded Buddhabrot Renderer