HPX Users Guide

HPX-5 Home / Applications / Documentation / Frequently Asked Questions /
GIT Repository /Media / Presentations / Publications /
Users Guide / Vision / CREST Home
  • Overview
  • Configuring Your Environment
  • Running HPX-5 Applications
  • Compiling HPX-5 Applications
  • A Minimal HPX-5 Application
  • The HPX-5 API
  • Trouble Shooting
    • hpx-config
    • Debugging
  • Writing HPX-5 Applications Idiomatically

Overview

Components of an HPX-5 Program
Before starting to use HPX-5, it helps to know the main components of HPX-5 from a conceptual level and how they interact:

ParalleX

ParalleX is a cross-cutting execution model that defines a set of close-knit components for use in scalable Exascale programming. It is an evolving parallel execution model derived to exploit the opportunities and address the challenges of emerging technology trends. It aims to ensure effective performance gain into the Exascale era of the next decade by addressing the challenges of starvation, latency, overhead, waiting, energy and reliability. At the core of the ParalleX strategy is a new framework to replace static methods with dynamic adaptive techniques. This method exploits runtime information and employs unused resources. It benefits from locality while managing distributed asynchrony. ParalleX is a crosscutting model to facilitate co-design and interoperability among system component layers from hardware architecture to programming interfaces.

ParalleX distinguishes itself by:

  • Dynamic adaptive methods through runtime system software for guiding computing resource management and task scheduling
  • Lightweight threads (Multi-threaded)
  • Parcels (a form of active messages)
  • Global name space and virtual addressing
  • Runtime discovery, synchronization, and control
  • Local Control Objects (LCOs)

Interface and design choices made in HPX-5 attempt to conform to ParalleX wherever possible.

Localities

A locality is a distinct virtual address space in which an instance of the HPX-5 runtime exists. On a clustered system, each physical node usually has a single locality, though this is determined at runtime by the application launcheroversubscription caused by the launcher may result in one locality per NUMA domain, or one locality per subset of cores. It is quite possible to write HPX-5 programs without explicitly referencing localities, but it sometimes helps to know they exist (for example when using the “PGAS” model for global memory, or when initializing C/C++ per-process global data – see below).

* Global Memory

An HPX-5 application has three distinct regions of memory.

  • Local memory is the memory that is normally returned by an allocator like malloc(). Local virtual addresses can be shared between lightweight threads within a locality, e.g., for parallel for operations, however local virtual addresses can not be passed across localities.
  • Registered memory is memory that is allocate through the hpx_malloc_registered interface, as well as lightweight thread stacks. To allocate local zeroed memory use hpx_calloc_registered interface. Registered memory should be used as the local address for hpx_gas_memget,memput operations. Local memory may also be used, however this may require slow dynamic network registration.
  • Global memory is memory allocated via the HPX-5 runtime and that has an address that can be shared among the entire HPX-5 application. All global virtual addresses have a local virtual alias somewhere in the system, however a global virtual address does not necessarily have a mapping at the current locality. If an action (see below) is invoked at a global address, it will run at the locality at which the address is mapped, and it can pin the global virtual address to get its local translation. It can then interact with that local alias just like normal, local virtual memory. This is the most common way global memory is operated on. Otherwise, global memory must be copied to and from a registered address by the appropriate HPX-5 call, such as a hpx_gas_memget() or hpx_gas_memput().

HPX-5 supports several different memory implementations:

  • Under the SMP implementation, which is only available when HPX-5 is running on a single node, global memory is effectively the same as local memory, except that the memory has a global address the same as it would in one of the normal HPX-5 memory models.
  • Under the PGAS (Partitioned Global Address Space) implementation, global memory is allocated and managed similarly to how UPC does so. Memory can be allocated at the current locality or can be allocated cyclically over a number of localities.
  • Under the AGAS (Active Global Address Space) implementation, global memory is allocated similarly to how it is allocated under the PGAS model, but it is not necessarily located at a fixed locality; it may be moved to other localities to balance system memory load or workload.

Within a locality applications may specify a soft core affinity for global addresses using the hpx_gas_set_affinity() interface. Threads generated from parcels targeting a global address with soft core affinity will be preferentially scheduled on the requested core.

  1. Soft affinity is of most use for well-balanced applications with long-running threads.
  2. Soft affinity is not a guaranteed affinity and programmers should make no assumptions about where threads will actually run.

The current libhpx implementation ships with three implementations of GAS affinity, which can be selected at runtime using the –hpx-gas-affinity option.

  1. The default –hpx-gas-affinity=none implementation ignores the affinity hint entirely. We intend this default to change in the future.
  2. The –hpx-gas-affinity=urcu implementation uses the userspace read-copy-update library and associated hash table to manage soft affinity bindings.
  3. The –hpx-gas-affinity=cuckoo implementation uses a libcuckoo hash table to manage soft affinity bindings.

The userspace read-copy-update implementation is the preferred option due to its low overhead for readers, but is currently not buildable for Darwin platforms, which can use libcuckoo instead.

Lightweight Threads & Actions

Actions are globally callable functions and are the thread entry points for HPX-5 lightweights threads (and tasks and interrupts). They are invoked through the use of an active-message parcel or through one of the higher-level remote procedure call interfaces. Actions are executed by HPX-5 lightweight thread and are usually bound to execute local to a specific global addresses; this is used by the HPX-5 runtime to schedule the action in the physical location that will be most efficient. HPX-5 threads are scheduled cooperatively and may block waiting for lightweight synchronization objects (LCOs).

LCOs

Synchronization and, often, communications between actions is done via LCOs (Lightweight Control Objects). There are a variety of LCOs, and some have rather different characteristics but all LCOs may be waited on by threads and parcels until some condition is true or until the LCO has been set with a value (either by another action, or possibly by some other means). Many LCOs will have a value that can be retrieved by get actions.

The most commonly used kind of LCO is a future. A future can be waited on by one or more thread until it is set (it may be set with a value, or not). Many asynchronous functions in the HPX-5 API take a future as a parameter, and will signal completion via the future.

Another commonly used LCO is the and LCO. When an and LCO is created, it requires a parameter specifying how many times it may be set. Then multiple actions may set it one or more times, and a thread waiting on it will be released when it has been set as many times as was specified. and LCOs are often used in parallel loops.

Building and Installing

Requirements

Platform

OS
Architecture
Linux
x86-64
Linux
Intel Knight’s landing
Linux
ARMv7
Linux
AArch64 (ARMv8)
Linux
IBM Power 8
Darwin (10.12.4)
x86-64 (SMP only)

The HPX-5 runtime system may compile on other platforms, but it is not guaranteed to do so. We do not test HPX-5 on other platforms.

Software

Compiling HPX-5 requires that you have several software packages installed. The table below lists those required packages. The “Version” column provides “known to work” versions of the package.

Compilers

Build System

Package
Version
Notes
pkg-config
 
Doxygen (Optional)
1.87+
Optional; required to build the documentation
GNU Make
3.81+
GNU Autoconf
 
2.69
Only required if building from the git version of HPX-5
GNU Automake
1.15
Only required if building from the git version of HPX-5
GNU M4
1.4.17
Only required if building from the git version of HPX-5
Libtool
2.4.6
Only required if building from the git version of HPX-5

Libraries

Package
Version
Notes
MPI (optional)
 
1.6.3+
Tested with OpenMPI 1.6.3, 1.6.5, 1.8.1, 1.8.4, 1.10.0, 1.10.2, MPICH 3.0.4, impi 4.1.3049
Photon (optional)
(included)
jemalloc (included)
4.5.0
Source included
hwloc (included)
 
(included)
uthash (included)
(included)
libffi (included)
(included)
libffi-mic (included)
(included)
Needed for Xeon-Phi
liburcu (optional)
0.9.2
Source included with HPX-5
libcuckoo (optional)
1.0
Source included with HPX-5
tbbmalloc (optional)
2015.1.133
If available, tbbmalloc may be used instead of jmellaoc
PAPI (optional)
Tested with 5.4.1
APEX (optional)
(included)0.5

Notes:

  • HPX-5 can build and run successfully without any network backend, but at present, MPI or Photon is required for networking by HPX-5.
  • The above versions of Make, Autoconf, Automake, M4, and Libtool are only required when building the repository version of HPX-5. The downloadable release packages do not require these.

Configuration and Environment

Cray users must set export CRAYPE_LINK_TYPE=dynamic to use HPX-5.

Bootstrapping For Developer Builds

HPX-5 releases have already been bootstrapped. Developers using other branches will need to bootstrap before configuration. In the HPX-5 directory, run the bootstrap script. You may need updated autotools and libtool, which can be installed using the scripts/setup_autotools.sh script.

Environment Dependencies

The HPX-5 configuration will pull a number of dependencies from your environment. See ./configure –help for a complete list of the variables that are used.

Variable
Description
DOXYGEN
The doxygen executable for documentation build
PHOTON_CARGS
Additional configuration arguments for the included photon package
TBBROOT
The path to the root of Intel’s TBB installation
TESTS_CMD
A command to launch each integration test during make check

The HPX-5 configuration looks for most dependencies in your current path. These dependencies will be found and included automatically as necessary. For missing dependencies HPX-5 will fall back to looking for pkg-config packages, which will also be found without intervention if they are installed on your system using their default names and their pkg.pc files are in your $PKG_CONFIG_PATH. You can force HPX-5 to use a package over the system-installed version of a dependency, or use an installation with a non-standard name, by specifying –with-package=pkg during configuration. See ./configure –help for further details.

Network Configuration

HPX-5 currently specifies two network interfaces with one transport implementation of each: ISend/IRecv (ISIR) with the MPI transport, and Put-With-Completion (PWC) with the Photon transport. HPX-5 can be built with one, both, or neither of the network transports.

Note that if you are building with Photon, the libraries for the given network interconnect you are targeting need to be present on the build system. The two supported interconnects are InfiniBand (libibverbs and librdmacm) and Cray’s GEMINI and ARIES via uGNI (libugni). HPX-5 2.2.0 includes experimental support for sockets and Intel’s PSM networking through the included libfabrics package, which may have additional dependencies on your system.

Libfabric is enabled by default, but if photon detects a “native” backend, like verbs or ugni, it will use that by default only falling back to libfabric/sockets if neither is detected. See the section on network runtime configuration below.

If you build with Photon and/or MPI on a system without networking, you may still use the SMP option to run applications that are not distributed.

Configuring with MPI

By default, HPX-5 will be configured in SMP modewithout a high-speed network. To configure with MPI, use the –enable-mpi option. When MPI is enabled, the configuration will search for the appropriate way to include and link to MPI.

  1. HPX-5 will try and see if mpi.h and libmpi.so are available with no additional flags (e.g., for CC=mpicc or on Cray CC=cc).
  2. HPX-5 will test for an mpi.h and -lmpi in the current C_INCLUDE_PATH and {LD_}LIBARARY_PATH respectively.
  3. HPX-5 will look for an ompi pkg-config package.

If you want to specify which MPI should be used or the build system cannot find the .pc file you may need to use the –with-mpi=pkg option. The pkg parameter can be either the prefix for the .pc file (ompi, mpich, etc.) or a complete path to the .pc file.

Configuring with Photon

The Photon network library is included in HPX-5 within the contrib directory. To configure HPX-5 with Photon use the option –enable-photon. HPX-5 can also use a system-installed or pkg-config version of Photon. This can be controlled using the –with-photon=system or –with-photon=pkg, respectively. Note that HPX-5 does not provide its own distributed job launcher, so it is necessary to use either the –enable-pmi or –enable-mpi option in addition to –enable-photon in order to build support for mpirun or aprun bootstrapping.

To configure on Cray machines you will need to include the PHOTON_CARGS=–enable-ugni flag during configuration so that Photon builds with ugni support. In addition, the –enable-hugetlbfs option causes the HPX-5 heap to be mapped with huge pages, which is necessary for larger heaps on some Cray Gemini machines. The hugepages modules provide the environment necessary for compilation.

Enabling the Test-suite

To build and run the unit and performance testsuite set TESTS_CMD= as the driver command.

For Example:
./configure TESTS_CMD=”mpirun -np 2 –map-by node:PE=16″ –enable-mpi

will work on many systems using recent versions of OpenMPI. For distributed tests, two processes should be available (-np 2) but it is not necessary to use more than that.

HPX-5 ships with some long-running tests that are disabled by default, the configuration option –enable-lengthy-tests will build these and run them during make check.

Additional optional configure options

Some additional features that HPX-5 supports are the following

Option
Description
– – enable-parallel-config
Speed-up configure [off]
– – enable-hpx++
Enable HPX++ bindings (experimental, requires C++11) [off]
– – enable-photon
Build with Photon networking [off]
Build with MPI networking and mpirun bootstrap [off]
– – enable-pmi
Build with PMI bootstrap [off]
– – enable-jemalloc
Use jemalloc for memory allocation [on]
– – enable-tbbmalloc
Use tbbmalloc for memory allocation (requires C++11) [off]
– – enable-dlmalloc
Use Doug Lea’s malloc for GAS allocation [off]

– – enable-hugetlbfs
Enable support for explicit huge pages [off]
– – enable-agas
Enable AGAS (requires C++11) [off]
– – enable-percolation
Enable percolation support [off]
– – enable-docs
Build doxygen documentation [off]
– – enable-debug
Enable debug checks (–hpx-dbg-{waitat,waitonabort,waitonsegv} and logging (–hpx-log-{at,level}) [off]
– – enable-instrumentation
Enable instrumentation (can affect performance) [off]
– – enable-testsuite
Build the testsuite [on]
– – enable-lengthy-tests
Enable long running tests in the test suite [off]

Note that enabling the HPX C++ bindings, AGAS or tbbmalloc requires C++11. Note also that even when built with AGAS, HPX-5 defaults to PGAS at present so –hpx-gas=agas must be used at runtime to specify the use of AGAS. AGAS works only with IS/IR, so the runtime option –hpx-network=isir must be used as well.

The –enable-jemalloc, –enable-dlmalloc and –enable-tbbmalloc options are mutually exclusive; –enable-tbbmalloc requires TBBROOT to be set in the environment.

When either jemalloc or tbbmalloc is enabled, programs compiled with HPX-5 will use jemalloc or tbbmalloc for calls to malloc() and free(). Either jemalloc or tbbmalloc is required for network operation, and they are mutually exclusive. In an SMP configuration, –disable-jemalloc will result in executables linked to the standard libc implementation.

Configuring for building on Xeon Phi MICs

When compiling for a Xeon Phi processor’s MIC units, there are some slight differences from the normal configuration process. First there are a couple caveats: At present, TBBMalloc must be used in place of the default jemalloc, and MPI must be used in place of Photon.

Furthermore, when configuring, the options –host=x86_64-k1om-linux –with-tbbarch=mic must be specified and the following variables must be added to the configure line: CC=mpiicc CXX=mpiicpc CFLAGS=-mmic CXXFLAGS=-mmic CCASFLAGS=-mmic.

Complete the build and install

To complete the build and install use:
$ make $ make install

If HPX-5 was configured with –enable-testsuite you can run make check. On clusters, you may have to do this from a job allocation, though.

You may need to add the lib subdirectory of the directory HPX-5 was installed to to your LD_LIBRARY_PATH environment variable, and you should add the lib/pkgconfig subdirectory to your PKG_CONFIG_PATH. Additionally, if you wish to use the hpx-config tool you should add the bin subdirectory to your path.

For example:
export PATH=$HPX_PREFIX/bin:$PATH
export LD_LIBRARY_PATH=$HPX_PREFIX/lib:$LD_LIBRARY_PATH
export PKG_CONFIG_PATH=$HPX_PREFIX/lib/pkgconfig:$PKG_CONFIG_PATH

Configuring-Your-Environment

The specifics of building and installing HPX-5 are covered in the previous section. If HPX-5 has already been installed on your system or is available via a module, you won’t have to know most of that information. You may need to know a couple things, first, though:

The installed HPX-5 library path should be in your LD_LIBRARY_PATH, even when building a static library (if you are using a module, this will most likely already be set).

To build any new HPX-5 programs, you may need to know where HPX-5 is installed. The easiest way to build applications using HPX-5 is to use pkg-config. If the path to the HPX-5 pkg-config file (hpx.pc) is already in your PKG_CONFIG_PATH environment variable, then you won’t need this information. If it isn’t in your PKG_CONFIG_PATH, it will need to be added to it if you want to be able to use pkg-config to automatically generate the proper flags for building HPX-5 applications.

For example, if HPX-5 was installed to /path/to/hpx5/ and you are using bash:
export PATH=/path/to/hpx5/bin:$PATH
export LD_LIBRARY_PATH=/path/to/hpx5/lib:$LD_LIBRARY_PATH
export PKG_CONFIG_PATH=/path/to/hpx5/lib/pkgconfig:$PKG_CONFIG_PATH
Or, in tcsh:
setting PATH /path/to/hpx5/bin:$PATH
setenv LD_LIBRARY_PATH /path/to/hpx5/lib:$LD_LIBRARY_PATH
setenv PKG_CONFIG_PATH /path/to/hpx5/lib/pkgconfig:$PKG_CONFIG_PATH

Furthermore, and discussed in more detail below, you may export any number of the HPX-5 runtime options into your environment to set up defaults.

For example, if you would like to use the Isend/lrcv network backend by default you could:
export HPX_NETWORK=isir

You will also need to know what job launcher your system uses. If HPX-5 was built in “SMP” mode, not intended to be used in a distributed environment, you probably don’t need a job launcher, though you may want to launch with numactl to control locality placement. In all other cases, you will. The job launcher is usually one of mpirun, mpiexec, ibrun, or aprun, but some systems have system-specific job launchers.

Running-Applications

By default HPX-5 uses the number of cores specified in the process’ affinity map as the number of system threads to spawn!

Know Your Launcher

The first step to running an HPX-5 application is to understand your system’s launcher infrastructure, particularly with respect to process placement. This is because that HPX-5 does its best to conform its locality topology to the one that your launcher sets up. As HPX-5 expects to run multithreaded localities it is particularly important for you to launch multithreaded processes with the appropriate number of cores.

Launching an HPX-5 job is generally no more complicated than launching an MPI job. Note that HPX-5 jobs should be arranged on nodes like multi-threaded MPI processes. That is, on multi-core machines, single-threaded MPI jobs are usually arranged such that there is one MPI process (rank) per core, but HPX-5 is multi-threaded and should be arranged one HPX-5 process per node (which is how an MPI program using pthreads of OpenMP would usually be arranged). However, it may be necessary to specify to the job launcher that each process should be allowed to use all the available threads.

Note: your launcher is often constrained by your job management software, so you should make sure to request an allocation the corresponds to how you would like to launch your job (e.g., -l nodes=:ppn= or -mppwidth=<np*ppn>). Check with your system administrator for details.

SMP Launching

For SMP execution, you may either run your application directly, in which case HPX-5 will query the operating system for the number of cores to run on (this includes SMT cores), or you can use numactl to restrict the execution.

For example, on a node that has two 6-core processors with hyper threading enabled (24 physical cores), running the hello example result in all 24 cores being used by HPX-5 scheduler threads.
$ ./hello –hpx-log-level=default LIBHPX<-1,-1>: (Network.cpp:Create:98) bootstrapped using SMP. LIBHPX<0,-1>: (hpx.cpp:hpx_init:170) HPX running 16 worker threads on 16 cores LIBHPX<0,-1>: (Network.cpp:Create:128) SMP network initialized LIBHPX<0,-1>: (Scheduler.cpp:start:75) hpx started running 0 Hello World from 0. LIBHPX<0,-1>: (Scheduler.cpp:start:121) hpx stopped running 0 $
To restrict execution to one specific NUMA domain, you can launch using numactl -N, and HPX-5 will restrict itself to that domain.
$ numactl -N 0 ./hello –hpx-log-level=default LIBHPX<-1,-1>: (Network.cpp:Create:98) bootstrapped using SMP. LIBHPX<0,-1>: (hpx.cpp:hpx_init:170) HPX running 8 worker threads on 8 cores LIBHPX<0,-1>: (Network.cpp:Create:128) SMP network initialized LIBHPX<0,-1>: (Scheduler.cpp:start:75) hpx started running 0 Hello World from 0. LIBHPX<0,-1>: (Scheduler.cpp:start:121) hpx stopped running 0 $
Even here, HPX-5 is using the SMT cores because they are part of the process' affinity map. If you would like to restrict the execution to specific cores, say one core per SMT pair, you can completely control the layout as follows.
$ numactl -C 0,2,4,6,8,10 ./hello –hpx-log-level=default LIBHPX<-1,-1>: (Network.cpp:Create:98) bootstrapped using SMP. LIBHPX<0,-1>: (hpx.cpp:hpx_init:170) HPX running 6 worker threads on 6 cores LIBHPX<0,-1>: (Network.cpp:Create:128) SMP network initialized LIBHPX<0,-1>: (Scheduler.cpp:start:75) hpx started running 0 Hello World from 0. LIBHPX<0,-1>: (Scheduler.cpp:start:121) hpx stopped running 0 $

mpirun Launching

Note: OpenMPI 1.10.0 causes a regression with the –map-by node:pe=1 option, please set –hpx-threads=1 explicitly to work around this issue.

Launching a multithreaded process with mpirun varies based on your version of MPI. You should consult your system’s man pages or administrator for advice if needed.

Using OpenMPI 1.10.2, on a cluster with 16 core nodes, we need to use the –map-by node:PE= syntax to correctly launch a job using cores.

Using OpenMPI 1.10.2, on a cluster with 16 core nodes, we need to use the --map-by node:PE= syntax to correctly launch a job using cores. Without this launcher flag, we get single-threaded localities, e.g.,
$ mpirun –version mpirun (Open MPI) 1.10.2 Report bugs to http://www.open-mpi.org/community/help/ $ mpirun -np 2 ./hello –hpx-log-level=default –hpx-log-at=0 mpirun -np 2 ./hello –hpx-log-level=default –hpx-log-at=0 LIBHPX<0,-1>: (hpx.cpp:hpx_init:170) HPX running 1 worker threads on 1 cores LIBHPX<0,-1>: (Network.cpp:Create:128) ISIR network initialized LIBHPX<0,-1>: (Scheduler.cpp:start:75) hpx started running 0 Hello World from 0. LIBHPX<0,-1>: (Scheduler.cpp:start:121) hpx stopped running 0 $ mpirun -np 2 –map-by node:PE=16 ./hello –hpx-log-level=default –hpx-log-at=0 LIBHPX<0,-1>: (hpx.cpp:hpx_init:170) HPX running 16 worker threads on 16 cores LIBHPX<0,-1>: (Network.cpp:Create:128) ISIR network initialized LIBHPX<0,-1>: (Scheduler.cpp:start:75) hpx started running 0 Hello World from 0. LIBHPX<0,-1>: (Scheduler.cpp:start:121) hpx stopped running 0 $
With earlier versions of mpirun the flags may be different. For example, using OpenMPI 1.6.5 we use the -cpus-per-proc flag, coupled with one of -bynode, -bysocket, etc.
$ mpirun –version mpirun (Open MPI) 1.6.5 Report bugs to http://www.open-mpi.org/community/help/ $ mpirun -np 2 ./hello –hpx-log-level=default –hpx-log-at=0 Reported: 1 (out of 1) daemons – 2 (out of 2) procs LIBHPX<0,-1>: (hpx.c:hpx_init:178) HPX running 1 worker threads on 1 cores LIBHPX<0,-1>: (network.c:network_new:96) PWC network initialized Hello World from 0. $ mpirun -np 2 -bynode -cpus-per-proc 8 ./hello –hpx-log-level=default –hpx-log-at=0 Reported: 2 (out of 2) daemons – 2 (out of 2) procs LIBHPX<0,-1>: (hpx.c:hpx_init:178) HPX running 8 worker threads on 8 cores LIBHPX<0,-1>: (network.c:network_new:96) PWC network initialized Hello World from 0.

slurp Launching

On systems with slurm available HPX applications can be launched directly from the command line.
$ srun -n 16 -N 8 -c 8 ./hello –hpx-log-level=default –hpx-log-at=0 LIBHPX<0,-1>: (hpx.c:hpx_init:206) HPX running 8 worker threads on 8 cores LIBHPX<0,-1>: (network.c:network_new:108) PWC network initialized LIBHPX<0,-1>: (hpx.c:_run:239) hpx started running 0 Hello World from 0. LIBHPX<0,-1>: (hpx.c:_run:241) hpx stopped running 0

aprun Launching

On Cray system the aprun job launcher maintains strict control over the binding of PEs and cores within a job. Here, the -d flag indicates the “depth” of the job, i.e., the number of cores per PE.
$ aprun –version aprun (ALPS) 5.2.1 Revision 9041 Network type Aries $ aprun -n 2 -N 1 -d 2 ./hello –hpx-log-level=default –hpx-log-at=0 LIBHPX<0,-1>: (cpu.c:system_get_affinity_group_size:71) saw 1 cores from set 1 LIBHPX<0,-1>: (hpx.c:hpx_init:178) HPX running 1 worker threads on 1 cores LIBHPX<0,-1>: (network.c:network_new:96) PWC network initialized LIBHPX<0,0>: (worker.c:worker_start:582) worker 0 starting on core 0 Hello World from 0. Application 395149 resources: utime ~0s, stime ~2s, Rss ~1101744, inblocks ~707, outblocks ~70 $

Controlling HPX-5 Behavior at Run-time

There are a variety of options that a user may want to control not just at build-time but at run-time for the HPX-5 runtime. Such options include, for example, the global memory heap size, which network interconnect to use, and whether to turn on logging. There are two ways to control these options: through environment variables and command line options.

To control the options through command line parameters, just somewhere to your command line (after the name of the HPX-5 executable) the option.

For example, the command line option to control the active network is --hpx-network. On a machine using mpirun as the job launcher, setting this option to use the Isend/Irecv network backend instead of the Photon backend (selected by default when built) would look like this:
mpirun -np 2 -map-by node:PE=16 ./hello –hpx-network=isir

To control options through the environment variables, simply set the environment variable that corresponds to the available command line option. The naming convention for environment variables is that they match the name of the command line parameters except that they are all upper-case and all hyphens are converted to underscores (so the command line parameter –hpx-network is named HPX_NETWORK as an environment variable).

If you're using bash as your shell, the above command would be instead:
export HPX_NETWORK=isir mpirun -np 2 –map-by node:PE=16 ./hello

Run-time options

HPX-5 ships with an extensive set of runtime options. All applications that link against HPX-5 can use the –hpx-help option to get a list of these. In this section we will discuss some specific options that are important for most programs.

General runtime options

The parameters users will most often be interested in that need some explanation are –hpx-network=<network> and –hpx-gas=<model>.

HPX-5 supports two different network types that have different performance characteristics. They rely on two separate transport implementations. PWC (“put with completion”) is the preferred network and usually performs better. It uses the Photon transport layer (included as part of the HPX-5 distribution) and communicates via RDMA for better performance. The Isend/Irecv network uses MPI two-sided communication and is included for portability.

To use one of the networks, it must have been enabled when HPX-5 was built. Selecting a network automatically uses the corresponding transport method.

–hpx-gas=<model>can be used to specify the memory model the runtime will use. pgas is the default unless HPX-5 was built in SMP mode or is being run in a non-distributed environment. smp will force HPX-5 to use the SMP model. agas will force HPX-5 to use AGAS. In order to use AGAS, the runtime must have been built with –-enable-agas and the Isend/IRecv network must be used (–hpx-network=isir) as AGAS is not compatible with the PWC netowrk at this time.

Scheduler options

The HPX-5 cooperative scheduler typically spawns one “worker thread” per core, as described in Know Your Launcher. This can be overridden using the –hpx-threads=<threads> flag.

HPX-5 lightweight threads run on small, 32KB, stacks. Some applications need larger or smaller stacks. This can be controlled with the –hpx-stacksize=<bytes> option.

The –hpx-sched-wfthreshold=<tasks> option controls how many tasks a worker will spawn in “help-first” mode before switching to “work-first” mode. Help-first mode spawns any tasks created in the current tasks and continues with the current task. This can become a problem when a task spawns extreme amounts of other tasks, causing HPX-5’s internal buffers to grow uncontrollably. In work-first mode a newly spawned task is processed before the continuation of the task that spawned it, preventing the uncontrollable buffer growth. If your task meets such characteristics, you can use this option to prevent HPX-5 from crashing.

Debugging options

It is important to note that the –hpx-dbg-waitonsig is implemented using non-portable techniques, and it is not guaranteed to work on all systems.

The –hpx-dbg-mprotectstacks uses the mprotect() interface to protect memory pages around task stacks. This provides a way to debug stack overflows, but if the incorrect memory access skips the proteceted page, the problem will not be detected. This does not detect buffer overruns within the stack; your compiler may provide options to catch these errors. Protecting the stack is most useful in deterministic single-threaded execution to detect a deterministic overflow.

Statistics, logging, and tracing options

Log messages are not errors. Instead, they provide an insight into what the HPX-5 runtime is doing “under the hood.” The logging options –hpx-log-at=<locality> and –hpx-log-level=<locality> produce logging output for selected subsystems at selected localities. If no locality is selected, all localities will produce log messages.

Statistical profiling and tracing of applications are supported, but require HPX-5 to be configured with instrumentation enabled. The statistics reported can relate to timings, hardware counters, and any user specified values requested using the API. Both the statistical profiling and tracing infrastructure are useful for locating bottlenecks in application performance. The options available are as follows:

  • –hpx-trace-classes=<list of classes as strings> which is the “class” of events to trace (possible values are parcel, network, sched, lco, process, memory, trace, gas,collective, all)
  • –hpx-trace-backend=<type> type of tracing backend to use (possible values=default, file, console, stats)
  • –hpx-trace-at=localities filter by locality, -1 for all (default all)
  • –hpx-trace-dir=dir output directory (file backend)
  • –hpx-trace-buffersize=bytes size of trace buffers (file backend)
  • –hpx-trace-off disable tracing at startup (default=off)

ISIR network options

These options are only applicable if using the “ISIR” network (enabled only when using MPI as a transport)

PWC network options

These options are only applicable if using the “PWC” network (enabled only when using Photon as a transport)

The parcelbuffersize is used for point-to-point eager parcel sends, and controls how many parcel bytes may be sent between p2p synchronization. The default value is 65 kb. If there are N localities, then N^2 buffers are allocated.

The parceleagerlimit is used to switch to a rendezvous-send algorithm for parcel transfers. It must be smaller than the parcelbuffersize.

Photon transport options

Photon PWC Options

  • –hpx-photon-pwcbufsize=<bytes> : the size of the packed PWC buffer (default 65k)
  • –hpx-photon-smallpwcsize=<bytes> : the maximum message size for transfer in packed mode (default 128)
  • –hpx-photon-ledgersize=<bytes> : control the number of non-packed Photon PWC requests that may be outstanding (default 512)

General Backend Options

Photon implements two “native” backends: Infiniband Verbs and Gray uGNI (Gemini and Aries interconnects). When detected, or enabled via configure flags, these native backends will be selected by default for use with HPX-5 PWC network. Photon also includes a libfabric backend, which itself implements a number of backends known as providers. These providers include support for TCP sockets, UDP, uGNI, Verbs, Cisco’s USNIC, and Intel’s PSM. If none of the Photon native backends are detected, libfabric/sockets will be used as a fall-back.

Use –hpx-photon-backend={verbs,ugni,fi} to explicitly control which Photon backend is used at runtime. The libfabric backend is abbreviated as fi.

Photon backend details are described below.

Infiniband (verbs)

The Photon default is to use the first detected IB device and active port. This behavior can be overridden with –hpx-photon-ibdev=<dev> and –hpx-photon-ibport=<port>. The ibdev string also acts as a device filter. For example, –hpx-photon-ibdev=”mlx4_0:1+qib0:0″ will have Photon prefer device mlx4_0 and port 1 but will also use qib0 and port 0 if it is detected.

Device names can be retrieved with ibv_devinfo on systems with IB Verbs support. If –hpx-photon-ibdev= is set to be blank, Photon will try to automatically select the right device.

Cray Gemini and Aries (ugni)

There is currently one flag to control the behavior of the uGNI backend’s use of the block transfer engine (BTE). This threshold is set to 4096 bytes but may be overridden with –hpx-photon-btethresh=<bytes>. Any message equal or exceeding this threshold will be sent using the BTE.

libfabric (fi)

When the Photon libfabric backend is selected (e.g., –hpx-photon-backend=fi), the default provider will be TCP sockets. The libfabric provider may be selected with –hpx-photon-fiprov={sockets,psm,usnic,…}. If a provider requires a specific network interface to be selected (e.g., Linux Ethernet device), this may be specified with –hpx-photon-fidev=<iface>

Compiling-Applications

Assuming pkg-config is available, using it is the easiest way to get the proper compiler and linker flags for building an HPX-5 application.

Merely set:
PKG_CONFIG_PATH=/lib/pkgconfig:$PKG_CONFIG_PATH
and then you can get the proper LIBS and CFLAGS values by doing the following:
pkg-config –libs hpx
pkg-config –cflags hpx
For example, to compile hello.c using HPX-5, you could do:
cc hello.c -o hello `pkg-config –cflags hpx` `pkg-config –libs hpx`
Inside of a Makefile you can get the same values by doing the following:
LIBS = $(shell pkg-config –libs hpx)
CFLAGS = $(shell pkg-config –cflags hpx)

A-Minimal-Application

In this section we will discuss the minimum elements that are necessary to put together an HPX-5 program, and we will show an example of putting them together into a simple HPX-5 application.

Every HPX-5 program will need to include the HPX-5 header:
#include <hpx/hpx.h>

Just as every C program has its main() function, every HPX-5 program has a main action which will be invoked when the HPX-5 runtime begins executing. An action is just simply a function that has been registered with the HPX-5 runtime, so we merely have to define our main action and then register it with the HPX-5 runtime.

We will keep our function very simple:
static HPX_ACTION_DECL(_hello);
static int _hello_action(void) {
printf(“Hello World from %u.\n”, hpx_get_my_rank());
hpx_exit(0, NULL);
}

The main action of an HPX-5 program has one obligation other actions don’t have, which is to call hpx_exit() which ends the HPX-5 runtime’s execution. hpx_exit() takes a single buffer which is returned to the application through the out parameter passed to hpx_run().

Now we just need to register our function. There are several kinds of actions in HPX-5; for now we will declare our main action with a type of default (HPX_DEFAULT) just to keep it simple. HPX-5 contains a macro to make registering actions simpler: HPX_ACTION(). HPX_ACTION() should be placed immediately after the function definition. It takes a type (in our case, HPX_DEFAULT), any attributes modifying the action type (we’ll just 0 to indicate “none”), an action name (which has to be a valid C variable name), and the name of the function to register as that action, and finally some optional parameters specific to certain action types and attributes (which we will ignore for now).

Here is what our use of HPX_ACTION() will look like:
static HPX_ACTION(HPX_DEFAULT, 0, _hello, _hello_action);

The result of using this macro will be that _hello_action() is registered with HPX-5 as an action that takes 0 arguments, and this action may now be referred to via the variable declared as static hpx_action_t _hello;

The rest of what a minimal HPX-5 application needs will be done inside of main(). First, every HPX-5 program will need to initialize the HPX-5 runtime. This is done via the function int hpx_init(int *argc, char ***argv) which should be given a pointer to argc and argv. It returns HPX_SUCCESS (defined by including hpx/hpx.h) on success. This must be done before any other HPX-5 calls.

The next thing that an HPX-5 application needs is a call to hpx_run(hpx_action_t entry, …). hpx_run() causes the HPX-5 runtime to begin execution and take control of the application. Note that hpx_run() is variadic so we can provide as many arguments as are needed to the main action, though in our case we have declared our main action to take no arguments so we will simply provide a pointer to the registered name of our main action, _hello:

Like so:
hpx_run(&_hello, NULL);

It is important to note that in contrast to MPI and other SPMD models, this main action is only run once by one lightweight thread at one locality over the entire HPX-5 application. As your launcher is likely a SPMD launcher, the actual main() function executes at all localities and can be used to perform per-locality argument parsing and initialization, but once the HPX-5 runtime begins, the execution model has changed to the HPX-5’s model. Of course, our main action can spawn as many new actions as we want it to this will be explored in the following sections.

The last thing an HPX-5 program must have is a call to hpx_finalize(), which will clean up any remaining resources being used by the HPX-5 runtime. It should be called after the hpx_run() has returned.

We will now add one final thing to our first HPX-5 program: error checking. Most HPX-5 functions return HPX_SUCCESS if they do not encounter an error, so we can check the return values of hpx_init() and hpx_run() to make sure no errors were encountered. Remember that the return value of hpx_run() actually comes from the value that we gave it in our main action, though.

Putting this all together, our first, minimal program looks like this:
#include
#include <hpx/hpx.h>
static HPX_ACTION_DECL(_hello);
static int _hello_action(void) {
printf(“Hello World from %u.\n”, hpx_get_my_rank());
hpx_exit(0, NULL);
}
static HPX_ACTION(HPX_DEFAULT, 0, _hello, _hello_action);

int main(int argc, char *argv[argc]) {
if (hpx_init(&argc, &argv) != 0)
return -1;
int e = hpx_run(&_hello, NULL);
hpx_finalize();
return e;
}

As explained in Compiling HPX-5 Applications, assuming $PKG_CONFIG_PATH is set properly, we should be able to compile the above program with a command

Such as:
gcc `pkg-config –cflags hpx` `pkg-config –libs hpx` -o hello hello.c

If you are using mpirun as your job launcher, you should be able to run the above program on one node with one scheduler thread

Like this:
mpirun -np 1 –map-by node:PE=1 ./hello

Of course, to do anything interesting we would need many more API calls. The most useful and critical parts of the API are explained in the following section.

HPX re-entrance

Starting from 2.0 release, HPX-5 gives user the ability to call hpx_run() as many times the application needs. This enables user to execute other routines between each run for example MPI. Once the hpx_run() phases are completed hpx_finalize() may be invoked to cleanup the hpx-5 environment.

Following is a simplified code snippet illustrating the re-entrance functionality.
#include #include <hpx/hpx.h> #define RUNS 100 static HPX_ACTION_DECL(_hello); static int _hello_action(void) { printf(“Hello World from %u.\n”, hpx_get_my_rank()); hpx_exit(0, NULL); } static HPX_ACTION(HPX_DEFAULT, 0, _hello, _hello_action); int main(int argc, char *argv[argc]) { int success; if (hpx_init(&argc, &argv) != 0) return -1; for (int i = 0; i < RUNS; ++i) { success = hpx_run(&_hello, NULL); } hpx_finalize(); return success; }

The-API

Initializing, starting, and finishing the HPX-5 runtime

The runtime’s system API is defined in hpx/runtime.h.

The HPX-5 runtime must be initialized via hpx_init() before any other HPX-5 functions are called. To start the HPX-5 runtime, use hpx_run(). After all calls to hpx_run(), call hpx_finalize() to clean up the HPX-5 runtime. For more details see the previous section.

Global Memory

The global memory API is defined in hpx/gas.h and hpx/addr.h.

HPX-5 supports several different memory models:

  • Under the “SMP” model (which is only available when HPX-5 is not being used in a distributed manner) global memory is effectively the same as local memory, except that the memory has a global address the same as it would in one of the normal HPX-5 memory models.
  • Under the “PGAS” model, global memory is allocated and managed similarly to how UPC does so. Memory can be allocated at the current locality or can be allocated cyclically over a number of localities.
  • Under the AGAS (Active Global Address Space) model, global memory is allocated similarly to how it is allocated under the PGAS model, but it is not necessarily located at a fixed locality; it may be moved to other localities to balance system memory load or workload. The AGAS distributed in HPX-5 is considered experimental.

All HPX-5 global memory is allocated in blocks. A block is a contiguous chunk of memory that is always located on the same locality, and will always be pinned (see below) together when pinning the block.

Global memory must be pinned before it can be read from or written to directly. This accomplishes two things: It prevents the memory from moving (when using AGAS) and it provides a pointer to local virtual memory that corresponds to the given global memory.

The simplest way to pin memory is to use pinned actions (see below). The other way is to use hpx_gas_try_pin(). It takes as parameters a global address and the address of a pointer, which will be set with the local address of the global memory on success. Note that unless AGAS is being used, this will fail if the memory is not at the same locality as the thread current thread. Memory pinned this way should be unpinned after use with hpx_gas_unpin().

Global memory can also be written to from local memory using hpx_gas_memput(). hpx_gas_memget() accomplishes the opposite. Note that these functions do not prevent other threads (whether from the same locality or not) from writing to or reading from the same global memory; some other synchronization mechanism must be employed to ensure that this does not happen.

The simplest way to get distributed global memory is with hpx_gas_alloc_cyclic() which takes as parameters the number of blocks to allocate, the size of each block, and an alignment (0 is an acceptable value for this if you don’t care), and returns a global address. The allocation is initially allocated cyclically by block, as the name implies i.e., block 0 is at rank 0, block 1 is at rank 1, …, block N is at rank 0, ….

Non-cyclic memory can be allocated with hpx_gas_alloc_local() which will try to return a memory block local to the calling thread. The memory block is not guaranteed to be local however, and applications should not expect a subsequent hpx_gas_try_pin() to succeed.

Addresses

Global addresses have type hpx_addr_thpx_addr_add() can be used to compute an address given a base address, and offset, and the block size used when the memory was allocated. hpx_addr_sub() returns an offset given two addresses and their block size.

There are several special addresses defined:

  • HPX_NULL is like NULL except for global addresses.
  • HPX_HERE represents the current locality. It is often used to invoke actions that do not need a specific address but that the user thinks will perform better if being run on the same locality as the invoking action (due to, for example, the placement of data or LCOs the action might use) or because the current action will perform better if done so (for example, if it needs a result from the invoked action).
  • HPX_THERE(i) is a special address that refers to locality i. Its use cases are similar to HPX_HERE as described above.

Actions
The API for registering actions and action types are defined in hpx/action.h. Actions can be invoked using the parcel interface defined in hpx/parcel.h or through one of the remote-procedure-call-style operations in hpx/rpc.h. Additional functionality related to actions can be found in hpx/par.h and hpx/thread.h.

Registering actions

There are two ways to register actions:

1. Manual registration, by calling HPX_REGISTER_ACTION() before hpx_init():
main() { … HPX_REGISTER_ACTION(HPX_DEFAULT, 0, foo, foo_handler, HPX_INT, HPX_INT); hpx_init(&argc, &argv); }
2. Automatic registration, by calling HPX_ACTION() right after defining the function to register:
int foo_handler(int a, int b) { … } HPX_ACTION(HPX_DEFAULT, 0, foo, foo_handler, HPX_INT, HPX_INT);

Automatic registration relies on the object file’s static constructor capabilities. This has been tested and works on ELF and Mach-O platforms. One thing to be aware of when using automatic registration with dynamic libraries is that you must ensure that the library has been loaded prior to calling hpx_init(). This can typically be done by explicitly adding a call to a library function prior to the call to hpx_init() which will force it to be loaded and have its constructors run.

Types of actions

There are four “exclusive” types of actions: The first is HPX_DEFAULT, the most common kind of action, which results in a HPX-5 thread being created. The other three are:

  • HPX_TASK
  • HPX_INTERRUPT
  • HPX_FUNCTION

Tasks are threads that do not block. Interrupts are simple actions that have function call semantics. Functions are simple functions that have uniform ids across localities; they can not be called with the set of hpx_call() operations or as the action or continuation in a parcel, but can only be called by using the returned value from hpx_action_get_handler().

There are in addition two “non-exclusive” action attributes:

  • HPX_PINNED
  • HPX_MARSHALLED

Parcel coalescing, is only applied for parcels targeting actions registered with the HPX_COALESCED attribute. Use –hpx-coalescing-buffersize=N runtime option, where N is the number of parcels that will be coalesced on a per-target-rank basis, to enable message coalescing.

Default actions can only accept base datatypes or struct datatypes already known to HPX-5. In order to pass a buffer of arbitrary size to the handler, marshalled actions should be used.

Marshalled actions must accept as parameters a void* (which will point to the data passed to it) and a size_t (which indicates the size of the data):
// buffer buf of size “n” bytes
int foo_handler(void *buf, size_t n) {

}
HPX_ACTION(HPX_DEFAULT, HPX_MARSHALLED,
foo, foo_handler,
HPX_POINTER, HPX_SIZE_T);

A pinned action translates the global address that an action is targeted to, and makes the translated local virtual address available to the handler. An action handler of this type must have a pointer as its first parameter (which will point to the local address of the pinned global memory) and when such an action is registered, the first parameter in its type signature must be

HPX_POINTER:
int foo_handler(void *translated, int b) { … } HPX_ACTION(HPX_DEFAULT, HPX_PINNED, foo, foo_handler, HPX_POINTER, HPX_INT)

Invoking actions

Low level action invocation is done using parcels, however a higher level remote-procedure-call style interface is available in hpx/rpc.h and hpx/par.h.

There are many different ways to invoke actions in HPX-5. The simplest is the variadic function hpx_call() which takes as parameters a global address at which to run the action (which will also be the address that is pinned when using a pinned action), the action to run, an LCO to set with the result of the action, and any arguments to pass to action. The address at which to run the action may be HPX_NULL in which case the runtime will execute the new action at the same locality as the current action.

When the result of an action is needed before work can continue, it can be convenient to use hpx_call_sync(), which will return only once the called action has completed. It has additional parameters for a local buffer and its size that will be written to with the results of the invoked action.

hpx_call_async() in some sense is the opposite of the synchronous version of call it will not even wait for the arguments passed to the action to be copied before returning. It has an optional parameter for an LCO which will be set when the arguments are free to be reused or freed.

Sometimes an action should not begin until something else has been completed. While an action can be written to wait for an LCO to be set before doing anything else, there is a more elegant and efficient way to accomplish this pattern: hpx_call_when() will run an action only once a “gate” has been set. It takes the same parameters as hpx_call() with an additional first parameter for the gate LCO. There are several variants of hpx_call_when() that combine other call methods described in this section.

Continuations are actions that are run automatically once some other action has been completed. hpx_call_with_continuation() will call an action with a specified continuation that will be run once the first action has been completed. There are several variants of hpx_call_with_continuation().

An action can be invoked in such a way that the new action will use the invoking action’s contiuation: hpx_call_cc(). Note that this finishes the current actions’s execution it does not yield control back to the thread, but control passes to the new action.

hpx_bcast() can be used to invoke an action on all localities in parallel. Note that hpx_bcast() is actually asynchronous in the sense of hpx_call_async(). If you want to simply wait until the arguments to the broadcast are available for reuse or freeing, you can use hpx_bcast_lsync(). hpx_bcast_rsync() is the fully synchronous version; it will not return until the broadcast action has completed on all localities.

hpx_par_call() is used to invoke multiple instances of the same action (but all with different arguments) at the current locality. It can be used to make parallel loops.

Controlling and inspecting the current thread

The thread API is defined in hpx/thread.h.

Several functions are provided for getting information about the currently executing thread (i.e. action). hpx_thread_current_target() returns the address to which this action was sent when invoked. hpx_thread_current_action() returns an hpx_action_t that represents what action the current thread is performing. hpx_thread_current_cont_target() and hpx_thread_current_cont_action() provide the continuation address and action of the current thread. hpx_thread_get_tls_id() returns a unique id that may be used to represent the current thread (useful for debugging, for example).

hpx_thread_continue() causes the current thread to end, invoking it’s continuation, and passes a provided value to the the thread’s continuation. hpx_thread_continue_cleanup() is similar but also takes a as a parameter a cleanup function (e.g. free()) to be run on the data provided to the continuation after the thread completes.

It sometimes happen that a thread needs to wait for a condition that cannot be effectively described using an LCO. hpx_thread_yield() pauses execution and gives other threads the opportunity to be scheduled. We recommend avoiding hpx_thread_yield() if possible as it inhibits the runtime’s ability to work-steal and load balance, as the runtime has no information with which to prioritize new work over yielded threads, or vice versa.

LCOs

The LCO API is defined in hpx/lco.h.

Types of LCOs

HPX-5 includes several different types of LCOs, including:

  • futures
  • and LCOs
  • “collective” LCOs
    • alltoall
    • allreduce
    • reduce
  • semaphores
  • user LCOs
  • Dataflow LCOs
  • Generation counters

Colletive LCOs

The collective LCOs function similarly to how MPI collectives do, except that they are collective over participating threads not ranks.

  • The alltoall LCO is similar to the allgather LCO, except that now all setting participants set an array of values instead of a single value; each getting participant receives an array of values composed of portions of each of the values with which the LCO was set. For the alltoall LCO there is an equivalent call for each of the allgather LCO calls: hpx_lco_alltoall_new() and hpx_lco_alltoall_setid(). But, additionally, as the index of the getting participant now matters, there is a hpx_lco_alltoall_getid() to get the value of the LCO which replaces hpx_lco_wait() for this LCO.
  • The allreduce LCO receives values from a number of participants like the above collectives, but the LCO automatically performs a specified operation on the values (such as summing them) before the LCO is set. An allreduce LCO is created with a call to hpx_lco_allreduce_new() which takes as arguments the number of “setters,” the number of “getters,” the size of the value being input and output, and two HPX-5 actions: one to initialize the value being reduced (e.g. in the case of a sum, set it the sume to 0) and one to perform the reduction by combining the running value with the new input. These HPX-5 actions must be of a specific types. The initialization action takes two arguments, one a pointer to the value to be initialized, and the second the size of the value to be initialized (which may not be needed to actually initialize the value). The reduction action takes a pointer to the accumulated value, a pointer to the new value being added to the reduction, and the size of the values.
  • The reduce LCO is very similar to the allreduce LCO except that there is considered to be only one getter.

Other LCOs

For some situations where complicated synchronization is required, HPX-5 provides the semaphore LCO which functions just as any semaphore. The semaphore LCO can be created with hpx_lco_sema_new() which takes as an argument the initial value the semphore will have. The semaphore can be waited on by using hpx_lco_sema_p() and signaled with hpx_lco_sema_v().

If none of the existing LCOs supports a patter that you need, HPX-5 provide “user” LCOs.

Other LCO Operations

In the tutorial section, creating, setting, getting, waiting on, and deleting LCOs were covered. However, there are other operations that can be performed on LCOs that may be of use.

Resetting an LCO

A very common operation on LCOs is reset. Resetting an LCO via hpx_lco_reset() allows the LCO to be reused; after the LCO is reset, it may be set again. If an LCO that has not been reset is set again, the set function will return an error status.

Waiting on multiple LCOs at once

Sometimes one needs to wait on a set of LCOs. While this can be done with a for loop, it is more efficient to use hpx_lco_wait_all(). (Obviously, if an and LCO can be used, it is to be preferred.) Furthermore, if the values of those LCOs are needed, the function hpx_lco_get_all() is available.

Getting references to LCO values

It is not always desirable to copy the value of an LCO, but calling hpx_lco_get() will copy the value of the LCO’s buffer. There is a way around this with hpx_lco_getref(). hpx_lco_getref() takes as parameters the LCO’s global address, the size of the buffer, and the address of a pointer. The pointer’s value will be set with an address to an external buffer. This is just a reference to existing memory; it should not be freed the runtime will automatically manage the memory. (Note that if the LCO is on a different locality than the action that called hpx_lco_getref(), there will still be a copy.)

Gloabl arrays of LCOs

Most LCOs can be allocated not just one at a time but in global memory arrays of LCOs. (Note that with one exception these arrays are in global memory but are created at the locality that created them, they are not allocated cyclically.) For example, an entire array of LCOs can be created with hpx_lco_future_local_array_new() which takes as arguments the number of futures to allocate and the size of the value each future will posses. There are equivalent functions for most LCO types. To get the address of the members of this array, use hpx_lco_array_at() which takes as arguments the array address, the index of the LCO to get the address for, and the size of the data the LCO contains (i.e. the size given when the LCO is created).

Futures are a special case to LCO arrays, because they can be allocated as above but also they can be allocated block-cyclically in a global array by using hpx_lco_future_array_new() which takes an additional parameter specifying the block size for the allocation. The address of these futures can be obtained with hpx_lco_future_array_at().

Processes

The process API is defined in hpx/process.h.

HPX-5 has a special construct, processes to make it easier to tell when a program or part of a program has completed. (This is helpful especially in cases where there are many small actions that may spawn other actions it can be difficult to tell when all of the actions have completed.)

Processes are created with hpx_process_new() which takes as a parameter an LCO to set when the process has completed, and returns an hpx_addr_t that can be used to refer to the process. Actions that belong to the process are then called via hpx_process_call(), which is just like hpx_call() except it has an additional parameter before the others that is the global address of the process. Processes can be delete with hpx_process_delete().

Process Level Collectives

HPX-5 provides a programming model for collective communication under process API. For process level collectives underlying mechanics differ from the standard LCO collectives and is intended to be more efficient than the LCO implementation. HPX-5 process collectives has 2 stages namely a) subscription b) accumulation. First stage involves forming a group for a respective collective operation. Users will need to invoke hpx_process_collective_allreduce_new() to create a global descriptor address and then use that in conjunction with hpx_process_collective_subscribe() to subscribe for a particular collective operation output (if any). Then a hpx_process_collective_subscribe_finalize() should be issued finalize all subscribers for a respective collective action. Final stage (ie:- accumulation) involves executing collective operations using hpx_process_collective_allreduce_join() and accumulating collective results using HPX-5 futures or similar constructs. An example of collectives can be found in allreduce.c.

Two flavors of process level collectives can be found in HPX-5.

  1. Native parcel based – This is an implementation which is entirely parcel driven. This means that messages pertaining to collective communication is fully dictated by HPX-5 parcel send operations.
  2. Network based – This implementation relies on underlying network transport to realize part of collective communication. Network collectives enables HPX-5 to utilize faster routes for certain communication paths during a collective operation. However currently only ISIR transport is supported on this mode. Users need to enable the network collective mode explicitly by specifying –hpx-coll-network flag when executing the program.

Miscellaneous functionality

Topology

Most of the topology API is defined in hpx/topology.h; hpx_thread_get_tls_id() is in hpx/thread.h.

Sometimes it is helpful to know where in the system a particular action is executing, especially when debugging. You can find which node the current thread is running on with either the HPX_LOCALITY_ID symbol or hpx_get_my_rank(). HPX_LOCALITIES or hpx_get_num_ranks() will tell you how many nodes the current HPX-5 program is running on.

HPX_THREADS or int hpx_get_num_threads() will tell you how many operating system threads each locality is using. HPX_THREAD_ID or hpx_get_my_thread_id() will return an id number representing which operating system thread the current action is running in. These are generally not needed by applications. Usually, what you want instead is hpx_thread_get_tls_id() which returns an id for the action’s thread (which is unique to that thread, unlike HPX_THREAD_ID).

Parcels

Parcels are what underlie HPX-5 actions and threads. Most of the functionality provided by the parcels API is provided at a higher-level elsewhere in the API, so many user applications will not need parcels. The parcel API is defined in hpx/parcel.h.

Timing

HPX-5 provides some simple timers in hpx/time.h.

HPX C++ API

Experimental support for HPX C++ bindings (HPX++) is available and can be enabled with the configure option –enable-hpx++. The C++ bindings provides a smart pointer template, hpx::global_ptr which acts as a pointer to an address in global memory. The pointer is templated based on the type of data it points to. Various methods are provided for address arithmetic on global pointers. Similarly, classes for various types of LCOs are provided. Finally, a new action registration interface makes it easy to create actions and typecheck argument and return types. Users of the interface can use the API using the hpx/hpx++.h header.

The hello world program from above can be rewritten with the C++ API as shown below:
#include #include <hpx/hpx++.h> static int _hello_action(void) { std::cout << “Hello World from ” << hpx_get_my_rank() << std::endl; hpx::exit(); } auto _hello = hpx::make_action(_hello_action); int main(int argc, char* argv[]) { if (hpx::init(&argc, &argv) != 0) { return -1; } int e = _hello.run(); hpx::finalize(); return e; }

The C++ API is work in progress and will be modified or added to in later releases.

Troubleshooting

If you receive errors from the HPX-5 runtime such as “Application ran out of global address space,” you will need to increase the global memory heapsize using the –hpx-heapsize= <bytes> option.

The default thread stack size used by the runtime is rather small, to improve efficiency. However, this makes it much easier to overflow the stack. If you experience memory corruption or segmentation faults, try increasing the stacksize using the –hpx-stacksize=<bytes> option or using the debug option –hpx-dbg-mprotectstacks which tries to catch stack overflows (but not buffer overflows).

Remember when troubleshooting and debugging that it is much easier to debug issues with less complexity. Always run your application with one locality and one thread to see if you can reproduce your issue there, then move on to one locality and multiple threads, then to multiple (i.e., two) localities with one thread each. HPX-5 1.0 is deterministic with one scheduler thread and one locality.

hpx-config

The hpx-config script provides information about how the HPX-5 library was configured which may be useful when debugging applications or when reporting bugs in the runtime. It also provides information on how to integrate the runtime library into application builds.

hpx-config is installed to the bin subdirectory of the HPX-5 installation path when installing the library.

Its usage is shown below:
$ export PATH=$HPX_PREFIX/bin:$PATH $ hpx-config –help Usage: hpx-configOptions: –help | -h : Print usage. –version : Print hpx version. –config : Print configure options used to build hpx. –prefix : Print installation directory prefix. –bindir : Print binary installation directory. –includedir : Print include installation directory. –libdir : Print library installation directory. –src-root : Print the source root HPX was built from. –obj-root : Print the object root used to build HPX. –host-cpu : Print the host cpu used to configure HPX. –host-os : Print the host os used to configure HPX. –cc : Print compiler used to build hpx. –cflags : Print compiler flags used to build hpx. –cxx : Print c++ compiler used to build hpx. –cxxflags : Print cxx flags used to build HPX. –cppflags : Print preprocessor flags used to build hpx. –ldflags : Print the ld flags used to build HPX. –ccasflags : Print the ccas flags used to build HPX. –libs : Print libraries hpx was linked against.

Debugging

Debugging HPX-5 applications can be more complicated than debugging non-distributed applications, or even MPI applications. This section will explain some inherent difficulties in debugging HPX-5 applications, and cover some of the things that can make debugging HPX-5 applications easier.

Why debugging HPX-5 applications is hard

There are some inherent difficulties in debugging HPX-5 applications. The largest is that HPX-5 is a distributed runtime that encourages programs to use an event-driven style to address starvation and latency. As a result there are many transient threads executing all over the system, and finding errors happening in such a context can be very challenging. Future debuggers may make solving this problem easier, but at present the best way to counteract this problem is to tackle the bug at the smallest scale at which it occurs (see the next section).

One of other the difficulties in debugging HPX-5 applications is that a debugger will conflate application state with HPX-5 runtime state. The happens because while the debugger can interpret some kinds of threads, for example, pthreads, it can’t tell for an HPX-5 program what code is part of the application and what is part of the runtime, and it interprets it all the same. Furthermore the debugger is unaware of the set of pending threads and parcels that are waiting on LCOs or currently runnable but not scheduled. This can lead to challenges if the problem is something waiting on an LCO that shouldn’t be (this could happen, for example, if the application is waiting on the wrong LCO). Ideally if you asked for a backtrace you would see hpx_lco_wait() at the end, but in reality hpx_lco_wait() may not show up at all in the backtrace since internal to the runtime the waiting action is not executing at all at that moment. At present, there is no good workaround to this problem.

Fundamentally there is a mismatch between the commonly available debugging tools which are thread-centric, and HPX-5 applications which are encouraged to be event-driven and data-centric.

How to make debugging HPX-5 applications easier

The key to debugging HPX-5 applications is running at the smallest scale possible. If an error manifests, see if it can be reproduced on a single node instead of multiple ones; distributed debugging is much harder than debugging a single process! It is also helpful to reduce the number of operating system threads that the HPX-5 runtime is using (this can be controlled on the command line with –hpx-threads=<bytes>). If an error is reproducible with only 1 thread, it will be much easier to debug, of course, but even if the number of threads can merely be reduced, it is still less things that need to be filtered out. The simplest errors to debug will happen on one node and with one thread, and while most errors will be much more complicated, it doesn’t pay to debug an error the hard way first.

If a memory error is suspected, we recommended using valgrind to check the program. valgrind will work with HPX-5 with a couple caveats. False positives will be generated as well it may be helpful to create a filter. Also, it is even more important to reduce a bug to the minimal scale at which it will happen as, because of the way the runtime functions, even “small” HPX-5 applications can be extremely slow in valgrind.

Also, HPX-5 has an option –hpx-dbg-mprotectstacks that tries to catch errors resulting from overwriting thread stacks.

Attaching a debugger

When configured with –enable-debug, HPX-5 has a feature to make debugging a little easier: it can pause an HPX-5 program so that a debugger can be attached.

HPX-5 programs can be paused at startup via the command line option –hpx-dbg-waitat=<nodes>, where <nodes> is a single node number, a comma separated list of node numbers, or -1 for all nodes. Once the program has started, a message will be printed that includes the process number of the application so that a debugger can be attached to that process.

Alternatively, the HPX-5 runtime can be set to wait for a debugger only if an error (either an abort or a segmentation fault) occurs. This can be controlled by adding the command line option –hpx-dbg-waitonabort or –hpx-dbg-waitonsegv, or both. Waiting on segmentation faults is unreliable if the segmentation fault is a result of a stack overflow.

For example, waiting on the example application hello at startup on the lowest numbered-node might look like this:
$ mpirun -np 2 /path/to/hello –hpx-dbg-waitat=0 PID 8377 on node001 ready for attach

At this point, for non-Cray systems (see below for more information about debugging on Cray systems), at this point it is merely necessary to attach a debugger to the process in a seperate terminal. Using gdb

For example:
gdb /path/to/hello 8377

If the application is running over multiple nodes, you will have to use ssh to connect to the appropriate node before attaching the debugger.

The remainder of the procedure is a little more arcane. There is a two sequence step of instructions that must be executed before you can continue in the debugger:

First you must go up two stack frames (in gdb this is up 2) and then set variable i to 1 (set variable i = 1):
(gdb) up 2
#2 0x00007ffff7962180 in dbg_wait () at ../../hpx/libhpx/debug.c:42
42 sleep(12);
(gdb) set variable i = 1
(gdb) c
Continuing.

Now you can debug your application normally.

Attaching a debugger on Cray Systems

Debugging on Cray systems uses lgdb. As lgdb is a parallel debugger, we want our application to pause at the same place on all ranks, so the command line option –hpx-dbg-waitat=-1 should be used.

To attach a debugger to the process, launch the application to wait everywhere and then find the “application id” using apstat:
$ apstat | grep hello 354794 128602 p02135 64 2 4h06m run hello

The first column of output contains the application id; we can use that to attach lgdb by using

attach $a :
$ lgdb dbg all> attach $a 355045

The sequence of commands to unpause the application is similar but slightly more complicated than the one used for gdb, and requires the use of gdbmode.

The sequence is:

  • up 2 to move up on the stack,
  • gdbmode to gain direct control of all gdb instances run by lgdb,
  • p i=1 to set the variable i to 1,
  • and end to end gdbmode we entered earlier.

At this point, the application is ready for further debugging normally.

In practice the above might look like something like this:
dbg all> up 2 a{0..15}: #2 0x00002aaaaace0c30 in dbg_wait at ../../hpx/libhpx/debug.c:42 dbg all> gdbmode Entering gdb pass-thru mode. Type “end” to exit mode… GNU gdb (GDB) 7.6.2-4.2.2 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type “show copying” and “show warranty” for details. This GDB was configured as “x86_64-unknown-linux-gnu”. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. > p i=1 $1 = 1 > end Ending gdb pass-thru mode. If program location has changed (i.e. continue) debugger is in an unknown state. dbg all>

Writing-Applications-Idiomatically

Sometimes, it is not clear how to best convert a program to use the HPX-5 program model. Other times it may be possible to write applications using the same idioms used in other models, but there may be a performance impact. In this section we will consider some different idioms that work best in HPX-5 programs.

Send work to data, not vice versa

Traditional parallel programming models often rely on parallel threads running in fixed locations, and moving data between those threads.

Although HPX-5 can emulate more traditional parallel models (by using hpx_gas_memget() and hpx_gas_memput() or by packing data into parcels), that is not generally the most efficient way to get “work” to the data it needs. While an action is waiting to receive data from a hpx_gas_memget(), for example, the memory can not be used and the waiting action can do nothing but it is still consuming stack space and space in the scheduler queue. More importantly, the cost to move the memory, if a sufficient amount of memory needs to be moved, can be quite expensive and can even lead to network congestion. The overall throughput of the system is generally improved by moving the “work rather than the data whenever possible.

The preferred way to deal with this situation in HPX-5 is to leave the memory in a single location but to spawn a new action at that address to perform the work that needs to be done using that data. It is easy to execute an action at the location of the data it needs, since hpx_call() takes an address at which to run the action as an argument. The main complication in doing this is that sometimes actions will need to be split into shorter actions that can be sent to where they need to be run.

(Sometimes it may not be possible to effectively use this idiom. For example, for an action whose whole purpose is to aggregate data, it may be easier and more efficient to use a series of calls to hpx_gas_memget() or to refactor entirely using an appropriate LCO. But in general, the principle holds.)

Start actions only when they are needed

In some applications, you may want a large number (perhaps thousands, say) of threads to wait on an event, such as an LCO being set, before they run. The obvious approach is to write the action for these threads such that the first thing the action does is wait on the LCO:
static hpx_status_t _myaction(void *args) { hpx_lco_wait(ready_lco); do_some_stuff(); return HPX_SUCCESS; } static hpx_status_t _someotheraction(void* args) { for (int i = 0; i < LARGE_NUMBER; i++) hpx_call(target[i], myaction, result[i], arg[i]); return HPX_SUCCESS; }

The problem with this approach is that this large number of threads all waiting to be run can consume large amounts of resources such as stack space, for example. While each thread consumes only a small amount, the sheer number of threads trying to run at once can potentially consume enough resources to lead to performance degradation. The solution to this problem is to use hpx_call_when() or one of its variants.

hpx_call_when() will run an action only once a “gate” has been set. Then the action can be written assuming that the result it was waiting on has already been completed:
static hpx_status_t _myaction(hpx_addr_t ready) { do_some_stuff(); return HPX_SUCCESS; } static hpx_status_t _someotheraction(void* args) { for (int i = 0; i < LARGE_NUMBER; i++) hpx_call_when(ready_lco, target[i], myaction, result[i], arg[i]); return HPX_SUCCESS; }

With this idiom, resources are reserved much closer in time to when they can be utilized.

Parallelize loops

HPX-5 threads are designed to be have a rather low overhead, so they can be in cases where heavier-weight threads, such as pthreads or seperate processes, would not. For example, to parallelize a loop (much like OpenMP might be used). HPX-5 supports this idiom through the functions hpx_par_for() and hpx_par_call(). Care must be taken with this method, though, as some loops really are small enough that parallelizing does not necessarily help.