Interoperability

Standard Library (STL algorithms)

The fundamental goal of the library is that the arrays and iterators can be used out-of-the-box with C++ Standard Library, in particular the Standard Template Library (STL) and the algorithms it provides. The most dramatic example of this is that std::sort works with array as it is shown in a previous section.

Along with STL itself, the library tries to interact with other existing quality C++ libraries listed below.

Ranges (C++20)

Standard ranges extend standard algorithms, reducing the need for iterators, in favor of more composability and a less error-prone syntax.

In this example, we replace the values of the first row for which the sum of the elements is odd:

	static constexpr auto accumulate = [](auto const& R) { return std::ranges::fold_left(R, 0, std::plus<>{}); };

	auto arr = multi::array<int, 2>{
		{2, 0, 2, 2},
		{2, 7, 0, 2},  // this row adds to an odd number
		{2, 2, 0, 4},
	};

	auto const row = std::ranges::find_if(arr, [](auto const& r) { return accumulate(r) % 2 == 1; });
	if(row != arr.end()) std::ranges::fill(*row, 9);

	assert(arr[1][0] == 9 );

(live)

Together with the array constructors, the ranges library enables a more functional programming style; this allows us to work with immutable variables in many cases.

	multi::array<double, 2> const A = {{...}};
	multi::array<double, 1> const V = {...};

	multi::array<double, 1> const R = std::views::zip_transform(std::plus<>{}, A[0], V);

	// Alternative imperative mutating code:
	// multi::array<double, 1> R(V.size());  // R is created here...
	// for(auto i : R.extension()) {R[i] = A[0][i] + V[i];}  // ...and then mutated here

(live)

The "pipe" (|) notation of standard ranges allows one-line expressions. In this example, the expression will yield the maximum value of the rows sums: std::ranges::max(arr | std::views::transform(accumulate))

Like in classic STL, standard range algorithms acting on sequences operate in the first dimension by default, for example, lexicographical sorting on rows can be performed with the std::ranges::sort algorithm.

	auto A = multi::array<char, 2>{
		{'S', 'e', 'a', 'n', ' ', ' '},
		{'A', 'l', 'e', 'x', ' ', ' '},
		{'B', 'j', 'a', 'r', 'n', 'e'},
	};
	assert(!std::ranges::is_sorted(A));

	std::ranges::sort(A);  // will sort on rows

	assert( std::ranges::is_sorted(A));

	assert(
		A == multi::array<char, 2>{
			{'A', 'l', 'e', 'x', ' ', ' '},
			{'B', 'j', 'a', 'r', 'n', 'e'},
			{'S', 'e', 'a', 'n', ' ', ' '},
		}
	);

To operate on the second dimension (sort by columns), use std::ranges::sort(~A) (or std::ranges::sort(A.transposed())).

There are many range’s views to enumerate here, including std::reverse (reverse the elements --of the leading dimension--). It is important to understand that most views generated by ranges are multidimensional arrays in the sense of this library.

Polymorphic Memory Resources

In addition to supporting classic allocators (std::allocator by default), the library is compatible with C++17’s polymorphic memory resources (PMR), which allows using advanced allocation strategies, including preallocated buffers. This example code uses a buffer as memory for two arrays; in it, a predefined buffer will contain the arrays' data (something like "aaaabbbbbbXX").

#include <memory_resource>  // for polymorphic memory resource, monotonic buffer

int main() {
	char buffer[13] = "XXXXXXXXXXXX";  // a small buffer on the stack
	std::pmr::monotonic_buffer_resource pool{std::data(buffer), std::size(buffer)};

	multi::pmr::array<char, 2> A({2, 2}, 'a', &pool);
	multi::pmr::array<char, 2> B({3, 2}, 'b', &pool);

	assert( buffer != std::string{"XXXXXXXXXXXX"} );  // overwritten w/elements, implementation-dependent (libstd consumes from left, and libc++, from the right)
}

multi::pmr::array<T, D> is a synonym for multi::array<T, D, std::pmr::polymorphic_allocator<T>>. In this particular example, the technique can be used to avoid dynamic memory allocations of small local arrays. (live)

The library also supports memory resources from other libraries, including those returning special pointer types (see the "CUDA Thrust" section and the "Boost.Interprocess" section).

Substitutability with standard vector and span

The one-dimensional case multi::array<T, 1> is special and overlaps functionality with other dynamic array implementations, such as std::vector. Indeed, both types of containers are similar and usually substitutable, with no or minor modifications. For example, both can be constructed from a list of elements (C c = {x0, x2, …};) or from a size C c(size);, where C is either type.

Both values are assignable, have the same element access patterns and iterator interface, and implement all (lexical) comparisons.

They differ conceptually in their resizing operations: multi::array<T, 1> doesn’t insert or push elements and resizing works differently. The difference is that the library doesn’t implement amortized allocations; therefore, these operations would be of a higher complexity cost than the std::vector. For this reason, resize(new_size) is replaced with reextent({new_size}) in multi::array, whose primary utility is for element preservation when necessary.

In a departure from standard containers, elements are left initialized if they have trivial constructor. So, while multi::array<T, 1> A({N}, T{}) is equivalent to std::vector<T> V(N, T{}), multi::array<T, 1> A(N) will leave elements T uninitialized if the type allows this (e.g. built-ins), unlike std::vector<T> V(N) which will initialize the values. RAII types (e.g. std::string) do not have trivial default constructor, therefore they are not affected by this rule.

With the appropriate specification of the memory allocator, multi::array<T, 1, Alloc> can refer to special memory not supported by std::vector.

Finally, an array A1D can be copied by std::vector<T> v(A1D.begin(), A1D.end()); or v.assign(A1D.begin(), A1D.end()); or vice versa. Without copying, a reference to the underlying memory can be created auto&& R1D = multi::array_ref<double, 1>(v.data(), v.size()); or conversely std::span<T>(A1D.data_elements(), A1D.num_elements());. (See examples here.)

The std::span (C++20) has not a well defined reference- or pointer-semantics; it doesn’t respect const correctness in generic code. This behavior is contrary to the goals of this library; and for this reason, there is no single substitute for std::span for all cases. Depending on how it is used, either multi::array_ref<T, 1> [const& | &&] or multi::array_ptr<T [const], 1> may replace the features of std::span. The former typically works when using it as function argument.

Multi-dimensinal arrays can interoperate with C++23’s non-owning mdspan. Preliminarily, Multi’s subarrays (arrays) can be converted (viewed as) mdspan automatically (replicating the behavior of std::vector).

A detailed comparison with other array libraries (mspan, Boost.MultiArray, Eigen) is explained in an Appendix.

Execution policies (parallel algorithms)

Multi’s iterators can exploit parallel algorithms by specifying execution policies. This code takes every row of a two-dimensional array and sums its elements, putting the results in a one-dimensional array of compatible size. The execution policy (par) selected is passed as the first argument.

    multi::array<double, 2> const A = ...;
    multi::array<double, 1> v(size(A));

    std::transform(std::execution::par, arr.begin(), arr.end(), vec.begin(), [](auto const& row) {return std::reduce(row.begin(), row.end());} );

(live)

For an array of 10000x10000 elements, the execution time decreases to 0.0288 sec, compared to 0.0526 sec for the non-parallel version (i.e. without the par argument).

Note that parallelization is, in this context, inherently one-dimensional. For example, parallelization happens for the transformation operation, but not to the summation.

The optimal way to parallelize specific operations strongly depends on the array’s size and shape. Generally, straightforward parallelization without exploiting the n-dimensional structure of the data has a limited pay-off; and nesting parallelization policies usually don’t help either.

Flattening the n-dimensional structure for certain algorithms might help, but such techniques are beyond the scope of this documentation.

Some member functions internally perform algorithms and that can benefit from execution policies; in turn, some of these functions have the option to pass a policy. For example, this copy construction can initialize elements in parallel from the source:

    multi::array<double, 2> const A = ...;
    multi::array<double, 1> const B(std::execution::par, A);  // copies A into B, in parallel, same effect as multi::array<double, 1> const B(A); or ... B = A;

Execution policies are not limited to STL; Thrust and oneAPI also offer execution policies that can be used with the corresponding algorithms.

Execution policies and ranges can be mixed (x and y can be 1D dimensional arrays, of any arithmetic element type)

template <class X1D, class Y1D>
auto dot_product(X1D const& x, Y1D const& y) {
	assert(x.size() == y.size());
	auto const& z = std::ranges::views::zip(x, y)
		| std::ranges::views::transform([](auto const& ab) { auto const [a, b] = ab;
			return a * b;
		})
	;
	return std::reduce(std::execution::par_unseq, z.begin(), z.end());
}

(live)

Range-v3

The library works out of the box with Eric Niebler’s Range-v3 library, a precursor to the standard Ranges library (see above). The library helps to remove explicit iterators (e.g. begin, end) from the code when possible.

Every Multi array object can be regarded as range. Every subarray references (and array values) are interpreted as range views.

For example for a 2D array d2D, d2D itself is interpreted as a range of rows. Each row, in turn, is interpreted as a range of elements. In this way, d2D.transposed() is interpreted as a range of columns (of the original array), and each column a range of elements (arranged vertically in the original array).

#include <range/v3/all.hpp>

int main(){

	multi::array<int, 2> const d2D = {
		{ 0,  1,  2,  3},
		{ 5,  6,  7,  8},
		{10, 11, 12, 13},
		{15, 16, 17, 18}
	};
	assert( ranges::inner_product(d2D[0], d2D[1], 0.) == 6+2*7+3*8 );
	assert( ranges::inner_product(d2D[0], rotated(d2D)[0], 0.) == 1*5+2*10+15*3 );

	static_assert(ranges::RandomAccessIterator<multi::array<double, 1>::iterator>{});
	static_assert(ranges::RandomAccessIterator<multi::array<double, 2>::iterator>{});
}

In this other example, a 2D Multi array (or subarray) is modified such that each element of a column is subtracted the mean value of such column.

#include<multi/array.hpp>
#include<range/v3/all.hpp>

template<class MultiArray2D>
void subtract_mean_columnwise(MultiArray2D&& arr) {
    auto&& tarr = arr.transposed();
    auto const column_mean =
        tarr
        | ranges::views::transform([](auto const& row) {return ranges::accumulate(row, 0.0)/row.size();})
        | ranges::to<multi::array<double, 1>>
    ;

    ranges::transform(
        arr.elements(),
        column_mean | ranges::views::cycle,
        arr.elements().begin(),
        [](auto const elem, auto const mean) {return elem - mean;}
    );
}

Serialization

The ability to serialize arrays is essential for storing data in a persistent medium (files on disk) and communicating values via streams or networks (e.g., MPI). Unfortunately, the C++ language does not provide facilities for serialization, and the standard library doesn’t either.

However, there are a few libraries that offer a certain common protocol for serialization, such as Boost.Serialization and Cereal. The Multi library is compatible with both (and doesn’t depend on any of them). The user can choose one or the other, or none, if serialization is not needed. The generic protocol is such that variables are (de)serialized using the (>>)<< operator with the archive; operator & can be used to have a single code for both. Serialization can be binary (efficient) or text-based (human-readable).

Here, it is a small implementation of save and load functions for an array to JSON format with the Cereal library. The example can be easily adapted to other formats or libraries. (An alternative for XML with Boost.Serialization is commented on the right.)

#include<multi/array.hpp>  // this library

#include<cereal/archives/json.hpp>  // or #include<cereal/archives/xml.hpp>   // #include <boost/archive/xml_iarchive.hpp>
                                                                              // #include <boost/archive/xml_oarchive.hpp>
// for serialization of array elements (in this case strings)
#include<cereal/types/string.hpp>                                             // #include <boost/serialization/string.hpp>

#include<fstream>  // saving to files in example

using input_archive  = cereal::JSONInputArchive ;  // or ::XMLInputArchive ;  // or boost::archive::xml_iarchive;
using output_archive = cereal::JSONOutputArchive;  // or ::XMLOutputArchive;  // or boost::archive::xml_oarchive;

using cereal::make_nvp;                                                       // or boost::serialization::make_nvp;

namespace multi = boost::multi;

template<class Element, multi::dimensionality_type D, class IStream>
auto array_load(IStream&& is) {
	multi::array<Element, D> value;
	input_archive{is} >> make_nvp("value", value);
	return value;
}

template<class Element, multi::dimensionality_type D, class OStream>
void array_save(OStream&& os, multi::array<Element, D> const& value) {
	output_archive{os} << make_nvp("value", value);
}

int main() {
	multi::array<std::string, 2> const A = {{"w", "x"}, {"y", "z"}};
	array_save(std::ofstream("file.string2D.json"), A);  // use std::cout to print serialization to the screen

	auto const B = array_load<std::string, 2>(std::ifstream("file.string2D.json"));
	assert(A == B);
}

(online)

These templated functions work for any dimension and element type (as long as the element type is serializable in itself; all basic types are serializable by default). However, note that the user must ensure that data is serialized and deserialized into the same type; the underlying serialization libraries only do minimal consistency checks for efficiency reasons and don’t try to second-guess file formats or contained types. Serialization is a relatively low-level feature for which efficiency and economy of bytes are a priority. Cryptic errors and crashes can occur if serialization libraries, file formats, or C++ types are mixed between writes and reads. Some formats are human-readable but still not particularly pretty for showing as output (see the section on Formatting on how to print to the screen).

References to subarrays (views) can also be serialized; however, size information is not saved in such cases. The reasoning is that references to subarrays cannot be resized in their number of elements if there is a size mismatch during deserialization. Therefore, array views should be deserialized as other array views with matching sizes.

The output JSON file created by Cereal in the previous example looks like this.

{
    "value": {
        "cereal_class_version": 0,
        "extensions": {
            "cereal_class_version": 0,
            "extension": {
                "cereal_class_version": 0,
                "first": 0,
                "last": 2
            },
            "extension": {
                "first": 0,
                "last": 2
            }
        },
        "elements": {
            "cereal_class_version": 0,
            "item": "w",
            "item": "x",
            "item": "y",
            "item": "z"
        }
    }
}

(The Cereal XML and Boost XML output would have a similar structure.)

Large datasets tend to be serialized slowly for archives with heavy formatting. Here it is a comparison of speeds when (de)serializing a 134 MB 4-dimensional array of with random double.

Archive format (Library)	file size	speed (read - write)	time (read - write)
JSON (Cereal)	684 MB	3.9 MB/sec - 8.4 MB/sec	32.1 sec - 15.1 sec
XML (Cereal)	612 MB	2.0 MB/sec - 4.0 MB/sec	56.0 sec - 28.0 sec
XML (Boost)	662 MB	11.0 MB/sec - 13.0 MB/sec	11.0 sec - 9.0 sec
YAML (custom archive))	702 MB	10.0 MB/sec - 4.4 MB/sec	12.0 sec - 28.0 sec
Portable Binary (Cereal)	134 MB	130 MB/sec - 121 MB/sec	9.7 sec - 10.6 sec
Text (Boost)	411 MB	15.0 MB/sec - 16.0 MB/sec	8.2 sec - 7.6 sec
Binary (Cereal)	134 MB	134.4 MB/sec - 126. MB/sec	0.9 sec - 0.9 sec
Binary (Boost)	134 MB	5200 MB/sec - 1600 MB/sec	0.02 sec - 0.1 sec
gzip-XML (Cereal)	191 MB	2.0 MB/sec - 4.0 MB/sec	61 sec - 32 sec
gzip-XML (Boost)	207 MB	8.0 MB/sec - 8.0 MB/sec	16.1 sec - 15.9 sec

Archive format (Library)

file size

speed (read - write)

time (read - write)

JSON (Cereal)

684 MB

3.9 MB/sec - 8.4 MB/sec

32.1 sec - 15.1 sec

XML (Cereal)

612 MB

2.0 MB/sec - 4.0 MB/sec

56.0 sec - 28.0 sec

XML (Boost)

662 MB

11.0 MB/sec - 13.0 MB/sec

11.0 sec - 9.0 sec

YAML (custom archive))

702 MB

10.0 MB/sec - 4.4 MB/sec

12.0 sec - 28.0 sec

Portable Binary (Cereal)

134 MB

130 MB/sec - 121 MB/sec

9.7 sec - 10.6 sec

Text (Boost)

411 MB

15.0 MB/sec - 16.0 MB/sec

8.2 sec - 7.6 sec

Binary (Cereal)

134 MB

134.4 MB/sec - 126. MB/sec

0.9 sec - 0.9 sec

Binary (Boost)

134 MB

5200 MB/sec - 1600 MB/sec

0.02 sec - 0.1 sec

gzip-XML (Cereal)

191 MB

2.0 MB/sec - 4.0 MB/sec

61 sec - 32 sec

gzip-XML (Boost)

207 MB

8.0 MB/sec - 8.0 MB/sec

16.1 sec - 15.9 sec

Boost.Interprocess

Using Interprocess allows for shared memory and for persistent mapped memory.

#include <boost/interprocess/managed_mapped_file.hpp>
#include "multi/array.hpp"
#include<cassert>

namespace bip = boost::interprocess;
using manager = bip::managed_mapped_file;
template<class T> using mallocator = bip::allocator<T, manager::segment_manager>;
auto get_allocator(manager& m){return m.get_segment_manager();}

namespace multi = boost::multi;
template<class T, int D> using marray = multi::array<T, D, mallocator<T>>;

int main(){
{
	manager m{bip::create_only, "bip_mapped_file.bin", 1 << 25};
	auto&& arr2d = *m.construct<marray<double, 2>>("arr2d")(std::tuple{1000, 1000}, 0., get_allocator(m));
	arr2d[4][5] = 45.001;
	m.flush();
}
{
	manager m{bip::open_only, "bip_mapped_file.bin"};
	auto&& arr2d = *m.find<marray<double, 2>>("arr2d").first;
	assert( arr2d[4][5] == 45.001 );
	m.destroy<marray<double, 2>>("arr2d");//    eliminate<marray<double, 2>>(m, "arr2d");}
}
}

CUDA (and HIP, and OMP, and TBB) via Thrust

The library works out-of-the-box in combination with the Thrust library.

#include <multi/array.hpp>  // this library

#include <thrust/device_allocator.h>  // from CUDA or ROCm distributions

namespace multi = boost::multi;

int main() {
	multi::array<double, 2, thrust::device_allocator<double>> A({10,10});
	multi::array<double, 2, thrust::device_allocator<double>> B({10,10});
	A[5][0] = 50.0;

	thrust::copy(A.rotated()[0].begin(), A.rotated()[0].end(), B.rotated()[0].begin());  // copy row 0
	assert( B[5][0] == 50.0 );
}

(live)

which uses the default Thrust device backend (i.e. CUDA when compiling with nvcc, HIP/ROCm when compiling with a HIP/ROCm compiler, or OpenMP or TBB in other cases). Universal memory (accessible from normal CPU code) can be used with thrust::universal_allocator (from <thrust/universal_allocator.h>) instead.

More specific allocators can be used ensure CUDA backends, for example CUDA managed memory:

#include <thrust/system/cuda/memory.h>
...
	multi::array<double, 2, thrust::cuda::universal_allocator<double>> A({10,10});

In the same way, to ensure HIP backends please replace the cuda namespace by the hip namespace, and in the directory name <thrust/system/hip/memory.h>. <thrust/system/hip/memory.h> is provided by rocThrust in the ROCm distribution (in /opt/rocm/include/thrust/system/hip/, and not by the NVIDIA distribution.)

Multi doesn’t have a dependency on Thrust (or vice versa); they just work well together, both in terms of semantics and efficiency. Certain "patches" (to improve Thrust behavior) can be applied to Thrust to gain extra efficiency and achieve near native speed by adding the #include<multi/adaptors/thrust.hpp>.

Multi can be used on existing memory in a non-invasive way via (non-owning) reference arrays:

	// assumes raw_pointer was allocated with cudaMalloc or hipMalloc
	using gpu_ptr = thrust::cuda::pointer<double>;  // or thrust::hip::pointer<double>
	multi::array_ref<double, 2, gpu_ptr> Aref({n, n}, gpu_ptr{raw_pointer});

Finally, the element type of the device array has to be device-friendly to work correctly; this includes all build in types, and classes with basic device operations, such as construction, destruction, and assigment. They notably do not include std::complex<T>, in which can be replaced by the device-friendly thrust::complex<T> can be used as replacement.

Thrust algorithms can be used to gain a high degree of parallelism. For example, for reduction by index (sum over rows) over a big array in the CUDA array, a generic implementation can be:

template<class In2D, class OutputIt>
void reduce_by_row(In2D const& in, OutputIt d_first) {
	using index = typename In2D::index;
	auto const& row_id = multi::restricted(
		[] __host__ __device__ (index i, index j) { return i; },
		in.extensions()
	);  // or use "fun ^ ext" notation

	::thrust::reduce_by_key(
		row_id.elements().begin(), row_id.elements().end(),
		in.elements().begin(),
		::thrust::make_discard_iterator(),
		d_first,
		::thrust::equal_to<multi::index>{}  //, thrust::plus<>{}
	);
};

This algorithm can be applied to a 2D array, using a second 1D array to store the sums:

multi::array<double, 2, thrust::device_allocator<double>> A_in = ...;

multi::array<double, 1, thrust::device_allocator<double>> sums_out(multi::extensions_t<1>{A.size()});

reduce_by_row(A_in, sums_out.begin());

Interestingly, we can apply this same algorithm to a pure function and still get the benefits of parallelism. That is, without the need to store data in memory and whose elements are evaluated on the fly (in the GPU).

multi::array<double, 1, thrust::device_allocator<double>> sums_out(multi::extensions_t<1>{N});

using multi::index;
auto const& hankel =  // a restriction object
	[] __host__ __device__ (index i, index j) { return i + j; }
	^ multi::extensions_t<2>({N, M})
;

reduce_by_row(hankel, sums_out.begin());

(live)

Lambdas which are expected to be executed in the GPU need to be declared host device. Take into account that lambdas executed in the device cannot take captures by reference. Library’s restrictions do not work on device-only functions due to limitations of the CUDA language. In addition, host device cannot take initializing captures (e.g. [var = …]), or auto arguments. As a partial workaround, the library provides a macro BOOST_MULTI_HD_LAMBDA (note the parenthesis).

	int base = 4;
	auto const& hankel2 =  // a restriction object of a lambda with captures
		[initval = base + 1] BOOST_MULTI_HD_LAMBDA( (index i, index j) { return initval + i + j; } )
		^ multi::extensions_t<2>({N, M})

(Alternatively the functions can be defined as separate class with a host device operator()).

OpenMP via Thrust

In an analogous way, Thrust can also handle OpenMP (omp) allocations and multithreaded algorithms of arrays. The OMP backend can be enabled by the compiler flags -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP or by using the explicit omp system types:

#include <multi/array.hpp>
#include <multi/adaptors/thrust/omp.hpp>

#include <thrust/copy.h>

namespace multi = boost::multi;

int main() {
    auto A = multi::thrust::omp::array<double, 2>({10,10}, 0.0);  // or multi::array<double, 2, thrust::omp::allocator<double>>;
    auto B = multi::thrust::omp::array<double, 2>({10,10});  // or multi::array<double, 2, thrust::omp::allocator<double>>;

	A[5][0] = 50.0;

    // copy row 0
	thrust::copy(
        A.rotated()[0].begin(), A.rotated()[0].end(),
        B.rotated()[0].begin()
    );
	assert( B[5][0] == 50.0 );
	auto C = B;  // uses omp automatically for copying behind the scenes
}

(live)

Compilation might need to link to an omp library, -fopenmp -lgomp.

Without Thrust, OpenMP pragmas would also work with this library, however OpenMP memory allocation, would need to be manually managed.

Thrust memory resources

GPU memory is relative expensive to allocate, therefore any application that allocates and deallocates arrays often will suffer performance issues. This is where special memory management is important, for example for avoiding real allocations when possible by caching and reusing memory blocks.

Thrust implements both polymorphic and non-polymorphic memory resources via thrust::mr::allocator<T, MemoryResource>; Multi supports both.

auto pool = thrust::mr::disjoint_unsynchronized_pool_resource(
	thrust::mr::get_global_resource<thrust::universal_memory_resource>(),
	thrust::mr::get_global_resource<thrust::mr::new_delete_resource>()
);

// memory is handled by pool, not by the system allocator
multi::array<int, 2, thrust::mr::allocator<int, decltype(pool)>> arr({1000, 1000}, &pool);

The associated pointer type for the array data is deduced from the upstream resource; in this case, thrust::universal_ptr<int>.

As a quick way to improve performance in many cases, here it is a recipe for a caching_allocator which uses a global (one per thread) memory pool that can replace the default Thrust allocator. The requested memory resides in GPU (managed) memory (thrust::cuda::universal_memory_resource) while the cache bookkeeping is held in CPU memory (new_delete_resource).

template<class T, class Base_ = thrust::mr::allocator<T, thrust::mr::memory_resource<thrust::cuda::universal_pointer<void>>>>
struct caching_allocator : Base_ {
	caching_allocator() :
		Base_{&thrust::mr::tls_disjoint_pool(
			thrust::mr::get_global_resource<thrust::cuda::universal_memory_resource>(),
			thrust::mr::get_global_resource<thrust::mr::new_delete_resource>()
		)} {}
	caching_allocator(caching_allocator const&) : caching_allocator{} {}  // all caching allocators are equal
	template<class U> struct rebind { using other = caching_allocator<U>; };
};
...
int main() {
	...
	using array2D = multi::array<double, 2, caching_allocator<double>>;

	for(int i = 0; i != 10; ++i) { array2D A({100, 100}); /*... use A ...*/ }
}

(live)

In the example, most of the frequent memory requests are handled by reutilizing the memory pool avoiding expensive system allocations. More targeted usage patterns may require locally (non-globally) defined memory resources.

CUDA C++

CUDA is a dialect of C++ that allows writing pieces of code for GPU execution, known as "CUDA kernels". CUDA code is generally "low level" (less abstracted), but it can be used in combination with CUDA Thrust or the CUDA runtime library, specially to implement custom algorithms. Although code inside kernels has certain restrictions, most Multi features can be used. (Most functions in Multi, except those involving memory allocations, are marked device to allow this.)

Calling kernels involves a special syntax (<<< … >>>), and they cannot take arguments by reference (or by values that are not trivial). Since arrays are usually passed by reference (e.g. multi::array<double, 2>& or Array&&), a different idiom needs to be used. (Large arrays are not passed by value to avoid copies, but even if a copy would be fine, kernel arguments cannot allocate memory themselves.) Iterators (e.g. .begin()/.end()) and "cursors" (e.g. .home()) are "trivial to copy" and can be passed by value and represent a "proxy" to an array, including allowing the normal index syntax and other transformations.

Cursors are a generalization of iterators for multiple dimensions. They are cheaply copied (like iterators) and they allow indexing. Also, they have no associated .size() or .extensions(), but this is generally fine for kernels. (Since cursors have minimal information for indexing, they can save stack/register space in individual kernels.)

Here it is an example implementation for matrix multiplication, in combination with Thrust and Multi,

#include <multi/array.hpp>  // from https://gitlab.com/correaa/boost-multi
#include <thrust/system/cuda/memory.h>  // for thrust::cuda::allocator

template<class ACursor, class BCursor, class CCursor>
__global__ void Kernel(ACursor A, BCursor B, CCursor C, int N) {
	int x = threadIdx.x + blockIdx.x * blockDim.x;
	int y = threadIdx.y + blockIdx.y * blockDim.y;

	typename CCursor::element_type value{0.0};
	for (int k = 0; k != N; ++k) { value += A[y][k] * B[k][x]; }
	C[y][x] = value;
}

namespace multi = boost::multi;

int main() {
	int N = 1024;

	// declare 3 square arrays
	multi::array<double, 2, thrust::cuda::allocator<double>> A({N, N}); A[0][0] = ...;
	multi::array<double, 2, thrust::cuda::allocator<double>> B({N, N}); B[0][0] = ...;
	multi::array<double, 2, thrust::cuda::allocator<double>> C({N, N});

	// kernel invocation code
	assert(N % 32 == 0);
	dim3 dimBlock(32, 32);
	dim3 dimGrid(N/32, N/32);
	Kernel<<<dimGrid, dimBlock>>>(A.home(), B.home(), C.home(), N);
	cudaDeviceSynchronize();

    // now C = A x B
}

(live)

Expressions such as A.begin() (iterators) can also be passed to kernels, but they could unnecessarely occupy more kernel "stack space" when size information is not needed (e.g. A.begin()→size()).

SYCL

The SYCL library promises the unify CPU, GPU and FPGA code. At the moment, the array containers can use the Unified Shared Memory (USM) allocator, but no other tests have been investigated.

    sycl::queue q;

    sycl::usm_allocator<int, sycl::usm::alloc::shared> q_alloc(q);
    multi::array<int, 1, decltype(q_alloc)> data(N, 1.0, q_alloc);

    //# Offload parallel computation to device
    q.parallel_for(sycl::range<1>(N), [=,ptr = data.base()] (sycl::id<1> i) {
        ptr[i] *= 2;
    }).wait();

(live)

Algorithms are expected to work with oneAPI execution policies as well (not tested)

    auto policy = oneapi::dpl::execution::dpcpp_default;
    sycl::usm_allocator<int, sycl::usm::alloc::shared> alloc(policy.queue());
    multi::array<int, 1, decltype(alloc)> vec(n, alloc);

    std::fill(policy, vec.begin(), vec.end(), 42);

Formatting ({fmt} pretty printing)

The library doesn’t have a "pretty" printing facility to display arrays. Although it is not ideal, arrays can be printed and formated by looping over elements and dimension, as shown in other examples (using standard streams).

Fortunately, the library automatically works with the external library {fmt}, both for arrays and subarrays. The fmt library is not a dependency of the Multi library; they simply work well together using the "ranges" part of the formatting library. fmt allows a high degree of configurability.

This example prints a 2-dimensional subblock of a larger array:

#include "fmt/ranges.h"
...
    multi::array<double, 2> A2D = {
        {1.0, 2.0,      3.0},
        /*-subblock-**/
        {3.0, 4.0, /**/ 5.0},
        {6.0, 7.0, /**/ 8.0},
    };

    fmt::print("A2D subblock = {}", A2({1, 3}, {0, 2}));  // second and third row, first and second column

yielding the "flat" output A2D subblock = [[3, 4], [6, 7]]. (A similar effect can be achieved with experimental C23 `std::print` in libc.)

For 2 or more dimensions the output can be conveniently structured in different lines using the fmt::join facility:

    fmt::print("\n[{}]\n", fmt::join(A2({1, 3}, {0, 2}), ",\n "));  // rows are printed in separate lines

with the output:

[[3, 4],
 [6, 7]]

In a similar way, the size of the array can be printed simply by passing the sizes output; fmt::print("{}", A2(…).sizes() ), which will print (2, 2).

((live)).

When saving arrays to files, please consider using serialization (see section) instead of formatting facilities.

Python (cppyy)

The library works out-of-the-box via de automatic cppyy bindings, providing a syntax similar to that of C. Additionally, the memory layouts are compatible with Numpy which makes possible to interoperate C and python code without memory copies.

Here it is a complete Python session:

# We import the library
import cppyy
cppyy.add_include_path('path-to/boost-multi/include')
cppyy.include("boost/multi/array.hpp")  # or from single header https://correaa.gitlab.io/boost-multi/boost-multi.hpp

multi = cppyy.gbl.boost.multi

# We can create a one-dimensional array with 4 elements initialized to 0.0, or from a list of numbers.
# We can print the array and change the element values:
a1d = multi.array['double', 1](4, 0.0)
a1d = multi.array['double', 1]([1.0, 2.0, 3.0, 4.0])
print(a1d)

{ 1.0000000, 2.0000000, 3.0000000, 4.0000000 }

# elements and assignable
a1d[2] = 99.9
print(a1d)

{ 1.0000000, 2.0000000, 99.900000, 4.0000000 }

# We can also create a 2x2 array, or directly from a bidimensional list:
a2d = multi.array['double', 2](multi.extensions_t[2](2, 2), 0.0)
a2d = multi.array['double', 2]([[1.0, 2.0], [3.0, 4.0]])

# We can retrive information from an individual element, or from a row
print(a2d.transposed()[0])

{ 1.0000000, 3.0000000 }

arow = a2d[0]
print(arow)

# rows (or slices in general) are references to the original arrays
arow[0] = 66.6
print(a2d[0])

{ 66.600000, 2.0000000 }

arow_copy = + a2d[0]
arow_copy = 11111.1
print(a2d[0])

{ 66.600000, 2.0000000 }

Take into account that assignment in Python differs semantically from that of C++. This can be particularly confusing when assigning elements from a list.

a2d = multi.array['double', 2]();  # empty 2D array
print(type(a2d))

<class cppyy.gbl.boost.multi.array<double,2,std::allocator<double>> at 0xa133120>

a2d = [[1.0, 2.0],[3.0, 4.0]];  # this is now a Python list
print(type(a2d))

<class 'list'>

The last line reassign the variable into a Python list, changing its type. To reassign the elements without changing the type, use .assign instead:

a2d.assign([[1.0, 2.0],[3.0, 4.0]])

Pointers from arrays can be extracted normally and reused in other libraries, to access the elements through other Python libraries; for example, an array that uses contiguous memory can be converted into a Numpy array, without making copies

npa2d = np.frombuffer(a2d.data_elements(), dtype=np.float64, count=a2d.num_elements()).reshape(a2d.sizes().get[0](), a2d.sizes().get[1]())
print(npa2d)

[[1. 2.]
 [4. 5.]]

The other way around,

a2d_ref = multi.array_ref['double', 2](multi.extensions_t[2](*npa2d.shape), npa2d)
print(a2d_ref)

{
        {1, 2},
        {4, 999.9},
}

Note that the elements fpr these Python objects (a2d, npa2d and a2d_ref) all refer to the same memory; so modifying one, also nodifies the rest. Special care must be taken if the original element is resized, or assigned to a different Python object, in which case the memory will not be valid. For this reason alone, it would be an error to name the last variable as the original (a2d) instead of the new name (a2d_ref).

Legacy libraries (C-APIs)

Multidimensional array data structures exist in all languages, whether implicitly defined by its strides structure or explicitly at the language level. Functions written in C tend to receive arrays by pointer arguments (e.g., to the "first" element) and memory layout (sizes and strides).

A C-function taking a 2D array with a concrete type might look like this in the general case:

void fun(double* data, int size1, int size2, int stride1, int stride2);

such a function can be called from C++ on Multi array (arr), by extracting the size and layout information,

fun(arr.base(), std::get<0>(arr.sizes()), std::get<1>(arr.sizes()), std::get<0>(arr.strides()), std::get<1>(arr.strides());

auto const [size1, size2] = arr.sizes();
auto const [stride1, stride2] = arr.strides();

fun(arr.base(), size1, size2, stride1, stride2);

Although the recipe can be applied straightforwardly, different libraries make various assumptions about memory layouts (e.g., 2D arrays assume that the second stride is 1), and some might take stride information in a different way (e.g., FFTW doesn’t use strides but stride products). Furthermore, some arguments may need to be permuted if the function expects arrays in column-major (Fortran) ordering.

For these reasons, the library is accompanied by a series of adaptor libraries to popular C-based libraries, which can be found in the include/multi/adaptors/ subdirectory:

BLAS/cuBLAS Adator 🔗

Interface for BLAS-like linear algebra libraries, such as openblas, Apple’s Accelerate, MKL and hipBLAS/cuBLAS (GPUs). Simply #include "multi/adaptors/blas.hpp" (and link your program with -lblas for example).

Lapack

Interface for Lapack linear solver libraries. Simply #include "multi/adaptors/lapack.hpp" (and link your program with -llapack for example).

FFTW/cuFFT

Interface for FFTW libraries, including FFTW 3, MKL, cuFFT/hipFFT (for GPU). Simply #include "multi/adaptors/fftw.hpp" (and link your program with -lfftw3 for example).

MPI Adaptor 🔗

Use arrays (and subarrays) as messages for distributed interprocess communication (GPU and CPU) that can be passed to MPI functions through datatypes. Simply #include "multi/adaptors/mpi.hpp".

TotalView: visual debugger (commercial)

Popular in HPC environments, can display arrays in human-readable form (for simple types, like double or std::complex). Simply #include "multi/adaptors/totalview.hpp" and link to the TotalView libraries, compile and run the code with the TotalView debugger.

Edit this Page