https://chapel-lang.org/blog/posts/announcing-chapel-1.32/

Chapel logo

Chapel Language Blog

About Chapel Website Featured Series Tags Authors All Posts

Announcing Chapel 1.32!

Posted on September 28, 2023.

Tags: Release Announcements Chapel 2.0

By Brad Chamberlain

Table of Contents

  * Highlights of Chapel 1.32
      + Chapel 2.0 Release Candidate
      + GPU Improvements
      + Support for Co-Locales
      + IO Serialization Framework
      + Improved ARM64 Support
      + And much more...
  * For More Information

The Chapel developer community is excited to announce the release of
Chapel version 1.32! To obtain a copy, please refer to the
Downloading Chapel page on the Chapel website.

Highlights of Chapel 1.32

Chapel 2.0 Release Candidate

The main highlight of Chapel 1.32 is that it is a release candidate
for our forthcoming Chapel 2.0 release! If you're not familiar with
the concept of Chapel 2.0, it is intended to be a release that
declares a core subset of the language and library features as
'stable'. These features are ones that we intend to support in their
current form going forward, such that code relying on them will not
break across releases. Meanwhile, other features will be considered
'unstable', implying that they are ones where we are still learning
from user experiences and refining interfaces before considering them
to be stabilized. Unstable features may continue evolving after the
2.0 release, either by improving them until they too are stable, or
replacing them with other, more stable features.

Chapel 1.32 being a 2.0 release candidate means that this is a key
time for Chapel users to give us feedback about aspects of our design
that they would like to see change prior to the 2.0 release. Users
may also want to compile their programs with the --warn-unstable flag
in order to identify any unstable features that they are currently
relying upon. Reliance on such features could motivate you to
advocate for stabilizing those features sooner, or you could simply
view it as an opportunity to be aware that those features may
continue to evolve over time. We are generally interested in hearing
about which unstable features user code is currently relying upon, to
help with our own prioritization efforts.

Users with feedback about 2.0 readiness or the stability of current
features are encouraged to share it with us on Chapel's Discourse
user forum or as a GitHub issue.

As part of the team's push to make this a worthy Chapel 2.0 release
candidate, Chapel 1.32 contains a large number of improvements to the
language, compiler, and libraries. Some of these changes include:

  * new warnings to encourage a programming style in which generic
    types are more clearly visible in a program's source code

  * a change in the default intent for arrays and record receivers
    (i.e., this) to const for greater uniformity with other types

  * revised definitions of the compiler's interpretation of const
    intents and default return/yield intents

  * significant improvements to ranges, domains, and distributions,
    including converting distribution types to records, obviating the
    need for the dmap type

  * major improvements to the IO, Math, BigInteger, and Time modules,
    including a new IO serialization framework for specifying how to
    read and write types to files orthogonally from the file's format
    (see below for more detail)

For more information about these changes, and many others not
summarized here, refer to the CHANGES.md file, documentation for
Chapel 1.32, or forthcoming release note slides.

GPU Improvements

Version 1.32 includes significant improvements to Chapel's support
for vendor-neutral GPU programming, both in terms of performance and
capabilities.

Key performance improvements include:

  * compiler optimizations to reduce the number of pointer
    dereferences when accessing arrays within GPU kernels

  * switching the default memory allocation scheme for arrays to
    'array_on_device' mode, in which an array's data is stored
    directly on the GPU rather than in managed memory

  * a reduction in overheads when invoking math routines within GPU
    kernels by eliminating unnecessary boilerplate wrapper code

  * using per-task GPU streams, which can enable
    communication-computation overlap to improve performance

The non-trivial impact of these optimizations can be seen in the
following graphs, which show the improvements that have occurred in a
Chapel port of the SHOC Sort benchmark on both NVIDIA and AMD GPUs.
Note that the second graph includes data transfer times while the
first does not.

[SHOC-sort-]

Chapel's support for AMD effectively reaches feature parity with
NVIDIA in this release, largely due to the addition of a number of
math routines that had not been supported for AMD in Chapel 1.31. In
addition, the Chapel compiler's --savec flag can now be used to
inspect the assembly code generated when targeting AMD GPUs.

Meanwhile, when targeting NVIDIA GPUs, Chapel 1.32 adds support for
generating multi-architecture binaries by setting CHPL_GPU_ARCH to a
comma-separated list of target architectures.

See the latest GPU Programming technical note for additional details
about these changes and Chapel's overall support for GPUs in 1.32.

Support for Co-Locales

Since its inception, Chapel has preferred to represent each compute
node as a single top-level locale, using multitasking to implement
any intra-node parallelism. This approach has been beneficial in many
problem domains where running a process per core could result in
larger memory requirements or poor surface-to-volume effects due to
the amount of SPMD [ ] [note: SPMD = Single Program, Multiple Data, a
static and coarse-grained style of parallelism in which multiple
copies of the same program are executed, e.g. one per processor core 
] parallelism.

However, as modern compute nodes have begun to support multiple NICs,
[ ] [note: NICs = Network Interface Chips, which permit processes to
communicate with remote nodes ] this traditional approach has faced
challenges. Specifically, it is unduly complicated to have a single
locale (UNIX process) leverage multiple NICs effectively; yet using
just one NIC leaves potential performance benefits on the floor by
not exercising the network to its full capacity.

To address this, Chapel 1.32 introduces user-facing support for
co-locales, in which multiple locales can be mapped to a single
compute node. Using co-locales can lead to performance improvements
by making better use of the network and/or reducing the number of
memory references that cross between sockets. For example, the
following charts show improvements to a pair of benchmarks when run
using two locales per node on a dual-NIC HPE Cray EX system using
Slingshot 11:

[co-locales]

Current support is limited to running a locale per socket on a given
compute node, and is also limited to certain platforms and
configurations:

  * HPE Cray EX platforms with Slingshot 11 when using CHPL_COMM=ofi

  * InfiniBand-based systems when using CHPL_COMM=gasnet with
    CHPL_COMM_SUBSTRATE=ibv

  * Configurations using CHPL_LAUNCHER=slurm-srun or
    pbs-gasnetrun_ibv

To opt-in to using co-locales, specify the number of locales for your
Chapel program using a product of nodes and locales per node. For
example, the following invocation:

$ ./myChapelProgram -nl 8x2

says to run the Chapel program on 8 nodes with 2 locales per node,
for a total of 16 locales.

For more information on using co-locales with Chapel, please refer to
the online documentation.

IO Serialization Framework

The IO serialization framework that was prototyped in Chapel 1.31 is
now used by default for calls like writeln() and read(), and it is
also available for use with types written by end-users.

As an illustration, consider the following example that prints an
array in a couple of different formats:

1 use IO, JSON;
2 
3 var A = [1, 2, 3, 4];
4 
5 writeln(A);             // prints '1 2 3 4'
6 
7 var jsonWriter = stdout.withSerializer(jsonSerializer);
8 jsonWriter.writeln(A);  // prints '[1, 2, 3, 4]'  

Line 5 uses a normal writeln() to print the array of integers to the
standard console output (stdout) using Chapel's traditional
format--one element at a time, separated by spaces. Then, in line 7,
we create a variant of stdout that uses the JSON serializer for all
write()s called on it. The result is that when we write the array to
this output stream in line 8, it is printed using standard JSON
formatting. Other current serializers support binary, YAML, and
Chapel syntax as alternate formats.

The new serialization framework also includes deserializers, which
support reading values back in from the given format. And most
importantly, users can now define their own methods specifying how
their types should be written or read. This can be done in a
format-neutral manner for simplicity, or in a way that's sensitive to
the output format when needed. For more information on defining these
methods, please refer to their online documentation.

Improved ARM64 Support

Thanks to our colleagues on the Qthreads team at Sandia National
Laboratories, support for ARM64 chips is significantly improved in
Chapel 1.32. Specifically, this release bundles version 1.19 of
Qthreads, in which task creation and switching have been
re-implemented using assembly code for ARM64 chips. This can
dramatically reduce multitasking overheads when using Chapel's
preferred CHPL_TASKS=qthreads mode.

As a simple illustration, the following table shows the impact of
this fast task switching on a 16-node run of Bale Index Gather using
various implementation strategies:

       Approach         w/out fast tasks with fast tasks  improvement
ordered                 70.7 MB/s/node   84.7 MB/s/node   1.20x
ordered, oversubscribed 86.3 MB/s/node   140.4 MB/s/node  1.63x
unordered               147.5 MB/s/node  152.3 MB/s/node  1.03x
aggregated              1352.0 MB/s/node 1448.5 MB/s/node 1.07x

In addition, Qthreads 1.19 also improved portability for ARM64-based
platforms. This enables the use of CHPL_TASKS=qthreads on a wider
variety of systems, such as M1/M2 Macs, where it is now the default.

And much more...

Beyond the highlights mentioned here, Chapel 1.32 contains numerous
other improvements to Chapel's features and interfaces, such as:

  * initial support for array allocations that will throw if the
    system is out of memory

  * a more robust set of types and routines for dealing with C
    pointer types, particularly with respect to const-ness

  * initial support for interface declarations, to opt-in to special
    methods like the serialization methods mentioned above

  * features for power users to better understand the vectorization
    and transformation of their Chapel programs

  * support for selecting between processor types on chips with
    heterogeneous processing units

For a more complete list of changes in Chapel 1.32, please refer to
its CHANGES.md file.

For More Information

For questions about any of the changes in this release, please reach
out to the developer community on Discourse.

As always, we're interested in feedback on how we can help make the
Chapel language, libraries, implementation, and tools more useful to
you in your work.

And always, thanks to everyone who contributed to the Chapel 1.32
release!