codemadness.org

       add catb.org page - webdump_tests - Testfiles for webdump
 (HTM) git clone git://git.codemadness.org/webdump_tests
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
       ---
 (DIR) commit 6d7678a78907c999dc004785d718347ccc01d0b3
 (DIR) parent 667920b3abc123846ebb61af422fdeba4282a43c
 (HTM) Author: Hiltjo Posthuma <hiltjo@codemadness.org>
       Date:   Tue, 19 Sep 2023 20:03:22 +0200
       
       add catb.org page
       
       Diffstat:
         A realworld/catb.org.html             |     639 +++++++++++++++++++++++++++++++
       
       1 file changed, 639 insertions(+), 0 deletions(-)
       ---
 (DIR) diff --git a/realworld/catb.org.html b/realworld/catb.org.html
       @@ -0,0 +1,638 @@
       +<?xml version="1.0" encoding="UTF-8"?>
       +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>The Lost Art of Structure Packing</title><link rel="stylesheet" type="text/css" href="docbook-xsl.css" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /></head><body><div xml:lang="en" class="article" lang="en"><div class="titlepage"><div><div><h2 class="title"><a id="idm1"></a>The Lost Art of Structure Packing</h2></div><div><div class="author"><h3 class="author"><span class="firstname">Eric</span> <span class="othername">S.</span> <span class="surname">Raymond</span></h3><code class="email">&lt;<a class="email" href="mailto:esr@thyrsus.com">esr@thyrsus.com</a>&gt;</code></div></div></div><hr /></div><div class="toc"><p><strong>Table of Contents</strong></p><dl class="toc"><dt><span class="section"><a href="#_who_should_read_this">1. Who should read this</a></span></dt><dt><span class="section"><a href="#_why_i_wrote_it">2. Why I wrote it</a></span></dt><dt><span class="section"><a href="#_alignment_requirements">3. Alignment requirements</a></span></dt><dt><span class="section"><a href="#_padding">4. Padding</a></span></dt><dt><span class="section"><a href="#_structure_alignment_and_padding">5. Structure alignment and padding</a></span></dt><dt><span class="section"><a href="#_bitfields">6. Bitfields</a></span></dt><dt><span class="section"><a href="#_structure_reordering">7. Structure reordering</a></span></dt><dt><span class="section"><a href="#_awkward_scalar_cases">8. Awkward scalar cases</a></span></dt><dt><span class="section"><a href="#_readability_and_cache_locality">9. Readability and cache locality</a></span></dt><dt><span class="section"><a href="#_other_packing_techniques">10. Other packing techniques</a></span></dt><dt><span class="section"><a href="#_overriding_alignment_rules">11. Overriding alignment rules</a></span></dt><dt><span class="section"><a href="#_tools">12. Tools</a></span></dt><dt><span class="section"><a href="#_proof_and_exceptional_cases">13. Proof and exceptional cases</a></span></dt><dt><span class="section"><a href="#C-like">14. Other languages</a></span></dt><dd><dl><dt><span class="section"><a href="#_c">14.1. C++</a></span></dt><dt><span class="section"><a href="#_go">14.2. Go</a></span></dt><dt><span class="section"><a href="#_rust">14.3. Rust</a></span></dt><dt><span class="section"><a href="#_java">14.4. Java</a></span></dt><dt><span class="section"><a href="#_swift">14.5. Swift</a></span></dt><dt><span class="section"><a href="#_c_2">14.6. C#</a></span></dt></dl></dd><dt><span class="section"><a href="#_supporting_this_work">15. Supporting this work</a></span></dt><dt><span class="section"><a href="#_related_reading">16. Related Reading</a></span></dt><dt><span class="section"><a href="#_version_history">17. Version history</a></span></dt></dl></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_who_should_read_this"></a>1. Who should read this</h2></div></div></div><p>This page is about a technique for reducing the memory footprint of
       +programs in compiled languages with C-like structures - manually
       +repacking these declarations for reduced size. To read it, you will
       +require basic knowledge of the C programming language.</p><p>You need to know this technique if you intend to write code for
       +memory-constrained embedded systems, or operating-system kernels.  It
       +is useful if you are working with application data sets so large that
       +your programs routinely hit memory limits.  It is good to know in any
       +application where you really, <span class="emphasis"><em>really</em></span> care about optimizing your use
       +of memory bandwidth and minimizing cache-line misses.</p><p>Finally, knowing this technique is a gateway to other esoteric C
       +topics.  You are not an advanced C programmer until you have grasped
       +these rules. You are not a master of C until you could have written
       +this document yourself and can criticize it intelligently.</p><p>This document originated with "C" in the title, but almost everything
       +in it applies to C++ as well. Many of the techniques discussed here
       +also apply to the Go language, and should generalize to any compiled
       +language with C-like structures.  There is a note discussing C++, Go,
       +Rust, Java, Swift, and C# towards the end.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_why_i_wrote_it"></a>2. Why I wrote it</h2></div></div></div><p>This webpage exists because in late 2013 I found myself heavily
       +applying an optimization technique that I had learned more than two
       +decades previously and not used much since.</p><p>I needed to reduce the memory footprint of a program that used thousands -
       +sometimes hundreds of thousands - of C struct instances. The program
       +was <a class="ulink" href="http://www.catb.org/~esr/cvs-fast-export" target="_top">cvs-fast-export</a> and
       +the problem was that it was dying with out-of-memory errors on large
       +repositories.</p><p>There are ways to reduce memory usage significantly in situations like
       +this, by rearranging the order of structure members in careful ways.
       +This can lead to dramatic gains - in my case I was able to cut the
       +working-set size by around 40%, enabling the program to handle
       +much larger repositories without dying.</p><p>But as I worked, and thought about what I was doing, it began to dawn
       +on me that the technique I was using has been more than half forgotten
       +in these latter days.  A little web research confirmed that
       +programmers don’t seem to talk about it much any more, at least not
       +where a search engine can see them.  A couple of Wikipedia entries
       +touch the topic, but I found nobody who covered it comprehensively.</p><p>There are actually reasons for this that aren’t stupid.  CS courses
       +(rightly) steer people away from micro-optimization towards finding better
       +algorithms. The plunging price of machine resources has made squeezing
       +memory usage less necessary.  And the way hackers used to learn how to
       +do it back in the day was by bumping their noses on strange hardware
       +architectures - a less common experience now.</p><p>But the technique still has value in important situations, and will
       +as long as memory is finite.  This document is intended to save
       +programmers from having to rediscover the technique, so they can
       +concentrate effort on more important things.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_alignment_requirements"></a>3. Alignment requirements</h2></div></div></div><p>The first thing to understand is that, on modern processors, the way
       +your compiler lays out basic datatypes in memory is constrained
       +in order to make memory accesses faster.  Our examples are in C, but
       +any compiled language generates code under the same constraints.</p><p>There is a large class of modern ISAs (Instruction Set Architectures)
       +for which these constraints lead to identical layouts. These ISAs
       +include Intel, ARM, and RISC-V; I will refer to these as "vanilla"
       +ISAs.</p><p>Storage for the basic C datatypes on a vanilla ISA doesn’t
       +normally start at arbitrary byte addresses in memory. Rather, each
       +type except char has an <span class="emphasis"><em>alignment requirement</em></span>; chars can start on
       +any byte address, but 2-byte shorts must start on an even address,
       +4-byte ints or floats must start on an address divisible by 4, and
       +8-byte longs or doubles must start on an address divisible by 8.
       +Signed or unsigned makes no difference.</p><p>The jargon for this is that basic C types on a vanilla ISA are
       +<span class="emphasis"><em>self-aligned</em></span>.  Pointers, whether 32-bit (4-byte) or 64-bit (8-byte)
       +are self-aligned too.</p><p>Self-alignment makes access faster because it facilitates generating
       +single-instruction fetches and puts of the typed data. Without
       +alignment constraints, on the other hand, the code might end up having
       +to do two or more accesses spanning machine-word boundaries.
       +Characters are a special case; they’re equally expensive from anywhere
       +they live inside a single machine word. That’s why they don’t have a
       +preferred alignment.</p><p>I said "on modern processors" because on some older ones forcing your
       +C program to violate alignment rules (say, by casting an odd address
       +into an int pointer and trying to use it) didn’t just slow your code
       +down, it caused an illegal instruction fault. This was the behavior, for
       +example, on Sun SPARC chips. In fact, with sufficient determination
       +and the right (e18) hardware flag set on the processor, you can still
       +trigger this on x86.</p><p>Also, self-alignment is not the only possible rule. Historically, some
       +processors (especially those lacking
       +<a class="ulink" href="https://en.wikipedia.org/wiki/Barrel_shifter" target="_top">barrel shifters</a>) have had
       +more restrictive ones. If you do embedded systems, you might trip over one
       +of these lurking in the underbrush.  Be aware this is possible.</p><p>One curious and illustrative exception is the Motorola 68020 and its
       +successors. These are word-oriented 32-bit machines - that is, the
       +underlying granularity of fast access is 16 bits. Compilers can start
       +structs on 16-bit boundaries without a speed penalty, even if the
       +first member was a 32-bit scalar.  Therefore, only character fields
       +with odd byte lengths can ever cause padding.</p><p>From when it was first written at the beginning of 2014 until late
       +2016, this section ended with other caveats about odd
       +architectures. During that period I’ve learned something rather
       +reassuring from working with the source code for the reference
       +implementation of NTP.  It does packet analysis by reading packets off
       +the wire directly into memory that the rest of the code sees as a
       +struct, relying on the assumption of minimal self-aligned padding - or
       +zero padding in odd cases like 690x0.</p><p>The interesting news is that NTP has apparently being getting away
       +with this for <span class="strong"><strong>decades</strong></span> across a very wide span of hardware, operating
       +systems, and compilers, including not just Unixes but under Windows
       +variants as well. This suggests that platforms with padding rules
       +other than self-alignment are either nonexistent or confined to
       +such specialized niches that they’re never either NTP servers or
       +clients.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_padding"></a>4. Padding</h2></div></div></div><p>Now we’ll look at a simple example of variable layout in
       +memory. Consider the following series of variable declarations
       +in the top level of a C module:</p><pre class="screen">char *p;
       +char c;
       +int x;</pre><p>If you didn’t know anything about data alignment, you might assume
       +that these three variables would occupy a continuous span of bytes in
       +memory.  That is, on a 32-bit machine 4 bytes of pointer would be
       +immediately followed by 1 byte of char and that immediately followed
       +by 4 bytes of int.  And a 64-bit machine would be different only
       +in that the pointer would be 8 bytes.</p><p>In fact, the hidden assumption that the allocated order of static
       +variables is their source order is not necessarily valid; the C
       +standards don’t mandate it.  I’m going to ignore this detail because
       +(a) that hidden assumption is usually correct anyway, and (b) the
       +actual purpose of talking about padding and packing outside structures
       +is to prepare you for what happens inside them.</p><p>Here’s what actually happens (on an x86 or ARM or anything else with
       +self-aligned types). The storage for <code class="literal">p</code> starts on a self-aligned 4-
       +or 8-byte boundary depending on the machine word size.  This is
       +<span class="emphasis"><em>pointer alignment</em></span> - the strictest possible.</p><p>The storage for <code class="literal">c</code> follows immediately.  But the 4-byte alignment
       +requirement of <code class="literal">x</code> forces a gap in the layout; it comes out as though
       +there were a fourth intervening variable, like this:</p><pre class="screen">char *p;      /* 4 or 8 bytes */
       +char c;       /* 1 byte */
       +char pad[3];  /* 3 bytes */
       +int x;        /* 4 bytes */</pre><p>The <code class="literal">pad[3]</code> character array represents the fact that there are three
       +bytes of waste space in the structure.  The old-school term for this
       +was "slop". The value of the padding bits is undefined; in particular
       +it is not guaranteed that they will be zeroed.</p><p>Compare what happens if <code class="literal">x</code> is a 2-byte short:</p><pre class="screen">char *p;
       +char c;
       +short x;</pre><p>In that case, the actual layout will be this:</p><pre class="screen">char *p;      /* 4 or 8 bytes */
       +char c;       /* 1 byte */
       +char pad[1];  /* 1 byte */
       +short x;      /* 2 bytes */</pre><p>On the other hand, if <code class="literal">x</code> is a long on a 64-bit machine</p><pre class="screen">char *p;
       +char c;
       +long x;</pre><p>we end up with this:</p><pre class="screen">char *p;     /* 8 bytes */
       +char c;      /* 1 byte
       +char pad[7]; /* 7 bytes */
       +long x;      /* 8 bytes */</pre><p>If you have been following carefully, you are probably now wondering about
       +the case where the shorter variable declaration comes first:</p><pre class="screen">char c;
       +char *p;
       +int x;</pre><p>If the actual memory layout were written like this</p><pre class="screen">char c;
       +char pad1[M];
       +char *p;
       +char pad2[N];
       +int x;</pre><p>what can we say about <code class="literal">M</code> and <code class="literal">N</code>?</p><p>First, in this case <code class="literal">N</code> will be zero.  The address of <code class="literal">x</code>, coming
       +right after <code class="literal">p</code>, is guaranteed to be pointer-aligned, which is
       +never less strict than int-aligned.</p><p>The value of <code class="literal">M</code> is less predictable.  If the compiler
       +happened to map <code class="literal">c</code> to the last byte of a machine word, the next
       +byte (the first of <code class="literal">p</code>) would be the first byte of the next one
       +and properly pointer-aligned.  M would be zero.</p><p>It is more likely that <code class="literal">c</code> will be mapped to the <span class="emphasis"><em>first</em></span> byte of a
       +machine word.  In that case M will be whatever padding is needed to
       +ensure that <code class="literal">p</code> has pointer alignment - 3 on a 32-bit machine,
       +7 on a 64-bit machine.</p><p>Intermediate cases are possible.  M can be anything  from 0 to 7
       +(0 to 3 on 32-bit) because a char can start on any byte boundary
       +in a machine word.</p><p>If you wanted to make those variables take up less space,
       +you could get that effect by swapping <code class="literal">x</code> with <code class="literal">c</code> in the
       +original sequence.</p><pre class="screen">char *p;     /* 8 bytes */
       +long x;      /* 8 bytes */
       +char c;      /* 1 byte</pre><p>Usually, for the small number of scalar variables in your C programs,
       +bumming out the few bytes you can get by changing the order of
       +declaration won’t save you enough to be significant.  The technique
       +becomes more interesting when applied to nonscalar variables -
       +especially structs.</p><p>Before we get to those, let’s dispose of arrays of scalars.  On a
       +platform with self-aligned types, arrays of
       +char/short/int/long/pointer have no internal padding; each
       +member is automatically self-aligned at the end of the next one.</p><p>All these rules and examples map over to Go structs, and to Rust
       +structs with the "repr(C)" attribute, with only syntactic changes.</p><p>In the next section we will see that the same is <span class="strong"><strong>not</strong></span> necessarily true of
       +structure arrays.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_structure_alignment_and_padding"></a>5. Structure alignment and padding</h2></div></div></div><p>In general, a struct instance will have the alignment of its widest scalar
       +member. Compilers do this as the easiest way to ensure that all the
       +members are self-aligned for fast access.</p><p>Also, in C (and Go, and Rust) the address of a struct is the same as
       +the address of its first member - there is no leading padding. In
       +C++ this may not be true; see <a class="xref" href="#C-like" title="14. Other languages">Section 14, “Other languages”</a>.</p><p>(When you’re in doubt about this sort of thing, ANSI C provides an
       +offsetof() macro which can be used to read out structure member
       +offsets.)</p><p>Consider this struct:</p><pre class="screen">struct foo1 {
       +    char *p;
       +    char c;
       +    long x;
       +};</pre><p>Assuming a 64-bit machine, any instance of <code class="literal">struct foo1</code> will have
       +8-byte alignment. The memory layout of one of these looks
       +unsurprising, like this:</p><pre class="screen">struct foo1 {
       +    char *p;     /* 8 bytes */
       +    char c;      /* 1 byte
       +    char pad[7]; /* 7 bytes */
       +    long x;      /* 8 bytes */
       +};</pre><p>It’s laid out exactly as though variables of these types has
       +been separately declared.  But if we put <code class="literal">c</code> first, that’s
       +no longer true.</p><pre class="screen">struct foo2 {
       +    char c;      /* 1 byte */
       +    char pad[7]; /* 7 bytes */
       +    char *p;     /* 8 bytes */
       +    long x;      /* 8 bytes */
       +};</pre><p>If the members were separate variables, <code class="literal">c</code> could start at any byte
       +boundary and the size of <code class="literal">pad</code> might vary.  Because <code class="literal">struct foo2</code> has
       +the pointer alignment of its widest member, that’s no longer possible.
       +Now <code class="literal">c</code> has to be pointer-aligned, and following padding of 7 bytes is
       +locked in.</p><p>Now let’s talk about trailing padding on structures. To explain this,
       +I need to introduce a basic concept which I’ll call the <span class="emphasis"><em>stride
       +address</em></span> of a structure. It is the first address following the
       +structure data that has the <span class="strong"><strong>same alignment as the structure</strong></span>.</p><p>The general rule of trailing structure padding is this: the compiler
       +will behave as though the structure has trailing padding out to
       +its stride address. This rule controls what <code class="literal">sizeof()</code> will return.</p><p>Consider this example on a 64-bit x86 or ARM machine:</p><pre class="screen">struct foo3 {
       +    char *p;     /* 8 bytes */
       +    char c;      /* 1 byte */
       +};
       +
       +struct foo3 singleton;
       +struct foo3 quad[4];</pre><p>You might think that <code class="literal">sizeof(struct foo3)</code> should be 9, but it’s
       +actually 16.  The stride address is that of <code class="literal">(&amp;p)[2]</code>.  Thus, in the <code class="literal">quad</code>
       +array, each member has 7 bytes of trailing padding, because the first
       +member of each following struct wants to be self-aligned on an 8-byte
       +boundary.  The memory layout is as though the structure had been
       +declared like this:</p><pre class="screen">struct foo3 {
       +    char *p;     /* 8 bytes */
       +    char c;      /* 1 byte */
       +    char pad[7];
       +};</pre><p>For contrast, consider the following example:</p><pre class="screen">struct foo4 {
       +    short s;     /* 2 bytes */
       +    char c;      /* 1 byte */
       +};</pre><p>Because <code class="literal">s</code> only needs to be 2-byte aligned, the stride address is
       +just one byte after <code class="literal">c</code>, and <code class="literal">struct foo4</code> as a whole only needs one
       +byte of trailing padding.  It will be laid out like this:</p><pre class="screen">struct foo4 {
       +    short s;     /* 2 bytes */
       +    char c;      /* 1 byte */
       +    char pad[1];
       +};</pre><p>and <code class="literal">sizeof(struct foo4)</code> will return 4.</p><p>Here’s a last important detail:  If your structure has structure
       +members, the inner structs want to have the alignment of
       +longest scalar too.  Suppose you write this:</p><pre class="screen">struct foo5 {
       +    char c;
       +    struct foo5_inner {
       +        char *p;
       +        short x;
       +    } inner;
       +};</pre><p>The <code class="literal">char *p</code> member in the inner struct forces the outer struct to be
       +pointer-aligned as well as the inner.  Actual layout will be like
       +this on a 64-bit machine:</p><pre class="screen">struct foo5 {
       +    char c;           /* 1 byte*/
       +    char pad1[7];     /* 7 bytes */
       +    struct foo5_inner {
       +        char *p;      /* 8 bytes */
       +        short x;      /* 2 bytes */
       +        char pad2[6]; /* 6 bytes */
       +    } inner;
       +};</pre><p>This structure gives us a hint of the savings that might be possible
       +from repacking structures.  Of 24 bytes, 13 of them are padding.
       +That’s more than 50% waste space!</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_bitfields"></a>6. Bitfields</h2></div></div></div><p>Now let’s consider C bitfields.  What they give you the ability to do is
       +declare structure fields of smaller than character width, down to 1
       +bit, like this:</p><pre class="screen">struct foo6 {
       +    short s;
       +    char c;
       +    int flip:1;
       +    int nybble:4;
       +    int septet:7;
       +};</pre><p>The thing to know about bitfields is that they are implemented with
       +word- and byte-level mask and rotate instructions operating on machine
       +words, and cannot cross word boundaries.  C99 guarentees that
       +bit-fields will be packed as tightly as possible, provided they don’t
       +cross storage unit boundaries (6.7.2.1 #10).</p><p>This restriction is relaxed in C11 (6.7.2.1p11) and C++14
       +([class.bit]p1); these revisions do not actually require <code class="literal">struct foo9</code>
       +to be 64 bits instead of 32; a bit-field can span multiple allocation
       +units instead of starting a new one. It’s up to the implementation to
       +decide; GCC leaves it up to the ABI, which for x64 does prevent them
       +from sharing an allocation unit.</p><p>Assuming we’re on a 32-bit machine, the C99 rules imply that the layout may
       +look like this:</p><pre class="screen">struct foo6 {
       +    short s;       /* 2 bytes */
       +    char c;        /* 1 byte */
       +    int flip:1;    /* total 1 bit */
       +    int nybble:4;  /* total 5 bits */
       +    int pad1:3;    /* pad to an 8-bit boundary */
       +    int septet:7;  /* 7 bits */
       +    int pad2:25;   /* pad to 32 bits */
       +};</pre><p>But this isn’t the only possibility, because the C standard does not
       +specify that bits are allocated low-to-high.  So the layout could look
       +like this:</p><pre class="screen">struct foo6 {
       +    short s;       /* 2 bytes */
       +    char c;        /* 1 byte */
       +    int pad1:3;    /* pad to an 8-bit boundary */
       +    int flip:1;    /* total 1 bit */
       +    int nybble:4;  /* total 5 bits */
       +    int pad2:25;   /* pad to 32 bits */
       +    int septet:7;  /* 7 bits */
       +};</pre><p>That is, the padding could precede rather than following the
       +payload bits.</p><p>Note also that, as with normal structure padding, the padding bits are
       +not guaranteed to be zero; C99 mentions this.</p><p>Note that the base type of a bit field is interpreted for signedness
       +but not necessarily for size.  It is up to implementors whether
       +"short flip:1" or "long flip:1" are supported, and whether those
       +base types change the size of the storage unit the field is packed into.</p><p>Proceed with caution and check with -Wpadded if you have it available
       +(e.g. under clang).  Compilers on exotic hardware might interpret the
       +C99 rules in surprising ways; older compilers might not quite follow
       +them.</p><p>The restriction that bitfields cannot cross machine word boundaries
       +means that, while the first two of the following structures pack into
       +one and two 32-bit words as you’d expect, the third (<code class="literal">struct foo9</code>) takes
       +up <span class="strong"><strong>three</strong></span> 32-bit words in C99, in the last of which only one bit is used.</p><pre class="screen">struct foo7 {
       +    int bigfield:31;      /* 32-bit word 1 begins */
       +    int littlefield:1;
       +};
       +
       +struct foo8 {
       +    int bigfield1:31;     /* 32-bit word 1 begins /*
       +    int littlefield1:1;
       +    int bigfield2:31;     /* 32-bit word 2 begins */
       +    int littlefield2:1;
       +};
       +
       +struct foo9 {
       +    int bigfield1:31;     /* 32-bit word 1 begins */
       +    int bigfield2:31;     /* 32-bit word 2 begins */
       +    int littlefield1:1;
       +    int littlefield2:1;   /* 32-bit word 3 begins */
       +};</pre><p>Again, C11 and C++14 may pack foo9 tighter, but it would perhaps be
       +unwise to count on this.</p><p>On the other hand, <code class="literal">struct foo8</code> would fit into a single 64-bit word
       +if the machine has those.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_structure_reordering"></a>7. Structure reordering</h2></div></div></div><p>Now that you know how and why compilers insert padding in and after your
       +structures we’ll examine what you can do to squeeze out the slop.
       +This is the art of structure packing.</p><p>The first thing to notice is that slop only happens in two places.
       +One is where storage bound to a larger data type (with stricter
       +alignment requirements) follows storage bound to a smaller one. The
       +other is where a struct naturally ends before its stride address,
       +requiring padding so the next one will be properly aligned.</p><p>The simplest way to eliminate slop is to reorder the structure members
       +by decreasing alignment.  That is: make all the pointer-aligned
       +subfields come first, because on a 64-bit machine they will be 8
       +bytes.  Then the 4-byte ints; then the 2-byte shorts; then the
       +character fields.</p><p>So, for example, consider this simple linked-list structure:</p><pre class="screen">struct foo10 {
       +    char c;
       +    struct foo10 *p;
       +    short x;
       +};</pre><p>With the implied slop made explicit, here it is:</p><pre class="screen">struct foo10 {
       +    char c;          /* 1 byte */
       +    char pad1[7];    /* 7 bytes */
       +    struct foo10 *p; /* 8 bytes */
       +    short x;         /* 2 bytes */
       +    char pad2[6];    /* 6 bytes */
       +};</pre><p>That’s 24 bytes.  If we reorder by size, we get this:</p><pre class="screen">struct foo11 {
       +    struct foo11 *p;
       +    short x;
       +    char c;
       +};</pre><p>Considering self-alignment, we see that none of the data fields need
       +padding.  This is because the stride address for a (longer) field with
       +stricter alignment is always a validly-aligned start address for a
       +(shorter) field with less strict requirements.  All the repacked struct
       +actually requires is trailing padding:</p><pre class="screen">struct foo11 {
       +    struct foo11 *p; /* 8 bytes */
       +    short x;         /* 2 bytes */
       +    char c;          /* 1 byte */
       +    char pad[5];     /* 5 bytes */
       +};</pre><p>Our repack transformation drops the size from 24 to 16 bytes.  This
       +might not seem like a lot, but suppose you have a linked list of 200K
       +of these?  The savings add up fast - especially on memory-constrained
       +embedded systems or in the core part of an OS kernel that has to stay
       +resident.</p><p>Note that reordering is not guaranteed to produce savings.  Applying this
       +technique to an earlier example, <code class="literal">struct foo5</code>, we get this:</p><pre class="screen">struct foo12 {
       +    struct foo5 {
       +        char *p;      /* 8 bytes */
       +        short x;      /* 2 bytes */
       +    } inner;
       +    char c;           /* 1 byte*/
       +};</pre><p>With padding written out, this is</p><pre class="screen">struct foo12 {
       +    struct foo5 {
       +        char *p;      /* 8 bytes */
       +        short x;      /* 2 bytes */
       +        char pad[6];  /* 6 bytes */
       +    } inner;
       +    char c;           /* 1 byte*/
       +    char pad[7];      /* 7 bytes */
       +};</pre><p>It’s still 24 bytes because <code class="literal">c</code> cannot back into the inner struct’s
       +trailing padding. To collect that gain you would need to redesign
       +your data structures.</p><p>Curiously, strictly ordering your structure fields by <span class="emphasis"><em>increasing</em></span>
       +size also works to mimimize padding.  You can minimize padding with
       +any order in which (a) all fields of any one size are in a continuous
       +span (completely eliminating padding between them), and (b) the gaps
       +between those spans are such that the sizes on either side have as few
       +doubling steps of difference from each other as possible. Usually
       +this means no padding at all on one side.</p><p>Even more general minimal-padding orders are possible.  Example:</p><pre class="screen">struct foo13 {
       +    int32_t i;
       +    int32_t i2;
       +    char octet[8];
       +    int32_t i3;
       +    int32_t i4;
       +    int64_t l;
       +    int32_t i5;
       +    int32_t i6;
       +};</pre><p>This struct has zero padding under self-alignment rules. Working out
       +why is a useful exercise to develop your understanding.</p><p>Since shipping the first version of this guide I have been asked why,
       +if reordering for minimal slop is so simple, C compilers don’t do it
       +automatically.  The answer: C is a language originally designed for
       +writing operating systems and other code close to the
       +hardware. Automatic reordering would interfere with a systems
       +programmer’s ability to lay out structures that exactly match the
       +byte and bit-level layout of memory-mapped device control blocks.</p><p>Go hews to the C philosophy and does not reorder fields.  Rust makes
       +the opposite choice; by default, its compiler <span class="emphasis"><em>may</em></span> reorder structure
       +fields.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_awkward_scalar_cases"></a>8. Awkward scalar cases</h2></div></div></div><p>Using enumerated types instead of #defines is a good idea, if only
       +because symbolic debuggers have those symbols available and can
       +show them rather than raw integers.  But, while enums are guaranteed
       +to be compatible with an integral type, the C standard does
       +not specify <span class="emphasis"><em>which</em></span> underlying integral type is to be used for them.</p><p>Be aware when repacking your structs that while enumerated-type
       +variables are usually ints, this is compiler-dependent; they could be
       +shorts, longs, or even chars by default. Your compiler may have a
       +pragma or command-line option to force the size.</p><p>The <code class="literal">long double</code> type is a similar trouble spot.  Some C platforms
       +implement this in 80 bits, some in 128, and some of the 80-bit
       +platforms pad it to 96 or 128 bits.</p><p>In both cases it’s best to use <code class="literal">sizeof()</code> to check the storage size.</p><p>Finally, under x86 Linux doubles are sometimes an exception to the
       +self-alignment rule; an 8-byte double may require only 4-byte
       +alignment within a struct even though standalone doubles variables
       +have 8-byte self-alignment. This depends on compiler and options.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_readability_and_cache_locality"></a>9. Readability and cache locality</h2></div></div></div><p>While reordering by size is the simplest way to eliminate slop, it’s
       +not necessarily the right thing.  There are two more issues:
       +readability and cache locality.</p><p>Programs are not just communications to a computer, they are
       +communications to other human beings.  Code readability is important
       +even (or especially!) when the audience of the communication is only
       +your future self.</p><p>A clumsy, mechanical reordering of your structure can harm
       +readability.  When possible, it is better to reorder fields so they
       +remain in coherent groups with semantically related pieces of data
       +kept close together.  Ideally, the design of your structure should
       +communicate the design of your program.</p><p>When your program frequently accesses a structure, or parts of a
       +structure, it is helpful for performance if the accesses tend to fit
       +within a <span class="emphasis"><em>cache line</em></span> - the memory block fetched by your processor
       +when it is told to get any single address within the block.  On 64-bit
       +x86 a cache line is 64 bytes beginning on a self-aligned address; on
       +other platforms it is often 32 bytes.</p><p>The things you should do to preserve readability - grouping related
       +and co-accessed data in adjacent fields - also improve cache-line
       +locality. These are both reasons to reorder intelligently, with
       +awareness of your code’s data-access patterns.</p><p>If your code does concurrent access to a structure from multiple
       +threads, there’s a third issue: cache line bouncing. To minimize
       +expensive bus traffic, you should arrange your data so that reads
       +come from one cache line and writes go to another in your tighter
       +loops.</p><p>And yes, this sometimes contradicts the previous guidance about grouping
       +related data in the same cache-line-sized block. Multithreading is hard.
       +Cache-line bouncing and other multithread optimization issues are very
       +advanced topics which deserve an entire tutorial of their own.  The
       +best I can do here is make you aware that these issues exist.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_other_packing_techniques"></a>10. Other packing techniques</h2></div></div></div><p>Reordering works best when combined with other techniques for slimming
       +your structures. If you have several boolean flags in a struct, for example,
       +consider reducing them to 1-bit bitfields and packing them into a
       +place in the structure that would otherwise be slop.</p><p>You’ll take a small access-time penalty for this - but if it squeezes
       +the working set enough smaller, that penalty will be swamped by your
       +gains from avoided cache misses.</p><p>More generally, look for ways to shorten data field sizes.  In
       +cvs-fast-export, for example, one squeeze I applied was to use the
       +knowledge that RCS and CVS repositories didn’t exist before 1982.  I
       +dropped a 64-bit Unix <code class="literal">time_t</code> (zero date at the beginning of 1970)
       +for a 32-bit time offset from 1982-01-01T00:00:00; this will cover
       +dates to 2118.  (Note: if you pull a trick like this, <span class="emphasis"><em>do a bounds
       +check whenever you set the field</em></span> to prevent nasty bugs!)</p><p>Each such field shortening not only decreases the explicit size
       +of your structure, it may remove slop and/or create additional
       +opportunities for gains from field reordering.  Virtuous cascades
       +of such effects are not very hard to trigger.</p><p>The riskiest form of packing is to use unions.  If you know that
       +certain fields in your structure are never used in combination with
       +certain other fields, consider using a union to make them share
       +storage.  But be extra careful and verify your work with regression
       +testing, because if your lifetime analysis is even slightly wrong you
       +will get bugs ranging from crashes to (much worse) subtle data
       +corruption.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_overriding_alignment_rules"></a>11. Overriding alignment rules</h2></div></div></div><p>Sometimes you can coerce your compiler into not using the processor’s
       +normal alignment rules by using a pragma, usually <code class="literal">#pragma pack</code>.
       +GCC and clang have a "packed" attribute you can attach to
       +individual structure declarations; GCC has an -fpack-struct option
       +for entire compilations.</p><p>Do not do this casually, as it forces the generation of more expensive
       +and slower code. Usually you can save as much memory, or almost as
       +much, with the techniques I describe here.</p><p>The only really compelling reason for <code class="literal">#pragma pack</code> is if you have to
       +exactly match your C data layout to some kind of bit-level hardware or
       +protocol requirement, like a memory-mapped hardware port, and
       +violating normal alignment is required for that to work. If you’re in
       +that situation, and you don’t already know everything else I’m writing
       +about here, you’re in deep trouble and I wish you luck.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_tools"></a>12. Tools</h2></div></div></div><p>The clang compiler has a -Wpadded option that causes it to generate
       +messages about alignment holes and padding.  Some versions also have
       +an undocumented -fdump-record-layouts option that yields
       +<a class="ulink" href="http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-July/037778.html" target="_top">more
       +information</a>.</p><p>If you’re using C11, you can deploy static_assert to check your
       +assumptions about type and structure sizes.  Example:</p><pre class="screen">#include &lt;assert.h&gt;
       +struct foo4 {
       +    short s;     /* 2 bytes */
       +    char c;      /* 1 byte */
       +};
       +static_assert(sizeof(struct foo4) == 4, “Check your assumptions");</pre><p>I have not used it myself, but several respondents speak well of a
       +program called <code class="literal">pahole</code>. This tool cooperates with a compiler to produce
       +reports on your structures that describe padding, alignment, and
       +cache line boundaries.  This was at one time a standalone C program,
       +but that is now unmaintained; a script with the name pahole now ships
       +with gdb and that is what you should use.</p><p>I’ve received a report that a proprietary code auditing tool called
       +"PVS Studio" can detect structure-packing opportunities.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_proof_and_exceptional_cases"></a>13. Proof and exceptional cases</h2></div></div></div><p>You can download sourcecode for a little program that demonstrates the
       +assertions about scalar and structure sizes made above.  It is
       +<a class="ulink" href="packtest.c" target="_top">packtest.c</a>.</p><p>If you look through enough strange combinations of compilers, options,
       +and unusual hardware, you will find exceptions to some of the rules I
       +have described. They get more common as you go back in time to older
       +processor designs.</p><p>The next level beyond knowing these rules is knowing how and when to
       +expect that they will be broken.  In the years when I learned them
       +(the early 1980s) we spoke of people who didn’t get this as victims of
       +"all-the-world’s-a-VAX" syndrome. Remember that not all the world is
       +vanilla.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="C-like"></a>14. Other languages</h2></div></div></div><p>In this section we’ll call a language "C-like" if structure and array
       +members are self-aligned, are not reordered by the compiler, the address
       +of a struct is the address of its first member, and structs have
       +trailing pdding to their stride length.</p><p>If you know the implications of self-aligment in C, you can apply them
       +directly to calculating sizes and offsets in any language that is
       +C-like in this sense, and to space-optimizing in the language’s
       +structures.</p><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_c"></a>14.1. C++</h3></div></div></div><p>C++ is C-like, except that classes that look like structs may ignore
       +the rule that the address of a struct is the address of its first
       +member! Whether they do or not depends on how base classes and virtual
       +member functions are implemented, and varies by compiler.  Otherwise
       +everything we’ve observed about C applies.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_go"></a>14.2. Go</h3></div></div></div><p>The Go language is in many respects similar to C.  It has structures
       +and arrays, though not bitfields or unions.  Go compilers have the
       +same optimization and alignment issues as C compilers. As in C, array
       +elements are padded up to the following stride address.</p><p>There is no Go equivalent of the C #pack pragma.</p><p>Variables and struct fields will normally be self-aligned for the same
       +reasons rgis is a rule in C.  However, there is one peculiat
       +exception; on 32—bit platforms, 64-bit struct fields only require
       +akignment on a machine word boundary, e.g. 32 bits. There has been
       +discussion of
       +<a class="ulink" href="https://go.googlesource.com/proposal/+/master/design/36606-64-bit-field-alignment.md[a" target="_top">https://go.googlesource.com/proposal/+/master/design/36606-64-bit-field-alignment.md[a</a>
       +proposal to change this, but it stands.</p><p>On the other hand, the Go specification makes no guarantees about how
       +structure fields are ordered. Unlike C, it would be legal for a Go
       +compiler to lay out fields in an order different from their
       +specification in the source.  As of 2022 no Go compiler that
       +actually does this has beenbn sughted in the wild.</p><p>Go has one odd really odd quirk.  Since Go 1.5, a zero-length field at
       +the end of a struct (that is, a zero-length array or empty struct) is
       +sized and aligned as though it is one byte.  The reasons for this are
       +discussed in an essay
       +<a class="ulink" href="https://dave.cheney.net/2015/10/09/padding-is-hard" target="_top">Padding is Hard</a> by
       +one of the Go developers.</p><p>There’s a
       +<a class="ulink" href="https://itnext.io/structure-size-optimization-in-golang-alignment-padding-more-effective-memory-layout-linters-fffdcba27c61" target="_top">specific
       +discussion of Go alignment rules</a> that includes pointers tp some tools
       +that can automatically tweak your structures to optimal alignment.
       +Another such tool is
       +<a class="ulink" href="https://github.com/orijtech/structslop" target="_top">strucrslop</a>. I have not used
       +any of these, try at your own risk.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_rust"></a>14.3. Rust</h3></div></div></div><p>Rust follows C-like packing rules if a structure is annotated with
       +"repr(C)". Otherwise (by default) all bets are off: padding rules are
       +(deliberately) unspecified and the compiler may even reorder structure
       +members. It is probably best to let the Rust compiler do space
       +optimization rather than forcing it.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_java"></a>14.4. Java</h3></div></div></div><p>Java’s JNI (Java Native Interface) supports C-like packing rules for
       +structure members so that JNI Java structures can map exactly to C
       +equivalents. There is a pack pragma.</p><p>Packing in the JVM itself, however, is not well defined. The JVM
       +spec simply says "The Java Virtual Machine does not mandate any
       +particular internal structure for objects.", making choices about
       +this implementation-dependent.</p><p>That said, many JVM implementations are word-oriented in an even
       +stricter way than Motorola processors - structure and array members
       +can start at any 32-bit boundary but not at a 16- or 8-bit one. This
       +will create internal padding after char and 16-bit short members where
       +it wouldn’t be expected under C-like rules.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_swift"></a>14.5. Swift</h3></div></div></div><p>Swift is
       +<a class="ulink" href="https://swiftunboxed.com/internals/size-stride-alignment/" target="_top">exactly
       +C-like</a>. There is no equivalent of a pack pragma.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_c_2"></a>14.6. C#</h3></div></div></div><p>C# is C-like with the default structure layout attribute
       +LayoutKind.Sequential.  LayoutKind.Auto allows the compiler to
       +reorder, and LayoutKind.Explicit allows the programmmer to specify
       +field sizes explicitly. There’s a Pack modifier which is equivalent to
       +a C pack pragma.</p></div></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_supporting_this_work"></a>15. Supporting this work</h2></div></div></div><p>If you were educated or entertained by this document, please sign up
       +for my <a class="ulink" href="https://www.patreon.com/esr" target="_top">Patreon feed</a>. The time needed to
       +write and maintain documents like this one is not free, and while
       +I enjoying giving them to the world my bills won’t pay themselves.
       +Even a few dollars a month - from enough of you - helps a lot.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_related_reading"></a>16. Related Reading</h2></div></div></div><p>This section exists to collect pointers to essays which I judge to be
       +good companions to this one.</p><p><a class="ulink" href="http://blog.regehr.org/archives/213" target="_top">A Guide to Undefined Behavior in C and C++</a></p><p><a class="ulink" href="http://www.catb.org/esr/time-programming/" target="_top">Time, Clock, and Calendar Programming In C</a></p><p><a class="ulink" href="http://www.catb.org/esr/faqs/things-every-hacker-once-knew/" target="_top">Things Every Hacker Once Knew</a></p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_version_history"></a>17. Version history</h2></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term">
       +2.5 @ 2020-04-30
       +</span></dt><dd>
       +   Revision and expansion of the section on Go packing/alignment rules.
       +</dd><dt><span class="term">
       +2.4 @ 2020-04-28
       +</span></dt><dd>
       +   Added coverage of Java, Swift, and C#.
       +</dd><dt><span class="term">
       +2.3 @ 2020-04-20
       +</span></dt><dd>
       +   Describe exceptional alignment rules on Motorola 680x0.
       +   Sightly improve coverage of Rust alignment rules.
       +</dd><dt><span class="term">
       +2.2 @ 2019-12-19
       +</span></dt><dd>
       +   Minor markup fix.
       +</dd><dt><span class="term">
       +2.1 @ 2019-02-14
       +</span></dt><dd>
       +   Correct a minor error in one of the examples pointed out by
       +   Luis Emilio Moreno Durán.
       +</dd><dt><span class="term">
       +2.0 @ 2018-08-06
       +</span></dt><dd>
       +    Drop "C" out of the title as these techniques are applicable to Go
       +    and Rust as well. More coverage of Go and Rust. The pahole tool
       +    has been replaced by a script in the gdb distribution.
       +</dd><dt><span class="term">
       +1.19 @ 2018-02-12
       +</span></dt><dd>
       +    Describe usefulness of static_assert() in C11. Added Go and Rust coverage.
       +</dd><dt><span class="term">
       +1.18 @ 2017-06-01
       +</span></dt><dd>
       +    More general zero-padding orders.  C11 and C14 relax a constraint
       +    on bitfield packing.
       +</dd><dt><span class="term">
       +1.17 @ 2016-11-14
       +</span></dt><dd>
       +    Typo fixes.
       +</dd><dt><span class="term">
       +1.16 @ 2016-10-21
       +</span></dt><dd>
       +    Answer an objection about allocation order being unrelated to
       +    source order.
       +</dd><dt><span class="term">
       +1.15 @ 2016-10-20
       +</span></dt><dd>
       +    Note the field evidence from NTP.
       +</dd><dt><span class="term">
       +1.14 @ 2015-12-19
       +</span></dt><dd>
       +    Typo correction: -Wpadding → -Wpadded.
       +</dd><dt><span class="term">
       +1.13 @ 2015-11-23
       +</span></dt><dd>
       +    Be explicit about padding bits being undefined.  More about bitfields.
       +</dd><dt><span class="term">
       +1.12 @ 2015-11-11
       +</span></dt><dd>
       +    Major revision of section on bitfields reflecting C99 rules.
       +</dd><dt><span class="term">
       +1.11 @ 2015-07-23
       +</span></dt><dd>
       +    Mention the clang -fdump-record-layouts option.
       +</dd><dt><span class="term">
       +1.10 @ 2015-02-20
       +</span></dt><dd>
       +    Mention <span class="emphasis"><em>attribute</em></span><a id="idm359" class="indexterm"></a><span class="emphasis"><em>packed</em></span>, -fpack-struct, and PVS Studio.
       +</dd><dt><span class="term">
       +1.9 @ 2014-10-01
       +</span></dt><dd>
       +    Added link to "Time, Clock, and Calendar Programming In C".
       +</dd><dt><span class="term">
       +1.8 @ 2014-05-20
       +</span></dt><dd>
       +    Improved explanation for the bitfield examples,
       +</dd><dt><span class="term">
       +1.7 @ 2014-05-17
       +</span></dt><dd>
       +    Correct a minor error in the description of the layout of <code class="literal">struct foo8</code>.
       +</dd><dt><span class="term">
       +1.6 @ 2014-05-14
       +</span></dt><dd>
       +    Emphasize that bitfields cannot cross word boundaries.  Idea from
       +    Dale Gulledge.
       +</dd><dt><span class="term">
       +1.5 @ 2014-01-13
       +</span></dt><dd>
       +    Explain why structure member reordering is not done automatically.
       +</dd><dt><span class="term">
       +1.4 @ 2014-01-04
       +</span></dt><dd>
       +    A note about double under x86 Linux.
       +</dd><dt><span class="term">
       +1.3 @ 2014-01-03
       +</span></dt><dd>
       +    New sections on awkward scalar cases, readability and cache
       +    locality, and tools.
       +</dd><dt><span class="term">
       +1.2 @ 2014-01-02
       +</span></dt><dd>
       +    Correct an erroneous address calculation.
       +</dd><dt><span class="term">
       +1.1 @ 2014-01-01
       +</span></dt><dd>
       +    Explain why aligned accesses are faster.  Mention offsetof.
       +    Various minor fixes, including the packtest.c download link.
       +</dd><dt><span class="term">
       +1.0 @ 2014-01-01
       +</span></dt><dd>
       +    Initial release.
       +</dd></dl></div></div></div></body></html>
       +\ No newline at end of file