catb.org.html - webdump_tests - Testfiles for webdump
 (HTM) git clone git://git.codemadness.org/webdump_tests
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
       ---
       catb.org.html (48597B)
       ---
            1 <?xml version="1.0" encoding="UTF-8"?>
            2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>The Lost Art of Structure Packing</title><link rel="stylesheet" type="text/css" href="docbook-xsl.css" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /></head><body><div xml:lang="en" class="article" lang="en"><div class="titlepage"><div><div><h2 class="title"><a id="idm1"></a>The Lost Art of Structure Packing</h2></div><div><div class="author"><h3 class="author"><span class="firstname">Eric</span> <span class="othername">S.</span> <span class="surname">Raymond</span></h3><code class="email">&lt;<a class="email" href="mailto:esr@thyrsus.com">esr@thyrsus.com</a>&gt;</code></div></div></div><hr /></div><div class="toc"><p><strong>Table of Contents</strong></p><dl class="toc"><dt><span class="section"><a href="#_who_should_read_this">1. Who should read this</a></span></dt><dt><span class="section"><a href="#_why_i_wrote_it">2. Why I wrote it</a></span></dt><dt><span class="section"><a href="#_alignment_requirements">3. Alignment requirements</a></span></dt><dt><span class="section"><a href="#_padding">4. Padding</a></span></dt><dt><span class="section"><a href="#_structure_alignment_and_padding">5. Structure alignment and padding</a></span></dt><dt><span class="section"><a href="#_bitfields">6. Bitfields</a></span></dt><dt><span class="section"><a href="#_structure_reordering">7. Structure reordering</a></span></dt><dt><span class="section"><a href="#_awkward_scalar_cases">8. Awkward scalar cases</a></span></dt><dt><span class="section"><a href="#_readability_and_cache_locality">9. Readability and cache locality</a></span></dt><dt><span class="section"><a href="#_other_packing_techniques">10. Other packing techniques</a></span></dt><dt><span class="section"><a href="#_overriding_alignment_rules">11. Overriding alignment rules</a></span></dt><dt><span class="section"><a href="#_tools">12. Tools</a></span></dt><dt><span class="section"><a href="#_proof_and_exceptional_cases">13. Proof and exceptional cases</a></span></dt><dt><span class="section"><a href="#C-like">14. Other languages</a></span></dt><dd><dl><dt><span class="section"><a href="#_c">14.1. C++</a></span></dt><dt><span class="section"><a href="#_go">14.2. Go</a></span></dt><dt><span class="section"><a href="#_rust">14.3. Rust</a></span></dt><dt><span class="section"><a href="#_java">14.4. Java</a></span></dt><dt><span class="section"><a href="#_swift">14.5. Swift</a></span></dt><dt><span class="section"><a href="#_c_2">14.6. C#</a></span></dt></dl></dd><dt><span class="section"><a href="#_supporting_this_work">15. Supporting this work</a></span></dt><dt><span class="section"><a href="#_related_reading">16. Related Reading</a></span></dt><dt><span class="section"><a href="#_version_history">17. Version history</a></span></dt></dl></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_who_should_read_this"></a>1. Who should read this</h2></div></div></div><p>This page is about a technique for reducing the memory footprint of
            3 programs in compiled languages with C-like structures - manually
            4 repacking these declarations for reduced size. To read it, you will
            5 require basic knowledge of the C programming language.</p><p>You need to know this technique if you intend to write code for
            6 memory-constrained embedded systems, or operating-system kernels.  It
            7 is useful if you are working with application data sets so large that
            8 your programs routinely hit memory limits.  It is good to know in any
            9 application where you really, <span class="emphasis"><em>really</em></span> care about optimizing your use
           10 of memory bandwidth and minimizing cache-line misses.</p><p>Finally, knowing this technique is a gateway to other esoteric C
           11 topics.  You are not an advanced C programmer until you have grasped
           12 these rules. You are not a master of C until you could have written
           13 this document yourself and can criticize it intelligently.</p><p>This document originated with "C" in the title, but almost everything
           14 in it applies to C++ as well. Many of the techniques discussed here
           15 also apply to the Go language, and should generalize to any compiled
           16 language with C-like structures.  There is a note discussing C++, Go,
           17 Rust, Java, Swift, and C# towards the end.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_why_i_wrote_it"></a>2. Why I wrote it</h2></div></div></div><p>This webpage exists because in late 2013 I found myself heavily
           18 applying an optimization technique that I had learned more than two
           19 decades previously and not used much since.</p><p>I needed to reduce the memory footprint of a program that used thousands -
           20 sometimes hundreds of thousands - of C struct instances. The program
           21 was <a class="ulink" href="http://www.catb.org/~esr/cvs-fast-export" target="_top">cvs-fast-export</a> and
           22 the problem was that it was dying with out-of-memory errors on large
           23 repositories.</p><p>There are ways to reduce memory usage significantly in situations like
           24 this, by rearranging the order of structure members in careful ways.
           25 This can lead to dramatic gains - in my case I was able to cut the
           26 working-set size by around 40%, enabling the program to handle
           27 much larger repositories without dying.</p><p>But as I worked, and thought about what I was doing, it began to dawn
           28 on me that the technique I was using has been more than half forgotten
           29 in these latter days.  A little web research confirmed that
           30 programmers don’t seem to talk about it much any more, at least not
           31 where a search engine can see them.  A couple of Wikipedia entries
           32 touch the topic, but I found nobody who covered it comprehensively.</p><p>There are actually reasons for this that aren’t stupid.  CS courses
           33 (rightly) steer people away from micro-optimization towards finding better
           34 algorithms. The plunging price of machine resources has made squeezing
           35 memory usage less necessary.  And the way hackers used to learn how to
           36 do it back in the day was by bumping their noses on strange hardware
           37 architectures - a less common experience now.</p><p>But the technique still has value in important situations, and will
           38 as long as memory is finite.  This document is intended to save
           39 programmers from having to rediscover the technique, so they can
           40 concentrate effort on more important things.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_alignment_requirements"></a>3. Alignment requirements</h2></div></div></div><p>The first thing to understand is that, on modern processors, the way
           41 your compiler lays out basic datatypes in memory is constrained
           42 in order to make memory accesses faster.  Our examples are in C, but
           43 any compiled language generates code under the same constraints.</p><p>There is a large class of modern ISAs (Instruction Set Architectures)
           44 for which these constraints lead to identical layouts. These ISAs
           45 include Intel, ARM, and RISC-V; I will refer to these as "vanilla"
           46 ISAs.</p><p>Storage for the basic C datatypes on a vanilla ISA doesn’t
           47 normally start at arbitrary byte addresses in memory. Rather, each
           48 type except char has an <span class="emphasis"><em>alignment requirement</em></span>; chars can start on
           49 any byte address, but 2-byte shorts must start on an even address,
           50 4-byte ints or floats must start on an address divisible by 4, and
           51 8-byte longs or doubles must start on an address divisible by 8.
           52 Signed or unsigned makes no difference.</p><p>The jargon for this is that basic C types on a vanilla ISA are
           53 <span class="emphasis"><em>self-aligned</em></span>.  Pointers, whether 32-bit (4-byte) or 64-bit (8-byte)
           54 are self-aligned too.</p><p>Self-alignment makes access faster because it facilitates generating
           55 single-instruction fetches and puts of the typed data. Without
           56 alignment constraints, on the other hand, the code might end up having
           57 to do two or more accesses spanning machine-word boundaries.
           58 Characters are a special case; they’re equally expensive from anywhere
           59 they live inside a single machine word. That’s why they don’t have a
           60 preferred alignment.</p><p>I said "on modern processors" because on some older ones forcing your
           61 C program to violate alignment rules (say, by casting an odd address
           62 into an int pointer and trying to use it) didn’t just slow your code
           63 down, it caused an illegal instruction fault. This was the behavior, for
           64 example, on Sun SPARC chips. In fact, with sufficient determination
           65 and the right (e18) hardware flag set on the processor, you can still
           66 trigger this on x86.</p><p>Also, self-alignment is not the only possible rule. Historically, some
           67 processors (especially those lacking
           68 <a class="ulink" href="https://en.wikipedia.org/wiki/Barrel_shifter" target="_top">barrel shifters</a>) have had
           69 more restrictive ones. If you do embedded systems, you might trip over one
           70 of these lurking in the underbrush.  Be aware this is possible.</p><p>One curious and illustrative exception is the Motorola 68020 and its
           71 successors. These are word-oriented 32-bit machines - that is, the
           72 underlying granularity of fast access is 16 bits. Compilers can start
           73 structs on 16-bit boundaries without a speed penalty, even if the
           74 first member was a 32-bit scalar.  Therefore, only character fields
           75 with odd byte lengths can ever cause padding.</p><p>From when it was first written at the beginning of 2014 until late
           76 2016, this section ended with other caveats about odd
           77 architectures. During that period I’ve learned something rather
           78 reassuring from working with the source code for the reference
           79 implementation of NTP.  It does packet analysis by reading packets off
           80 the wire directly into memory that the rest of the code sees as a
           81 struct, relying on the assumption of minimal self-aligned padding - or
           82 zero padding in odd cases like 690x0.</p><p>The interesting news is that NTP has apparently being getting away
           83 with this for <span class="strong"><strong>decades</strong></span> across a very wide span of hardware, operating
           84 systems, and compilers, including not just Unixes but under Windows
           85 variants as well. This suggests that platforms with padding rules
           86 other than self-alignment are either nonexistent or confined to
           87 such specialized niches that they’re never either NTP servers or
           88 clients.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_padding"></a>4. Padding</h2></div></div></div><p>Now we’ll look at a simple example of variable layout in
           89 memory. Consider the following series of variable declarations
           90 in the top level of a C module:</p><pre class="screen">char *p;
           91 char c;
           92 int x;</pre><p>If you didn’t know anything about data alignment, you might assume
           93 that these three variables would occupy a continuous span of bytes in
           94 memory.  That is, on a 32-bit machine 4 bytes of pointer would be
           95 immediately followed by 1 byte of char and that immediately followed
           96 by 4 bytes of int.  And a 64-bit machine would be different only
           97 in that the pointer would be 8 bytes.</p><p>In fact, the hidden assumption that the allocated order of static
           98 variables is their source order is not necessarily valid; the C
           99 standards don’t mandate it.  I’m going to ignore this detail because
          100 (a) that hidden assumption is usually correct anyway, and (b) the
          101 actual purpose of talking about padding and packing outside structures
          102 is to prepare you for what happens inside them.</p><p>Here’s what actually happens (on an x86 or ARM or anything else with
          103 self-aligned types). The storage for <code class="literal">p</code> starts on a self-aligned 4-
          104 or 8-byte boundary depending on the machine word size.  This is
          105 <span class="emphasis"><em>pointer alignment</em></span> - the strictest possible.</p><p>The storage for <code class="literal">c</code> follows immediately.  But the 4-byte alignment
          106 requirement of <code class="literal">x</code> forces a gap in the layout; it comes out as though
          107 there were a fourth intervening variable, like this:</p><pre class="screen">char *p;      /* 4 or 8 bytes */
          108 char c;       /* 1 byte */
          109 char pad[3];  /* 3 bytes */
          110 int x;        /* 4 bytes */</pre><p>The <code class="literal">pad[3]</code> character array represents the fact that there are three
          111 bytes of waste space in the structure.  The old-school term for this
          112 was "slop". The value of the padding bits is undefined; in particular
          113 it is not guaranteed that they will be zeroed.</p><p>Compare what happens if <code class="literal">x</code> is a 2-byte short:</p><pre class="screen">char *p;
          114 char c;
          115 short x;</pre><p>In that case, the actual layout will be this:</p><pre class="screen">char *p;      /* 4 or 8 bytes */
          116 char c;       /* 1 byte */
          117 char pad[1];  /* 1 byte */
          118 short x;      /* 2 bytes */</pre><p>On the other hand, if <code class="literal">x</code> is a long on a 64-bit machine</p><pre class="screen">char *p;
          119 char c;
          120 long x;</pre><p>we end up with this:</p><pre class="screen">char *p;     /* 8 bytes */
          121 char c;      /* 1 byte
          122 char pad[7]; /* 7 bytes */
          123 long x;      /* 8 bytes */</pre><p>If you have been following carefully, you are probably now wondering about
          124 the case where the shorter variable declaration comes first:</p><pre class="screen">char c;
          125 char *p;
          126 int x;</pre><p>If the actual memory layout were written like this</p><pre class="screen">char c;
          127 char pad1[M];
          128 char *p;
          129 char pad2[N];
          130 int x;</pre><p>what can we say about <code class="literal">M</code> and <code class="literal">N</code>?</p><p>First, in this case <code class="literal">N</code> will be zero.  The address of <code class="literal">x</code>, coming
          131 right after <code class="literal">p</code>, is guaranteed to be pointer-aligned, which is
          132 never less strict than int-aligned.</p><p>The value of <code class="literal">M</code> is less predictable.  If the compiler
          133 happened to map <code class="literal">c</code> to the last byte of a machine word, the next
          134 byte (the first of <code class="literal">p</code>) would be the first byte of the next one
          135 and properly pointer-aligned.  M would be zero.</p><p>It is more likely that <code class="literal">c</code> will be mapped to the <span class="emphasis"><em>first</em></span> byte of a
          136 machine word.  In that case M will be whatever padding is needed to
          137 ensure that <code class="literal">p</code> has pointer alignment - 3 on a 32-bit machine,
          138 7 on a 64-bit machine.</p><p>Intermediate cases are possible.  M can be anything  from 0 to 7
          139 (0 to 3 on 32-bit) because a char can start on any byte boundary
          140 in a machine word.</p><p>If you wanted to make those variables take up less space,
          141 you could get that effect by swapping <code class="literal">x</code> with <code class="literal">c</code> in the
          142 original sequence.</p><pre class="screen">char *p;     /* 8 bytes */
          143 long x;      /* 8 bytes */
          144 char c;      /* 1 byte</pre><p>Usually, for the small number of scalar variables in your C programs,
          145 bumming out the few bytes you can get by changing the order of
          146 declaration won’t save you enough to be significant.  The technique
          147 becomes more interesting when applied to nonscalar variables -
          148 especially structs.</p><p>Before we get to those, let’s dispose of arrays of scalars.  On a
          149 platform with self-aligned types, arrays of
          150 char/short/int/long/pointer have no internal padding; each
          151 member is automatically self-aligned at the end of the next one.</p><p>All these rules and examples map over to Go structs, and to Rust
          152 structs with the "repr(C)" attribute, with only syntactic changes.</p><p>In the next section we will see that the same is <span class="strong"><strong>not</strong></span> necessarily true of
          153 structure arrays.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_structure_alignment_and_padding"></a>5. Structure alignment and padding</h2></div></div></div><p>In general, a struct instance will have the alignment of its widest scalar
          154 member. Compilers do this as the easiest way to ensure that all the
          155 members are self-aligned for fast access.</p><p>Also, in C (and Go, and Rust) the address of a struct is the same as
          156 the address of its first member - there is no leading padding. In
          157 C++ this may not be true; see <a class="xref" href="#C-like" title="14. Other languages">Section 14, “Other languages”</a>.</p><p>(When you’re in doubt about this sort of thing, ANSI C provides an
          158 offsetof() macro which can be used to read out structure member
          159 offsets.)</p><p>Consider this struct:</p><pre class="screen">struct foo1 {
          160     char *p;
          161     char c;
          162     long x;
          163 };</pre><p>Assuming a 64-bit machine, any instance of <code class="literal">struct foo1</code> will have
          164 8-byte alignment. The memory layout of one of these looks
          165 unsurprising, like this:</p><pre class="screen">struct foo1 {
          166     char *p;     /* 8 bytes */
          167     char c;      /* 1 byte
          168     char pad[7]; /* 7 bytes */
          169     long x;      /* 8 bytes */
          170 };</pre><p>It’s laid out exactly as though variables of these types has
          171 been separately declared.  But if we put <code class="literal">c</code> first, that’s
          172 no longer true.</p><pre class="screen">struct foo2 {
          173     char c;      /* 1 byte */
          174     char pad[7]; /* 7 bytes */
          175     char *p;     /* 8 bytes */
          176     long x;      /* 8 bytes */
          177 };</pre><p>If the members were separate variables, <code class="literal">c</code> could start at any byte
          178 boundary and the size of <code class="literal">pad</code> might vary.  Because <code class="literal">struct foo2</code> has
          179 the pointer alignment of its widest member, that’s no longer possible.
          180 Now <code class="literal">c</code> has to be pointer-aligned, and following padding of 7 bytes is
          181 locked in.</p><p>Now let’s talk about trailing padding on structures. To explain this,
          182 I need to introduce a basic concept which I’ll call the <span class="emphasis"><em>stride
          183 address</em></span> of a structure. It is the first address following the
          184 structure data that has the <span class="strong"><strong>same alignment as the structure</strong></span>.</p><p>The general rule of trailing structure padding is this: the compiler
          185 will behave as though the structure has trailing padding out to
          186 its stride address. This rule controls what <code class="literal">sizeof()</code> will return.</p><p>Consider this example on a 64-bit x86 or ARM machine:</p><pre class="screen">struct foo3 {
          187     char *p;     /* 8 bytes */
          188     char c;      /* 1 byte */
          189 };
          190 
          191 struct foo3 singleton;
          192 struct foo3 quad[4];</pre><p>You might think that <code class="literal">sizeof(struct foo3)</code> should be 9, but it’s
          193 actually 16.  The stride address is that of <code class="literal">(&amp;p)[2]</code>.  Thus, in the <code class="literal">quad</code>
          194 array, each member has 7 bytes of trailing padding, because the first
          195 member of each following struct wants to be self-aligned on an 8-byte
          196 boundary.  The memory layout is as though the structure had been
          197 declared like this:</p><pre class="screen">struct foo3 {
          198     char *p;     /* 8 bytes */
          199     char c;      /* 1 byte */
          200     char pad[7];
          201 };</pre><p>For contrast, consider the following example:</p><pre class="screen">struct foo4 {
          202     short s;     /* 2 bytes */
          203     char c;      /* 1 byte */
          204 };</pre><p>Because <code class="literal">s</code> only needs to be 2-byte aligned, the stride address is
          205 just one byte after <code class="literal">c</code>, and <code class="literal">struct foo4</code> as a whole only needs one
          206 byte of trailing padding.  It will be laid out like this:</p><pre class="screen">struct foo4 {
          207     short s;     /* 2 bytes */
          208     char c;      /* 1 byte */
          209     char pad[1];
          210 };</pre><p>and <code class="literal">sizeof(struct foo4)</code> will return 4.</p><p>Here’s a last important detail:  If your structure has structure
          211 members, the inner structs want to have the alignment of
          212 longest scalar too.  Suppose you write this:</p><pre class="screen">struct foo5 {
          213     char c;
          214     struct foo5_inner {
          215         char *p;
          216         short x;
          217     } inner;
          218 };</pre><p>The <code class="literal">char *p</code> member in the inner struct forces the outer struct to be
          219 pointer-aligned as well as the inner.  Actual layout will be like
          220 this on a 64-bit machine:</p><pre class="screen">struct foo5 {
          221     char c;           /* 1 byte*/
          222     char pad1[7];     /* 7 bytes */
          223     struct foo5_inner {
          224         char *p;      /* 8 bytes */
          225         short x;      /* 2 bytes */
          226         char pad2[6]; /* 6 bytes */
          227     } inner;
          228 };</pre><p>This structure gives us a hint of the savings that might be possible
          229 from repacking structures.  Of 24 bytes, 13 of them are padding.
          230 That’s more than 50% waste space!</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_bitfields"></a>6. Bitfields</h2></div></div></div><p>Now let’s consider C bitfields.  What they give you the ability to do is
          231 declare structure fields of smaller than character width, down to 1
          232 bit, like this:</p><pre class="screen">struct foo6 {
          233     short s;
          234     char c;
          235     int flip:1;
          236     int nybble:4;
          237     int septet:7;
          238 };</pre><p>The thing to know about bitfields is that they are implemented with
          239 word- and byte-level mask and rotate instructions operating on machine
          240 words, and cannot cross word boundaries.  C99 guarentees that
          241 bit-fields will be packed as tightly as possible, provided they don’t
          242 cross storage unit boundaries (6.7.2.1 #10).</p><p>This restriction is relaxed in C11 (6.7.2.1p11) and C++14
          243 ([class.bit]p1); these revisions do not actually require <code class="literal">struct foo9</code>
          244 to be 64 bits instead of 32; a bit-field can span multiple allocation
          245 units instead of starting a new one. It’s up to the implementation to
          246 decide; GCC leaves it up to the ABI, which for x64 does prevent them
          247 from sharing an allocation unit.</p><p>Assuming we’re on a 32-bit machine, the C99 rules imply that the layout may
          248 look like this:</p><pre class="screen">struct foo6 {
          249     short s;       /* 2 bytes */
          250     char c;        /* 1 byte */
          251     int flip:1;    /* total 1 bit */
          252     int nybble:4;  /* total 5 bits */
          253     int pad1:3;    /* pad to an 8-bit boundary */
          254     int septet:7;  /* 7 bits */
          255     int pad2:25;   /* pad to 32 bits */
          256 };</pre><p>But this isn’t the only possibility, because the C standard does not
          257 specify that bits are allocated low-to-high.  So the layout could look
          258 like this:</p><pre class="screen">struct foo6 {
          259     short s;       /* 2 bytes */
          260     char c;        /* 1 byte */
          261     int pad1:3;    /* pad to an 8-bit boundary */
          262     int flip:1;    /* total 1 bit */
          263     int nybble:4;  /* total 5 bits */
          264     int pad2:25;   /* pad to 32 bits */
          265     int septet:7;  /* 7 bits */
          266 };</pre><p>That is, the padding could precede rather than following the
          267 payload bits.</p><p>Note also that, as with normal structure padding, the padding bits are
          268 not guaranteed to be zero; C99 mentions this.</p><p>Note that the base type of a bit field is interpreted for signedness
          269 but not necessarily for size.  It is up to implementors whether
          270 "short flip:1" or "long flip:1" are supported, and whether those
          271 base types change the size of the storage unit the field is packed into.</p><p>Proceed with caution and check with -Wpadded if you have it available
          272 (e.g. under clang).  Compilers on exotic hardware might interpret the
          273 C99 rules in surprising ways; older compilers might not quite follow
          274 them.</p><p>The restriction that bitfields cannot cross machine word boundaries
          275 means that, while the first two of the following structures pack into
          276 one and two 32-bit words as you’d expect, the third (<code class="literal">struct foo9</code>) takes
          277 up <span class="strong"><strong>three</strong></span> 32-bit words in C99, in the last of which only one bit is used.</p><pre class="screen">struct foo7 {
          278     int bigfield:31;      /* 32-bit word 1 begins */
          279     int littlefield:1;
          280 };
          281 
          282 struct foo8 {
          283     int bigfield1:31;     /* 32-bit word 1 begins /*
          284     int littlefield1:1;
          285     int bigfield2:31;     /* 32-bit word 2 begins */
          286     int littlefield2:1;
          287 };
          288 
          289 struct foo9 {
          290     int bigfield1:31;     /* 32-bit word 1 begins */
          291     int bigfield2:31;     /* 32-bit word 2 begins */
          292     int littlefield1:1;
          293     int littlefield2:1;   /* 32-bit word 3 begins */
          294 };</pre><p>Again, C11 and C++14 may pack foo9 tighter, but it would perhaps be
          295 unwise to count on this.</p><p>On the other hand, <code class="literal">struct foo8</code> would fit into a single 64-bit word
          296 if the machine has those.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_structure_reordering"></a>7. Structure reordering</h2></div></div></div><p>Now that you know how and why compilers insert padding in and after your
          297 structures we’ll examine what you can do to squeeze out the slop.
          298 This is the art of structure packing.</p><p>The first thing to notice is that slop only happens in two places.
          299 One is where storage bound to a larger data type (with stricter
          300 alignment requirements) follows storage bound to a smaller one. The
          301 other is where a struct naturally ends before its stride address,
          302 requiring padding so the next one will be properly aligned.</p><p>The simplest way to eliminate slop is to reorder the structure members
          303 by decreasing alignment.  That is: make all the pointer-aligned
          304 subfields come first, because on a 64-bit machine they will be 8
          305 bytes.  Then the 4-byte ints; then the 2-byte shorts; then the
          306 character fields.</p><p>So, for example, consider this simple linked-list structure:</p><pre class="screen">struct foo10 {
          307     char c;
          308     struct foo10 *p;
          309     short x;
          310 };</pre><p>With the implied slop made explicit, here it is:</p><pre class="screen">struct foo10 {
          311     char c;          /* 1 byte */
          312     char pad1[7];    /* 7 bytes */
          313     struct foo10 *p; /* 8 bytes */
          314     short x;         /* 2 bytes */
          315     char pad2[6];    /* 6 bytes */
          316 };</pre><p>That’s 24 bytes.  If we reorder by size, we get this:</p><pre class="screen">struct foo11 {
          317     struct foo11 *p;
          318     short x;
          319     char c;
          320 };</pre><p>Considering self-alignment, we see that none of the data fields need
          321 padding.  This is because the stride address for a (longer) field with
          322 stricter alignment is always a validly-aligned start address for a
          323 (shorter) field with less strict requirements.  All the repacked struct
          324 actually requires is trailing padding:</p><pre class="screen">struct foo11 {
          325     struct foo11 *p; /* 8 bytes */
          326     short x;         /* 2 bytes */
          327     char c;          /* 1 byte */
          328     char pad[5];     /* 5 bytes */
          329 };</pre><p>Our repack transformation drops the size from 24 to 16 bytes.  This
          330 might not seem like a lot, but suppose you have a linked list of 200K
          331 of these?  The savings add up fast - especially on memory-constrained
          332 embedded systems or in the core part of an OS kernel that has to stay
          333 resident.</p><p>Note that reordering is not guaranteed to produce savings.  Applying this
          334 technique to an earlier example, <code class="literal">struct foo5</code>, we get this:</p><pre class="screen">struct foo12 {
          335     struct foo5 {
          336         char *p;      /* 8 bytes */
          337         short x;      /* 2 bytes */
          338     } inner;
          339     char c;           /* 1 byte*/
          340 };</pre><p>With padding written out, this is</p><pre class="screen">struct foo12 {
          341     struct foo5 {
          342         char *p;      /* 8 bytes */
          343         short x;      /* 2 bytes */
          344         char pad[6];  /* 6 bytes */
          345     } inner;
          346     char c;           /* 1 byte*/
          347     char pad[7];      /* 7 bytes */
          348 };</pre><p>It’s still 24 bytes because <code class="literal">c</code> cannot back into the inner struct’s
          349 trailing padding. To collect that gain you would need to redesign
          350 your data structures.</p><p>Curiously, strictly ordering your structure fields by <span class="emphasis"><em>increasing</em></span>
          351 size also works to mimimize padding.  You can minimize padding with
          352 any order in which (a) all fields of any one size are in a continuous
          353 span (completely eliminating padding between them), and (b) the gaps
          354 between those spans are such that the sizes on either side have as few
          355 doubling steps of difference from each other as possible. Usually
          356 this means no padding at all on one side.</p><p>Even more general minimal-padding orders are possible.  Example:</p><pre class="screen">struct foo13 {
          357     int32_t i;
          358     int32_t i2;
          359     char octet[8];
          360     int32_t i3;
          361     int32_t i4;
          362     int64_t l;
          363     int32_t i5;
          364     int32_t i6;
          365 };</pre><p>This struct has zero padding under self-alignment rules. Working out
          366 why is a useful exercise to develop your understanding.</p><p>Since shipping the first version of this guide I have been asked why,
          367 if reordering for minimal slop is so simple, C compilers don’t do it
          368 automatically.  The answer: C is a language originally designed for
          369 writing operating systems and other code close to the
          370 hardware. Automatic reordering would interfere with a systems
          371 programmer’s ability to lay out structures that exactly match the
          372 byte and bit-level layout of memory-mapped device control blocks.</p><p>Go hews to the C philosophy and does not reorder fields.  Rust makes
          373 the opposite choice; by default, its compiler <span class="emphasis"><em>may</em></span> reorder structure
          374 fields.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_awkward_scalar_cases"></a>8. Awkward scalar cases</h2></div></div></div><p>Using enumerated types instead of #defines is a good idea, if only
          375 because symbolic debuggers have those symbols available and can
          376 show them rather than raw integers.  But, while enums are guaranteed
          377 to be compatible with an integral type, the C standard does
          378 not specify <span class="emphasis"><em>which</em></span> underlying integral type is to be used for them.</p><p>Be aware when repacking your structs that while enumerated-type
          379 variables are usually ints, this is compiler-dependent; they could be
          380 shorts, longs, or even chars by default. Your compiler may have a
          381 pragma or command-line option to force the size.</p><p>The <code class="literal">long double</code> type is a similar trouble spot.  Some C platforms
          382 implement this in 80 bits, some in 128, and some of the 80-bit
          383 platforms pad it to 96 or 128 bits.</p><p>In both cases it’s best to use <code class="literal">sizeof()</code> to check the storage size.</p><p>Finally, under x86 Linux doubles are sometimes an exception to the
          384 self-alignment rule; an 8-byte double may require only 4-byte
          385 alignment within a struct even though standalone doubles variables
          386 have 8-byte self-alignment. This depends on compiler and options.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_readability_and_cache_locality"></a>9. Readability and cache locality</h2></div></div></div><p>While reordering by size is the simplest way to eliminate slop, it’s
          387 not necessarily the right thing.  There are two more issues:
          388 readability and cache locality.</p><p>Programs are not just communications to a computer, they are
          389 communications to other human beings.  Code readability is important
          390 even (or especially!) when the audience of the communication is only
          391 your future self.</p><p>A clumsy, mechanical reordering of your structure can harm
          392 readability.  When possible, it is better to reorder fields so they
          393 remain in coherent groups with semantically related pieces of data
          394 kept close together.  Ideally, the design of your structure should
          395 communicate the design of your program.</p><p>When your program frequently accesses a structure, or parts of a
          396 structure, it is helpful for performance if the accesses tend to fit
          397 within a <span class="emphasis"><em>cache line</em></span> - the memory block fetched by your processor
          398 when it is told to get any single address within the block.  On 64-bit
          399 x86 a cache line is 64 bytes beginning on a self-aligned address; on
          400 other platforms it is often 32 bytes.</p><p>The things you should do to preserve readability - grouping related
          401 and co-accessed data in adjacent fields - also improve cache-line
          402 locality. These are both reasons to reorder intelligently, with
          403 awareness of your code’s data-access patterns.</p><p>If your code does concurrent access to a structure from multiple
          404 threads, there’s a third issue: cache line bouncing. To minimize
          405 expensive bus traffic, you should arrange your data so that reads
          406 come from one cache line and writes go to another in your tighter
          407 loops.</p><p>And yes, this sometimes contradicts the previous guidance about grouping
          408 related data in the same cache-line-sized block. Multithreading is hard.
          409 Cache-line bouncing and other multithread optimization issues are very
          410 advanced topics which deserve an entire tutorial of their own.  The
          411 best I can do here is make you aware that these issues exist.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_other_packing_techniques"></a>10. Other packing techniques</h2></div></div></div><p>Reordering works best when combined with other techniques for slimming
          412 your structures. If you have several boolean flags in a struct, for example,
          413 consider reducing them to 1-bit bitfields and packing them into a
          414 place in the structure that would otherwise be slop.</p><p>You’ll take a small access-time penalty for this - but if it squeezes
          415 the working set enough smaller, that penalty will be swamped by your
          416 gains from avoided cache misses.</p><p>More generally, look for ways to shorten data field sizes.  In
          417 cvs-fast-export, for example, one squeeze I applied was to use the
          418 knowledge that RCS and CVS repositories didn’t exist before 1982.  I
          419 dropped a 64-bit Unix <code class="literal">time_t</code> (zero date at the beginning of 1970)
          420 for a 32-bit time offset from 1982-01-01T00:00:00; this will cover
          421 dates to 2118.  (Note: if you pull a trick like this, <span class="emphasis"><em>do a bounds
          422 check whenever you set the field</em></span> to prevent nasty bugs!)</p><p>Each such field shortening not only decreases the explicit size
          423 of your structure, it may remove slop and/or create additional
          424 opportunities for gains from field reordering.  Virtuous cascades
          425 of such effects are not very hard to trigger.</p><p>The riskiest form of packing is to use unions.  If you know that
          426 certain fields in your structure are never used in combination with
          427 certain other fields, consider using a union to make them share
          428 storage.  But be extra careful and verify your work with regression
          429 testing, because if your lifetime analysis is even slightly wrong you
          430 will get bugs ranging from crashes to (much worse) subtle data
          431 corruption.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_overriding_alignment_rules"></a>11. Overriding alignment rules</h2></div></div></div><p>Sometimes you can coerce your compiler into not using the processor’s
          432 normal alignment rules by using a pragma, usually <code class="literal">#pragma pack</code>.
          433 GCC and clang have a "packed" attribute you can attach to
          434 individual structure declarations; GCC has an -fpack-struct option
          435 for entire compilations.</p><p>Do not do this casually, as it forces the generation of more expensive
          436 and slower code. Usually you can save as much memory, or almost as
          437 much, with the techniques I describe here.</p><p>The only really compelling reason for <code class="literal">#pragma pack</code> is if you have to
          438 exactly match your C data layout to some kind of bit-level hardware or
          439 protocol requirement, like a memory-mapped hardware port, and
          440 violating normal alignment is required for that to work. If you’re in
          441 that situation, and you don’t already know everything else I’m writing
          442 about here, you’re in deep trouble and I wish you luck.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_tools"></a>12. Tools</h2></div></div></div><p>The clang compiler has a -Wpadded option that causes it to generate
          443 messages about alignment holes and padding.  Some versions also have
          444 an undocumented -fdump-record-layouts option that yields
          445 <a class="ulink" href="http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-July/037778.html" target="_top">more
          446 information</a>.</p><p>If you’re using C11, you can deploy static_assert to check your
          447 assumptions about type and structure sizes.  Example:</p><pre class="screen">#include &lt;assert.h&gt;
          448 struct foo4 {
          449     short s;     /* 2 bytes */
          450     char c;      /* 1 byte */
          451 };
          452 static_assert(sizeof(struct foo4) == 4, “Check your assumptions");</pre><p>I have not used it myself, but several respondents speak well of a
          453 program called <code class="literal">pahole</code>. This tool cooperates with a compiler to produce
          454 reports on your structures that describe padding, alignment, and
          455 cache line boundaries.  This was at one time a standalone C program,
          456 but that is now unmaintained; a script with the name pahole now ships
          457 with gdb and that is what you should use.</p><p>I’ve received a report that a proprietary code auditing tool called
          458 "PVS Studio" can detect structure-packing opportunities.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_proof_and_exceptional_cases"></a>13. Proof and exceptional cases</h2></div></div></div><p>You can download sourcecode for a little program that demonstrates the
          459 assertions about scalar and structure sizes made above.  It is
          460 <a class="ulink" href="packtest.c" target="_top">packtest.c</a>.</p><p>If you look through enough strange combinations of compilers, options,
          461 and unusual hardware, you will find exceptions to some of the rules I
          462 have described. They get more common as you go back in time to older
          463 processor designs.</p><p>The next level beyond knowing these rules is knowing how and when to
          464 expect that they will be broken.  In the years when I learned them
          465 (the early 1980s) we spoke of people who didn’t get this as victims of
          466 "all-the-world’s-a-VAX" syndrome. Remember that not all the world is
          467 vanilla.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="C-like"></a>14. Other languages</h2></div></div></div><p>In this section we’ll call a language "C-like" if structure and array
          468 members are self-aligned, are not reordered by the compiler, the address
          469 of a struct is the address of its first member, and structs have
          470 trailing pdding to their stride length.</p><p>If you know the implications of self-aligment in C, you can apply them
          471 directly to calculating sizes and offsets in any language that is
          472 C-like in this sense, and to space-optimizing in the language’s
          473 structures.</p><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_c"></a>14.1. C++</h3></div></div></div><p>C++ is C-like, except that classes that look like structs may ignore
          474 the rule that the address of a struct is the address of its first
          475 member! Whether they do or not depends on how base classes and virtual
          476 member functions are implemented, and varies by compiler.  Otherwise
          477 everything we’ve observed about C applies.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_go"></a>14.2. Go</h3></div></div></div><p>The Go language is in many respects similar to C.  It has structures
          478 and arrays, though not bitfields or unions.  Go compilers have the
          479 same optimization and alignment issues as C compilers. As in C, array
          480 elements are padded up to the following stride address.</p><p>There is no Go equivalent of the C #pack pragma.</p><p>Variables and struct fields will normally be self-aligned for the same
          481 reasons rgis is a rule in C.  However, there is one peculiat
          482 exception; on 32—bit platforms, 64-bit struct fields only require
          483 akignment on a machine word boundary, e.g. 32 bits. There has been
          484 discussion of
          485 <a class="ulink" href="https://go.googlesource.com/proposal/+/master/design/36606-64-bit-field-alignment.md[a" target="_top">https://go.googlesource.com/proposal/+/master/design/36606-64-bit-field-alignment.md[a</a>
          486 proposal to change this, but it stands.</p><p>On the other hand, the Go specification makes no guarantees about how
          487 structure fields are ordered. Unlike C, it would be legal for a Go
          488 compiler to lay out fields in an order different from their
          489 specification in the source.  As of 2022 no Go compiler that
          490 actually does this has beenbn sughted in the wild.</p><p>Go has one odd really odd quirk.  Since Go 1.5, a zero-length field at
          491 the end of a struct (that is, a zero-length array or empty struct) is
          492 sized and aligned as though it is one byte.  The reasons for this are
          493 discussed in an essay
          494 <a class="ulink" href="https://dave.cheney.net/2015/10/09/padding-is-hard" target="_top">Padding is Hard</a> by
          495 one of the Go developers.</p><p>There’s a
          496 <a class="ulink" href="https://itnext.io/structure-size-optimization-in-golang-alignment-padding-more-effective-memory-layout-linters-fffdcba27c61" target="_top">specific
          497 discussion of Go alignment rules</a> that includes pointers tp some tools
          498 that can automatically tweak your structures to optimal alignment.
          499 Another such tool is
          500 <a class="ulink" href="https://github.com/orijtech/structslop" target="_top">strucrslop</a>. I have not used
          501 any of these, try at your own risk.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_rust"></a>14.3. Rust</h3></div></div></div><p>Rust follows C-like packing rules if a structure is annotated with
          502 "repr(C)". Otherwise (by default) all bets are off: padding rules are
          503 (deliberately) unspecified and the compiler may even reorder structure
          504 members. It is probably best to let the Rust compiler do space
          505 optimization rather than forcing it.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_java"></a>14.4. Java</h3></div></div></div><p>Java’s JNI (Java Native Interface) supports C-like packing rules for
          506 structure members so that JNI Java structures can map exactly to C
          507 equivalents. There is a pack pragma.</p><p>Packing in the JVM itself, however, is not well defined. The JVM
          508 spec simply says "The Java Virtual Machine does not mandate any
          509 particular internal structure for objects.", making choices about
          510 this implementation-dependent.</p><p>That said, many JVM implementations are word-oriented in an even
          511 stricter way than Motorola processors - structure and array members
          512 can start at any 32-bit boundary but not at a 16- or 8-bit one. This
          513 will create internal padding after char and 16-bit short members where
          514 it wouldn’t be expected under C-like rules.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_swift"></a>14.5. Swift</h3></div></div></div><p>Swift is
          515 <a class="ulink" href="https://swiftunboxed.com/internals/size-stride-alignment/" target="_top">exactly
          516 C-like</a>. There is no equivalent of a pack pragma.</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a id="_c_2"></a>14.6. C#</h3></div></div></div><p>C# is C-like with the default structure layout attribute
          517 LayoutKind.Sequential.  LayoutKind.Auto allows the compiler to
          518 reorder, and LayoutKind.Explicit allows the programmmer to specify
          519 field sizes explicitly. There’s a Pack modifier which is equivalent to
          520 a C pack pragma.</p></div></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_supporting_this_work"></a>15. Supporting this work</h2></div></div></div><p>If you were educated or entertained by this document, please sign up
          521 for my <a class="ulink" href="https://www.patreon.com/esr" target="_top">Patreon feed</a>. The time needed to
          522 write and maintain documents like this one is not free, and while
          523 I enjoying giving them to the world my bills won’t pay themselves.
          524 Even a few dollars a month - from enough of you - helps a lot.</p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_related_reading"></a>16. Related Reading</h2></div></div></div><p>This section exists to collect pointers to essays which I judge to be
          525 good companions to this one.</p><p><a class="ulink" href="http://blog.regehr.org/archives/213" target="_top">A Guide to Undefined Behavior in C and C++</a></p><p><a class="ulink" href="http://www.catb.org/esr/time-programming/" target="_top">Time, Clock, and Calendar Programming In C</a></p><p><a class="ulink" href="http://www.catb.org/esr/faqs/things-every-hacker-once-knew/" target="_top">Things Every Hacker Once Knew</a></p></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="_version_history"></a>17. Version history</h2></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term">
          526 2.5 @ 2020-04-30
          527 </span></dt><dd>
          528    Revision and expansion of the section on Go packing/alignment rules.
          529 </dd><dt><span class="term">
          530 2.4 @ 2020-04-28
          531 </span></dt><dd>
          532    Added coverage of Java, Swift, and C#.
          533 </dd><dt><span class="term">
          534 2.3 @ 2020-04-20
          535 </span></dt><dd>
          536    Describe exceptional alignment rules on Motorola 680x0.
          537    Sightly improve coverage of Rust alignment rules.
          538 </dd><dt><span class="term">
          539 2.2 @ 2019-12-19
          540 </span></dt><dd>
          541    Minor markup fix.
          542 </dd><dt><span class="term">
          543 2.1 @ 2019-02-14
          544 </span></dt><dd>
          545    Correct a minor error in one of the examples pointed out by
          546    Luis Emilio Moreno Durán.
          547 </dd><dt><span class="term">
          548 2.0 @ 2018-08-06
          549 </span></dt><dd>
          550     Drop "C" out of the title as these techniques are applicable to Go
          551     and Rust as well. More coverage of Go and Rust. The pahole tool
          552     has been replaced by a script in the gdb distribution.
          553 </dd><dt><span class="term">
          554 1.19 @ 2018-02-12
          555 </span></dt><dd>
          556     Describe usefulness of static_assert() in C11. Added Go and Rust coverage.
          557 </dd><dt><span class="term">
          558 1.18 @ 2017-06-01
          559 </span></dt><dd>
          560     More general zero-padding orders.  C11 and C14 relax a constraint
          561     on bitfield packing.
          562 </dd><dt><span class="term">
          563 1.17 @ 2016-11-14
          564 </span></dt><dd>
          565     Typo fixes.
          566 </dd><dt><span class="term">
          567 1.16 @ 2016-10-21
          568 </span></dt><dd>
          569     Answer an objection about allocation order being unrelated to
          570     source order.
          571 </dd><dt><span class="term">
          572 1.15 @ 2016-10-20
          573 </span></dt><dd>
          574     Note the field evidence from NTP.
          575 </dd><dt><span class="term">
          576 1.14 @ 2015-12-19
          577 </span></dt><dd>
          578     Typo correction: -Wpadding → -Wpadded.
          579 </dd><dt><span class="term">
          580 1.13 @ 2015-11-23
          581 </span></dt><dd>
          582     Be explicit about padding bits being undefined.  More about bitfields.
          583 </dd><dt><span class="term">
          584 1.12 @ 2015-11-11
          585 </span></dt><dd>
          586     Major revision of section on bitfields reflecting C99 rules.
          587 </dd><dt><span class="term">
          588 1.11 @ 2015-07-23
          589 </span></dt><dd>
          590     Mention the clang -fdump-record-layouts option.
          591 </dd><dt><span class="term">
          592 1.10 @ 2015-02-20
          593 </span></dt><dd>
          594     Mention <span class="emphasis"><em>attribute</em></span><a id="idm359" class="indexterm"></a><span class="emphasis"><em>packed</em></span>, -fpack-struct, and PVS Studio.
          595 </dd><dt><span class="term">
          596 1.9 @ 2014-10-01
          597 </span></dt><dd>
          598     Added link to "Time, Clock, and Calendar Programming In C".
          599 </dd><dt><span class="term">
          600 1.8 @ 2014-05-20
          601 </span></dt><dd>
          602     Improved explanation for the bitfield examples,
          603 </dd><dt><span class="term">
          604 1.7 @ 2014-05-17
          605 </span></dt><dd>
          606     Correct a minor error in the description of the layout of <code class="literal">struct foo8</code>.
          607 </dd><dt><span class="term">
          608 1.6 @ 2014-05-14
          609 </span></dt><dd>
          610     Emphasize that bitfields cannot cross word boundaries.  Idea from
          611     Dale Gulledge.
          612 </dd><dt><span class="term">
          613 1.5 @ 2014-01-13
          614 </span></dt><dd>
          615     Explain why structure member reordering is not done automatically.
          616 </dd><dt><span class="term">
          617 1.4 @ 2014-01-04
          618 </span></dt><dd>
          619     A note about double under x86 Linux.
          620 </dd><dt><span class="term">
          621 1.3 @ 2014-01-03
          622 </span></dt><dd>
          623     New sections on awkward scalar cases, readability and cache
          624     locality, and tools.
          625 </dd><dt><span class="term">
          626 1.2 @ 2014-01-02
          627 </span></dt><dd>
          628     Correct an erroneous address calculation.
          629 </dd><dt><span class="term">
          630 1.1 @ 2014-01-01
          631 </span></dt><dd>
          632     Explain why aligned accesses are faster.  Mention offsetof.
          633     Various minor fixes, including the packtest.c download link.
          634 </dd><dt><span class="term">
          635 1.0 @ 2014-01-01
          636 </span></dt><dd>
          637     Initial release.
          638