[HN Gopher] The time complexity of the libc++ deque push_front i...
       ___________________________________________________________________
        
       The time complexity of the libc++ deque push_front implementation
       is O(log n)
        
       Author : x1f604
       Score  : 69 points
       Date   : 2021-11-28 10:53 UTC (12 hours ago)
        
 (HTM) web link (1f604.blogspot.com)
 (TXT) w3m dump (1f604.blogspot.com)
        
       | beached_whale wrote:
       | std::deque is odd in that there's no std size of the blocks used.
       | We have libstdc++ that is about 512bytes, libc++ at 4096bytes,
       | and MS STL at around 64bytes. This leaves a bunch of room for
       | optimizing for certain cases. Smaller is more insert friendly,
       | larger more op[] reading friendly. But push_front after a
       | pop_front should be O(1) on all of them. Otherwise it is
       | moving/copying data to make room. Another nice/similar data
       | structure is something like Boosts devector that adds a pointer
       | to vector for the front capacity.
        
         | nly wrote:
         | Boosts deque let's you decide the the block size.
        
         | brandmeyer wrote:
         | > MS STL at around 64bytes.
         | 
         | Citation? Are you sure it isn't 64 size-independent elements?
        
           | MauranKilom wrote:
           | It's worse actually. The MSVC STL uses 16 byte (or however
           | large your element is, if greater) chunks. Source:
           | 
           | https://github.com/microsoft/STL/blob/main/stl/inc/deque#L56.
           | ..
           | 
           | They have made clear that this won't be changed, for ABI
           | stability reasons.
           | 
           | That makes std::dequeue basically unusable in a portable
           | context. In virtually any situation, "allocate each element
           | separately" and "allocate elements in 4k chunks" are on
           | opposite ends of the performance spectrum.
        
       | logicchains wrote:
       | Anyone who programs low-latency software knows std::deque is an
       | absolutely horrible-performance deque implementation due to how
       | much it allocates and the pointer chasing involved. In most cases
       | it's better to use a vector-backed deque (ring buffer) for
       | anything performance-sensitive.
        
         | 323 wrote:
         | How do you feel about the absl containers? Are they also slow
         | for low-latency?
        
           | ncmncm wrote:
           | They are OK. But "slow for low latency" is not really
           | defined; for many low-latency cases, only static or at-
           | startup allocation is permitted, thus not deque at all. If
           | you might need to build for production with MSVC, definitely
           | do not use the std::deque that comes with it.
        
         | mgaunard wrote:
         | That is not true.
         | 
         | Anyone who programs low-latency C++ knows that the libstdc++
         | implementation is great (which is what 99.9% of people use)
         | while others tend to be less stellar.
         | 
         | It's just a segmented vector. The libstdc++ implementation
         | always allocates one segment even if empty, and while I've seen
         | low-latency guidelines arguing for empty things not allocating,
         | my personal guideline is to always allocate reasonable capacity
         | on construction rather than on first insertion.
         | 
         | A ring buffer is a completely different data structure, it only
         | works for fixed-capacity queues.
        
           | logicchains wrote:
           | >which is what 99.9% of people use
           | 
           | Maybe we have different ideas about what constitutes "low-
           | latency" but in HFT std::deque is rarely used. Much like
           | std::unordered_map, which allocates every insert potentially
           | costing up to a microsecond for each insert.
           | 
           | >It's just a segmented vector.
           | 
           | https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-
           | USER... we can look at the source. This "segmented vector" is
           | `_Tp** _M_map`, essentially a vector of vectors. This means
           | potentially twice the pointer-chasing, as we have to do two
           | lookups. It also means allocation/deallocation each time a
           | new segment is added/destroyed, which leads to more
           | allocation than when using a vector (first need to allocate
           | the segment, then potentially allocate entries within that).
           | 
           | >A ring buffer is a completely different data structure, it
           | only works for fixed-capacity queues.
           | 
           | Where possible it's better to use a fixed capacity queue,
           | especially one where the capacity is a power of 2 so wrap-
           | around calculation can be done more efficiently. But the same
           | kind of thing can be done for a dynamic-capacity queue, by
           | resizing the whole backing array when capacity is reached.
        
             | dan-robertson wrote:
             | HFT needn't mean low latency is required but of course the
             | two tend to be related.
             | 
             | If trivial allocations might cost you a microsecond then
             | you can't allocate at all on a latency-sensitive critical
             | path and should probably get a better allocator. Also, be
             | wary of power-of-two sized things because if they are
             | aligned they are likely to have poor cache performance (too
             | many things compete for the same set of cache lines).
        
             | mgaunard wrote:
             | I've worked in many HFT shops and that's not my experience.
             | A lot of those firms have poor C++ code though, but
             | std::deque was typically used in the better ones.
             | 
             | It's not one allocation per element, as said previously
             | it's a segmented vector. It works similarly to virtual
             | memory: fixed-size segments are allocated for contiguous
             | storage, while another parent array points to the segments.
             | So appending/prepending either needs to append/prepend to
             | the last/first segment or to allocate a new segment if
             | full, which is what makes those operations more efficient
             | than on a vector. Yes there is indirection, that's what
             | segmented means. You also get stability of the memory
             | location which is important for many data structures. The
             | article even tells you how big the segments are in the two
             | implementations being looked at.
             | 
             | Going back to HFT, a similar data structure that's
             | definitely heavily used there is a pool, which has a
             | configurable segment size (potentially variable) and
             | doesn't have order, enabling faster deallocation with a
             | freelist.
        
               | monocasa wrote:
               | I thought a resizable ring buffer was a valid
               | implementation, but maybe I'm thinking of Rust's VecDeque
               | and there's something about the semantics of std::deque
               | that doesn't allow that.
        
               | kadoban wrote:
               | I don't think insert at front/back is allowed an
               | amortized bound, it has to be flat constant. But then I
               | don't see how libc++'s is conforming if this article is
               | true, so I'm not entirely sure what's going on there.
        
               | mgaunard wrote:
               | How could you make any operation O(1) with resizing?
        
               | rav wrote:
               | It is possible to deamortize the O(n) resizing step,
               | meaning you spend O(1) extra time over the O(n)
               | operations leading to the resize, so that when the resize
               | needs to happen, you don't need to spend O(n) time. This
               | assumes that allocation/deallocation takes O(1) time
               | regardless of the buffer size.
               | 
               | Deamortized resizing can work as follows: When inserting
               | into a vector that is half full, allocate a new buffer
               | twice the size and copy two elements into the buffer. For
               | the subsequent insertions, copy two elements every time
               | you insert one element. This guarantees that you are done
               | copying all elements into the new buffer once the old
               | buffer is full, and you only performed a constant number
               | of instructions with each operation.
               | 
               | It's possible to deamortize many quite advanced data
               | structures in theory - see e.g. the lecture notes on
               | "Maintaining Order in a List" on this course webpage:
               | https://cs.au.dk/~gerth/aa11/
               | 
               | ... but as with everything in theoretical computer
               | science you are not guaranteed that the implementation
               | works better in practice than the "naive" approach.
               | Deamortization techniques do have a lot of practical
               | relevance in hard realtime applications, but I don't know
               | how much of "real" hard realtime stuff uses academic
               | techniques and how much of it is mostly engineering.
        
               | monocasa wrote:
               | Amortized O(1)
        
               | monocasa wrote:
               | Ohhhh, I think I found the issue with a ring buffer
               | implementation. Would edit my post if I was in the edit
               | window.
               | 
               | resize() doesn't invalidate existing references in
               | std::deque.
        
             | CoolGuySteve wrote:
             | Most STL data structures take an allocator template
             | argument. If you fight the compiler enough to make your own
             | std::allocator with preallocation then the STL is actually
             | pretty nice.
        
           | ComputerGuru wrote:
           | > my personal guideline is to always allocate reasonable
           | capacity on construction rather than on first insertion.
           | 
           | That works until probabilistically there is a decent chance
           | the capacity will always be zero.
        
           | IshKebab wrote:
           | I think he might be talking about a segmented ring buffer,
           | which is similar to a deque but each segment is a ring
           | buffer. It gives much better insertion complexity.
           | 
           | Anyway I have to echo his point that I've never found a
           | situation where deque performs better than vector, even when
           | you really might expect it to. I guess the extra allocations
           | are too slow. Maybe it would be more useful with arenas or
           | something.
        
       | im3w1l wrote:
       | So iiuc:
       | 
       | The libc++ implementation is bad. The libstdc++ implementation is
       | fine. The issue with the former is that it doesn't have enough
       | spare capacity so it has to shift elements around too often.
       | 
       | Actually I think the push_front is even worse than stated: O(n).
       | Consider an array with capacity N+1, contains N elements. Now if
       | you alternate push_front and pop_back then every push_front will
       | cause memmove of N elements.
       | 
       | Oh and to make like a 4th addition to this comment: It's kind of
       | funny that the code is so unreadable that the author found it
       | more helpful to look at the generated assembler code.
        
         | rightbyte wrote:
         | Reading those C or C++ standard libraries is like a joke.
         | Almost nothing is done in a direct or clear way and internal
         | methods have cryptic names hidden in macros.
         | 
         | Maybe for a good reason I dunno. But it would be nice if the
         | code was clearer so you could make sense of it when gdb or
         | callgrind jumps into an inlined chunk ...
        
           | brandmeyer wrote:
           | > internal methods have cryptic names
           | 
           | They choose names like _Capitalized and __lowercase because
           | those identifiers are reserved for the implementation. Its a
           | consequence of the preprocessor's lack of hygiene.
           | 
           | So where you might see a convention of naming members like
           | m_thingish_foo, in the implementation's library headers they
           | would be named _M_thingish_foo or __m_thingish_foo.
        
             | rightbyte wrote:
             | Ye true. Maybe I am just whining because the inlined chunks
             | in e.g. callgrind or gdb looks so random. I should use
             | "-Og" more often ...
        
       | blahgeek wrote:
       | While I very much appreciate and respect author's detailed
       | analysis, I am still not 100% convinced without a corresponding
       | benchmark result. If it's really O(log n) vs O(1), it should be
       | very easy to verify in some micro-benchmark.
       | 
       | Instead of that, the author mentioned that:
       | 
       | > The tests that I did run, initially the libstdc++ and the
       | libc++ versions have roughly the same performance, but the
       | performance diverges after a number of insertions somewhere
       | between 6,400 and 12,800. After that the runtime of the libc++
       | version is roughly 2-3 times that of the libstdc++ version. (...)
       | the outcomes of which depend on the compiler version anyway.
       | 
       | This does not seem right.
        
         | x1f604 wrote:
         | Here are the results of the benchmark that I did:
         | https://imgur.com/a/2lCktTO
         | 
         | It didn't look quite right to me so I didn't post it.
         | 
         | Here's another benchmark that I did where you can clearly see
         | that something has gone wrong: https://imgur.com/a/s2rA8qE
        
       | krona wrote:
       | CppReference defines the time complexity is constant:
       | https://en.cppreference.com/w/cpp/container/deque/push_front
       | 
       | So does this mean: - We're talking about different complexities
       | (Amortized vs something else) - The libc++ implementation isn't
       | standards conformant - The analysis in the c++ standard is
       | incorrect. - Something else
        
         | mkl wrote:
         | To make bullet points on HN, put a blank line between the
         | paragraphs.
        
         | im3w1l wrote:
         | It's not conformant.
        
         | Kranar wrote:
         | Neither is correct. Generally when complexity is specified in
         | the standard, it's with respect to a specific operation. In
         | this case the standard specifies that it's the number of calls
         | to the constructor of T that must be constant, which means that
         | push_back and push_front can not copy or move its elements.
         | 
         | There is never a way to guarantee that the physical amount of
         | time is bounded by O(1) in the worst case. You could always
         | have pathological scenarios or data structures written so that
         | performing a move or copy of a single T could require O(n^n)
         | time for all anyone knows.
        
           | gpderetta wrote:
           | Memory allocation in particular is very hard to analyze.
        
       | ncmncm wrote:
       | The most important thing to know about std::deque is that the
       | implementation in MSVC is really just _awfully bad_. So, if your
       | code might need to be built there, you are should consider using
       | an implementation from somewhere other than std::.
        
       | [deleted]
        
       | colanderman wrote:
       | The best thing I learned in algorithms class is:
       | 
       | For all realizable n, ln(n) is less than 40.
       | 
       | So when amortizing over a large window (say, a block size of
       | 512), it can be the case that, _for any realizable n_ , the
       | logarithmic amortized costs are completely dominated by the
       | constant non-amortized costs (adding an element to a block in
       | this case)... making the operation effectively constant-time.
        
         | tomerv wrote:
         | Practically speaking, it is not a good advice to just treat
         | log(n) as if it is constant. While it's true that log(n) is
         | always small in practice, the conclusion should not be to
         | ignore it, but rather to notice the constant factor. And in
         | practice, _usually_ data structures with O(log n) complexity
         | also have a bigger hidden constant. For example,
         | std::unordered_map is much faster in practice than std::map. Of
         | course, this is not strictly correct, it 's just a heuristics.
         | Quicksort with its O(log n) [Edit: O(n log n)] complexity is a
         | counter-example to this.
        
           | colanderman wrote:
           | To be clear, that is not the advice I'm giving -- but rather,
           | when your performance looks like _p*log n + q_ , if _q_ is
           | much greater than _p /40_ -- that is, the constant term
           | dwarfs the logarithmic term -- then it is safe to consider it
           | constant.
        
             | xdavidliu wrote:
             | > p*log n + q, if q is much greater than p/40 -- that is,
             | the constant term dwarfs the logarithmic term
             | 
             | I think you meant to say "if q is much greater than p TIMES
             | 40".
        
               | colanderman wrote:
               | Ah good catch, yes you are correct.
        
           | kwertyoowiyop wrote:
           | And of course, nothing is important until you've profiled
           | your code and measured that it is.
        
           | kadoban wrote:
           | That seems like a pretty good argument _for_ treating it as
           | constant though and just shifting your focus to how large the
           | constants actually are.
        
         | gpderetta wrote:
         | Random fact of the day: The currently best known upper bound
         | for the complexity of the Union-Find algorithm is the reverse
         | Ackermann function, which can be treated as a constant (4) for
         | all remotely practical (and most of the impractical) values of
         | n.
        
           | tylerhou wrote:
           | Tarjan's proof of inverse* Ackermann bound is hard to
           | understand; it's much easier to understand the proof of the
           | log*(n) bound, which is also an extremely slow growing
           | function: https://people.eecs.berkeley.edu/~vazirani/algorith
           | ms/chap5....
           | 
           | where log* is the iterated log function; i.e. the number of
           | times you have to apply log repeatedly until you reach a
           | value of 1 or smaller. For reference, log_2*(10^10000) = 5.
           | 
           | https://en.wikipedia.org/wiki/Iterated_logarithm
        
       | Labo333 wrote:
       | Interesting!
       | 
       | It reminds me of the hash maps that are said to be amortized O(1)
       | but can be O(n) for some sequences of operations in various
       | languages like Python and JS: https://blog.toit.io/hash-maps-
       | that-dont-hate-you-1a96150b49...
        
       | williamkuszmaul wrote:
       | Very nice post!
       | 
       | Small comment: Ideally, big-O notation is for upper bounds. If
       | you are doing lower bounds, you should ideally use big-Omega
       | notation. But Omegas are harder to format in plain text, so it
       | may be better to abuse notation and use big-O...
        
       ___________________________________________________________________
       (page generated 2021-11-28 23:01 UTC)