[HN Gopher] The time complexity of the libc++ deque push_front i...
___________________________________________________________________
The time complexity of the libc++ deque push_front implementation
is O(log n)
Author : x1f604
Score : 69 points
Date : 2021-11-28 10:53 UTC (12 hours ago)
(HTM) web link (1f604.blogspot.com)
(TXT) w3m dump (1f604.blogspot.com)
| beached_whale wrote:
| std::deque is odd in that there's no std size of the blocks used.
| We have libstdc++ that is about 512bytes, libc++ at 4096bytes,
| and MS STL at around 64bytes. This leaves a bunch of room for
| optimizing for certain cases. Smaller is more insert friendly,
| larger more op[] reading friendly. But push_front after a
| pop_front should be O(1) on all of them. Otherwise it is
| moving/copying data to make room. Another nice/similar data
| structure is something like Boosts devector that adds a pointer
| to vector for the front capacity.
| nly wrote:
| Boosts deque let's you decide the the block size.
| brandmeyer wrote:
| > MS STL at around 64bytes.
|
| Citation? Are you sure it isn't 64 size-independent elements?
| MauranKilom wrote:
| It's worse actually. The MSVC STL uses 16 byte (or however
| large your element is, if greater) chunks. Source:
|
| https://github.com/microsoft/STL/blob/main/stl/inc/deque#L56.
| ..
|
| They have made clear that this won't be changed, for ABI
| stability reasons.
|
| That makes std::dequeue basically unusable in a portable
| context. In virtually any situation, "allocate each element
| separately" and "allocate elements in 4k chunks" are on
| opposite ends of the performance spectrum.
| logicchains wrote:
| Anyone who programs low-latency software knows std::deque is an
| absolutely horrible-performance deque implementation due to how
| much it allocates and the pointer chasing involved. In most cases
| it's better to use a vector-backed deque (ring buffer) for
| anything performance-sensitive.
| 323 wrote:
| How do you feel about the absl containers? Are they also slow
| for low-latency?
| ncmncm wrote:
| They are OK. But "slow for low latency" is not really
| defined; for many low-latency cases, only static or at-
| startup allocation is permitted, thus not deque at all. If
| you might need to build for production with MSVC, definitely
| do not use the std::deque that comes with it.
| mgaunard wrote:
| That is not true.
|
| Anyone who programs low-latency C++ knows that the libstdc++
| implementation is great (which is what 99.9% of people use)
| while others tend to be less stellar.
|
| It's just a segmented vector. The libstdc++ implementation
| always allocates one segment even if empty, and while I've seen
| low-latency guidelines arguing for empty things not allocating,
| my personal guideline is to always allocate reasonable capacity
| on construction rather than on first insertion.
|
| A ring buffer is a completely different data structure, it only
| works for fixed-capacity queues.
| logicchains wrote:
| >which is what 99.9% of people use
|
| Maybe we have different ideas about what constitutes "low-
| latency" but in HFT std::deque is rarely used. Much like
| std::unordered_map, which allocates every insert potentially
| costing up to a microsecond for each insert.
|
| >It's just a segmented vector.
|
| https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-
| USER... we can look at the source. This "segmented vector" is
| `_Tp** _M_map`, essentially a vector of vectors. This means
| potentially twice the pointer-chasing, as we have to do two
| lookups. It also means allocation/deallocation each time a
| new segment is added/destroyed, which leads to more
| allocation than when using a vector (first need to allocate
| the segment, then potentially allocate entries within that).
|
| >A ring buffer is a completely different data structure, it
| only works for fixed-capacity queues.
|
| Where possible it's better to use a fixed capacity queue,
| especially one where the capacity is a power of 2 so wrap-
| around calculation can be done more efficiently. But the same
| kind of thing can be done for a dynamic-capacity queue, by
| resizing the whole backing array when capacity is reached.
| dan-robertson wrote:
| HFT needn't mean low latency is required but of course the
| two tend to be related.
|
| If trivial allocations might cost you a microsecond then
| you can't allocate at all on a latency-sensitive critical
| path and should probably get a better allocator. Also, be
| wary of power-of-two sized things because if they are
| aligned they are likely to have poor cache performance (too
| many things compete for the same set of cache lines).
| mgaunard wrote:
| I've worked in many HFT shops and that's not my experience.
| A lot of those firms have poor C++ code though, but
| std::deque was typically used in the better ones.
|
| It's not one allocation per element, as said previously
| it's a segmented vector. It works similarly to virtual
| memory: fixed-size segments are allocated for contiguous
| storage, while another parent array points to the segments.
| So appending/prepending either needs to append/prepend to
| the last/first segment or to allocate a new segment if
| full, which is what makes those operations more efficient
| than on a vector. Yes there is indirection, that's what
| segmented means. You also get stability of the memory
| location which is important for many data structures. The
| article even tells you how big the segments are in the two
| implementations being looked at.
|
| Going back to HFT, a similar data structure that's
| definitely heavily used there is a pool, which has a
| configurable segment size (potentially variable) and
| doesn't have order, enabling faster deallocation with a
| freelist.
| monocasa wrote:
| I thought a resizable ring buffer was a valid
| implementation, but maybe I'm thinking of Rust's VecDeque
| and there's something about the semantics of std::deque
| that doesn't allow that.
| kadoban wrote:
| I don't think insert at front/back is allowed an
| amortized bound, it has to be flat constant. But then I
| don't see how libc++'s is conforming if this article is
| true, so I'm not entirely sure what's going on there.
| mgaunard wrote:
| How could you make any operation O(1) with resizing?
| rav wrote:
| It is possible to deamortize the O(n) resizing step,
| meaning you spend O(1) extra time over the O(n)
| operations leading to the resize, so that when the resize
| needs to happen, you don't need to spend O(n) time. This
| assumes that allocation/deallocation takes O(1) time
| regardless of the buffer size.
|
| Deamortized resizing can work as follows: When inserting
| into a vector that is half full, allocate a new buffer
| twice the size and copy two elements into the buffer. For
| the subsequent insertions, copy two elements every time
| you insert one element. This guarantees that you are done
| copying all elements into the new buffer once the old
| buffer is full, and you only performed a constant number
| of instructions with each operation.
|
| It's possible to deamortize many quite advanced data
| structures in theory - see e.g. the lecture notes on
| "Maintaining Order in a List" on this course webpage:
| https://cs.au.dk/~gerth/aa11/
|
| ... but as with everything in theoretical computer
| science you are not guaranteed that the implementation
| works better in practice than the "naive" approach.
| Deamortization techniques do have a lot of practical
| relevance in hard realtime applications, but I don't know
| how much of "real" hard realtime stuff uses academic
| techniques and how much of it is mostly engineering.
| monocasa wrote:
| Amortized O(1)
| monocasa wrote:
| Ohhhh, I think I found the issue with a ring buffer
| implementation. Would edit my post if I was in the edit
| window.
|
| resize() doesn't invalidate existing references in
| std::deque.
| CoolGuySteve wrote:
| Most STL data structures take an allocator template
| argument. If you fight the compiler enough to make your own
| std::allocator with preallocation then the STL is actually
| pretty nice.
| ComputerGuru wrote:
| > my personal guideline is to always allocate reasonable
| capacity on construction rather than on first insertion.
|
| That works until probabilistically there is a decent chance
| the capacity will always be zero.
| IshKebab wrote:
| I think he might be talking about a segmented ring buffer,
| which is similar to a deque but each segment is a ring
| buffer. It gives much better insertion complexity.
|
| Anyway I have to echo his point that I've never found a
| situation where deque performs better than vector, even when
| you really might expect it to. I guess the extra allocations
| are too slow. Maybe it would be more useful with arenas or
| something.
| im3w1l wrote:
| So iiuc:
|
| The libc++ implementation is bad. The libstdc++ implementation is
| fine. The issue with the former is that it doesn't have enough
| spare capacity so it has to shift elements around too often.
|
| Actually I think the push_front is even worse than stated: O(n).
| Consider an array with capacity N+1, contains N elements. Now if
| you alternate push_front and pop_back then every push_front will
| cause memmove of N elements.
|
| Oh and to make like a 4th addition to this comment: It's kind of
| funny that the code is so unreadable that the author found it
| more helpful to look at the generated assembler code.
| rightbyte wrote:
| Reading those C or C++ standard libraries is like a joke.
| Almost nothing is done in a direct or clear way and internal
| methods have cryptic names hidden in macros.
|
| Maybe for a good reason I dunno. But it would be nice if the
| code was clearer so you could make sense of it when gdb or
| callgrind jumps into an inlined chunk ...
| brandmeyer wrote:
| > internal methods have cryptic names
|
| They choose names like _Capitalized and __lowercase because
| those identifiers are reserved for the implementation. Its a
| consequence of the preprocessor's lack of hygiene.
|
| So where you might see a convention of naming members like
| m_thingish_foo, in the implementation's library headers they
| would be named _M_thingish_foo or __m_thingish_foo.
| rightbyte wrote:
| Ye true. Maybe I am just whining because the inlined chunks
| in e.g. callgrind or gdb looks so random. I should use
| "-Og" more often ...
| blahgeek wrote:
| While I very much appreciate and respect author's detailed
| analysis, I am still not 100% convinced without a corresponding
| benchmark result. If it's really O(log n) vs O(1), it should be
| very easy to verify in some micro-benchmark.
|
| Instead of that, the author mentioned that:
|
| > The tests that I did run, initially the libstdc++ and the
| libc++ versions have roughly the same performance, but the
| performance diverges after a number of insertions somewhere
| between 6,400 and 12,800. After that the runtime of the libc++
| version is roughly 2-3 times that of the libstdc++ version. (...)
| the outcomes of which depend on the compiler version anyway.
|
| This does not seem right.
| x1f604 wrote:
| Here are the results of the benchmark that I did:
| https://imgur.com/a/2lCktTO
|
| It didn't look quite right to me so I didn't post it.
|
| Here's another benchmark that I did where you can clearly see
| that something has gone wrong: https://imgur.com/a/s2rA8qE
| krona wrote:
| CppReference defines the time complexity is constant:
| https://en.cppreference.com/w/cpp/container/deque/push_front
|
| So does this mean: - We're talking about different complexities
| (Amortized vs something else) - The libc++ implementation isn't
| standards conformant - The analysis in the c++ standard is
| incorrect. - Something else
| mkl wrote:
| To make bullet points on HN, put a blank line between the
| paragraphs.
| im3w1l wrote:
| It's not conformant.
| Kranar wrote:
| Neither is correct. Generally when complexity is specified in
| the standard, it's with respect to a specific operation. In
| this case the standard specifies that it's the number of calls
| to the constructor of T that must be constant, which means that
| push_back and push_front can not copy or move its elements.
|
| There is never a way to guarantee that the physical amount of
| time is bounded by O(1) in the worst case. You could always
| have pathological scenarios or data structures written so that
| performing a move or copy of a single T could require O(n^n)
| time for all anyone knows.
| gpderetta wrote:
| Memory allocation in particular is very hard to analyze.
| ncmncm wrote:
| The most important thing to know about std::deque is that the
| implementation in MSVC is really just _awfully bad_. So, if your
| code might need to be built there, you are should consider using
| an implementation from somewhere other than std::.
| [deleted]
| colanderman wrote:
| The best thing I learned in algorithms class is:
|
| For all realizable n, ln(n) is less than 40.
|
| So when amortizing over a large window (say, a block size of
| 512), it can be the case that, _for any realizable n_ , the
| logarithmic amortized costs are completely dominated by the
| constant non-amortized costs (adding an element to a block in
| this case)... making the operation effectively constant-time.
| tomerv wrote:
| Practically speaking, it is not a good advice to just treat
| log(n) as if it is constant. While it's true that log(n) is
| always small in practice, the conclusion should not be to
| ignore it, but rather to notice the constant factor. And in
| practice, _usually_ data structures with O(log n) complexity
| also have a bigger hidden constant. For example,
| std::unordered_map is much faster in practice than std::map. Of
| course, this is not strictly correct, it 's just a heuristics.
| Quicksort with its O(log n) [Edit: O(n log n)] complexity is a
| counter-example to this.
| colanderman wrote:
| To be clear, that is not the advice I'm giving -- but rather,
| when your performance looks like _p*log n + q_ , if _q_ is
| much greater than _p /40_ -- that is, the constant term
| dwarfs the logarithmic term -- then it is safe to consider it
| constant.
| xdavidliu wrote:
| > p*log n + q, if q is much greater than p/40 -- that is,
| the constant term dwarfs the logarithmic term
|
| I think you meant to say "if q is much greater than p TIMES
| 40".
| colanderman wrote:
| Ah good catch, yes you are correct.
| kwertyoowiyop wrote:
| And of course, nothing is important until you've profiled
| your code and measured that it is.
| kadoban wrote:
| That seems like a pretty good argument _for_ treating it as
| constant though and just shifting your focus to how large the
| constants actually are.
| gpderetta wrote:
| Random fact of the day: The currently best known upper bound
| for the complexity of the Union-Find algorithm is the reverse
| Ackermann function, which can be treated as a constant (4) for
| all remotely practical (and most of the impractical) values of
| n.
| tylerhou wrote:
| Tarjan's proof of inverse* Ackermann bound is hard to
| understand; it's much easier to understand the proof of the
| log*(n) bound, which is also an extremely slow growing
| function: https://people.eecs.berkeley.edu/~vazirani/algorith
| ms/chap5....
|
| where log* is the iterated log function; i.e. the number of
| times you have to apply log repeatedly until you reach a
| value of 1 or smaller. For reference, log_2*(10^10000) = 5.
|
| https://en.wikipedia.org/wiki/Iterated_logarithm
| Labo333 wrote:
| Interesting!
|
| It reminds me of the hash maps that are said to be amortized O(1)
| but can be O(n) for some sequences of operations in various
| languages like Python and JS: https://blog.toit.io/hash-maps-
| that-dont-hate-you-1a96150b49...
| williamkuszmaul wrote:
| Very nice post!
|
| Small comment: Ideally, big-O notation is for upper bounds. If
| you are doing lower bounds, you should ideally use big-Omega
| notation. But Omegas are harder to format in plain text, so it
| may be better to abuse notation and use big-O...
___________________________________________________________________
(page generated 2021-11-28 23:01 UTC)