hngopher.com

       [HN Gopher] An efficient way to make hash maps with insertion or...
       ___________________________________________________________________
        
       An efficient way to make hash maps with insertion ordering
        
       Author : eatonphil
       Score  : 83 points
       Date   : 2021-07-01 11:21 UTC (11 hours ago)
        
 (HTM) web link (blog.toit.io)
 (TXT) w3m dump (blog.toit.io)
        
       | usefulcat wrote:
       | First off, WTF is up with disabling copying? That is seriously
       | annoying. Now on to what I came here to say..
       | 
       | "When there are enough of these tombstones, the hash map gets
       | rebuilt, but that's OK. Due to the magic of amortized constant
       | time, this doesn't slow you down on average."
       | 
       | If amortized time is all you care about, that's fine. But there
       | are some applications where worst case time is more important,
       | and unfortunately I've seen some hash map implementations that
       | implicitly assume that the worst case never matters.
        
         | prutschman wrote:
         | In some browsers you can get around this by selecting the text,
         | keeping the mouse button down, hitting the keyboard shortcut
         | for 'copy', then letting go. The selection will still be
         | emptied, but the text is already in your clipboard.
        
       | peterkelly wrote:
       | Rust has HashMap and BTreeMap. You use the latter when you care
       | about ordering. There's also a couple of options for dequeues. A
       | good overview here explaining which one you'd choose depending on
       | what oyu want to do: https://doc.rust-
       | lang.org/std/collections/index.html
       | 
       | Use the right data structure for the job. Maps are an interface,
       | not an implementation.
        
         | edflsafoiewq wrote:
         | The indexmap crate provides a hashmap that iterates in
         | insertion order.
        
         | floitsch wrote:
         | Rust has hash maps that hate you.
         | 
         | The default hash map visits the keys in arbitrary order
         | (https://doc.rust-
         | lang.org/std/collections/struct.HashMap.htm...).
         | 
         | You could use a BTreeMap, but that's not insertion order and
         | requires the user to come up with some sort of ordering.
        
           | IshKebab wrote:
           | They don't hate you. That's how it's supposed to work.
           | Otherwise you get issues like the one discovered in the
           | article.
           | 
           | Would you say Rust has Vecs that hate you because prepend
           | isn't O(1)? Or does JavaScript have Maps that hate you
           | because they aren't sorted?
           | 
           | There's no perfect container. It's all trade-offs.
        
             | ErikCorry wrote:
             | The article also explains how the issue discovered can be
             | fixed.
             | 
             | And JS maps are insertion ordered.
        
       | ecshafer wrote:
       | This is a weird article. It creates the idea that hash maps are
       | uncommon? surprising? I am not really sure. Then makes the claim
       | that you would expect them to be ordered by insertion order, and
       | not just randomly which is the common usage of hash maps. Ordered
       | Hash Maps exist, and isn't a insertion order hash map, basically
       | just a list? But Hash Maps are so common and used absolutely
       | everywhere its almost like its aiming at people who have used
       | them but never any other data structures.
        
         | EamonnMR wrote:
         | When I was first taught Java in college I didn't find out about
         | hash maps for a while because it was considered too advanced
         | for beginners (ohh scary templates!)
        
           | myphs wrote:
           | This is quite interesting. I wonder, why this was decided to
           | be too advanced. In my uni we teach it from the very
           | beginning.
        
         | pydry wrote:
         | Python switching all dicts to being ordered is my favorite new
         | feature from the last five years.
         | 
         | It's useful all over the damn place.
        
         | rob74 wrote:
         | Hash maps are a fairly recent addition to the standard data
         | structures "library", because they are relatively performance-
         | intensive. They are of course very useful and therefore
         | commonplace now, but I think that only came about after the
         | rise of dynamic languages which have built-in hash maps
         | (Javascript objects, PHP "associative arrays" etc.). So
         | nowadays any new programming language which wants to attract
         | web developers (e.g. Go) needs a built-in hash map.
        
           | boardwaalk wrote:
           | Ehh, hash tables were invented in the 50s and were and are
           | used wherever they are useful. I'm pretty sure a running joke
           | is that every decently sized C program contains a half dozen
           | hash table implementations. They're not recent.
        
           | mypalmike wrote:
           | Hash maps have been commonly used in C since just about
           | forever.
           | 
           | "isn't a insertion order hash map, basically just a list?"
           | 
           | No. They are different in the computational complexity of
           | random access operations.
        
             | benibela wrote:
             | But the standard C++ std::unordered_map only came with
             | C++11
        
       | jerf wrote:
       | "The other feature you quickly find yourself needing is the
       | ability to use arbitrary objects as keys in a hash map. In
       | languages that don't support that (Perl, Python, Go, JavaScript
       | until recently)"
       | 
       | I don't think that's right. Python and Go do effectively support
       | that. Each has their own take on the problem of being unable to
       | use mutable keys, but I index by some sort of object all the
       | time. Makes iteration over hashes substantially more useful when
       | you get objects for both hashes and keys.
       | 
       | Back in the Python 2.0 era, before it grew all the other features
       | that were advantages over Perl, before Python even had
       | generators/iterators, this was one of the big reasons I preferred
       | Python over Perl. Only being able to index by strings wasn't
       | necessarily a huge stopper in theory, but in practice it was very
       | dangerous, because you really need to _encode_ your keys. It was
       | extremely common to see                   print($hash{$keyPart1 .
       | "|" . $keyPart2});
       | 
       | when what you need is
       | print($hash{pipeEncode($keyPart1) . "|" .
       | pipeEncode($keyPart2)});
       | 
       | to be safe, where pipeEncode can be as simple as a routine to
       | backslash encode pipe and backslash characters (although no
       | simpler than that; you must do both of them to be safe, that's
       | the minimum). This could result in non-trivial bugs and even
       | security issues when you start encoding things like $object . "|"
       | $permission and hackers start sticking pipes in the names of
       | objects.
       | 
       | But between the inconvenience of that if you know what you're
       | doing, and the pile of code you'll encounter from developers who
       | don't know why that's important (and may even argue in code
       | review), it was definitely the sort of thing that grinds you down
       | over time.
       | 
       | In Python it was safe to                   print(hash[(keyPart1,
       | keyPart2)])
       | 
       | and it was consistent and safe. Or use an object, etc.
        
         | ErikCorry wrote:
         | It looks like your "pipeEncode" is generating a string that
         | reflects your desired key equality function. I mention this
         | approach in the blog post, but I consider it a poor substitute
         | for being able to use arbitrary objects and specify the
         | equality function on the map or set.
         | 
         | The issue of mutable keys is a slightly different one. If you
         | mutate any of the properties of your object that are used by
         | the map you are going to have a bad time, so don't do that. And
         | I guess if your maps are sufficiently simple (eg JS's object-
         | identity maps) then the user can't make that mistake, but at
         | what cost?
         | 
         | If they are generating strings as keys and they mutate the
         | object after creating the string then this will also break so
         | they haven't even really avoided the problem.
        
           | jerf wrote:
           | "I consider it a poor substitute for being able to use
           | arbitrary objects and specify the equality function on the
           | map or set"
           | 
           | My point is that it's an even _worse_ substitute than most
           | programmers realize, because to use it properly you have to
           | understand how to encode parameters. The thing that people
           | usually use, string concatenation with some delimiter, is
           | fundamentally flawed.
           | 
           | (My favorite... and, alas, I've inherited a system that uses
           | this, though fortunately it hasn't surprised me yet... is
           | using "underscore" as a delimiter, for values whose names
           | routinely include underscores! Fortunately, nothing ever
           | tries to extract the original values from the key
           | programmatically, and it's not really in a place hackers are
           | going to attack. But still... yeesh.)
           | 
           | The one exception I've sometimes made is that if you happen
           | to be in an environment where you _know_ a certain value will
           | never be used, you can use that as the delimiter; I 've used
           | ASCII NUL for that a few times. But you have to be sure that
           | it's not just a "weird" value that "nobody would ever use",
           | but something truly _excluded_ by the context, something that
           | regardless of what is input by someone somewhere is
           | completely impossible to ever get to your code. Generally,
           | the characters you can type on a keyboard are not a good
           | idea.
        
         | nemo1618 wrote:
         | Go does not support using some types as map keys, including
         | slices, channels, functions, and other maps. Slices are the
         | most annoying, and I have seen plenty of code that uses
         | fmt.Sprint(s) as a workaround. Fortunately the compiler now
         | recognizes when you convert a []byte to a string for use as a
         | map key, and will not allocate a new string.
        
           | jerf wrote:
           | Python either. I correct my post above to switch to indexing
           | by a tuple, because what I originally had is wrong:
           | >>> d = {}         >>> d[[1,2]] = 10         Traceback (most
           | recent call last):           File "<stdin>", line 1, in
           | <module>         TypeError: unhashable type: 'list'
           | 
           | This is why I qualified my statement with "Each has their own
           | take on the problem of being unable to use mutable keys,".
           | 
           | Go can be consistent in a simple way because of its type
           | system, it can see if any part of a key has something in it
           | that can't be hashed: https://play.golang.org/p/rf8IqPb76Em
           | 
           | Python, as befits Python, has default behavior for instances
           | that I believe is "is" equivalency, but you can override that
           | with various double-underscore methods to do whatever.
        
             | vharuck wrote:
             | >Python, as befits Python, has default behavior for
             | instances that I believe is "is" equivalency
             | 
             | Python dicts consider two keys the same if they have the
             | same hash value and are "=="-equal. So __eq__ and __hash__
             | are the dunder methods to finagle. Python's sets are the
             | same way.
             | 
             | A useful example is with the pathlib library.
             | from pathlib import Path         p = Path('a.txt')
             | q = Path('a.txt').absolute()         p is q # False
             | {p, q} # Only one element
             | 
             | "q" carries some different info than "p", but it refers to
             | the same file location. So not considering them distinct
             | values is a good decision for this package.
        
               | joshuamorton wrote:
               | Note that this is an oversimplification, as
               | p = Path('a.txt')         q = Path('a.txt')
               | p is q  # False
               | 
               | since p and q are different objects and "is" equality
               | checks if the objects are the same (this can be
               | interpreted approximately as "have the same memory
               | address"), so it'll almost never be true. And in the
               | cases where it is ( 2 is 2), you shouldn't rely on it, as
               | most of them are optimizations.
        
           | arp242 wrote:
           | In Go it depends if the type is "comparable"; i.e. if "=="
           | works.
           | 
           | You can use arrays in Go; e.g. map[[2]int]string, and you can
           | also use channels; although I'm not sure what the rules for
           | comparing channels are exactly off-hand (I'm struggling to
           | come up with a scenario when this would be useful off-hand
           | actually).
           | 
           | The big problem with slices and maps is that they can be
           | modified. That is, what happens if you modify a slice after
           | you used it as a map key? In slices this is worse than with
           | maps because the backing array can change if you run out of
           | cap space. And also, do you compare by value or identity? And
           | again, what happens if either changes?
           | 
           | I'm not sure if it's possible to come up with a set of rules
           | that wouldn't take people by surprise in at least some cases.
        
       | patrickthebold wrote:
       | This reminds me of one bug I introduced a long time ago:
       | 
       | This was a node backend, and there was some code that was
       | recreating an object be reinserting all the key value pairs in a
       | precise order so that that's how it would serialize in the json
       | and then the UI depended on that order. But to me, it was just
       | recreating the same object so I deleted that code, causing the
       | bug.
       | 
       | In retrospect, I should have at least asked the committer what
       | they were trying to do.
       | 
       | In my defense, this was before javascript guaranteed the
       | insertion order would be the order kept.
        
       | arsome wrote:
       | Is there a proper way to use custom objects as keys in hashmaps
       | in JS (well, ideally TypeScript compliant)? Coming out of the C++
       | world where this is trivial and something I use quite often this
       | seems pretty infuriating. Maybe a general data structure package
       | that's worth pulling in?
       | 
       | EDIT: After doing a bit of research I found "tstl", looks
       | interesting, their hashmap appears to support custom hashers and
       | equality comparisons:
       | 
       | https://tstl.dev/api/classes/std.hashmap.html
       | 
       | I'm curious about experiences with it if anyone's had the chance?
        
         | ErikCorry wrote:
         | A quick look at the source implies that that iteration order is
         | not predictable.
        
         | upzylon wrote:
         | If the problem is just that you want to use non-primitive keys,
         | that's what Maps are for.
         | 
         | https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
        
           | arsome wrote:
           | I'm aware of them, but the problem with that is you can only
           | compare an object by instance if I recall correctly. There's
           | no way to override the hashing and equality check that I
           | could find, meaning even for simple objects you're basically
           | forced to serialize your keys in some way still, which is
           | disgustingly inefficient and slow for no good reason.
        
             | doteka wrote:
             | So much this. I remember being so excited that Map and Set
             | were added to the language, only to find that they were
             | worse than useless for objects. At least Set can be handy
             | in a pinch, but I have never seen an actual usecase for Map
             | that isn't solved better with plain old JS objects. Curious
             | if anyone else has?
        
               | upzylon wrote:
               | There definitely are valid use cases for a Map, and even
               | more now that we have WeakMaps (e.g. associating data DOM
               | nodes). Or mapping functions to values, e.g. something
               | like this:                 const returnCache = new Map<()
               | => any, any>()            const memoized = <T extends ()
               | => any>(f: T): ReturnType<T> => {         if
               | (returnCache.has(f)) return returnCache.get(f)
               | const v = f()         returnCache.set(f, v)
               | return v       }
        
               | gildas wrote:
               | > I have never seen an actual usecase for Map that isn't
               | solved better with plain old JS objects. Curious if
               | anyone else has?
               | 
               | With a Map, you can easily and efficiently associate data
               | with DOM nodes without changing them.
        
         | mediumdeviation wrote:
         | Create a wrapper class that takes a stringifying function that
         | converts the key objects into a primitive and use a Map<string,
         | [K, V]> as the backing store. fast-json-stable-stringify can be
         | used as the default stringifier, and if the objects are large
         | or have a natural key a custom stringifier can also be
         | supplied. Since we are using TS we can have our custom class
         | implement Map<K, V>.
        
       | adrianmonk wrote:
       | Having hashmaps be ordered is not always desirable. It's not
       | necessarily meaningful for every application[1]. Also, it can
       | have overhead.
       | 
       | If the problem is that the developer doesn't get what they
       | expect, often the solution is to require the developer to be
       | explicit about what they ask for.
       | 
       | I think that could work well here. Instead of having map.keys(),
       | the API could have map.orderedKeys() and map.unorderedKeys(). (Or
       | make it okeys() and ukeys() to be more concise.)
       | 
       | Yes, it's slightly more typing, but to me that's a small price to
       | pay to ensure you get what you want.
       | 
       | ---
       | 
       | [1] For example, maybe the order that's meaningful to me has
       | something to do with the values rather than the keys. If the map
       | represents git commits, maybe the order that's meaningful to me
       | is a topological sort, and the map isn't going to provide that.
        
         | stickfigure wrote:
         | This would be genuinely weird. The in-memory representation
         | will make one or the other approach more efficient. If you want
         | ordered, pick an ordered implementation (eg Java's
         | _LinkedHashMap_ ). If you don't care, pick an implementation
         | that doesn't care (eg _HashMap_ ). If you want some special
         | ordering, pick something that can use a Comparator (eg
         | _TreeMap_ ) or just sort the data when fetching.
         | 
         | A Frankenstein _LinkedRandomTreeMap_ would pay the overhead of
         | all approaches for an extremely exotic use case.
        
           | adrianmonk wrote:
           | I suppose these ideas can be combined. An implementation that
           | doesn't support ordering could be of a type that only has an
           | unorderedKeys() and lacks orderedKeys(). Which is more
           | explicit than having a keys() function that behaves
           | differently than other types' keys() functions.
           | 
           | Also, I guess you could define unorderedKeys() to mean "I'm
           | not picky about the order" (rather than "it must not be
           | ordered"). Then all types could have an unorderedKeys()
           | function.
        
             | stickfigure wrote:
             | If _unrderedKeys()_ isn 't picky about the order, then
             | there's no reason to have a separate method. Just _keys()_
             | will do, and different implementations can have different
             | behavior.
             | 
             | Also note that relaxing the ordering requirement when
             | obtaining keys does not necessarily improve performance; if
             | you've already done the work of maintaining an ordered list
             | (ie _LinkedHashMap_ ), it's actually faster to iterate in
             | order.
             | 
             | Furthermore, what exactly does _orderedKeys()_ mean anyway?
             | Insertion order? Does reinsertion change order? If some
             | sort of natural ordering (eg, topological) then how do you
             | configure that?
             | 
             | I think the "state of the art" is basically correct here:
             | An abstract _Map_ interface that provides basic operations,
             | and a variety of implementations all with different
             | behaviors and performance profiles. Sorry.
        
       | dragontamer wrote:
       | There's all sorts of application-specific assumptions in this
       | post, that it isn't funny. Ex: Use of hash-tables with insertion
       | ordering to handle a insert-once task queue (preventing "double
       | tasking" if a task is ever somehow inserted twice?)
       | 
       | Solvable with a Linked-list, circular buffer queue + hash table
       | combined together (hash-table to prevent double-insertions.
       | Linked-list / circular buffer queue for the FIFO functionality).
       | 
       | If O(log(n)) insert/removes are sufficient and you have
       | priorities associated with each element, a Heap data-structure
       | works (all "repeats" will be next to each other, because max-
       | heaps always keep the maximum element on the top)
       | 
       | The goal isn't supposed to be to make a data-structure that
       | magically works in all situations. The goal is to define standard
       | library "components" that can be combined together in ways that
       | matches the performance demands of a particular application.
       | 
       | ---------------
       | 
       | And this is all still single-threaded land. If you want a highly
       | performant multithreaded data-structure, you're probably going to
       | have to write your own from scratch (instead of using building
       | blocks from the standard library).
       | 
       | There's just too many performance assumptions that goes into a
       | multithreaded data structure. Is it read-mostly / write-rarely?
       | Read/write balanced? Single-producer / multi-consumer? Multi-
       | producer / single-consumer? Multi-producer / multi-consumer? Is
       | the data-structure able to "count the threads" that may access it
       | and use thread-barriers (thread-pools may practically limit a
       | data-structure to say 64-threads, which means a 64-thread barrier
       | / semaphore can be a highly-performant solution to
       | synchronization) ? Or is it a fully dynamic design where an
       | unbounded number of threads may exist (unbounded number of
       | threads means the sempahore method won't work).
       | 
       | --------
       | 
       | I mean... most people don't care about performance. So a standard
       | library data-structure with a mutex is just fine for well over
       | 95% of applications. But if you care about performance, then
       | suddenly things get complicated.
        
       | chris_st wrote:
       | The surprising/interesting thing about this article is the notion
       | that insertion order is the order that you want to iterate
       | through a hash map.
       | 
       | In my experience, I'm doing either (A) using them as associative
       | memory (I want the value for a key) and I don't ever iterate, or
       | (B) I don't care about order when I iterate (I'm summing values,
       | say) or (C) I want the keys in some _sorted_ order, that probably
       | has nothing whatsoever to do with insertion order (counting
       | keywords, then showing them in alphabetical order).
       | 
       | So I'm curious why the insertion order is preferred, in your
       | experience.
        
         | __s wrote:
         | A result of insertion order is that when I print {b:2,a:1} I
         | get "{b:2,a:1}". This can be nice for being able to store
         | something like an HTML element's attributes as a dictionary
         | while still serializing it back to the order it was declared
         | with
         | 
         | Sorted order also requires sortable keys-- not something that's
         | required when hash maps only require hashing & equality
         | 
         | A stable ordering is nice, without the ability to sort
         | insertion order meets this ask.
         | https://mail.python.org/pipermail/python-dev/2012-December/1...
         | shows how in Python this useful property also comes as a side
         | effect of an optimization
        
           | dheera wrote:
           | There are also some use of dictionaries where the ordering
           | matters (e.g. some mongoDB commands) and you can't use JSON
           | or BSON and you have to use goddamn SON.
        
           | uyt wrote:
           | Compact dict which preserves insertion order are now the
           | default implementation of dicts in python 3.7. They gave a
           | talk about it in https://www.youtube.com/watch?v=npw4s1QTmPg
        
           | cabalamat wrote:
           | > A result of insertion order is that when I print {b:2,a:1}
           | I get "{b:2,a:1}".
           | 
           | This implies that {b:2,a:1} is a different data structure
           | than {a:1,b:2}, which disagrees with my intuitions about how
           | associative arrays ought to work.
           | 
           | If I wanted a data structure that cares about order, I could
           | use [(b,2),(a,1)] and [(a,1),(b,2)]
        
             | OskarS wrote:
             | This is for, like, debug purposes. Makes logs and things
             | like JSON requests easier to read because the output
             | matches the source. Logically though, they are still
             | unordered for things like equivalence checks, so those two
             | examples you had there still compare equal.
             | 
             | Since order is unspecified, but when serializing the hash
             | table you have to pick one, you might as well go for
             | insertion order and make it easier for the programmer to
             | read. Especially since this kind of hash table gets that
             | "for free"
        
         | macspoofing wrote:
         | >So I'm curious why the insertion order is preferred, in your
         | experience.
         | 
         | It's a little bit like asking what's the use of a linked list.
         | A map which can iterate through keys in insertion order is just
         | a data structure.
         | 
         | A recent use-case for me was using timestamps as keys that map
         | to captured state at that time.
        
           | chris_st wrote:
           | > * It's a little bit like asking what's the use of a linked
           | list.*
           | 
           | Well, no, it's more like asking why you'd use data structure
           | X instead of Y.
           | 
           | I was asking _why_ you 'd want that property as part of your
           | data structure, and pointing out that (AFAIR) I've never
           | needed it.
           | 
           | For your use case, I guess you wanted to be able to retrieve
           | the data for specific time stamps? Otherwise I'd think you'd
           | just put it into an array.
        
           | recursive wrote:
           | > A recent use-case for me was using timestamps as keys that
           | map to captured state at that time.
           | 
           | That sounds like sorted key order would suffice just fine.
        
         | metalliqaz wrote:
         | When working with data that has been scanned/parsed from a
         | file, it is often very useful to me that I retrieve the data in
         | the order it was found in the file. This gives humans some kind
         | of continuity between the input and the output of the program.
        
         | foobarian wrote:
         | While we can sort the keys before iterating a plain hashmap,
         | getting insertion order is particularly cumbersome because we
         | need to keep some extra information about the original ordering
         | of the elements. So I guess I'm kinda glad that this is the
         | default implementation, and getting other sort orders is easy
         | with just having access to the keys.
        
         | ErikCorry wrote:
         | The most important thing is that there is some defined order.
         | This makes it easier to debug, and reproducibility is just a
         | very nice property. See
         | https://en.wikipedia.org/wiki/Reproducible_builds Simple stuff
         | like making golden output files for your tests becomes easier.
         | 
         | Certainly, having an ordering on the keys themselves (eg.
         | alphabetical order) is an alternative to insertion order. I'm
         | fine with that. But eg. the built-in maps in Go, Perl or JS
         | don't support that.
         | 
         | So often you end up getting the keys, and then immediately
         | sorting them before iterating. That's taking something that
         | should be O(n) and making it O(n log n). I remember doing this
         | all the time in Perl.
         | 
         | And often there's no order, so you have to create one for your
         | key type. That's extra work, and it's just easier to use a data
         | structure that doesn't require it.
         | 
         | I would be interested to see performance comparisons for maps
         | where the keys are ordered without having to sort on each
         | iteration. Often they have tree structures internally, which
         | means cache-unfriendly pointer chasing, but probably you can
         | remove most of this (and the log-n access time) with wider
         | trees like B-trees.
        
           | sgeisenh wrote:
           | > The most important thing is that there is some defined
           | order. This makes it easier to debug, and reproducibility is
           | just a very nice property. See
           | https://en.wikipedia.org/wiki/Reproducible_builds Simple
           | stuff like making golden output files for your tests becomes
           | easier.
           | 
           | Relying on this type of behavior when it isn't required is
           | somewhat of an anti-pattern. And can often lead to mindlessly
           | copying the results of an implementation in tests.
           | 
           | A quick aside: any sort of generic ordered container that
           | supports O(n) iteration must necessarily require O(n log n)
           | work to construct in the average case. So the asymptotic
           | complexity is necessarily the same as sorting the keys of a
           | hash map with undefined iteration order to artificially
           | create a defined order.
           | 
           | It isn't too hard to simulate the behavior of insertion order
           | iteration by maintaining an array alongside the hash map. And
           | surprisingly enough, space and runtime complexity is
           | practically identical to containers that implement insertion
           | order iteration ;-)
        
             | ErikCorry wrote:
             | > any sort of generic ordered container that supports O(n)
             | iteration must necessarily require O(n log n) work to
             | construct in the average case
             | 
             | This doesn't apply if the order you want is insertion
             | order.
             | 
             | These hash maps are constructed in O(n) on average, just
             | like any other hash maps.
             | 
             | Java's LinkedHashMap is also constructed in O(n).
        
           | simcop2387 wrote:
           | For perl yoy can get this but you have to explicitly ask for
           | it outside the perl interpreter. This is dome by setting up
           | the hash seed and tell it not to add turbulence during
           | runtime.
           | 
           | export PERL_HASH_SEED=0; export PERL_PERTURB_KEYS=0;
           | 
           | The article doesn't mention why python and perl made the
           | choice to add more randomness but ti's because there were
           | attacks out there where people could determine which bucket
           | keys would be put in ahead of time and calculate a set of
           | keys to cause pathological behavior in anything that made a
           | hash from user input, this was commonly url parameters or
           | POST requests. This meant that a simple web request with the
           | right information in it could cause your application to
           | continually reallocate memory and expand the hash size to
           | accomadate the inputs. Usually this just burned cpu time and
           | made other requests time out but it could also result in
           | excessive memory usage and the OOM killer ccoming into play.
           | 
           | The above environment variables work to set a known seed and
           | then not mix any new randomness in, this doesn't get you an
           | jnsertion order but it will make the hash itself as
           | reproducible as possible (if you don't put the keys in in the
           | same order your results will vary still). I uwe this in a now
           | defunct (because of second/third/fourth system syndrome)
           | fuzzer I had going to detect problems in the prerelease
           | versions of perl.
        
             | ErikCorry wrote:
             | I don't think compact dictionaries are necessarily
             | vulnerable to this issue. You could randomize the hash seed
             | without affecting the iteration order.
        
         | _old_dude_ wrote:
         | When you want to output values in the same order as the user
         | give it to you e.g. when you read then write JSON or TOML files
         | and those files can be written by a real human. Changing the
         | order feel arbitrary.
         | 
         | When you have a language that support keyword arguments
         | (kwargs) like Ruby or Python more recently.
        
         | arsome wrote:
         | > The surprising/interesting thing about this article is the
         | notion that insertion order is the order that you want to
         | iterate through a hash map.
         | 
         | I agree, this is what java calls a "LinkedHashMap" and what
         | python calls an "OrderedDictionary" - there are times when it's
         | quite useful, but often it's simply not necessary, uses extra
         | memory and shouldn't be the default when you just need a
         | hashmap.
        
           | oweiler wrote:
           | In Kotlin, calling mapOf() gives you a LinkedHashmap. Which
           | is a weird choice IMHO.
        
           | stickfigure wrote:
           | Or you could look at it as "relaxed ordering is a performance
           | optimization" and shouldn't be taken prematurely.
           | 
           | In Java, there is no default; you ask for a HashMap or a
           | LinkedHashMap explicitly. In dynamic languages like Python or
           | JS I prefer default insertion ordered maps because if I
           | wanted tightly optimized code, I wouldn't use those languages
           | in the first place.
        
           | ErikCorry wrote:
           | The nice thing is the compact dictionary doesn't use extra
           | memory, so why not make it the default?
        
             | tialaramex wrote:
             | It may (perhaps, I can't tell) not use "extra memory"
             | compared to the data structure they had before, but it
             | definitely is using memory it wouldn't need if it didn't
             | preserve order.
             | 
             | Otherwise it would be more or less a violation of the
             | pigeonhole principle. The insertion order is information,
             | if you don't need to store that anywhere but you promise to
             | give it back anyway that's a hole in your information
             | theory.
        
               | floitsch wrote:
               | With that reasoning a sorted list would need more memory
               | than an unsorted list.
        
               | tialaramex wrote:
               | How so? Sorting _changes_ the order of the list, to the
               | extent it 's adding information (the sort order) it's
               | also destroying the same amount of information (the
               | previous order).
               | 
               | It's easy enough (including by accident) to deliver
               | insertion ordered iterator behaviour for small hash maps
               | that have never had any items removed, but after that
               | things get sticky very quickly.
        
               | floitsch wrote:
               | You were stating that "[...] it definitely is using
               | memory it wouldn't need if it didn't preserve order.".
               | 
               | There is no reason for that. A sorted list doesn't need
               | more memory than an unsorted list. And an insertion-
               | sorted map doesn't need to use more memory than a map
               | that doesn't keep the order.
               | 
               | Independently, some properties we deem valuable are often
               | side-products of the way the data-structure works. For
               | example, an efficient binary search tree will naturally
               | tend to be balanced. In the case of efficient hash maps,
               | it is now common to add the key/value pair at the end of
               | an internal vector. Maintaining the insertion order this
               | way doesn't add any cost.
        
               | tialaramex wrote:
               | It actually _does_ have a cost. In fact I 'm pretty sure
               | it has two separate types of cost, so let's talk about
               | both of them.
               | 
               | Firstly, and this is orthogonal to my original point,
               | whenever people say "Cost" they actually silently mean
               | "... that I care about" and so we're involuntarily
               | obliged to agree to stop caring about anything else.
               | Because of Hyrum's law even if you _don 't_ promise
               | ordering (as Python's "Compact Dict" developer proposes
               | at one point) your users will rely upon it and you're
               | stuck with whatever you implemented. This is a real cost
               | for a language or standard library. Look at the sad state
               | of C++ standard collections.
               | 
               | But my main point is that the requirement for ordering
               | forces you to dispose of otherwise permissible
               | optimisations which could destroy the order. You actually
               | ran head first into this in the article, when repeatedly
               | removing and adding elements blows up by clogging your
               | data structure with tombstones.
               | 
               | Swiss Tables wouldn't clog up with tombstones. Since they
               | don't care about ordering they only need tombstones at
               | all in uncommon special cases (specifically, if the
               | entire rest of the group is full or tombstones), in most
               | cases Swiss Tables get to just set removed (key, value)
               | pairs as empty. But this optimisation isn't open to you.
        
               | joshuamorton wrote:
               | Sticking tombstones in the backing array doesn't actually
               | cost you anything though. The backing array still needs
               | to be the same size. Like maybe you end up having to re-
               | hash more often?
               | 
               | >Because of Hyrum's law even if you don't promise
               | ordering (as Python's "Compact Dict" developer proposes
               | at one point) your users will rely upon it and you're
               | stuck with whatever you implemented. This is a real cost
               | for a language or standard library. Look at the sad state
               | of C++ standard collections.
               | 
               | Yes but this cost goes away if you commit to it, and the
               | alternative (like what go does) to force the order to
               | change all the time also comes with a complexity cost vs.
               | doing "nothing" and having the order be arbitrary and
               | sometimes, but not actually, reliable.
        
             | deathanatos wrote:
             | It does, though. See the diagram in the article. There's
             | two allocated arrays there: one for the real hash table (on
             | top) and one for the "backing array" which then stores the
             | actual items. The "extra" bit is the indexes in the top
             | allocation: those indexes would be omitted in a normal
             | (unsorted) hash map. (That is, a normal hash map allocates
             | a single array, which is the storage for both the hash
             | table and the items it contains.1)
             | 
             | (There's also some extra _time_ requirements, too, to chase
             | that additional pointer from the top table to the bottom
             | one.)
             | 
             | 1N.b. in some languages those items will be references, but
             | that doesn't matter for the comparison here.
        
               | ErikCorry wrote:
               | Note that the bottom array can be densely packed, whereas
               | the top one must have gaps to avoid excessive probing. If
               | you only had one array you would have to put the gaps in
               | that array, and it is at least 2 words per entry (much
               | worse if you are inlining objects). So it's not so clear.
               | If you want to save space you can also omit the hash bits
               | stored in the index and reduce the width of the index
               | entries to 16 or 8 bits on small maps.
               | 
               | In practice the compact dict does very well on space.
               | Much better than Java's unordered bucket-based HashMap
               | for example.
               | 
               | It's true that there's an extra pointer chase.
        
           | eximius wrote:
           | note: it has been the default on python for some time, then
           | made official (instead of an implementation detail) after
           | some 3.x release. 3.7 maybe?
        
             | chris_st wrote:
             | Turns out it's the default implementation in Ruby as of 2.7
             | or so as well.
        
               | majormajor wrote:
               | As of 1.9 :)
               | https://www.igvita.com/2009/02/04/ruby-19-internals-
               | ordered-...
        
         | [deleted]
        
         | sorokod wrote:
         | Other then having a deterministic order without the need to
         | specify additional ordering constraints, it is usefull when you
         | are populating the map while linearly processing some data.
        
         | mypalmike wrote:
         | It's useful for when you are in a job interview and are asked
         | to build an LRU cache. :-)
        
         | paavohtl wrote:
         | I've relied on the order of keys for URL routing in JavaScript
         | / TypeScript. Each route of the web application is unique, and
         | the order matters - it's surprisingly easy to end up with
         | otherwise ambiguous routes like /products/:ean and
         | /products/all in real world applications. One benefit of using
         | an object and not just an array is that I can validate at
         | compile time with TypeScript that each route has exactly one
         | handler.
         | 
         | An insertion-ordered hashmap works really well for this without
         | requiring any extra syntax or data beyond object literals.
        
         | spuz wrote:
         | Imagine you want to write a JSON parser and prettyfier. Then
         | writing the keys out in the same order as they were inserted
         | would be very important. Yes, it's true that key order doesn't
         | matter to the computers that consume JSON but humans consume
         | JSON too and key order often matters to them.
        
           | chris_st wrote:
           | You may be able to tell from this reply that I've never built
           | a JSON parser and prettyfier :-) but I'd guess that in this
           | case, I'd never have any reason to ask for a bit of the JSON
           | "out of order", so I'd just use arrays...?
        
       | noneeeed wrote:
       | Curious to read this and not see a reference to Ruby.
       | 
       | In Ruby hashes are ordered by insertion, can have any object as
       | keys, but the hash key value can be defined on a class by class
       | basis (by overriding the default `hash` method on the class).
       | 
       | The ordering is one of those features that I need very
       | infrequently, but when I do it significantly simplifies that bit
       | of code.
       | 
       | I've always found the usefulness and flexibilty of Ruby's hashes
       | to be a double-edged sword though: they are extremely useful, but
       | this leads people to use them in situations where defining a
       | proper class might be better for long-term maintenance.
        
         | weatherlight wrote:
         | I've used this feature of Ruby Hashes to create something that
         | behaves like a a LRU cache :)
        
       | AtlasBarfed wrote:
       | Java is treated as the "prior art" like it is some sort of
       | ancient script? That is amusing.
       | 
       | Since Java was the first mainstream GC language with sufficient
       | man-hours of engineering, and it was an opportunity after almost
       | two decades of C/C++ libraries to start anew with the standard
       | library, but its standard library always seemed simply
       | evolutionary/"best practice" (not my favorite phrase, sorry) from
       | what came before it.
       | 
       | Is the hashmap impl in the JDK actually novel for its time?
       | 
       | But
        
         | ErikCorry wrote:
         | As I understood it, there was an implementation of "compact
         | dictionaries" in Java, but it was not the implementation of any
         | of the collections that were built into the language. It is
         | "prior art" in that it was done before compact dictionaries
         | were reinvented for Python.
        
           | abecedarius wrote:
           | IIRC it was for the collections library of the E language,
           | whose first implementation was in Java (erights.org). The E
           | designers had a self-imposed constraint that everything
           | computational had to be deterministic. The standard way of
           | doing iterable hashtables lacked this property.
           | 
           | I tried to check my memory by following the link to the
           | Raymond Hettinger talk, but he doesn't say just what prior
           | work he was talking about, only that it was in Java.
        
       | bullen wrote:
       | I wish JavaScript had linked hash maps as default so that JSON
       | would not require arrays to keep the order of objects!
       | 
       | Hierarchies would not have to look like this:
       | http://root.rupy.se/meta/user/task/563541707719927980
        
         | ajanuary wrote:
         | Javascript already does retain insertion order (except for
         | special casing of things like integer keys)
         | 
         | JSON objects, however, have unordered keys.
         | 
         | JSON != Javascript.
        
       ___________________________________________________________________
       (page generated 2021-07-01 23:01 UTC)