[HN Gopher] An efficient way to make hash maps with insertion or...
___________________________________________________________________
An efficient way to make hash maps with insertion ordering
Author : eatonphil
Score : 83 points
Date : 2021-07-01 11:21 UTC (11 hours ago)
(HTM) web link (blog.toit.io)
(TXT) w3m dump (blog.toit.io)
| usefulcat wrote:
| First off, WTF is up with disabling copying? That is seriously
| annoying. Now on to what I came here to say..
|
| "When there are enough of these tombstones, the hash map gets
| rebuilt, but that's OK. Due to the magic of amortized constant
| time, this doesn't slow you down on average."
|
| If amortized time is all you care about, that's fine. But there
| are some applications where worst case time is more important,
| and unfortunately I've seen some hash map implementations that
| implicitly assume that the worst case never matters.
| prutschman wrote:
| In some browsers you can get around this by selecting the text,
| keeping the mouse button down, hitting the keyboard shortcut
| for 'copy', then letting go. The selection will still be
| emptied, but the text is already in your clipboard.
| peterkelly wrote:
| Rust has HashMap and BTreeMap. You use the latter when you care
| about ordering. There's also a couple of options for dequeues. A
| good overview here explaining which one you'd choose depending on
| what oyu want to do: https://doc.rust-
| lang.org/std/collections/index.html
|
| Use the right data structure for the job. Maps are an interface,
| not an implementation.
| edflsafoiewq wrote:
| The indexmap crate provides a hashmap that iterates in
| insertion order.
| floitsch wrote:
| Rust has hash maps that hate you.
|
| The default hash map visits the keys in arbitrary order
| (https://doc.rust-
| lang.org/std/collections/struct.HashMap.htm...).
|
| You could use a BTreeMap, but that's not insertion order and
| requires the user to come up with some sort of ordering.
| IshKebab wrote:
| They don't hate you. That's how it's supposed to work.
| Otherwise you get issues like the one discovered in the
| article.
|
| Would you say Rust has Vecs that hate you because prepend
| isn't O(1)? Or does JavaScript have Maps that hate you
| because they aren't sorted?
|
| There's no perfect container. It's all trade-offs.
| ErikCorry wrote:
| The article also explains how the issue discovered can be
| fixed.
|
| And JS maps are insertion ordered.
| ecshafer wrote:
| This is a weird article. It creates the idea that hash maps are
| uncommon? surprising? I am not really sure. Then makes the claim
| that you would expect them to be ordered by insertion order, and
| not just randomly which is the common usage of hash maps. Ordered
| Hash Maps exist, and isn't a insertion order hash map, basically
| just a list? But Hash Maps are so common and used absolutely
| everywhere its almost like its aiming at people who have used
| them but never any other data structures.
| EamonnMR wrote:
| When I was first taught Java in college I didn't find out about
| hash maps for a while because it was considered too advanced
| for beginners (ohh scary templates!)
| myphs wrote:
| This is quite interesting. I wonder, why this was decided to
| be too advanced. In my uni we teach it from the very
| beginning.
| pydry wrote:
| Python switching all dicts to being ordered is my favorite new
| feature from the last five years.
|
| It's useful all over the damn place.
| rob74 wrote:
| Hash maps are a fairly recent addition to the standard data
| structures "library", because they are relatively performance-
| intensive. They are of course very useful and therefore
| commonplace now, but I think that only came about after the
| rise of dynamic languages which have built-in hash maps
| (Javascript objects, PHP "associative arrays" etc.). So
| nowadays any new programming language which wants to attract
| web developers (e.g. Go) needs a built-in hash map.
| boardwaalk wrote:
| Ehh, hash tables were invented in the 50s and were and are
| used wherever they are useful. I'm pretty sure a running joke
| is that every decently sized C program contains a half dozen
| hash table implementations. They're not recent.
| mypalmike wrote:
| Hash maps have been commonly used in C since just about
| forever.
|
| "isn't a insertion order hash map, basically just a list?"
|
| No. They are different in the computational complexity of
| random access operations.
| benibela wrote:
| But the standard C++ std::unordered_map only came with
| C++11
| jerf wrote:
| "The other feature you quickly find yourself needing is the
| ability to use arbitrary objects as keys in a hash map. In
| languages that don't support that (Perl, Python, Go, JavaScript
| until recently)"
|
| I don't think that's right. Python and Go do effectively support
| that. Each has their own take on the problem of being unable to
| use mutable keys, but I index by some sort of object all the
| time. Makes iteration over hashes substantially more useful when
| you get objects for both hashes and keys.
|
| Back in the Python 2.0 era, before it grew all the other features
| that were advantages over Perl, before Python even had
| generators/iterators, this was one of the big reasons I preferred
| Python over Perl. Only being able to index by strings wasn't
| necessarily a huge stopper in theory, but in practice it was very
| dangerous, because you really need to _encode_ your keys. It was
| extremely common to see print($hash{$keyPart1 .
| "|" . $keyPart2});
|
| when what you need is
| print($hash{pipeEncode($keyPart1) . "|" .
| pipeEncode($keyPart2)});
|
| to be safe, where pipeEncode can be as simple as a routine to
| backslash encode pipe and backslash characters (although no
| simpler than that; you must do both of them to be safe, that's
| the minimum). This could result in non-trivial bugs and even
| security issues when you start encoding things like $object . "|"
| $permission and hackers start sticking pipes in the names of
| objects.
|
| But between the inconvenience of that if you know what you're
| doing, and the pile of code you'll encounter from developers who
| don't know why that's important (and may even argue in code
| review), it was definitely the sort of thing that grinds you down
| over time.
|
| In Python it was safe to print(hash[(keyPart1,
| keyPart2)])
|
| and it was consistent and safe. Or use an object, etc.
| ErikCorry wrote:
| It looks like your "pipeEncode" is generating a string that
| reflects your desired key equality function. I mention this
| approach in the blog post, but I consider it a poor substitute
| for being able to use arbitrary objects and specify the
| equality function on the map or set.
|
| The issue of mutable keys is a slightly different one. If you
| mutate any of the properties of your object that are used by
| the map you are going to have a bad time, so don't do that. And
| I guess if your maps are sufficiently simple (eg JS's object-
| identity maps) then the user can't make that mistake, but at
| what cost?
|
| If they are generating strings as keys and they mutate the
| object after creating the string then this will also break so
| they haven't even really avoided the problem.
| jerf wrote:
| "I consider it a poor substitute for being able to use
| arbitrary objects and specify the equality function on the
| map or set"
|
| My point is that it's an even _worse_ substitute than most
| programmers realize, because to use it properly you have to
| understand how to encode parameters. The thing that people
| usually use, string concatenation with some delimiter, is
| fundamentally flawed.
|
| (My favorite... and, alas, I've inherited a system that uses
| this, though fortunately it hasn't surprised me yet... is
| using "underscore" as a delimiter, for values whose names
| routinely include underscores! Fortunately, nothing ever
| tries to extract the original values from the key
| programmatically, and it's not really in a place hackers are
| going to attack. But still... yeesh.)
|
| The one exception I've sometimes made is that if you happen
| to be in an environment where you _know_ a certain value will
| never be used, you can use that as the delimiter; I 've used
| ASCII NUL for that a few times. But you have to be sure that
| it's not just a "weird" value that "nobody would ever use",
| but something truly _excluded_ by the context, something that
| regardless of what is input by someone somewhere is
| completely impossible to ever get to your code. Generally,
| the characters you can type on a keyboard are not a good
| idea.
| nemo1618 wrote:
| Go does not support using some types as map keys, including
| slices, channels, functions, and other maps. Slices are the
| most annoying, and I have seen plenty of code that uses
| fmt.Sprint(s) as a workaround. Fortunately the compiler now
| recognizes when you convert a []byte to a string for use as a
| map key, and will not allocate a new string.
| jerf wrote:
| Python either. I correct my post above to switch to indexing
| by a tuple, because what I originally had is wrong:
| >>> d = {} >>> d[[1,2]] = 10 Traceback (most
| recent call last): File "<stdin>", line 1, in
| <module> TypeError: unhashable type: 'list'
|
| This is why I qualified my statement with "Each has their own
| take on the problem of being unable to use mutable keys,".
|
| Go can be consistent in a simple way because of its type
| system, it can see if any part of a key has something in it
| that can't be hashed: https://play.golang.org/p/rf8IqPb76Em
|
| Python, as befits Python, has default behavior for instances
| that I believe is "is" equivalency, but you can override that
| with various double-underscore methods to do whatever.
| vharuck wrote:
| >Python, as befits Python, has default behavior for
| instances that I believe is "is" equivalency
|
| Python dicts consider two keys the same if they have the
| same hash value and are "=="-equal. So __eq__ and __hash__
| are the dunder methods to finagle. Python's sets are the
| same way.
|
| A useful example is with the pathlib library.
| from pathlib import Path p = Path('a.txt')
| q = Path('a.txt').absolute() p is q # False
| {p, q} # Only one element
|
| "q" carries some different info than "p", but it refers to
| the same file location. So not considering them distinct
| values is a good decision for this package.
| joshuamorton wrote:
| Note that this is an oversimplification, as
| p = Path('a.txt') q = Path('a.txt')
| p is q # False
|
| since p and q are different objects and "is" equality
| checks if the objects are the same (this can be
| interpreted approximately as "have the same memory
| address"), so it'll almost never be true. And in the
| cases where it is ( 2 is 2), you shouldn't rely on it, as
| most of them are optimizations.
| arp242 wrote:
| In Go it depends if the type is "comparable"; i.e. if "=="
| works.
|
| You can use arrays in Go; e.g. map[[2]int]string, and you can
| also use channels; although I'm not sure what the rules for
| comparing channels are exactly off-hand (I'm struggling to
| come up with a scenario when this would be useful off-hand
| actually).
|
| The big problem with slices and maps is that they can be
| modified. That is, what happens if you modify a slice after
| you used it as a map key? In slices this is worse than with
| maps because the backing array can change if you run out of
| cap space. And also, do you compare by value or identity? And
| again, what happens if either changes?
|
| I'm not sure if it's possible to come up with a set of rules
| that wouldn't take people by surprise in at least some cases.
| patrickthebold wrote:
| This reminds me of one bug I introduced a long time ago:
|
| This was a node backend, and there was some code that was
| recreating an object be reinserting all the key value pairs in a
| precise order so that that's how it would serialize in the json
| and then the UI depended on that order. But to me, it was just
| recreating the same object so I deleted that code, causing the
| bug.
|
| In retrospect, I should have at least asked the committer what
| they were trying to do.
|
| In my defense, this was before javascript guaranteed the
| insertion order would be the order kept.
| arsome wrote:
| Is there a proper way to use custom objects as keys in hashmaps
| in JS (well, ideally TypeScript compliant)? Coming out of the C++
| world where this is trivial and something I use quite often this
| seems pretty infuriating. Maybe a general data structure package
| that's worth pulling in?
|
| EDIT: After doing a bit of research I found "tstl", looks
| interesting, their hashmap appears to support custom hashers and
| equality comparisons:
|
| https://tstl.dev/api/classes/std.hashmap.html
|
| I'm curious about experiences with it if anyone's had the chance?
| ErikCorry wrote:
| A quick look at the source implies that that iteration order is
| not predictable.
| upzylon wrote:
| If the problem is just that you want to use non-primitive keys,
| that's what Maps are for.
|
| https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
| arsome wrote:
| I'm aware of them, but the problem with that is you can only
| compare an object by instance if I recall correctly. There's
| no way to override the hashing and equality check that I
| could find, meaning even for simple objects you're basically
| forced to serialize your keys in some way still, which is
| disgustingly inefficient and slow for no good reason.
| doteka wrote:
| So much this. I remember being so excited that Map and Set
| were added to the language, only to find that they were
| worse than useless for objects. At least Set can be handy
| in a pinch, but I have never seen an actual usecase for Map
| that isn't solved better with plain old JS objects. Curious
| if anyone else has?
| upzylon wrote:
| There definitely are valid use cases for a Map, and even
| more now that we have WeakMaps (e.g. associating data DOM
| nodes). Or mapping functions to values, e.g. something
| like this: const returnCache = new Map<()
| => any, any>() const memoized = <T extends ()
| => any>(f: T): ReturnType<T> => { if
| (returnCache.has(f)) return returnCache.get(f)
| const v = f() returnCache.set(f, v)
| return v }
| gildas wrote:
| > I have never seen an actual usecase for Map that isn't
| solved better with plain old JS objects. Curious if
| anyone else has?
|
| With a Map, you can easily and efficiently associate data
| with DOM nodes without changing them.
| mediumdeviation wrote:
| Create a wrapper class that takes a stringifying function that
| converts the key objects into a primitive and use a Map<string,
| [K, V]> as the backing store. fast-json-stable-stringify can be
| used as the default stringifier, and if the objects are large
| or have a natural key a custom stringifier can also be
| supplied. Since we are using TS we can have our custom class
| implement Map<K, V>.
| adrianmonk wrote:
| Having hashmaps be ordered is not always desirable. It's not
| necessarily meaningful for every application[1]. Also, it can
| have overhead.
|
| If the problem is that the developer doesn't get what they
| expect, often the solution is to require the developer to be
| explicit about what they ask for.
|
| I think that could work well here. Instead of having map.keys(),
| the API could have map.orderedKeys() and map.unorderedKeys(). (Or
| make it okeys() and ukeys() to be more concise.)
|
| Yes, it's slightly more typing, but to me that's a small price to
| pay to ensure you get what you want.
|
| ---
|
| [1] For example, maybe the order that's meaningful to me has
| something to do with the values rather than the keys. If the map
| represents git commits, maybe the order that's meaningful to me
| is a topological sort, and the map isn't going to provide that.
| stickfigure wrote:
| This would be genuinely weird. The in-memory representation
| will make one or the other approach more efficient. If you want
| ordered, pick an ordered implementation (eg Java's
| _LinkedHashMap_ ). If you don't care, pick an implementation
| that doesn't care (eg _HashMap_ ). If you want some special
| ordering, pick something that can use a Comparator (eg
| _TreeMap_ ) or just sort the data when fetching.
|
| A Frankenstein _LinkedRandomTreeMap_ would pay the overhead of
| all approaches for an extremely exotic use case.
| adrianmonk wrote:
| I suppose these ideas can be combined. An implementation that
| doesn't support ordering could be of a type that only has an
| unorderedKeys() and lacks orderedKeys(). Which is more
| explicit than having a keys() function that behaves
| differently than other types' keys() functions.
|
| Also, I guess you could define unorderedKeys() to mean "I'm
| not picky about the order" (rather than "it must not be
| ordered"). Then all types could have an unorderedKeys()
| function.
| stickfigure wrote:
| If _unrderedKeys()_ isn 't picky about the order, then
| there's no reason to have a separate method. Just _keys()_
| will do, and different implementations can have different
| behavior.
|
| Also note that relaxing the ordering requirement when
| obtaining keys does not necessarily improve performance; if
| you've already done the work of maintaining an ordered list
| (ie _LinkedHashMap_ ), it's actually faster to iterate in
| order.
|
| Furthermore, what exactly does _orderedKeys()_ mean anyway?
| Insertion order? Does reinsertion change order? If some
| sort of natural ordering (eg, topological) then how do you
| configure that?
|
| I think the "state of the art" is basically correct here:
| An abstract _Map_ interface that provides basic operations,
| and a variety of implementations all with different
| behaviors and performance profiles. Sorry.
| dragontamer wrote:
| There's all sorts of application-specific assumptions in this
| post, that it isn't funny. Ex: Use of hash-tables with insertion
| ordering to handle a insert-once task queue (preventing "double
| tasking" if a task is ever somehow inserted twice?)
|
| Solvable with a Linked-list, circular buffer queue + hash table
| combined together (hash-table to prevent double-insertions.
| Linked-list / circular buffer queue for the FIFO functionality).
|
| If O(log(n)) insert/removes are sufficient and you have
| priorities associated with each element, a Heap data-structure
| works (all "repeats" will be next to each other, because max-
| heaps always keep the maximum element on the top)
|
| The goal isn't supposed to be to make a data-structure that
| magically works in all situations. The goal is to define standard
| library "components" that can be combined together in ways that
| matches the performance demands of a particular application.
|
| ---------------
|
| And this is all still single-threaded land. If you want a highly
| performant multithreaded data-structure, you're probably going to
| have to write your own from scratch (instead of using building
| blocks from the standard library).
|
| There's just too many performance assumptions that goes into a
| multithreaded data structure. Is it read-mostly / write-rarely?
| Read/write balanced? Single-producer / multi-consumer? Multi-
| producer / single-consumer? Multi-producer / multi-consumer? Is
| the data-structure able to "count the threads" that may access it
| and use thread-barriers (thread-pools may practically limit a
| data-structure to say 64-threads, which means a 64-thread barrier
| / semaphore can be a highly-performant solution to
| synchronization) ? Or is it a fully dynamic design where an
| unbounded number of threads may exist (unbounded number of
| threads means the sempahore method won't work).
|
| --------
|
| I mean... most people don't care about performance. So a standard
| library data-structure with a mutex is just fine for well over
| 95% of applications. But if you care about performance, then
| suddenly things get complicated.
| chris_st wrote:
| The surprising/interesting thing about this article is the notion
| that insertion order is the order that you want to iterate
| through a hash map.
|
| In my experience, I'm doing either (A) using them as associative
| memory (I want the value for a key) and I don't ever iterate, or
| (B) I don't care about order when I iterate (I'm summing values,
| say) or (C) I want the keys in some _sorted_ order, that probably
| has nothing whatsoever to do with insertion order (counting
| keywords, then showing them in alphabetical order).
|
| So I'm curious why the insertion order is preferred, in your
| experience.
| __s wrote:
| A result of insertion order is that when I print {b:2,a:1} I
| get "{b:2,a:1}". This can be nice for being able to store
| something like an HTML element's attributes as a dictionary
| while still serializing it back to the order it was declared
| with
|
| Sorted order also requires sortable keys-- not something that's
| required when hash maps only require hashing & equality
|
| A stable ordering is nice, without the ability to sort
| insertion order meets this ask.
| https://mail.python.org/pipermail/python-dev/2012-December/1...
| shows how in Python this useful property also comes as a side
| effect of an optimization
| dheera wrote:
| There are also some use of dictionaries where the ordering
| matters (e.g. some mongoDB commands) and you can't use JSON
| or BSON and you have to use goddamn SON.
| uyt wrote:
| Compact dict which preserves insertion order are now the
| default implementation of dicts in python 3.7. They gave a
| talk about it in https://www.youtube.com/watch?v=npw4s1QTmPg
| cabalamat wrote:
| > A result of insertion order is that when I print {b:2,a:1}
| I get "{b:2,a:1}".
|
| This implies that {b:2,a:1} is a different data structure
| than {a:1,b:2}, which disagrees with my intuitions about how
| associative arrays ought to work.
|
| If I wanted a data structure that cares about order, I could
| use [(b,2),(a,1)] and [(a,1),(b,2)]
| OskarS wrote:
| This is for, like, debug purposes. Makes logs and things
| like JSON requests easier to read because the output
| matches the source. Logically though, they are still
| unordered for things like equivalence checks, so those two
| examples you had there still compare equal.
|
| Since order is unspecified, but when serializing the hash
| table you have to pick one, you might as well go for
| insertion order and make it easier for the programmer to
| read. Especially since this kind of hash table gets that
| "for free"
| macspoofing wrote:
| >So I'm curious why the insertion order is preferred, in your
| experience.
|
| It's a little bit like asking what's the use of a linked list.
| A map which can iterate through keys in insertion order is just
| a data structure.
|
| A recent use-case for me was using timestamps as keys that map
| to captured state at that time.
| chris_st wrote:
| > * It's a little bit like asking what's the use of a linked
| list.*
|
| Well, no, it's more like asking why you'd use data structure
| X instead of Y.
|
| I was asking _why_ you 'd want that property as part of your
| data structure, and pointing out that (AFAIR) I've never
| needed it.
|
| For your use case, I guess you wanted to be able to retrieve
| the data for specific time stamps? Otherwise I'd think you'd
| just put it into an array.
| recursive wrote:
| > A recent use-case for me was using timestamps as keys that
| map to captured state at that time.
|
| That sounds like sorted key order would suffice just fine.
| metalliqaz wrote:
| When working with data that has been scanned/parsed from a
| file, it is often very useful to me that I retrieve the data in
| the order it was found in the file. This gives humans some kind
| of continuity between the input and the output of the program.
| foobarian wrote:
| While we can sort the keys before iterating a plain hashmap,
| getting insertion order is particularly cumbersome because we
| need to keep some extra information about the original ordering
| of the elements. So I guess I'm kinda glad that this is the
| default implementation, and getting other sort orders is easy
| with just having access to the keys.
| ErikCorry wrote:
| The most important thing is that there is some defined order.
| This makes it easier to debug, and reproducibility is just a
| very nice property. See
| https://en.wikipedia.org/wiki/Reproducible_builds Simple stuff
| like making golden output files for your tests becomes easier.
|
| Certainly, having an ordering on the keys themselves (eg.
| alphabetical order) is an alternative to insertion order. I'm
| fine with that. But eg. the built-in maps in Go, Perl or JS
| don't support that.
|
| So often you end up getting the keys, and then immediately
| sorting them before iterating. That's taking something that
| should be O(n) and making it O(n log n). I remember doing this
| all the time in Perl.
|
| And often there's no order, so you have to create one for your
| key type. That's extra work, and it's just easier to use a data
| structure that doesn't require it.
|
| I would be interested to see performance comparisons for maps
| where the keys are ordered without having to sort on each
| iteration. Often they have tree structures internally, which
| means cache-unfriendly pointer chasing, but probably you can
| remove most of this (and the log-n access time) with wider
| trees like B-trees.
| sgeisenh wrote:
| > The most important thing is that there is some defined
| order. This makes it easier to debug, and reproducibility is
| just a very nice property. See
| https://en.wikipedia.org/wiki/Reproducible_builds Simple
| stuff like making golden output files for your tests becomes
| easier.
|
| Relying on this type of behavior when it isn't required is
| somewhat of an anti-pattern. And can often lead to mindlessly
| copying the results of an implementation in tests.
|
| A quick aside: any sort of generic ordered container that
| supports O(n) iteration must necessarily require O(n log n)
| work to construct in the average case. So the asymptotic
| complexity is necessarily the same as sorting the keys of a
| hash map with undefined iteration order to artificially
| create a defined order.
|
| It isn't too hard to simulate the behavior of insertion order
| iteration by maintaining an array alongside the hash map. And
| surprisingly enough, space and runtime complexity is
| practically identical to containers that implement insertion
| order iteration ;-)
| ErikCorry wrote:
| > any sort of generic ordered container that supports O(n)
| iteration must necessarily require O(n log n) work to
| construct in the average case
|
| This doesn't apply if the order you want is insertion
| order.
|
| These hash maps are constructed in O(n) on average, just
| like any other hash maps.
|
| Java's LinkedHashMap is also constructed in O(n).
| simcop2387 wrote:
| For perl yoy can get this but you have to explicitly ask for
| it outside the perl interpreter. This is dome by setting up
| the hash seed and tell it not to add turbulence during
| runtime.
|
| export PERL_HASH_SEED=0; export PERL_PERTURB_KEYS=0;
|
| The article doesn't mention why python and perl made the
| choice to add more randomness but ti's because there were
| attacks out there where people could determine which bucket
| keys would be put in ahead of time and calculate a set of
| keys to cause pathological behavior in anything that made a
| hash from user input, this was commonly url parameters or
| POST requests. This meant that a simple web request with the
| right information in it could cause your application to
| continually reallocate memory and expand the hash size to
| accomadate the inputs. Usually this just burned cpu time and
| made other requests time out but it could also result in
| excessive memory usage and the OOM killer ccoming into play.
|
| The above environment variables work to set a known seed and
| then not mix any new randomness in, this doesn't get you an
| jnsertion order but it will make the hash itself as
| reproducible as possible (if you don't put the keys in in the
| same order your results will vary still). I uwe this in a now
| defunct (because of second/third/fourth system syndrome)
| fuzzer I had going to detect problems in the prerelease
| versions of perl.
| ErikCorry wrote:
| I don't think compact dictionaries are necessarily
| vulnerable to this issue. You could randomize the hash seed
| without affecting the iteration order.
| _old_dude_ wrote:
| When you want to output values in the same order as the user
| give it to you e.g. when you read then write JSON or TOML files
| and those files can be written by a real human. Changing the
| order feel arbitrary.
|
| When you have a language that support keyword arguments
| (kwargs) like Ruby or Python more recently.
| arsome wrote:
| > The surprising/interesting thing about this article is the
| notion that insertion order is the order that you want to
| iterate through a hash map.
|
| I agree, this is what java calls a "LinkedHashMap" and what
| python calls an "OrderedDictionary" - there are times when it's
| quite useful, but often it's simply not necessary, uses extra
| memory and shouldn't be the default when you just need a
| hashmap.
| oweiler wrote:
| In Kotlin, calling mapOf() gives you a LinkedHashmap. Which
| is a weird choice IMHO.
| stickfigure wrote:
| Or you could look at it as "relaxed ordering is a performance
| optimization" and shouldn't be taken prematurely.
|
| In Java, there is no default; you ask for a HashMap or a
| LinkedHashMap explicitly. In dynamic languages like Python or
| JS I prefer default insertion ordered maps because if I
| wanted tightly optimized code, I wouldn't use those languages
| in the first place.
| ErikCorry wrote:
| The nice thing is the compact dictionary doesn't use extra
| memory, so why not make it the default?
| tialaramex wrote:
| It may (perhaps, I can't tell) not use "extra memory"
| compared to the data structure they had before, but it
| definitely is using memory it wouldn't need if it didn't
| preserve order.
|
| Otherwise it would be more or less a violation of the
| pigeonhole principle. The insertion order is information,
| if you don't need to store that anywhere but you promise to
| give it back anyway that's a hole in your information
| theory.
| floitsch wrote:
| With that reasoning a sorted list would need more memory
| than an unsorted list.
| tialaramex wrote:
| How so? Sorting _changes_ the order of the list, to the
| extent it 's adding information (the sort order) it's
| also destroying the same amount of information (the
| previous order).
|
| It's easy enough (including by accident) to deliver
| insertion ordered iterator behaviour for small hash maps
| that have never had any items removed, but after that
| things get sticky very quickly.
| floitsch wrote:
| You were stating that "[...] it definitely is using
| memory it wouldn't need if it didn't preserve order.".
|
| There is no reason for that. A sorted list doesn't need
| more memory than an unsorted list. And an insertion-
| sorted map doesn't need to use more memory than a map
| that doesn't keep the order.
|
| Independently, some properties we deem valuable are often
| side-products of the way the data-structure works. For
| example, an efficient binary search tree will naturally
| tend to be balanced. In the case of efficient hash maps,
| it is now common to add the key/value pair at the end of
| an internal vector. Maintaining the insertion order this
| way doesn't add any cost.
| tialaramex wrote:
| It actually _does_ have a cost. In fact I 'm pretty sure
| it has two separate types of cost, so let's talk about
| both of them.
|
| Firstly, and this is orthogonal to my original point,
| whenever people say "Cost" they actually silently mean
| "... that I care about" and so we're involuntarily
| obliged to agree to stop caring about anything else.
| Because of Hyrum's law even if you _don 't_ promise
| ordering (as Python's "Compact Dict" developer proposes
| at one point) your users will rely upon it and you're
| stuck with whatever you implemented. This is a real cost
| for a language or standard library. Look at the sad state
| of C++ standard collections.
|
| But my main point is that the requirement for ordering
| forces you to dispose of otherwise permissible
| optimisations which could destroy the order. You actually
| ran head first into this in the article, when repeatedly
| removing and adding elements blows up by clogging your
| data structure with tombstones.
|
| Swiss Tables wouldn't clog up with tombstones. Since they
| don't care about ordering they only need tombstones at
| all in uncommon special cases (specifically, if the
| entire rest of the group is full or tombstones), in most
| cases Swiss Tables get to just set removed (key, value)
| pairs as empty. But this optimisation isn't open to you.
| joshuamorton wrote:
| Sticking tombstones in the backing array doesn't actually
| cost you anything though. The backing array still needs
| to be the same size. Like maybe you end up having to re-
| hash more often?
|
| >Because of Hyrum's law even if you don't promise
| ordering (as Python's "Compact Dict" developer proposes
| at one point) your users will rely upon it and you're
| stuck with whatever you implemented. This is a real cost
| for a language or standard library. Look at the sad state
| of C++ standard collections.
|
| Yes but this cost goes away if you commit to it, and the
| alternative (like what go does) to force the order to
| change all the time also comes with a complexity cost vs.
| doing "nothing" and having the order be arbitrary and
| sometimes, but not actually, reliable.
| deathanatos wrote:
| It does, though. See the diagram in the article. There's
| two allocated arrays there: one for the real hash table (on
| top) and one for the "backing array" which then stores the
| actual items. The "extra" bit is the indexes in the top
| allocation: those indexes would be omitted in a normal
| (unsorted) hash map. (That is, a normal hash map allocates
| a single array, which is the storage for both the hash
| table and the items it contains.1)
|
| (There's also some extra _time_ requirements, too, to chase
| that additional pointer from the top table to the bottom
| one.)
|
| 1N.b. in some languages those items will be references, but
| that doesn't matter for the comparison here.
| ErikCorry wrote:
| Note that the bottom array can be densely packed, whereas
| the top one must have gaps to avoid excessive probing. If
| you only had one array you would have to put the gaps in
| that array, and it is at least 2 words per entry (much
| worse if you are inlining objects). So it's not so clear.
| If you want to save space you can also omit the hash bits
| stored in the index and reduce the width of the index
| entries to 16 or 8 bits on small maps.
|
| In practice the compact dict does very well on space.
| Much better than Java's unordered bucket-based HashMap
| for example.
|
| It's true that there's an extra pointer chase.
| eximius wrote:
| note: it has been the default on python for some time, then
| made official (instead of an implementation detail) after
| some 3.x release. 3.7 maybe?
| chris_st wrote:
| Turns out it's the default implementation in Ruby as of 2.7
| or so as well.
| majormajor wrote:
| As of 1.9 :)
| https://www.igvita.com/2009/02/04/ruby-19-internals-
| ordered-...
| [deleted]
| sorokod wrote:
| Other then having a deterministic order without the need to
| specify additional ordering constraints, it is usefull when you
| are populating the map while linearly processing some data.
| mypalmike wrote:
| It's useful for when you are in a job interview and are asked
| to build an LRU cache. :-)
| paavohtl wrote:
| I've relied on the order of keys for URL routing in JavaScript
| / TypeScript. Each route of the web application is unique, and
| the order matters - it's surprisingly easy to end up with
| otherwise ambiguous routes like /products/:ean and
| /products/all in real world applications. One benefit of using
| an object and not just an array is that I can validate at
| compile time with TypeScript that each route has exactly one
| handler.
|
| An insertion-ordered hashmap works really well for this without
| requiring any extra syntax or data beyond object literals.
| spuz wrote:
| Imagine you want to write a JSON parser and prettyfier. Then
| writing the keys out in the same order as they were inserted
| would be very important. Yes, it's true that key order doesn't
| matter to the computers that consume JSON but humans consume
| JSON too and key order often matters to them.
| chris_st wrote:
| You may be able to tell from this reply that I've never built
| a JSON parser and prettyfier :-) but I'd guess that in this
| case, I'd never have any reason to ask for a bit of the JSON
| "out of order", so I'd just use arrays...?
| noneeeed wrote:
| Curious to read this and not see a reference to Ruby.
|
| In Ruby hashes are ordered by insertion, can have any object as
| keys, but the hash key value can be defined on a class by class
| basis (by overriding the default `hash` method on the class).
|
| The ordering is one of those features that I need very
| infrequently, but when I do it significantly simplifies that bit
| of code.
|
| I've always found the usefulness and flexibilty of Ruby's hashes
| to be a double-edged sword though: they are extremely useful, but
| this leads people to use them in situations where defining a
| proper class might be better for long-term maintenance.
| weatherlight wrote:
| I've used this feature of Ruby Hashes to create something that
| behaves like a a LRU cache :)
| AtlasBarfed wrote:
| Java is treated as the "prior art" like it is some sort of
| ancient script? That is amusing.
|
| Since Java was the first mainstream GC language with sufficient
| man-hours of engineering, and it was an opportunity after almost
| two decades of C/C++ libraries to start anew with the standard
| library, but its standard library always seemed simply
| evolutionary/"best practice" (not my favorite phrase, sorry) from
| what came before it.
|
| Is the hashmap impl in the JDK actually novel for its time?
|
| But
| ErikCorry wrote:
| As I understood it, there was an implementation of "compact
| dictionaries" in Java, but it was not the implementation of any
| of the collections that were built into the language. It is
| "prior art" in that it was done before compact dictionaries
| were reinvented for Python.
| abecedarius wrote:
| IIRC it was for the collections library of the E language,
| whose first implementation was in Java (erights.org). The E
| designers had a self-imposed constraint that everything
| computational had to be deterministic. The standard way of
| doing iterable hashtables lacked this property.
|
| I tried to check my memory by following the link to the
| Raymond Hettinger talk, but he doesn't say just what prior
| work he was talking about, only that it was in Java.
| bullen wrote:
| I wish JavaScript had linked hash maps as default so that JSON
| would not require arrays to keep the order of objects!
|
| Hierarchies would not have to look like this:
| http://root.rupy.se/meta/user/task/563541707719927980
| ajanuary wrote:
| Javascript already does retain insertion order (except for
| special casing of things like integer keys)
|
| JSON objects, however, have unordered keys.
|
| JSON != Javascript.
___________________________________________________________________
(page generated 2021-07-01 23:01 UTC)