https://blog.toit.io/hash-maps-that-dont-hate-you-1a96150b492a

 
 
The Toit Take
 
The Toit Take

Hash maps that don't hate you

Erik Corry
Erik Corry
Follow
Jul 1 * 7 min read

The map data-structure is incredibly useful when programming. The
first time I used Perl the thing that most impressed me was how easy
it was to use hashes, as they call them. I used them for everything!
Map is a central feature in Go too -- it's one of the built-in object
types with syntax support and generics. I think most people who
program in Python have a similar experience. They call them
dictionaries, and they are everywhere. In JavaScript, too, every
object is a sort of a hash map. When I worked on V8 we had to go to
huge lengths to hide the performance consequences of that decision,
but I'm still a fan of easy-to-use hash maps in languages.

But all these languages initially had hash maps that were unordered.
You can put all your key-value pairs in the hash map, but when you
want to go through the content of your map you get them back in some
strange order. It's not the order you put them in, and it's almost
certainly not the order you wanted. Go and Perl decided to fix this
by making hash ordering more random. JavaScript and Python went the
other way and fixed the ordering to be the same order as insertion.
(There's one exception in JS objects -- numeric property names are
ordered numerically, not by insertion order. We got a lot of hate on
V8's bug tracker for that one!)

The best way to make hash maps efficiently insertion ordered is some
variation of what Python calls the "compact dictionary", popularized/
rediscovered by Raymond Hettinger. (According to Hettinger there is
some prior art, implemented in Java, and it's actually similar to the
internal object implementation of V8, invented by Lars Bak, also one
of Toit's co-founders.) In compact dictionaries, the keys and values
are inserted in a linear array, and there is a more conventional open
addressing hash table that indexes into the linear array. It looks
something like this.

[1]

This is what a hash map looks like in Toit, too. In fact, the hash
numbers in the index table are not needed, but they make lookup
faster. Storing the last few bits of the hash code lets us detect
most of the collisions in the index without having to look at the
keys in the backing array. The other advantage of having hash codes
in the index is that it speeds up rebuilding the index when it
becomes too full.

If the user removes an entry from a hash map of this type we replace
the key with a marker that indicates this entry has been removed.
Perhaps a little morbidly, we call these markers "tombstones". When
there are enough of these tombstones, the hash map gets rebuilt, but
that's OK. Due to the magic of amortized constant time, this doesn't
slow you down on average.

This is all very cool because now there's an efficient way to make
hash maps with insertion ordering. These are hash maps that don't
hate you, and that's why we used them in the Dart project too, now
probably best known as the engine behind Flutter. (Some of the Dart
founders went on to found Toit.)

Other ways to know your hash maps don't hate you

The other feature you quickly find yourself needing is the ability to
use arbitrary objects as keys in a hash map. In languages that don't
support that (Perl, Python, Go, JavaScript until recently) you can
use strings as keys, generating a unique string for each key object.
This isn't very elegant, and your program creates a lot of strings
that didn't really need to exist.

Somewhat more advanced is being able to specify the equality function
for the keys in a map. In Java and Toit, for example, you can specify
a custom equals method for the class of the keys in a collection. In
contrast, the map implementation in JavaScript only uses object
identity. You can't create a new object and then discover whether an
equivalent key is already in the map. Again the inelegant workaround
is for each to generate strings and use those as keys.

Going even further, it is sometimes necessary to specify the equality
function per map instead of per-key type. This is available in C++ in
the form of the Hash and Pred template parameters on unordered_map.
(But as the name unordered_map indicates, the standard C++
collections hate you, so this advanced feature is little comfort.)
Currently, we don't support per-map equality functions in Toit, but
it's coming soon! In Java, you would have to wrap the keys in your
map or use Trove.

The fly in the ointment and how to remove it

Even if your language has compact dictionaries, there's still a tiny
fly in the ointment. Under certain circumstances, your hash maps will
still act as if they hate you, but in a new way! Some people like to
use hash maps (and sets, which have the same implementation) as a
sort of deduplicating work queue. You can add "tasks" to the set, and
it will automatically ignore duplicates because it's a set. A task
queue (also called FIFO) needs an append-to-the-end operation and a
remove-from-the-start operation. The append-to-the-end operation we
already have, and it's easy to make a remove-first operation by
looking at the start of the backing array. If the Toit language
didn't already have this function, you could add it as follows:

/// Function that takes a set, removes the and returns the first
element
/// Function that takes a set, removes the and returns the first
element
[1]
[1]

The problem arises after a while of using a set or map like this.
There will be a lot of tombstones near the start of the backing, and
suddenly the remove-from-the-start operation isn't so fast. It needs
to skip over more and more tombstones, and the run time to
insert-then-remove n items becomes quadratic. This issue can't be
solved by rebuilding the hash map more frequently, because if you do
that, you lose the property of amortized constant time operations.

An ad-hoc solution would be to record for each hash map where the
first entry is in the backing array. This feels like a band-aid
though. What if users want to repeatedly remove the last entry using
a set as a sort of deduplicating stack (or LIFO)? The hateful
quadratic hash map problem would be back. What if they want to
repeatedly remove the second entry in the set? One thing I learned on
V8 is that it's hard to predict what your users will do with a
programming language implementation you have built.

And we built Toit to work on tiny devices. Do we really want to spend
an extra word per hash map to record the number of leading
tombstones? The solution we came up with was tombstones with distance
markers. Each tombstone additionally records the distance to the next
real entry, either on the left or the right. We update this lazily
when scanning through the backing, so it doesn't cost any time on
other operations, but it lets us skip long stretches of tombstones
wherever they appear.

[1]
[1]

For the common case where there are no deletions, or the deletions
are randomly distributed, nothing changes, but for the problematic
cases we retain the desirable property of amortized constant time
insertion and deletion.

Testing languages for this bug

I wrote a tiny program to test whether language implementations have
quadratic slowdowns when using a set as a FIFO. I'm only testing
languages that have an insertion-ordered set. For JavaScript I'm
using the new(ish) Set, rather than abusing ordinary objects as hash
maps. In Java I used LinkedHashMap, which has a completely different
internal representation. I suspect LinkedHashMap of being fairly
profligate in memory use (I'll be blogging about this issue soon),
but it's the fastest insertion ordered set I tested, especially when
the adaptive JIT kicks in.

So here, for your enjoyment, is the graph for various languages.
Currently, all the compact-dictionary-based implementations (except
Toit) have this issue. Hopefully, this blog post will inspire the
implementers to fix it.

[1]
[1]
us / operation for various set sizes
[1]
[1]

The Toit Take

Hear the latest news from the Toit team on continuous firmware
delivery for your MCUs.

 

114

  * Hashmap
  * JavaScript
  * Dart
  * Python3
  * Perl

 

114 claps

 

114

The Toit Take

The Toit Take

Continuously update the code on your microcontrollers even over
cellular connections. Monitor and securely service your devices in
production; all through the Toit API.

Erik Corry

Written by

Erik Corry

Follow

The Toit Take

The Toit Take

Continuously update the code on your microcontrollers even over
cellular connections. Monitor and securely service your devices in
production; all through the Toit API.

More From Medium

Consibio chooses Toit to optimize bioprocesses

Nils Westerlund in The Toit Take
 
[1]
[1]

Add mentions in your Flutter app (don't @ me)

Annsh Singh in Flutter Clan
 
[0]
[0]

5 Easy Techniques That Will Improve Your Skills in Programming

Josef Cruz in JavaScript in Plain English
 
[0]
[0]

How to Pass Arguments to setTimeout and setInterval Callbacks

Cristian Salcescu in Frontend Essentials
 
[1]
[1]

Understanding Apache MultiKeyMap

Ganesh Kumar Marimuthu in Nerd For Tech
 
[1]
[1]

Conditional Statements in Python

Banana Chip Tech
 
[1]
[1]

17 Python Tips and Tricks to Boost your Coding Skills

Haider Imtiaz in Python in Plain English
 
[1]
[1]

Getting To Know Flutter: Google Maps Integration

Enrico Ori in TheOtherDev/s
 
[1]
[1]

Learn more.

Medium is an open platform where 170 million readers come to find
insightful and dynamic thinking. Here, expert and undiscovered voices
alike dive into the heart of any topic and bring new ideas to the
surface. Learn more

Make Medium yours.

Follow the writers, publications, and topics that matter to you, and
you'll see them on your homepage and in your inbox. Explore

Write a story on Medium.

If you have a story to tell, knowledge to share, or a perspective to
offer -- welcome home. It's easy and free to post your thinking on any
topic. Start a blog

 

About

Write

Help

Legal

Get the Medium app

A button that says 'Download on the App Store', and if clicked it
will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will
lead you to the Google Play store