[HN Gopher] A good day to trie-hard: saving compute 1% at a time
       ___________________________________________________________________
        
       A good day to trie-hard: saving compute 1% at a time
        
       Author : eaufavor
       Score  : 368 points
       Date   : 2024-09-10 15:03 UTC (7 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | amluto wrote:
       | If you had asked me to make a wild guess as to how Cloudflare
       | stores internal headers and then removes them, I would have come
       | up with some options:
       | 
       | - An entire separate dictionary or other data structure.
       | 
       | - One single header containing all internal metadata.
       | 
       | - All headers have a prefix, and the internal ones start with I
       | and the external ones start with E.
       | 
       | - All internal headers start with "CFInt".
       | 
       | I would not have come up with a scheme in which headers in a
       | particular list are internal. (What if someone else uses this
       | name? What if something forgets to sanitize? What if different
       | simultaneously running programs disagree in the list? What if the
       | Connection header names a Cloudflare-internal header? What if the
       | set-difference algorithm is annoyingly slow?)
       | 
       | The web is already full of obnoxiously ambiguous in-band
       | signaling and header naming, and I find it bizarre that a company
       | with Cloudflare's scale uses such a tedious and error-prone
       | mechanism internally.
        
         | jgrahamc wrote:
         | Yeah, me too, but systems grew over time and grew and grew and
         | we were using HTTP headers for all sorts of stuff. This
         | optimization makes the situation better, but the solution
         | (which is underway) is to use a different mechanism for IPC and
         | get rid of the use of HTTP headers completely.
        
           | DamonHD wrote:
           | And when that happy day dawns, slightly sadly, this nice
           | optimisation will evaporate!
        
             | jgrahamc wrote:
             | Maybe the real treasure was the optimizations we made and
             | discarded along the way.
        
               | DamonHD wrote:
               | Indeed. I am very happy with some of the tricks that I
               | developed (speeding up, and indeed allowing, Solaris C++
               | linking by prefiltering on MD5 hashes of canonicalised .o
               | files -> nm, anyone?) even though they were lost without
               | trace (yep, Lehman Brothers, where are my Imakefiles
               | now?)...
        
               | jkaptur wrote:
               | They're probably as critical as ever at Barclays :)
        
               | DamonHD wrote:
               | I really hope not, given that that very old Solaris
               | compiler should have been laid to rest many many years
               | ago. Yes, some of my stuff had kinda made it into BarCap
               | (even onto my desktop dev machine!) when I turned up
               | there later, and even some at Nomura when I rocked up
               | there again IIRC!
        
               | Twirrim wrote:
               | My likely overly cynical concern here is that this
               | suggests trie-hard will soon end up abandoned, as you're
               | making it sound like it is a stop-gap solution for you.
        
           | throwaway77385 wrote:
           | For anyone not in the loop, the above poster (JGC) is
           | Cloudflare's CTO himself :)
        
           | amluto wrote:
           | I'm certainly familiar with systems growing over time and
           | outgrowing their original schema.
           | 
           | Is this the same phenomenon that resulted in Cloudflare
           | Tunnels apparently being domain names? Or why domains that
           | are proxied by Cloudflare show up as "CNAME" in the DNS
           | panel? (The latter one seems extra strange -- wasn't the
           | reverse proxy service the very first service that Cloudflare
           | offered? There must be some history here.)
        
             | hinkley wrote:
             | All new products are full of scaffolding that has to be
             | removed when robustness becomes a higher priority than
             | existence. Where we fail is not calling it that. Instead we
             | just call some code "good" while our thicker skinned
             | coworkers call the rest "fine". It's not fine. I don't want
             | to be working overtime the week of thanksgiving because you
             | can't think past the horizon.
        
               | gmueckl wrote:
               | There is a very fine line between "good enough" and "not
               | good enough" in any product beyond a certain complexity.
               | Finding the pieces that cross that line can be hard and
               | improving "good enough" parts is (sadly) mostly a waste
               | of time in commercial settings.
        
               | hinkley wrote:
               | You have to backdate starting a project so it happens
               | before things stop being good enough. And take into
               | account bad estimates and the five other deadlines
               | between then and now.
               | 
               | Often it's better to clean as you go. If you have the
               | time and the inclination, shore up one of the least good
               | enough things.
               | 
               | It's not unheard of to open up new feature sets via
               | refactoring. Something that previously would have taken
               | an "impossible" amount of time now becomes profitable,
               | not just possible.
        
               | didgetmaster wrote:
               | This is why some bugs never get fixed (that is if you
               | define a bug to include inefficient code and not just
               | code that breaks something).
        
           | alex_suzuki wrote:
           | You chiming in on this post makes me slightly bitter at
           | having gone with CloudFront (I have everything on AWS so it
           | seemed the obvious choice) instead of Cloudflare.
        
         | efitz wrote:
         | Ive worked at several huge corporations in IT security, where
         | we care about headers a lot, and they all use headers in a
         | manner similar to CloudFlare.
         | 
         | Including using proxies at the edge to strip out internal
         | headers bidirectionally- yes, inbound too.
        
           | sitkack wrote:
           | That doesn't make it better, it makes it worse.
        
             | hinkley wrote:
             | Are you on a corporate network? Do you use a firewall at
             | home?
             | 
             | You're on enclaves _all the time_. This is just a different
             | one. Separate networks per class of traffic used to be de
             | rigeur before Cloud. Now it's all munged together.
        
               | sitkack wrote:
               | I understand that security needs to be pragmatic, but
               | because security does it, doesn't make it right.
               | 
               | It isn't about being in the enclave, it is having to keep
               | track of what headers you set vs external. It is fragile
               | and error prone and it will absolutely break someone else
               | when there is a name collision.
               | 
               | All security exploits are a subset of bugs.
        
               | hinkley wrote:
               | And what would you have them do instead? SSL to every
               | service? Reauthenticate on every single call? What's your
               | clever solution?
               | 
               | There's probably a dozen things you can terminate at a
               | boundary that would cost so much to run through the
               | entire system that it would bankrupt a company.
               | 
               | And then there's tracing, which also uses headers.
        
               | immibis wrote:
               | one of the listed options was to begin all the Cloudflare
               | internal header names with "CFInt"
        
               | hinkley wrote:
               | That's still header filtering. And it should be X-CF- by
               | the way.
        
               | amluto wrote:
               | X-CF- seems to be used by some other software. And
               | Cloudflare has plenty of documented headers that don't
               | start with X.
               | 
               | The whole point is that a genuine user header should
               | never conflict with a Cloudflare internal header, and a
               | browser should never see an internal header.
        
               | hinkley wrote:
               | Please reread the entire thread. We are talking about
               | someone who thinks the header solution is stupid. You are
               | both splitting hairs. Stay on target.
        
               | sitkack wrote:
               | Having re-read the thread, I totally agree. What you have
               | listed is table stakes. I'd also say that internal
               | headers would also be encrypted to a narrow protection
               | domain.
               | 
               | If all internal headers were prefixed with X-CF-, you
               | could strip them all via SIMD that had no knowledge of
               | any specific header. Hell, you could probably do it on
               | the NIC.
        
               | faitswulff wrote:
               | It took me about 10 years to hear about it, but the IETF
               | tried to deprecate the X- prefix back in 2012:
               | https://datatracker.ietf.org/doc/html/rfc6648
        
               | hinkley wrote:
               | Good to know.
               | 
               | (Also sibling is right, I spaced on X-CF- being a header
               | sent to CF customers' servers. I don't used cloudflare
               | but cloudfront does the exact same thing)
        
               | the8472 wrote:
               | Configure all machines in a way that they can survive on
               | some random untrusted public wifi. This should be the
               | obvious default stance for laptops and phones.
               | 
               | But even for workstations wired to a trusted LAN it still
               | makes sense because you never know which of the various
               | tunnels might assign and expose some IPv6 address to the
               | internet.
               | 
               | For servers you might be able to make an exception if you
               | have a vigilant IT and the people with admin access
               | aren't actively trying to circumvent security, but even
               | then that better not be your only layer of security.
        
           | hinkley wrote:
           | The first large scale piece of software I worked on was for
           | telcos pre smart phone. We used internal headers to offload
           | authentication and terminate SSL. We also had to pressure F5
           | to fix about half a dozen bugs in BIG-IP to do so. Bugs that
           | should in no universe have existed in version 9 of a product.
           | 
           | I used to joke that F5 owed me and my coworker 3 months of
           | salary for all the free QA we did for them.
        
             | tmsite wrote:
             | It helps if you realize that BIG-IP 9.0 was essentially a
             | from-scratch rewrite of BIG-IP 4.5. Among other major
             | redesigns, the data plane was moved from BSD kernel space
             | to Linux user space. Internally, the joke was that it would
             | be two times better when we were done (4.5 * 2 = 9.0). It
             | probably was, but not on day zero.
        
               | hinkley wrote:
               | On day sixty it couldn't do SSL termination and cookie
               | based traffic shaping didn't work.
               | 
               | I was a little bummed it was "just" a Linux box but
               | that's pretty common today. I hadn't discovered dd-wrt
               | yet and wouldn't for a few years.
        
         | 620gelato wrote:
         | Or perhaps even, insert yet another header with just the list
         | of internal headers being added to the request, assuming this
         | happens at a single place, otherwise a recipe for disaster.
         | 
         | I have a slightly different example of this, where a rpc
         | framework used in my company disallows the service owner from
         | modifying certain headers (say request identifier), and will
         | instead create a duplicate header with a suffix. In that
         | scenario at least, I can see this as a fairly reasonable
         | tradeoff, as the goal is to control certain headers from being
         | modified, not because they are platform internal but because
         | there are certain assumptions associated with those headers
         | that are followed company-wide.
         | 
         | I'll go check what mechanism is used for this matching.
        
           | amluto wrote:
           | This doesn't work if the front end is older than an internal
           | node and a header gets added, at least not without extra
           | logic to patch everything up and handle conflicts.
           | 
           | The HTTP Connection header works like this, and one should
           | generally assume that almost nothing implements it correctly.
        
           | adrianmonk wrote:
           | > _yet another header with just the list of internal headers_
           | 
           | Or the same but with a list of headers which AREN'T internal.
           | 
           | You'll probably have a custom header-adding function that
           | people should always use instead of the regular one. And this
           | way, if someone forgets to use it, their header will get
           | stripped.
           | 
           | You can think of a header escaping the internal network as
           | something that needs to be authorized. This is a deny by
           | default approach.
        
             | 620gelato wrote:
             | Ah yes - perfect.
        
         | taeric wrote:
         | Other ideas:                 - Ensure all headers added
         | internally that are not for export at the front of the header
         | list       - Preseed the hashtable of all requests with
         | internal headers you plan to remove.
         | 
         | In fact, if you preseed, you are basically combining these
         | ideas but fixing how many internal headers are on each request.
         | At that point, you can use a linked hash table that preserves
         | creation order and just remove the first N from the final list
         | that you send back to clients.
        
           | amluto wrote:
           | > At that point, you can use a linked hash table that
           | preserves creation order and just remove the first N from the
           | final list that you send back to clients.
           | 
           | While Python provides data structures like this out of the
           | box, designing a big system to require a special, nonstandard
           | data structure to identify private metadata seems like a
           | mistake to me.
        
             | taeric wrote:
             | I'm not sure I follow? Does rust not have the equivalent of
             | a LinkedHashMap from Java? All I'm proposing is to start
             | each map of headers off with a sentinel value for all of
             | the headers you don't intend to send back to the client.
             | Then, have a method that iterates over all values, for the
             | places you need to send to internal servers, and another
             | (probably the default?) that starts iteration after all of
             | the internal items, which should just be a different
             | pointer. At this point, there is no need to inspect each
             | header by name, as that was effectively done at compile
             | time when you put the protected headers at the front of the
             | list.
             | 
             | Is this somewhat special? I mean, sure. But it wasn't long
             | ago that a hash table was already in the special territory
             | of data structures.
             | 
             | Edit: I should add that I'm not certain this idea would be
             | better. Just more ideas on what you could do.
        
               | amluto wrote:
               | There's a third party crate for this. C++'s STL also
               | doesn't have it by default.
               | 
               | The creation-order-preserving hash map is basically two
               | data structures in one, it's more complex and slower than
               | a normal hash map, and IMO it's rather bizarre and I have
               | never really understood why anyone wants one.
        
               | Conscat wrote:
               | It's especially unfortunate that the intuitive name,
               | `std::map`, is that ordered and generally least useful
               | option of the standard library's three hash map
               | containers.
               | 
               | My last job did rely on ordered-ness of `QMap`, because
               | the elements are listed in a Qt GUI and it could confuse
               | users if those widgets randomly rearranged themselves.
               | That's the only use-case I've personally encountered for
               | an ordered hash map.
        
               | taeric wrote:
               | I would be surprised if it was dramatically slower, all
               | told. Though, fair that I have not benchmarked in a long
               | time, here.
               | 
               | My main "idea" here is to stop from having to check all
               | of the headers on a "per header" basis. Yes, you can make
               | each check faster, as the TRIE does. You can also remove
               | the entire reason to check individual headers by starting
               | a traversal after where the internal headers are.
               | 
               | You could also go with something like making sure the
               | first header is "InternalHeaderCount: N" where you then
               | keep the internal headers segregated form the others by a
               | simple count. (This assumes you have solid control over
               | what is adding the headers internally, of course.)
               | 
               | (I also forgot to give kudos on the blog title. Any Die
               | Hard reference is a good reference. :D )
        
         | giancarlostoro wrote:
         | > What if someone else uses this name?
         | 
         | Couldn't you bypass this by pre-fixing the ones that aren't
         | yours? Or prefixing your internal ones with something unique
         | enough?
        
         | e63f67dd-065b wrote:
         | Former employee here; the interesting (horrifying?) fact is
         | that you can _set_ some of these internal headers (I remember a
         | notable bug with cf-cache-status) in workers (the serverless
         | offering) and cause all kinds of bad things to happen :)
        
         | hn_throwaway_99 wrote:
         | I feel like I can point out lots of similar problems to the
         | other solutions you suggest (heck, I thing some of those
         | problems you list even _apply_ to those other solutions).
         | 
         | The list approach has some downsides, but it also has a bunch
         | of upsides. I feel like when people like to point out potential
         | flaws of these approaches, they're ignoring the history and
         | difficulties that comes with Cloudflare's scope. An enumerated
         | list is the simplest and most flexible of the approaches, and
         | it also doesn't require any a priori agreement on the structure
         | of a header key - this is probably important when you think
         | about the sheer number of teams at Cloudflare, potential
         | technology acquisitions, etc.
        
           | amluto wrote:
           | Lists, extensible protobufs, etc are indeed great in their
           | extensibility. The issue I would see (if I worked at
           | Cloudflare) isn't to using a list -- it's that the list is
           | the _same_ list that's used for externally visible HTTP
           | headers.
        
         | danbruc wrote:
         | This might of course introduce quite a bit of overhead, but the
         | clean solution would arguably be to not mix requests to begin
         | with, the external requests you are processing are the payload
         | of the requests you are sending around internally. This would
         | also allow you to cook up an optimized encapsulation protocol
         | that might be more efficient to process than parsing and
         | modifying HTTP. This admittedly comes from someone with
         | essentially zero knowledge of what exactly Cloudflare is doing
         | to the requests they receive.
        
       | Scaevolus wrote:
       | Did you try a (tiny) bloom filter? Doing a quick convolution of
       | the header key and a test against a bloom filter (SIMD? Using
       | builtin CRC ops? 256-bit bloom filter size?) might avoid walking
       | the trie altogether in most cases at the cost of a few cycles.
        
         | jgrahamc wrote:
         | I do not know if the team did that but Bloom filters aren't a
         | panacea: https://blog.cloudflare.com/when-bloom-filters-dont-
         | bloom/
        
           | Scaevolus wrote:
           | Yeah, for this problem you'd want a bloom filter that can
           | entirely fit in vector registers during the computation,
           | which should work well for a few hundred items.
        
           | vlovich123 wrote:
           | A bloom filter will have the same weaknesses of needing to do
           | a hash. For static filters like this though where you have
           | the set of strings to check against known up front, something
           | like a binary fuse filter could work really well as it's
           | significantly faster than bloom filters when you don't need
           | online construction (you still need to do a hash of course
           | but the evaluation against std::HashMap is flawed since it
           | uses SipHash instead of modern constructions like xxh3 or
           | gxhash).
        
         | jkaptur wrote:
         | I think that would require calculating the hash, which was
         | ruled out as too slow.
        
         | pragma_x wrote:
         | Considering that each node's filter in their trie
         | implementation is bloom-like already, the solution is awful
         | close. I agree that testing wider swaths of the string at once
         | might be a dramatic accelerator.
         | 
         | A more naive approach might also be warranted. Some frequency
         | analysis of hit/miss for trie nodes might allow them to find
         | specific character positions with a higher miss rate than the
         | first one. Testing such special positions first would speed
         | things up. That assumes, of course, that the header data is
         | fairly regular in nature.
        
       | IncreasePosts wrote:
       | Off topic: how does 60M requests a second stack up with other big
       | players like google, Facebook, TikTok?
        
         | xyst wrote:
         | This 2013 research article from facebook indicates memcached
         | handled "billions of requests per second" [1].
         | 
         | [1] https://research.facebook.com/publications/scaling-
         | memcache-...
        
           | joshuamorton wrote:
           | I will note that http requests and internal requests are
           | distinct.
           | 
           | The Zanzibar white paper from Google in 2019, for example
           | lists top-line rpcs to Zanzibar at around 10m qps, but
           | internal (in memory) cache hits at 200m qps and non cache
           | internal fan-out at >20m qps. Zanzibar of course isn't http,
           | but is a good public example of internal rpc fan-out.
           | 
           | Id expect cloutdflare similarly has some extremely high qps
           | internal services.
        
       | miso2024 wrote:
       | Finally a blog post with a trie. Those trie leetcodes didn't go
       | in vain ;)
        
         | marcosdumay wrote:
         | You will find them in any fast spelling or grammar verifier.
        
           | koito17 wrote:
           | Routing libraries as well. In Clojure, the reitit library
           | uses a prefix tree implementation when the route table is
           | large enough for linear search to be non-optimal.[1] The Go
           | ecosystem also has httprouter, which uses prefix trees under
           | the hood.
           | 
           | [1] https://github.com/metosin/reitit/blob/master/doc/perform
           | anc...
           | 
           | [2] https://github.com/julienschmidt/httprouter#how-does-it-
           | work
        
         | steve_adams_86 wrote:
         | I've used them in real code! They are pretty intuitive for some
         | use cases. The last time I used a trie was for address lookup.
         | I even used a generator, haha. Double whammy leetcode in real
         | life situation I guess.
         | 
         | It's definitely not a common tool I reach for though.
        
         | ayuhito wrote:
         | I recently got a chance to find a practical use for a modified
         | hybrid trie which was fun to implement! Lots of optimisations
         | to go around.
         | 
         | It's a nifty user-agent parser for Go:
         | https://github.com/medama-io/go-useragent
         | 
         | Typically this sort of problem relies on lots of regex parsing,
         | so it was nice to take a more novel approach.
        
       | cwilby wrote:
       | > Internally, we call the request's destination server its
       | "origin"
       | 
       | Origin: The point at which something comes into existence or from
       | which it derives or is derived.
       | 
       | How can the request's destination server be the origin, if it is
       | the destination server?
        
         | toast0 wrote:
         | The origin is the origin of the content.
        
           | cwilby wrote:
           | Thus cross origin resource sharing. Thanks!
           | 
           | At least it's consistently inconsistent.
        
         | tills13 wrote:
         | On the internet, the origin is the server sending the response
         | to the user. I suppose you can look at it from the perspective
         | of the owner of the server -- from their frame of reference,
         | their journey _starts_ when they receive, process, and respond
         | to the request.
         | 
         | Granted, computer scientists are infamously known for being
         | terrible at naming things.
        
           | hinkley wrote:
           | Had any fun with "Referer" headers lately?
        
         | fowl2 wrote:
         | Depends on your perspective :)
         | 
         | "Origin" is a term from web browsers, from their point of view
         | it refers to where a web page came from.
         | 
         | see https://developer.mozilla.org/en-
         | US/docs/Web/Security/Same-o...
        
           | taeric wrote:
           | This hilariously reminds me of the complaining about X server
           | semantics back in the day.
        
       | evmar wrote:
       | This post uses a fancy data structure to construct a to_remove
       | set, then iterates through it to remove these from the underlying
       | header map.
       | 
       | It appears the "remove_header" call is this one:
       | https://docs.rs/pingora-http/0.3.0/src/pingora_http/lib.rs.h...
       | which calls .remove() on two other data structures, both of which
       | bottom out in this mountain of code:
       | https://docs.rs/http/latest/src/http/header/map.rs.html#1550
        
       | aidenn0 wrote:
       | Given that the set if items to match is static, I wonder if they
       | tried a perfect hash table; that would reduce it to a few
       | arithmetic operations followed by a single string compare. It
       | would be interesting to see how it compares to the trie.
        
       | sophacles wrote:
       | So this trie uses a custom u256 type that is a wrapper around an
       | array of 4 u64s. Is rust smart enough to vecorize it into AVX
       | instructions for the bit twiddling?
       | 
       | What about for the other integer sizes that the trie supports?
        
       | mightyham wrote:
       | I'm not very well versed in data structure optimization, but I
       | was surprised they dismissed hash tables so quickly, especially
       | when the table they are searching through is static. I find it
       | somewhat hard to believe that a specially optimized hash table
       | would not be faster than their trie implementation.
        
         | supermatt wrote:
         | They mentioned it is the overhead of the hashing that makes it
         | slow compared to the trie.
        
           | hooli42 wrote:
           | Hash functions can be as simple as a single modulo.
        
             | ironman1478 wrote:
             | As they said though, the hash function on a string type
             | requires looking at every element in the string. Also,
             | modulo is historically very slow. So, they still have to
             | deal with O(n) operations, where n is the length of the
             | input string. If they can improve the memory hopping
             | problem associated with lists / graph structures (which
             | they claim to have done in their library), then a trie
             | could would be much fast enough, which is what they
             | observed. Combined with the fact that they claim that 90%
             | of the time there is a miss when querying the trie, then
             | you exit early a lot, whereas you always need to compute
             | the whole hash on the input string when doing the hash map
             | strategy.
        
               | hervature wrote:
               | > As they said though, the hash function on a string type
               | requires looking at every element in the string.
               | 
               | Maybe for the default hash function. As another commenter
               | pointed out, your data may make the following hash very
               | effective: s[0] + s[len(s)//2] + s[-1] which would be
               | very fast. The point being is spending a day seeing if
               | such a hash exists is worth it.
        
               | mightyham wrote:
               | The hash does not need to be computed on the whole
               | string. I pointed this out in my other comment but just
               | as a example: a hash function could be as simple as
               | xoring the first 16 and last 16 bits of the string then
               | indexing a 2^16 array. That means hashing is two pointer
               | offsets and an xor (no modulo required). If there are 100
               | strings that need to be removed, then ~99% of rejections
               | will be a very fast O(1).
               | 
               | And in the case of a match, a fast hash + memcmp will be
               | way faster than a trie traversal. In fact, according to
               | the trie-hard github readme, std::HashMap is already much
               | faster than their trie implementation when searching for
               | a match.
        
           | mightyham wrote:
           | The point I'm making is that a specially optimized hashing
           | function would probably blow away any trie traversal. When
           | you know the data being hashed before hand it is possible to
           | make custom hash functions that are as simple as a few
           | bitwise operations on the first, middle and last character of
           | the string (just an example).
        
             | dividuum wrote:
             | Right. Out of curiosity I looked into what code gperf
             | generates and it seems it would end up O(1) for misses
             | (it's always three lookups into a generated table) and O(L)
             | for hits, as a potential hit has to be confirmed by
             | essentially a strcmp. Not sure how something like
             | https://github.com/rust-phf/rust-phf would work, but
             | probably similar?
        
             | treis wrote:
             | Isn't the sticky wicket collisions with external to CF
             | defined headers?
        
             | Denvercoder9 wrote:
             | They don't know all the data that's being hashed, though.
             | They know everything that's in the map, but not all the
             | keys they're trying to lookup.
        
           | vlovich123 wrote:
           | If they just used the default, that's SipHash which is
           | optimized for security and DOS resistance, not performance.
           | XXh3 is ~10x faster than that and there's some newer ones
           | that are even faster than Xxh3 like GxHash (like another
           | ~1.5x-2x faster).
           | 
           | You would precompute the hash for the constant string list &
           | then compute the hash only for for the set of strings that
           | begin with C/c since "cf-" is a common prefix for those
           | internal headers if I recall correctly, using RawTable to
           | find entries by hash & then comparing against the precomputed
           | value to see if a collision matched before checking a string
           | for equality.
        
         | 5kg wrote:
         | There is FxHashMap (https://github.com/rust-lang/rustc-hash)
         | which is faster than std::collections::HashMap.
         | 
         | With ~100 static entries, o1hash
         | (https://github.com/rurban/smhasher/blob/master/o1hash.h)
         | should also work.
        
           | rapsey wrote:
           | Note FxHashMap is just the std HashMap with a different
           | hashing algorithm.
        
         | samatman wrote:
         | A hash function can't reject a string with a single test of the
         | first byte. It just can't.
         | 
         | For this application, that's a leg up which a hash won't be
         | able to touch. The rest of the art here is shaving down the
         | constant factor of a trie far enough that the initial boost
         | pays off in real-world performance.
        
       | MehdiHK wrote:
       | Slightly off topic: I was a bit surprised nobody ported space-
       | efficient marisa trie to Rust, or any other languages.
       | 
       | Anybody knows why?
       | 
       | https://github.com/s-yata/marisa-trie
        
       | hyperpape wrote:
       | I'm interested to know why the regex crate didn't do better. I
       | believe it should compile a search for multiple literal strings
       | into an Aho-Corasick automaton, which is structured very much
       | like a trie.
        
         | burntsushi wrote:
         | regex crate author here. From a quick look at trie-hard, my
         | guess is that trie-hard can do lookups on short strings much
         | more quickly than the regex crate can do matches on short
         | strings. The regex crate has a fair bit of overhead for
         | selecting the right matching engine and applying other
         | optimizations appropriate to general purpose matching.
         | 
         | But the aho-corasick crate, and especially the specific
         | automatons, would be interesting to try. There's much less
         | overhead there. But, there are some differences:
         | 
         | * The contiguous NFA in aho-corasick has a lot more going on
         | inside its "next transition" routine:
         | https://github.com/BurntSushi/aho-corasick/blob/cd400ad792d6...
         | --- This is primarily because an Aho-Corasick automaton isn't
         | just a trie. It also has failure transitions to support
         | unanchored search. You can search without respecting failure
         | transitions (and thus treat it as a trie), but the code is
         | still there to deal with failure transitions. So that may have
         | deleterious effects.
         | 
         | * The contiguous NFA also specializes its transition function
         | depending on the size of the node. It would be interesting to
         | see whether these techniques would be useful for trie-hard to
         | play with. But there is definitely a benefit to keeping one
         | simple and fast code path for all node types. (Less branching.)
         | 
         | * The aho-corasick crate does also expose a DFA directly:
         | https://docs.rs/aho-corasick/latest/aho_corasick/dfa/struct....
         | --- In theory this should do less work per transition than
         | trie-hard, and so it would be very interesting to measure its
         | perf when compared to trie-hard. The main downside is that it's
         | a DFA and can use up a lot more memory and take longer to
         | build. It seems like Cloudflare cares less about that in favor
         | of optimizing read perf though, so it's not clear why they
         | didn't go this route.
         | 
         | * The aho-corasick crate doesn't let you store arbitrary
         | values. You only get back a PatternID when a match occurs. So
         | to get equivalent functionality as trie-hard, you'd need a
         | separate map of PatternID->YourValue. Since that isn't stored
         | with the data in the trie, this alone might penalize you quite
         | a bit with poorer CPU cache performance.
         | 
         | * trie-hard seems to provide some customization options for its
         | representation. i.e., Let's you use a `u8` to store its
         | transitions versus a `u128` or `u256`. This might also help
         | with cache usage for a small enough trie.
        
           | vlovich123 wrote:
           | > can use up a lot more memory and take longer to build
           | 
           | Does it support const construction so that you build it
           | during compilation from a static set of strings?
        
             | burntsushi wrote:
             | Nope. That doesn't usually help in practice because it's
             | not hard to amortize construction with a
             | std::sync::LazyLock.
             | 
             | (The `memchr` crate doesn't even support const construction
             | of a _single_ substring searcher. I haven 't spent much
             | time thinking about it because `const` is still pretty
             | limited in what it supports, although progress is being
             | made. For example, you can't use trait methods in `const`.
             | So while you can undoubtedly use `const` Rust to build _a_
             | single substring searcher, whether `const` Rust can be used
             | to build the one in `memchr` specifically isn 't clear to
             | me. And if it could be done, it will undoubtedly come with
             | development/code-clarity costs and it's not clear to me
             | that those are worth paying. Because as I said, amortizing
             | construction for a static set of strings is usually very
             | simple to do.)
             | 
             | And in any case, "takes a long time to build and uses a lot
             | of memory" isn't avoided by doing it at compile time.
             | Instead, you get bloated binaries and longer compile times.
             | Which costs matter more? Dunno, especially when you
             | amortize construction at runtime.
        
               | vlovich123 wrote:
               | Yeah I guess I was mainly just asking "is const powerful
               | enough yet for this to be a trivial addition".
               | 
               | > Instead, you get bloated binaries and longer compile
               | times
               | 
               | Longer compile times probably are solved by having the
               | strings live in a standalone compilation unit which the
               | compiler will end up caching and you can only dirty it by
               | adding a new string. Needing to shuffle that extra data
               | around might still cost extra compile time, but in a real
               | world program I suspect that's likely to not matter (both
               | in terms of binary size & that other things in the
               | program will be more expensive to compile).
               | 
               | That being said I agree that doing a LazyLock is probably
               | sufficient and given the amount of entries being compared
               | against acquiring the lazy lock probably also gets
               | amortized (it's cheap but not free).
        
           | samatman wrote:
           | This is a gratifying read (along with the article) because
           | I've been working on closely related things, and had come to
           | the tentative conclusion that taking Aho-Corasick, anchoring
           | it (so no failure transitions), and laying it out in
           | contiguous memory, basically results in Liang's packed trie
           | from TeX's hyphenation algorithm.
           | 
           | trie-hard seems to be effectively that, but with some clever
           | masking to get mileage out of having a limited byte
           | vocabulary to deal with.
           | 
           | The DFA might pull ahead, but I wouldn't just bet on that.
           | Branch predictors don't like lookup tables much, and the
           | memory locality hit is small but real: more of a packed trie
           | fits in a cache line than would a DFA.
           | 
           | You seem to have landed on similar conclusions here: beating
           | a packed trie for this specific use case is going to be
           | tough.
           | 
           | I also like that they hit on the same popcount-and-shift
           | technique I've used in a data structure for UTF-8 character
           | sets: https://github.com/mnemnion/runeset. I admittedly
           | haven't benchmarked this much, correctness first and all, but
           | seeing that Cloudflare has seen good numbers for something
           | closely related reinforces the few tests I have run.
           | 
           | As a text-matching connoisseur, you may find the runeset
           | interesting. It turns out to be hard to search for papers on
           | "UTF-8 character set", so if you're aware of any prior art
           | here I'd love a heads-up. It's fundamentally a generalization
           | of using u64 bitmasks for sets of ASCII, which is as old as
           | the hills at this point.
        
             | burntsushi wrote:
             | RE DFA: possibly, but not completely obvious to me. The DFA
             | in aho-corasick doesn't typically have 256 different
             | transitions per state. It uses "byte classes" to represent
             | a typically smaller number of equivalence classes. Two
             | bytes are in the same equivalence class if a match/non-
             | match cannot be distinguished between them.
             | 
             | To be clear, I'd bet trie-hard is still more space
             | efficient. And using byte classes does require an extra
             | look-up per transition.
             | 
             | But, worth measuring.
             | 
             | As your your RuneSet thing, I'd look at lightgrep and
             | icgrep. I'm not 100% they have relevant work for you, but
             | they might. In practice, I just combine sets of codepoints
             | down into UTF-8 automata since that's needed anyway for
             | lazy DFA matching.
        
       | hencoappel wrote:
       | Given it's such a hot part of their code, I'm genuinely surprised
       | they didn't go down the state machine route for as optimal a
       | solution, or even further and made use of vector instructions.
       | Maybe Rust can autovectorise the code though.
        
       | wolf550e wrote:
       | I don't know Rust. Can someone explain how this data structure
       | stores whether a node is itself a valid word or whether it only
       | leads to more nodes? For example the node "do" in their ("and",
       | "ant", "dad", "do", & "dot") example. I guess it's a
       | discriminated union provided by Rust algebraic data types or
       | something like that, but why not show the full bit pattern in
       | memory?
        
         | SkiFire13 wrote:
         | By looking at the source code [1] they have a discriminated
         | union with 3 variants representing the possible combinations of
         | representing a valid word and leading to more nodes (excluding
         | the case where it is neither, which is impossible). So the node
         | for the 'o' in "do" should be a `SearchOrLeaf` storing "do",
         | its corresponding value (the `T`, in a set this would be empty)
         | and the `SearchNode` containing the successors, which should
         | only contain a 't' to "dot".
         | 
         | [1]: https://github.com/cloudflare/trie-
         | hard/blob/3d8163ca2d9208c...
        
         | steveklabnik wrote:
         | > why not show the full bit pattern in memory?
         | 
         | Rust doesn't guarantee a particular memory layout except for
         | types explicitly requesting that you get a specific
         | representation, so showing this would at least require a caveat
         | that it's not guaranteed to be accurate.
         | 
         | Furthermore, this specific data structure's internals are
         | generic, so to show the memory layout, it would have to be for
         | a specific instantiation.
         | 
         | (They do also provide a structure that includes instantiations
         | between u8 and u256, mentioning in a doc comment that the 256
         | size dominates overall size, and so if you want to slim it
         | down, use the internals directly.)
        
       | pragma_x wrote:
       | It took me a moment to ponder the effectiveness of mapping utf8
       | characters into a bitmask. At first, it seemed unwise. Then I
       | realized that 32 bits would get you `a-z` plus six special
       | characters, and 64 bits would get you 'A-Z' (uppercase) plus six
       | more. That's plenty of room for HTTP headers, and for a blazing
       | fast matching algorithm since it just masks and compares a couple
       | integers. Even figuring out which bit corresponds to which
       | character is a simple lookup into a 256-word table.
       | 
       | One thing the author leaves out is this technique is technically
       | a Bloom Filter. I find these kinds of things fascinating since
       | they came about in an era where computing was far more resource
       | bound than today (1970 in this case). But here we are, still
       | finding real-world corners to use the same old optimizations.
       | 
       | https://en.wikipedia.org/wiki/Bloom_filter
        
         | samatman wrote:
         | There are significant differences between trie-hard and a Bloom
         | filter. Bloom filters are probabilistic, and use hashing.
         | They're great for when rare false positives are acceptable in
         | exchange for no false negatives, which isn't uncommon, but
         | isn't what's needed here: this is exact, and hashing is the
         | step to beat.
         | 
         | Rather, this is a refinement/variation on Liang's algorithm,
         | used in TeX for storing a hyphenation dictionary. The thesis
         | mentions Bloom's algorithm, which it calls "superimposed
         | coding", and very much hearkens back to a time when memory was
         | often the most precious commodity. I think you'll like it. ^_^
         | 
         | https://tug.org/docs/liang/liang-thesis.pdf
        
       | maciek12 wrote:
       | Am i missing something or is the big O notation in this article
       | wrong? For example in "Sorted sets like BTreeSet use comparisons
       | for searching, and that makes them logarithmic over key length
       | O(log(L)), but they are also logarithmic in size too" how is the
       | search done in logarithmic time? You could have a header that
       | differs from an internal one only on the last character and then
       | you have to read the whole thing. Also space-wise B-trees are
       | linear, not O(nlogn). Additionally, at the end when talking about
       | the trie, how do you achieve O(log(L)) for misses? Tries are not
       | balanced, they do not halve the possible set on comparison (as I
       | think the author states) and even if they did I don't see how
       | that achieves the logarithmic time.
        
         | vlovich123 wrote:
         | > You could have a header that differs from an internal one
         | only on the last character and then you have to read the whole
         | thing
         | 
         | When people talk about BigO they usually talk about average
         | Big-o and worst/best case Big-o is explicitly called out if
         | mentioned, so you can't just think in terms of the worst
         | possible case. Regardless I think they made several mistakes
         | here as you note correctly.
         | 
         | BTreeSet is log(N) in number of nodes compared. But for strings
         | that comparison is O(L) since you at least have to compare it
         | to the string you find. So I believe it's at best O(L log(N))
         | where L is the average string length and N is the number of
         | nodes which is worse than the hash table which is just O(L).
         | It's tricky though if most strings don't match your string. In
         | that case I believe you end up degrading to an O(L) +
         | O(log(N)).
         | 
         | Similarly, you are correct that they can't be logarithmic in
         | size and must be linear since you have to store the data. Tries
         | can be sublinear depending on the input since it's a form of
         | compression as well.
         | 
         | Similarly you are right about trie complexity not being
         | O(log(L)) for trie misses. I think what they're observing is a
         | massive speedup because mismatches error out on the first
         | character usually. But it wouldn't be logarithmic as there's
         | unlikely to be a logarithmic relation between headers that
         | mismatch and the universe of matching words.
        
           | maciek12 wrote:
           | Thanks for replying, I really appreciate you taking the time
           | to check my concerns! I was afraid I was going crazy haha.
           | 
           | The part about big O being usually about the average case was
           | helpful, I'm still at uni where we mostly talk about worst
           | case performance.
        
       | o11c wrote:
       | That late in the process, does it even need to support efficient
       | lookup? Could it just be a Vec?
        
         | menaerus wrote:
         | They don't explicitly say that in the blog post but I assume
         | that the engineering team behind this code does not run any
         | workload performance experiments themselves but they only run
         | the local (micro)benchmarks.
         | 
         | Because of this I think there's no way for them to answer your
         | question or to say "hey, this could very well be a vector since
         | we could not find any statistically significant impact between
         | a vector, hash-map and trie in this code path for given
         | workload".
         | 
         | This is a common fallacy I've seen in too many places. In other
         | words, investing non-trivial engineering effort into optimizing
         | a code-path that resembles .000001% of your workload is a waste
         | of resources.
         | 
         | And due to the Amdahl's law
         | 
         | > Optimizing operations that are already measured in
         | microseconds may seem a little silly, but these small
         | improvements add up.
         | 
         | this is not how things usually work out in the end. Making
         | small improvements in insignificant code-paths will still
         | remain insignificant in the overall runtime performance.
        
           | vlovich123 wrote:
           | > If we compare the sampled performance of the different
           | versions of clear_internal_headers, we can see that the
           | results from the performance sampling closely match what our
           | benchmarks predicted
           | 
           | Look at the end where they validate that the performance wins
           | show up in production in a similar way that the
           | microbenchmarks predict.
        
       | xyst wrote:
       | It's a bit wild to me that one needs to add so many http headers
       | internally that they need to get stripped off.
       | 
       | The "trie-hard" crate is a solution to the layers of cruft at CF
        
       | mgaunard wrote:
       | If it takes 3.5 us to remove a substring within a packet-sized
       | string you're doing it wrong.
        
       | alexchamberlain wrote:
       | Jumping on the bandwagon of other ideas... I wonder if it would
       | be quicker to filter pit the internal headers when you write the
       | request to the network (buffer)? ie something like
       | `request_headers.filter(not_is_internal).for_each(...)`; that way
       | you don't have to remove anything from the data structure at all,
       | but does require you to break down a layer of abstraction for
       | performance benefits.
        
         | vlovich123 wrote:
         | No, you'd just be smearing the work at best. The expensive part
         | is determining set containment not updating the data structure.
         | It doesn't matter if you do it in a loop and update the data
         | structure or do that check before writing.
        
       | sixo wrote:
       | this post makes me think I could work at cloudflare, and also
       | that I shouldn't
        
       | kristianp wrote:
       | I would have stopped at the first step, saving almost 1% of cpu
       | time (0.993%). Creating a new fast trie data type for the other
       | 0.287% seems inefficient. Surely there are other parts of the
       | code that take more than 1% of the cpu time elsewhere to be
       | looked at first?
        
       | y04nn wrote:
       | Why adding and then removing HTTP headers of a existing request
       | for routing instead of adding a dedicated (custom) routing
       | protocol above the HTTP stack? This would allow to just drop the
       | added protocol on the outbound and be much more space efficient.
        
       ___________________________________________________________________
       (page generated 2024-09-10 23:00 UTC)