[HN Gopher] A good day to trie-hard: saving compute 1% at a time
___________________________________________________________________
A good day to trie-hard: saving compute 1% at a time
Author : eaufavor
Score : 368 points
Date : 2024-09-10 15:03 UTC (7 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| amluto wrote:
| If you had asked me to make a wild guess as to how Cloudflare
| stores internal headers and then removes them, I would have come
| up with some options:
|
| - An entire separate dictionary or other data structure.
|
| - One single header containing all internal metadata.
|
| - All headers have a prefix, and the internal ones start with I
| and the external ones start with E.
|
| - All internal headers start with "CFInt".
|
| I would not have come up with a scheme in which headers in a
| particular list are internal. (What if someone else uses this
| name? What if something forgets to sanitize? What if different
| simultaneously running programs disagree in the list? What if the
| Connection header names a Cloudflare-internal header? What if the
| set-difference algorithm is annoyingly slow?)
|
| The web is already full of obnoxiously ambiguous in-band
| signaling and header naming, and I find it bizarre that a company
| with Cloudflare's scale uses such a tedious and error-prone
| mechanism internally.
| jgrahamc wrote:
| Yeah, me too, but systems grew over time and grew and grew and
| we were using HTTP headers for all sorts of stuff. This
| optimization makes the situation better, but the solution
| (which is underway) is to use a different mechanism for IPC and
| get rid of the use of HTTP headers completely.
| DamonHD wrote:
| And when that happy day dawns, slightly sadly, this nice
| optimisation will evaporate!
| jgrahamc wrote:
| Maybe the real treasure was the optimizations we made and
| discarded along the way.
| DamonHD wrote:
| Indeed. I am very happy with some of the tricks that I
| developed (speeding up, and indeed allowing, Solaris C++
| linking by prefiltering on MD5 hashes of canonicalised .o
| files -> nm, anyone?) even though they were lost without
| trace (yep, Lehman Brothers, where are my Imakefiles
| now?)...
| jkaptur wrote:
| They're probably as critical as ever at Barclays :)
| DamonHD wrote:
| I really hope not, given that that very old Solaris
| compiler should have been laid to rest many many years
| ago. Yes, some of my stuff had kinda made it into BarCap
| (even onto my desktop dev machine!) when I turned up
| there later, and even some at Nomura when I rocked up
| there again IIRC!
| Twirrim wrote:
| My likely overly cynical concern here is that this
| suggests trie-hard will soon end up abandoned, as you're
| making it sound like it is a stop-gap solution for you.
| throwaway77385 wrote:
| For anyone not in the loop, the above poster (JGC) is
| Cloudflare's CTO himself :)
| amluto wrote:
| I'm certainly familiar with systems growing over time and
| outgrowing their original schema.
|
| Is this the same phenomenon that resulted in Cloudflare
| Tunnels apparently being domain names? Or why domains that
| are proxied by Cloudflare show up as "CNAME" in the DNS
| panel? (The latter one seems extra strange -- wasn't the
| reverse proxy service the very first service that Cloudflare
| offered? There must be some history here.)
| hinkley wrote:
| All new products are full of scaffolding that has to be
| removed when robustness becomes a higher priority than
| existence. Where we fail is not calling it that. Instead we
| just call some code "good" while our thicker skinned
| coworkers call the rest "fine". It's not fine. I don't want
| to be working overtime the week of thanksgiving because you
| can't think past the horizon.
| gmueckl wrote:
| There is a very fine line between "good enough" and "not
| good enough" in any product beyond a certain complexity.
| Finding the pieces that cross that line can be hard and
| improving "good enough" parts is (sadly) mostly a waste
| of time in commercial settings.
| hinkley wrote:
| You have to backdate starting a project so it happens
| before things stop being good enough. And take into
| account bad estimates and the five other deadlines
| between then and now.
|
| Often it's better to clean as you go. If you have the
| time and the inclination, shore up one of the least good
| enough things.
|
| It's not unheard of to open up new feature sets via
| refactoring. Something that previously would have taken
| an "impossible" amount of time now becomes profitable,
| not just possible.
| didgetmaster wrote:
| This is why some bugs never get fixed (that is if you
| define a bug to include inefficient code and not just
| code that breaks something).
| alex_suzuki wrote:
| You chiming in on this post makes me slightly bitter at
| having gone with CloudFront (I have everything on AWS so it
| seemed the obvious choice) instead of Cloudflare.
| efitz wrote:
| Ive worked at several huge corporations in IT security, where
| we care about headers a lot, and they all use headers in a
| manner similar to CloudFlare.
|
| Including using proxies at the edge to strip out internal
| headers bidirectionally- yes, inbound too.
| sitkack wrote:
| That doesn't make it better, it makes it worse.
| hinkley wrote:
| Are you on a corporate network? Do you use a firewall at
| home?
|
| You're on enclaves _all the time_. This is just a different
| one. Separate networks per class of traffic used to be de
| rigeur before Cloud. Now it's all munged together.
| sitkack wrote:
| I understand that security needs to be pragmatic, but
| because security does it, doesn't make it right.
|
| It isn't about being in the enclave, it is having to keep
| track of what headers you set vs external. It is fragile
| and error prone and it will absolutely break someone else
| when there is a name collision.
|
| All security exploits are a subset of bugs.
| hinkley wrote:
| And what would you have them do instead? SSL to every
| service? Reauthenticate on every single call? What's your
| clever solution?
|
| There's probably a dozen things you can terminate at a
| boundary that would cost so much to run through the
| entire system that it would bankrupt a company.
|
| And then there's tracing, which also uses headers.
| immibis wrote:
| one of the listed options was to begin all the Cloudflare
| internal header names with "CFInt"
| hinkley wrote:
| That's still header filtering. And it should be X-CF- by
| the way.
| amluto wrote:
| X-CF- seems to be used by some other software. And
| Cloudflare has plenty of documented headers that don't
| start with X.
|
| The whole point is that a genuine user header should
| never conflict with a Cloudflare internal header, and a
| browser should never see an internal header.
| hinkley wrote:
| Please reread the entire thread. We are talking about
| someone who thinks the header solution is stupid. You are
| both splitting hairs. Stay on target.
| sitkack wrote:
| Having re-read the thread, I totally agree. What you have
| listed is table stakes. I'd also say that internal
| headers would also be encrypted to a narrow protection
| domain.
|
| If all internal headers were prefixed with X-CF-, you
| could strip them all via SIMD that had no knowledge of
| any specific header. Hell, you could probably do it on
| the NIC.
| faitswulff wrote:
| It took me about 10 years to hear about it, but the IETF
| tried to deprecate the X- prefix back in 2012:
| https://datatracker.ietf.org/doc/html/rfc6648
| hinkley wrote:
| Good to know.
|
| (Also sibling is right, I spaced on X-CF- being a header
| sent to CF customers' servers. I don't used cloudflare
| but cloudfront does the exact same thing)
| the8472 wrote:
| Configure all machines in a way that they can survive on
| some random untrusted public wifi. This should be the
| obvious default stance for laptops and phones.
|
| But even for workstations wired to a trusted LAN it still
| makes sense because you never know which of the various
| tunnels might assign and expose some IPv6 address to the
| internet.
|
| For servers you might be able to make an exception if you
| have a vigilant IT and the people with admin access
| aren't actively trying to circumvent security, but even
| then that better not be your only layer of security.
| hinkley wrote:
| The first large scale piece of software I worked on was for
| telcos pre smart phone. We used internal headers to offload
| authentication and terminate SSL. We also had to pressure F5
| to fix about half a dozen bugs in BIG-IP to do so. Bugs that
| should in no universe have existed in version 9 of a product.
|
| I used to joke that F5 owed me and my coworker 3 months of
| salary for all the free QA we did for them.
| tmsite wrote:
| It helps if you realize that BIG-IP 9.0 was essentially a
| from-scratch rewrite of BIG-IP 4.5. Among other major
| redesigns, the data plane was moved from BSD kernel space
| to Linux user space. Internally, the joke was that it would
| be two times better when we were done (4.5 * 2 = 9.0). It
| probably was, but not on day zero.
| hinkley wrote:
| On day sixty it couldn't do SSL termination and cookie
| based traffic shaping didn't work.
|
| I was a little bummed it was "just" a Linux box but
| that's pretty common today. I hadn't discovered dd-wrt
| yet and wouldn't for a few years.
| 620gelato wrote:
| Or perhaps even, insert yet another header with just the list
| of internal headers being added to the request, assuming this
| happens at a single place, otherwise a recipe for disaster.
|
| I have a slightly different example of this, where a rpc
| framework used in my company disallows the service owner from
| modifying certain headers (say request identifier), and will
| instead create a duplicate header with a suffix. In that
| scenario at least, I can see this as a fairly reasonable
| tradeoff, as the goal is to control certain headers from being
| modified, not because they are platform internal but because
| there are certain assumptions associated with those headers
| that are followed company-wide.
|
| I'll go check what mechanism is used for this matching.
| amluto wrote:
| This doesn't work if the front end is older than an internal
| node and a header gets added, at least not without extra
| logic to patch everything up and handle conflicts.
|
| The HTTP Connection header works like this, and one should
| generally assume that almost nothing implements it correctly.
| adrianmonk wrote:
| > _yet another header with just the list of internal headers_
|
| Or the same but with a list of headers which AREN'T internal.
|
| You'll probably have a custom header-adding function that
| people should always use instead of the regular one. And this
| way, if someone forgets to use it, their header will get
| stripped.
|
| You can think of a header escaping the internal network as
| something that needs to be authorized. This is a deny by
| default approach.
| 620gelato wrote:
| Ah yes - perfect.
| taeric wrote:
| Other ideas: - Ensure all headers added
| internally that are not for export at the front of the header
| list - Preseed the hashtable of all requests with
| internal headers you plan to remove.
|
| In fact, if you preseed, you are basically combining these
| ideas but fixing how many internal headers are on each request.
| At that point, you can use a linked hash table that preserves
| creation order and just remove the first N from the final list
| that you send back to clients.
| amluto wrote:
| > At that point, you can use a linked hash table that
| preserves creation order and just remove the first N from the
| final list that you send back to clients.
|
| While Python provides data structures like this out of the
| box, designing a big system to require a special, nonstandard
| data structure to identify private metadata seems like a
| mistake to me.
| taeric wrote:
| I'm not sure I follow? Does rust not have the equivalent of
| a LinkedHashMap from Java? All I'm proposing is to start
| each map of headers off with a sentinel value for all of
| the headers you don't intend to send back to the client.
| Then, have a method that iterates over all values, for the
| places you need to send to internal servers, and another
| (probably the default?) that starts iteration after all of
| the internal items, which should just be a different
| pointer. At this point, there is no need to inspect each
| header by name, as that was effectively done at compile
| time when you put the protected headers at the front of the
| list.
|
| Is this somewhat special? I mean, sure. But it wasn't long
| ago that a hash table was already in the special territory
| of data structures.
|
| Edit: I should add that I'm not certain this idea would be
| better. Just more ideas on what you could do.
| amluto wrote:
| There's a third party crate for this. C++'s STL also
| doesn't have it by default.
|
| The creation-order-preserving hash map is basically two
| data structures in one, it's more complex and slower than
| a normal hash map, and IMO it's rather bizarre and I have
| never really understood why anyone wants one.
| Conscat wrote:
| It's especially unfortunate that the intuitive name,
| `std::map`, is that ordered and generally least useful
| option of the standard library's three hash map
| containers.
|
| My last job did rely on ordered-ness of `QMap`, because
| the elements are listed in a Qt GUI and it could confuse
| users if those widgets randomly rearranged themselves.
| That's the only use-case I've personally encountered for
| an ordered hash map.
| taeric wrote:
| I would be surprised if it was dramatically slower, all
| told. Though, fair that I have not benchmarked in a long
| time, here.
|
| My main "idea" here is to stop from having to check all
| of the headers on a "per header" basis. Yes, you can make
| each check faster, as the TRIE does. You can also remove
| the entire reason to check individual headers by starting
| a traversal after where the internal headers are.
|
| You could also go with something like making sure the
| first header is "InternalHeaderCount: N" where you then
| keep the internal headers segregated form the others by a
| simple count. (This assumes you have solid control over
| what is adding the headers internally, of course.)
|
| (I also forgot to give kudos on the blog title. Any Die
| Hard reference is a good reference. :D )
| giancarlostoro wrote:
| > What if someone else uses this name?
|
| Couldn't you bypass this by pre-fixing the ones that aren't
| yours? Or prefixing your internal ones with something unique
| enough?
| e63f67dd-065b wrote:
| Former employee here; the interesting (horrifying?) fact is
| that you can _set_ some of these internal headers (I remember a
| notable bug with cf-cache-status) in workers (the serverless
| offering) and cause all kinds of bad things to happen :)
| hn_throwaway_99 wrote:
| I feel like I can point out lots of similar problems to the
| other solutions you suggest (heck, I thing some of those
| problems you list even _apply_ to those other solutions).
|
| The list approach has some downsides, but it also has a bunch
| of upsides. I feel like when people like to point out potential
| flaws of these approaches, they're ignoring the history and
| difficulties that comes with Cloudflare's scope. An enumerated
| list is the simplest and most flexible of the approaches, and
| it also doesn't require any a priori agreement on the structure
| of a header key - this is probably important when you think
| about the sheer number of teams at Cloudflare, potential
| technology acquisitions, etc.
| amluto wrote:
| Lists, extensible protobufs, etc are indeed great in their
| extensibility. The issue I would see (if I worked at
| Cloudflare) isn't to using a list -- it's that the list is
| the _same_ list that's used for externally visible HTTP
| headers.
| danbruc wrote:
| This might of course introduce quite a bit of overhead, but the
| clean solution would arguably be to not mix requests to begin
| with, the external requests you are processing are the payload
| of the requests you are sending around internally. This would
| also allow you to cook up an optimized encapsulation protocol
| that might be more efficient to process than parsing and
| modifying HTTP. This admittedly comes from someone with
| essentially zero knowledge of what exactly Cloudflare is doing
| to the requests they receive.
| Scaevolus wrote:
| Did you try a (tiny) bloom filter? Doing a quick convolution of
| the header key and a test against a bloom filter (SIMD? Using
| builtin CRC ops? 256-bit bloom filter size?) might avoid walking
| the trie altogether in most cases at the cost of a few cycles.
| jgrahamc wrote:
| I do not know if the team did that but Bloom filters aren't a
| panacea: https://blog.cloudflare.com/when-bloom-filters-dont-
| bloom/
| Scaevolus wrote:
| Yeah, for this problem you'd want a bloom filter that can
| entirely fit in vector registers during the computation,
| which should work well for a few hundred items.
| vlovich123 wrote:
| A bloom filter will have the same weaknesses of needing to do
| a hash. For static filters like this though where you have
| the set of strings to check against known up front, something
| like a binary fuse filter could work really well as it's
| significantly faster than bloom filters when you don't need
| online construction (you still need to do a hash of course
| but the evaluation against std::HashMap is flawed since it
| uses SipHash instead of modern constructions like xxh3 or
| gxhash).
| jkaptur wrote:
| I think that would require calculating the hash, which was
| ruled out as too slow.
| pragma_x wrote:
| Considering that each node's filter in their trie
| implementation is bloom-like already, the solution is awful
| close. I agree that testing wider swaths of the string at once
| might be a dramatic accelerator.
|
| A more naive approach might also be warranted. Some frequency
| analysis of hit/miss for trie nodes might allow them to find
| specific character positions with a higher miss rate than the
| first one. Testing such special positions first would speed
| things up. That assumes, of course, that the header data is
| fairly regular in nature.
| IncreasePosts wrote:
| Off topic: how does 60M requests a second stack up with other big
| players like google, Facebook, TikTok?
| xyst wrote:
| This 2013 research article from facebook indicates memcached
| handled "billions of requests per second" [1].
|
| [1] https://research.facebook.com/publications/scaling-
| memcache-...
| joshuamorton wrote:
| I will note that http requests and internal requests are
| distinct.
|
| The Zanzibar white paper from Google in 2019, for example
| lists top-line rpcs to Zanzibar at around 10m qps, but
| internal (in memory) cache hits at 200m qps and non cache
| internal fan-out at >20m qps. Zanzibar of course isn't http,
| but is a good public example of internal rpc fan-out.
|
| Id expect cloutdflare similarly has some extremely high qps
| internal services.
| miso2024 wrote:
| Finally a blog post with a trie. Those trie leetcodes didn't go
| in vain ;)
| marcosdumay wrote:
| You will find them in any fast spelling or grammar verifier.
| koito17 wrote:
| Routing libraries as well. In Clojure, the reitit library
| uses a prefix tree implementation when the route table is
| large enough for linear search to be non-optimal.[1] The Go
| ecosystem also has httprouter, which uses prefix trees under
| the hood.
|
| [1] https://github.com/metosin/reitit/blob/master/doc/perform
| anc...
|
| [2] https://github.com/julienschmidt/httprouter#how-does-it-
| work
| steve_adams_86 wrote:
| I've used them in real code! They are pretty intuitive for some
| use cases. The last time I used a trie was for address lookup.
| I even used a generator, haha. Double whammy leetcode in real
| life situation I guess.
|
| It's definitely not a common tool I reach for though.
| ayuhito wrote:
| I recently got a chance to find a practical use for a modified
| hybrid trie which was fun to implement! Lots of optimisations
| to go around.
|
| It's a nifty user-agent parser for Go:
| https://github.com/medama-io/go-useragent
|
| Typically this sort of problem relies on lots of regex parsing,
| so it was nice to take a more novel approach.
| cwilby wrote:
| > Internally, we call the request's destination server its
| "origin"
|
| Origin: The point at which something comes into existence or from
| which it derives or is derived.
|
| How can the request's destination server be the origin, if it is
| the destination server?
| toast0 wrote:
| The origin is the origin of the content.
| cwilby wrote:
| Thus cross origin resource sharing. Thanks!
|
| At least it's consistently inconsistent.
| tills13 wrote:
| On the internet, the origin is the server sending the response
| to the user. I suppose you can look at it from the perspective
| of the owner of the server -- from their frame of reference,
| their journey _starts_ when they receive, process, and respond
| to the request.
|
| Granted, computer scientists are infamously known for being
| terrible at naming things.
| hinkley wrote:
| Had any fun with "Referer" headers lately?
| fowl2 wrote:
| Depends on your perspective :)
|
| "Origin" is a term from web browsers, from their point of view
| it refers to where a web page came from.
|
| see https://developer.mozilla.org/en-
| US/docs/Web/Security/Same-o...
| taeric wrote:
| This hilariously reminds me of the complaining about X server
| semantics back in the day.
| evmar wrote:
| This post uses a fancy data structure to construct a to_remove
| set, then iterates through it to remove these from the underlying
| header map.
|
| It appears the "remove_header" call is this one:
| https://docs.rs/pingora-http/0.3.0/src/pingora_http/lib.rs.h...
| which calls .remove() on two other data structures, both of which
| bottom out in this mountain of code:
| https://docs.rs/http/latest/src/http/header/map.rs.html#1550
| aidenn0 wrote:
| Given that the set if items to match is static, I wonder if they
| tried a perfect hash table; that would reduce it to a few
| arithmetic operations followed by a single string compare. It
| would be interesting to see how it compares to the trie.
| sophacles wrote:
| So this trie uses a custom u256 type that is a wrapper around an
| array of 4 u64s. Is rust smart enough to vecorize it into AVX
| instructions for the bit twiddling?
|
| What about for the other integer sizes that the trie supports?
| mightyham wrote:
| I'm not very well versed in data structure optimization, but I
| was surprised they dismissed hash tables so quickly, especially
| when the table they are searching through is static. I find it
| somewhat hard to believe that a specially optimized hash table
| would not be faster than their trie implementation.
| supermatt wrote:
| They mentioned it is the overhead of the hashing that makes it
| slow compared to the trie.
| hooli42 wrote:
| Hash functions can be as simple as a single modulo.
| ironman1478 wrote:
| As they said though, the hash function on a string type
| requires looking at every element in the string. Also,
| modulo is historically very slow. So, they still have to
| deal with O(n) operations, where n is the length of the
| input string. If they can improve the memory hopping
| problem associated with lists / graph structures (which
| they claim to have done in their library), then a trie
| could would be much fast enough, which is what they
| observed. Combined with the fact that they claim that 90%
| of the time there is a miss when querying the trie, then
| you exit early a lot, whereas you always need to compute
| the whole hash on the input string when doing the hash map
| strategy.
| hervature wrote:
| > As they said though, the hash function on a string type
| requires looking at every element in the string.
|
| Maybe for the default hash function. As another commenter
| pointed out, your data may make the following hash very
| effective: s[0] + s[len(s)//2] + s[-1] which would be
| very fast. The point being is spending a day seeing if
| such a hash exists is worth it.
| mightyham wrote:
| The hash does not need to be computed on the whole
| string. I pointed this out in my other comment but just
| as a example: a hash function could be as simple as
| xoring the first 16 and last 16 bits of the string then
| indexing a 2^16 array. That means hashing is two pointer
| offsets and an xor (no modulo required). If there are 100
| strings that need to be removed, then ~99% of rejections
| will be a very fast O(1).
|
| And in the case of a match, a fast hash + memcmp will be
| way faster than a trie traversal. In fact, according to
| the trie-hard github readme, std::HashMap is already much
| faster than their trie implementation when searching for
| a match.
| mightyham wrote:
| The point I'm making is that a specially optimized hashing
| function would probably blow away any trie traversal. When
| you know the data being hashed before hand it is possible to
| make custom hash functions that are as simple as a few
| bitwise operations on the first, middle and last character of
| the string (just an example).
| dividuum wrote:
| Right. Out of curiosity I looked into what code gperf
| generates and it seems it would end up O(1) for misses
| (it's always three lookups into a generated table) and O(L)
| for hits, as a potential hit has to be confirmed by
| essentially a strcmp. Not sure how something like
| https://github.com/rust-phf/rust-phf would work, but
| probably similar?
| treis wrote:
| Isn't the sticky wicket collisions with external to CF
| defined headers?
| Denvercoder9 wrote:
| They don't know all the data that's being hashed, though.
| They know everything that's in the map, but not all the
| keys they're trying to lookup.
| vlovich123 wrote:
| If they just used the default, that's SipHash which is
| optimized for security and DOS resistance, not performance.
| XXh3 is ~10x faster than that and there's some newer ones
| that are even faster than Xxh3 like GxHash (like another
| ~1.5x-2x faster).
|
| You would precompute the hash for the constant string list &
| then compute the hash only for for the set of strings that
| begin with C/c since "cf-" is a common prefix for those
| internal headers if I recall correctly, using RawTable to
| find entries by hash & then comparing against the precomputed
| value to see if a collision matched before checking a string
| for equality.
| 5kg wrote:
| There is FxHashMap (https://github.com/rust-lang/rustc-hash)
| which is faster than std::collections::HashMap.
|
| With ~100 static entries, o1hash
| (https://github.com/rurban/smhasher/blob/master/o1hash.h)
| should also work.
| rapsey wrote:
| Note FxHashMap is just the std HashMap with a different
| hashing algorithm.
| samatman wrote:
| A hash function can't reject a string with a single test of the
| first byte. It just can't.
|
| For this application, that's a leg up which a hash won't be
| able to touch. The rest of the art here is shaving down the
| constant factor of a trie far enough that the initial boost
| pays off in real-world performance.
| MehdiHK wrote:
| Slightly off topic: I was a bit surprised nobody ported space-
| efficient marisa trie to Rust, or any other languages.
|
| Anybody knows why?
|
| https://github.com/s-yata/marisa-trie
| hyperpape wrote:
| I'm interested to know why the regex crate didn't do better. I
| believe it should compile a search for multiple literal strings
| into an Aho-Corasick automaton, which is structured very much
| like a trie.
| burntsushi wrote:
| regex crate author here. From a quick look at trie-hard, my
| guess is that trie-hard can do lookups on short strings much
| more quickly than the regex crate can do matches on short
| strings. The regex crate has a fair bit of overhead for
| selecting the right matching engine and applying other
| optimizations appropriate to general purpose matching.
|
| But the aho-corasick crate, and especially the specific
| automatons, would be interesting to try. There's much less
| overhead there. But, there are some differences:
|
| * The contiguous NFA in aho-corasick has a lot more going on
| inside its "next transition" routine:
| https://github.com/BurntSushi/aho-corasick/blob/cd400ad792d6...
| --- This is primarily because an Aho-Corasick automaton isn't
| just a trie. It also has failure transitions to support
| unanchored search. You can search without respecting failure
| transitions (and thus treat it as a trie), but the code is
| still there to deal with failure transitions. So that may have
| deleterious effects.
|
| * The contiguous NFA also specializes its transition function
| depending on the size of the node. It would be interesting to
| see whether these techniques would be useful for trie-hard to
| play with. But there is definitely a benefit to keeping one
| simple and fast code path for all node types. (Less branching.)
|
| * The aho-corasick crate does also expose a DFA directly:
| https://docs.rs/aho-corasick/latest/aho_corasick/dfa/struct....
| --- In theory this should do less work per transition than
| trie-hard, and so it would be very interesting to measure its
| perf when compared to trie-hard. The main downside is that it's
| a DFA and can use up a lot more memory and take longer to
| build. It seems like Cloudflare cares less about that in favor
| of optimizing read perf though, so it's not clear why they
| didn't go this route.
|
| * The aho-corasick crate doesn't let you store arbitrary
| values. You only get back a PatternID when a match occurs. So
| to get equivalent functionality as trie-hard, you'd need a
| separate map of PatternID->YourValue. Since that isn't stored
| with the data in the trie, this alone might penalize you quite
| a bit with poorer CPU cache performance.
|
| * trie-hard seems to provide some customization options for its
| representation. i.e., Let's you use a `u8` to store its
| transitions versus a `u128` or `u256`. This might also help
| with cache usage for a small enough trie.
| vlovich123 wrote:
| > can use up a lot more memory and take longer to build
|
| Does it support const construction so that you build it
| during compilation from a static set of strings?
| burntsushi wrote:
| Nope. That doesn't usually help in practice because it's
| not hard to amortize construction with a
| std::sync::LazyLock.
|
| (The `memchr` crate doesn't even support const construction
| of a _single_ substring searcher. I haven 't spent much
| time thinking about it because `const` is still pretty
| limited in what it supports, although progress is being
| made. For example, you can't use trait methods in `const`.
| So while you can undoubtedly use `const` Rust to build _a_
| single substring searcher, whether `const` Rust can be used
| to build the one in `memchr` specifically isn 't clear to
| me. And if it could be done, it will undoubtedly come with
| development/code-clarity costs and it's not clear to me
| that those are worth paying. Because as I said, amortizing
| construction for a static set of strings is usually very
| simple to do.)
|
| And in any case, "takes a long time to build and uses a lot
| of memory" isn't avoided by doing it at compile time.
| Instead, you get bloated binaries and longer compile times.
| Which costs matter more? Dunno, especially when you
| amortize construction at runtime.
| vlovich123 wrote:
| Yeah I guess I was mainly just asking "is const powerful
| enough yet for this to be a trivial addition".
|
| > Instead, you get bloated binaries and longer compile
| times
|
| Longer compile times probably are solved by having the
| strings live in a standalone compilation unit which the
| compiler will end up caching and you can only dirty it by
| adding a new string. Needing to shuffle that extra data
| around might still cost extra compile time, but in a real
| world program I suspect that's likely to not matter (both
| in terms of binary size & that other things in the
| program will be more expensive to compile).
|
| That being said I agree that doing a LazyLock is probably
| sufficient and given the amount of entries being compared
| against acquiring the lazy lock probably also gets
| amortized (it's cheap but not free).
| samatman wrote:
| This is a gratifying read (along with the article) because
| I've been working on closely related things, and had come to
| the tentative conclusion that taking Aho-Corasick, anchoring
| it (so no failure transitions), and laying it out in
| contiguous memory, basically results in Liang's packed trie
| from TeX's hyphenation algorithm.
|
| trie-hard seems to be effectively that, but with some clever
| masking to get mileage out of having a limited byte
| vocabulary to deal with.
|
| The DFA might pull ahead, but I wouldn't just bet on that.
| Branch predictors don't like lookup tables much, and the
| memory locality hit is small but real: more of a packed trie
| fits in a cache line than would a DFA.
|
| You seem to have landed on similar conclusions here: beating
| a packed trie for this specific use case is going to be
| tough.
|
| I also like that they hit on the same popcount-and-shift
| technique I've used in a data structure for UTF-8 character
| sets: https://github.com/mnemnion/runeset. I admittedly
| haven't benchmarked this much, correctness first and all, but
| seeing that Cloudflare has seen good numbers for something
| closely related reinforces the few tests I have run.
|
| As a text-matching connoisseur, you may find the runeset
| interesting. It turns out to be hard to search for papers on
| "UTF-8 character set", so if you're aware of any prior art
| here I'd love a heads-up. It's fundamentally a generalization
| of using u64 bitmasks for sets of ASCII, which is as old as
| the hills at this point.
| burntsushi wrote:
| RE DFA: possibly, but not completely obvious to me. The DFA
| in aho-corasick doesn't typically have 256 different
| transitions per state. It uses "byte classes" to represent
| a typically smaller number of equivalence classes. Two
| bytes are in the same equivalence class if a match/non-
| match cannot be distinguished between them.
|
| To be clear, I'd bet trie-hard is still more space
| efficient. And using byte classes does require an extra
| look-up per transition.
|
| But, worth measuring.
|
| As your your RuneSet thing, I'd look at lightgrep and
| icgrep. I'm not 100% they have relevant work for you, but
| they might. In practice, I just combine sets of codepoints
| down into UTF-8 automata since that's needed anyway for
| lazy DFA matching.
| hencoappel wrote:
| Given it's such a hot part of their code, I'm genuinely surprised
| they didn't go down the state machine route for as optimal a
| solution, or even further and made use of vector instructions.
| Maybe Rust can autovectorise the code though.
| wolf550e wrote:
| I don't know Rust. Can someone explain how this data structure
| stores whether a node is itself a valid word or whether it only
| leads to more nodes? For example the node "do" in their ("and",
| "ant", "dad", "do", & "dot") example. I guess it's a
| discriminated union provided by Rust algebraic data types or
| something like that, but why not show the full bit pattern in
| memory?
| SkiFire13 wrote:
| By looking at the source code [1] they have a discriminated
| union with 3 variants representing the possible combinations of
| representing a valid word and leading to more nodes (excluding
| the case where it is neither, which is impossible). So the node
| for the 'o' in "do" should be a `SearchOrLeaf` storing "do",
| its corresponding value (the `T`, in a set this would be empty)
| and the `SearchNode` containing the successors, which should
| only contain a 't' to "dot".
|
| [1]: https://github.com/cloudflare/trie-
| hard/blob/3d8163ca2d9208c...
| steveklabnik wrote:
| > why not show the full bit pattern in memory?
|
| Rust doesn't guarantee a particular memory layout except for
| types explicitly requesting that you get a specific
| representation, so showing this would at least require a caveat
| that it's not guaranteed to be accurate.
|
| Furthermore, this specific data structure's internals are
| generic, so to show the memory layout, it would have to be for
| a specific instantiation.
|
| (They do also provide a structure that includes instantiations
| between u8 and u256, mentioning in a doc comment that the 256
| size dominates overall size, and so if you want to slim it
| down, use the internals directly.)
| pragma_x wrote:
| It took me a moment to ponder the effectiveness of mapping utf8
| characters into a bitmask. At first, it seemed unwise. Then I
| realized that 32 bits would get you `a-z` plus six special
| characters, and 64 bits would get you 'A-Z' (uppercase) plus six
| more. That's plenty of room for HTTP headers, and for a blazing
| fast matching algorithm since it just masks and compares a couple
| integers. Even figuring out which bit corresponds to which
| character is a simple lookup into a 256-word table.
|
| One thing the author leaves out is this technique is technically
| a Bloom Filter. I find these kinds of things fascinating since
| they came about in an era where computing was far more resource
| bound than today (1970 in this case). But here we are, still
| finding real-world corners to use the same old optimizations.
|
| https://en.wikipedia.org/wiki/Bloom_filter
| samatman wrote:
| There are significant differences between trie-hard and a Bloom
| filter. Bloom filters are probabilistic, and use hashing.
| They're great for when rare false positives are acceptable in
| exchange for no false negatives, which isn't uncommon, but
| isn't what's needed here: this is exact, and hashing is the
| step to beat.
|
| Rather, this is a refinement/variation on Liang's algorithm,
| used in TeX for storing a hyphenation dictionary. The thesis
| mentions Bloom's algorithm, which it calls "superimposed
| coding", and very much hearkens back to a time when memory was
| often the most precious commodity. I think you'll like it. ^_^
|
| https://tug.org/docs/liang/liang-thesis.pdf
| maciek12 wrote:
| Am i missing something or is the big O notation in this article
| wrong? For example in "Sorted sets like BTreeSet use comparisons
| for searching, and that makes them logarithmic over key length
| O(log(L)), but they are also logarithmic in size too" how is the
| search done in logarithmic time? You could have a header that
| differs from an internal one only on the last character and then
| you have to read the whole thing. Also space-wise B-trees are
| linear, not O(nlogn). Additionally, at the end when talking about
| the trie, how do you achieve O(log(L)) for misses? Tries are not
| balanced, they do not halve the possible set on comparison (as I
| think the author states) and even if they did I don't see how
| that achieves the logarithmic time.
| vlovich123 wrote:
| > You could have a header that differs from an internal one
| only on the last character and then you have to read the whole
| thing
|
| When people talk about BigO they usually talk about average
| Big-o and worst/best case Big-o is explicitly called out if
| mentioned, so you can't just think in terms of the worst
| possible case. Regardless I think they made several mistakes
| here as you note correctly.
|
| BTreeSet is log(N) in number of nodes compared. But for strings
| that comparison is O(L) since you at least have to compare it
| to the string you find. So I believe it's at best O(L log(N))
| where L is the average string length and N is the number of
| nodes which is worse than the hash table which is just O(L).
| It's tricky though if most strings don't match your string. In
| that case I believe you end up degrading to an O(L) +
| O(log(N)).
|
| Similarly, you are correct that they can't be logarithmic in
| size and must be linear since you have to store the data. Tries
| can be sublinear depending on the input since it's a form of
| compression as well.
|
| Similarly you are right about trie complexity not being
| O(log(L)) for trie misses. I think what they're observing is a
| massive speedup because mismatches error out on the first
| character usually. But it wouldn't be logarithmic as there's
| unlikely to be a logarithmic relation between headers that
| mismatch and the universe of matching words.
| maciek12 wrote:
| Thanks for replying, I really appreciate you taking the time
| to check my concerns! I was afraid I was going crazy haha.
|
| The part about big O being usually about the average case was
| helpful, I'm still at uni where we mostly talk about worst
| case performance.
| o11c wrote:
| That late in the process, does it even need to support efficient
| lookup? Could it just be a Vec?
| menaerus wrote:
| They don't explicitly say that in the blog post but I assume
| that the engineering team behind this code does not run any
| workload performance experiments themselves but they only run
| the local (micro)benchmarks.
|
| Because of this I think there's no way for them to answer your
| question or to say "hey, this could very well be a vector since
| we could not find any statistically significant impact between
| a vector, hash-map and trie in this code path for given
| workload".
|
| This is a common fallacy I've seen in too many places. In other
| words, investing non-trivial engineering effort into optimizing
| a code-path that resembles .000001% of your workload is a waste
| of resources.
|
| And due to the Amdahl's law
|
| > Optimizing operations that are already measured in
| microseconds may seem a little silly, but these small
| improvements add up.
|
| this is not how things usually work out in the end. Making
| small improvements in insignificant code-paths will still
| remain insignificant in the overall runtime performance.
| vlovich123 wrote:
| > If we compare the sampled performance of the different
| versions of clear_internal_headers, we can see that the
| results from the performance sampling closely match what our
| benchmarks predicted
|
| Look at the end where they validate that the performance wins
| show up in production in a similar way that the
| microbenchmarks predict.
| xyst wrote:
| It's a bit wild to me that one needs to add so many http headers
| internally that they need to get stripped off.
|
| The "trie-hard" crate is a solution to the layers of cruft at CF
| mgaunard wrote:
| If it takes 3.5 us to remove a substring within a packet-sized
| string you're doing it wrong.
| alexchamberlain wrote:
| Jumping on the bandwagon of other ideas... I wonder if it would
| be quicker to filter pit the internal headers when you write the
| request to the network (buffer)? ie something like
| `request_headers.filter(not_is_internal).for_each(...)`; that way
| you don't have to remove anything from the data structure at all,
| but does require you to break down a layer of abstraction for
| performance benefits.
| vlovich123 wrote:
| No, you'd just be smearing the work at best. The expensive part
| is determining set containment not updating the data structure.
| It doesn't matter if you do it in a loop and update the data
| structure or do that check before writing.
| sixo wrote:
| this post makes me think I could work at cloudflare, and also
| that I shouldn't
| kristianp wrote:
| I would have stopped at the first step, saving almost 1% of cpu
| time (0.993%). Creating a new fast trie data type for the other
| 0.287% seems inefficient. Surely there are other parts of the
| code that take more than 1% of the cpu time elsewhere to be
| looked at first?
| y04nn wrote:
| Why adding and then removing HTTP headers of a existing request
| for routing instead of adding a dedicated (custom) routing
| protocol above the HTTP stack? This would allow to just drop the
| added protocol on the outbound and be much more space efficient.
___________________________________________________________________
(page generated 2024-09-10 23:00 UTC)