hngopher.com

       [HN Gopher] Ask HN: What is new in algorithms and data structure...
       ___________________________________________________________________
        
       Ask HN: What is new in algorithms and data structures these days?
        
       Algs/DS were my first love in CS. Nowadays, all we hear about is
       AI/ML. There must be hardware/software improvements coming from or
       necessitating fundamental Algs/DS research.  Care to share any of
       the favorite recent results?  Are there big fields still making
       gains behind all this computing surge?
        
       Author : jvanderbot
       Score  : 378 points
       Date   : 2023-05-10 13:16 UTC (9 hours ago)
        
       | zelphirkalt wrote:
       | There is ongoing research for purely functional data structures
       | and many implementations to be written for many functional
       | languages, to make use of PFDS. Whenever one sees a traditional
       | data structure, one can ask oneself, what a persistent functional
       | one might look like, or whether a different solution would be
       | pursued in FP languages. Traditional data structures are often a
       | question of transcribing from an example language into a target
       | language, with variuos language specific changes for
       | optimization. Finding a good way to implement things without
       | mutation or side-effects, is a challenge.
        
       | revskill wrote:
       | Searching.
        
       | chc4 wrote:
       | In compilers, there's been a recent uptick in industry research
       | and adoption into using equivalency classes and graphs (egraphs)
       | for doing optimization passes in a way that preserves information
       | and solves the phase ordering problem. Egg[0] was one of the big
       | libraries that came out of it, but can only handle term rewriting
       | without guards, and so can't express y/x*x -> y because it's
       | unsound when x=0, or optimization passes that are across control
       | flow points and thus some of the eclasses are only valid in some
       | locations. The Cranelift developers adapted it into a
       | construction they call aegraphs[1] that handles this, and are
       | migrating Cranelift to use these passes for all optimizations,
       | which is (hopefully) faster and achieves more optimization
       | opportunitie; Chris Fallin is presenting about their aegraph work
       | at PLDI this year.
       | 
       | 0: https://egraphs-good.github.io/
       | 
       | 1:
       | https://github.com/cfallin/rfcs/blob/main/accepted/cranelift...
        
         | tester756 wrote:
         | thanks, this is cool!
        
         | aseipp wrote:
         | The recent work on relational, datalog-inspired egraphs in PLDI
         | this year ("Unifying Datalog and Equality Saturation") is
         | actually interesting because it can solve cases like the y/x*x
         | -> y identity example, by the power of an interval analysis on
         | x (among other things.) Sort of like adding a postulate to a
         | rewrite term and then proving it. But instead it's by adding
         | relations between terms in the graph. See section 6.2 of the
         | paper.
         | 
         | https://github.com/egraphs-good/egglog
         | 
         | https://arxiv.org/pdf/2304.04332.pdf
        
         | chc4 wrote:
         | The other recent research that's been interesting to watch has
         | been in logic programming (Prolog and Datalog), where there's
         | been a lot of papers extending Datalog with semilattice types
         | that are more efficient for computing dataflow-type queries
         | (and even getting lattice types added to Souffle recently),
         | including papers like datafun[0] extending it to sound
         | incremental queries for more efficient execution - and there's
         | even a paper by the Egg authors using it for efficient eclass-
         | based Datalog queries using lattices[1]. It also seems like
         | there has been more work recently in actually integrating logic
         | programming in existing languages and programs, like ascent[2],
         | instead of it being off in it's own world like it seems has
         | historically been the case.
         | 
         | 0: http://www.rntz.net/files/seminaive-datafun.pdf
         | 
         | 1: https://www.mwillsey.com/papers/egglog
         | 
         | 2: https://s-arash.github.io/ascent/cc22main-p95-seamless-
         | deduc...
        
           | ggleason wrote:
           | Yeah, it's really unfortunate that these amazing advances in
           | datalog don't make it into general purpose languages.
        
             | jorkadeen wrote:
             | You might be interested in Flix which has first-class
             | Datalog program values:
             | 
             | https://flix.dev/
             | 
             | https://doc.flix.dev/fixpoints.html
             | 
             | (I am one of the developers of Flix)
        
           | thaliaarchi wrote:
           | How does Datalog with e-graphs compare to Datalog with BDDs
           | in terms of what analyses can be performed? bddbddb is a
           | Datalog engine using BDDs, that, according to their 2005
           | paper[0], can solve problems like context-sensitive pointer
           | analysis for large programs and analyses implemented with it
           | are faster than hand-tuned implementations in dramatically
           | fewer lines of code. I'm not knowledgeable on what modern
           | Datalog engines are based on.
           | 
           | 0: https://people.csail.mit.edu/mcarbin/papers/aplas05.pdf
        
             | chc4 wrote:
             | I'm under the impression that BDDs are good for more
             | efficient representation of row membership for values, so
             | that you can query and saturate more facts faster, but
             | eclasses allow for writing rules that "automatically" have
             | transitive connectivity of two rules. If you have X -> Y
             | and Y -> Z, you only need to store one of X or Y :- Z since
             | X and Y have been unioned together and may be treated
             | identically, along with a same(X, Z) rule firing instead of
             | having to have additional transitive same(X, same(Y, Z))
             | rules. I don't profess to be an expert in BDDs, eqsat, or
             | Datalog, though! I have one friend that did a thesis on
             | eqsat who wasn't that big of a fan of bddbddb but I don't
             | remember exactly why.
        
       | aaron695 wrote:
       | [dead]
        
       | mad wrote:
       | The Cuckoo Trie, an ordered index data structure explicitly
       | designed to have memory-level parallelism, which out-of-order
       | processors can exploit to execute DRAM accesses in parallel:
       | 
       | https://arxiv.org/pdf/2201.09331.pdf
        
       | barbarr wrote:
       | I'm not in the field but it seems that SODA (ACM/SIAM Symposium
       | on Discrete Algorithms) is a top conference in DSA and might be a
       | good place to check. Here's the abstract book for 2022: [1].
       | 
       | [1]
       | https://www.siam.org/Portals/0/Conferences/About_SIAM_Confer...
        
       | vitorsr wrote:
       | - Streaming algorithms [1]
       | 
       | - Eventual consistency [2]
       | 
       | - Refinement types [3], [4]
       | 
       | [1] https://www2.cs.uh.edu/~ceick/6340/Streams.pdf
       | 
       | [2] https://www.microsoft.com/en-us/research/wp-
       | content/uploads/...
       | 
       | [3] https://ranjitjhala.github.io/static/trust_but_verify.pdf
       | 
       | [4]
       | https://ranjitjhala.github.io/static/refinement_types_for_ty...
        
         | NeuroCoder wrote:
         | Refinement types are pretty interesting and I've seen mention
         | of them recently. I'm curious if there are any practical
         | reasons we don't see them implemented in more languages.
        
           | klibertp wrote:
           | > Refinement types are pretty interesting
           | 
           | Indeed. To me, they look like a formalized subset of
           | precondition contracts that can be checked statically. I
           | don't think the idea is new, I seem to recall Ada and Eiffel
           | having something like this, but I first learned about them
           | via Racket[1]. I still don't know how they work or what's
           | needed to implement them.
           | 
           | Interestingly, Raku has a similar concept called `subset`,
           | eg.                   subset OneToTen of Int where * > 0 & *
           | < 10;
           | 
           | but from what I know no static analysis is done on the
           | refinement/"where" part, it's a runtime assertion only.
           | 
           | [1] https://blog.racket-lang.org/2017/11/adding-refinement-
           | types...
        
       | sireat wrote:
       | For Maximum Flow (and Minimal Cut) through flow network we are
       | getting close to linear time:
       | 
       | 2022 Paper https://arxiv.org/abs/2203.00671
        
       | cjalmeida wrote:
       | It's not particularly new, but recently I've been involved a lot
       | with solving large combinatorial optimization problems
       | 
       | https://en.m.wikipedia.org/wiki/Combinatorial_optimization
       | 
       | Algos to solve those both exact methods and heuristics are super
       | useful in real world applications, and very underappreciated by
       | the HN crowd IMO.
        
         | JZL003 wrote:
         | Yes! Me too, there's a whole world out there and it's codified
         | in bullet-proof programs like CPLEX or gurobi. Sadly they are
         | expensive and it's not obvious if a given problem's size and
         | complexity will tailor itself to some of their heuristics.
         | 
         | Have you been doing it by hand or using a big program?
        
           | cjalmeida wrote:
           | At $DayJob, we deal with OR problems for large companies and
           | typically, when possible, we leverage Gurobi to solve Mixed
           | Integer Linear Programming (MIP) formulations. However it's
           | not unusual for the problems to be huge to the point even
           | Gurobi can't solve them in a reasonable time. And of course,
           | there's the case where the problem can't be formulated (or
           | approximated) as a linear program at all.
           | 
           | So, in a couple projects I was involved we had to resort to
           | manually constructed heuristics such as Simulated Annealing
           | (SA) and Large Neighborhood Search (LNS).
           | 
           | A very cool large scale approach we resorted in some problems
           | (eg. airline scheduling) was Column Generation[1]. Not only
           | it solved in minutes what took hours in the MIP
           | implementation, unlike SA and LNS it's and _exact_ method
           | that provide valid bounds to global optimality.
           | 
           | [1]: https://en.wikipedia.org/wiki/Column_generation
        
           | kherud wrote:
           | Answer Set Programming is an incredibly powerful tool to
           | declaratively solve combinatorial problems. Clingo is one of
           | the best open source implementations in my opinion:
           | https://github.com/potassco/clingo                 - Based on
           | conflict driven nogood learning, it has incredible
           | performance (including good multithreading)       - It has
           | different optimization and enumeration modes       - It has
           | strong built-in heuristics, but also allows manual heuristics
           | - It allows incremental solving       - It has a nice API to
           | integrate other solving paradigms (like difference logic and
           | constraint programming, for which implementations already
           | exist)       - There is a huge ecosystem around it
           | 
           | And much more, go check it out!
        
         | bjornsing wrote:
         | Any advice on how to get to work on this for a living?
        
           | mtsolitary wrote:
           | I think it depends a lot on the domain you are in. A example
           | of some domains with lots of discrete opt problems:
           | transport, scheduling, logistics. And then read stuff by
           | Stephen Boyd :)
        
         | mtsolitary wrote:
         | What makes you think they're underappreciated? I too have been
         | spending a lot of time in this space recently and agree it's
         | one of the more fertile "maths meets reality" spaces :)
        
         | Karrot_Kream wrote:
         | > Algos to solve those both exact methods and heuristics are
         | super useful in real world applications, and very
         | underappreciated by the HN crowd IMO.
         | 
         | There just aren't many situations where they're useful. Time
         | scheduling, logistics, operations, there are only a handful of
         | places where these problems are so important that they need to
         | be solved optimally or nearly optimally.
        
           | cjalmeida wrote:
           | IMO, the main benefit of those exact algos are the they
           | provide optimality bounds, allowing you to make informed
           | decision to continue or stop and provide a non-optimal
           | solution. This is valuable in many practical situations.
           | 
           | >Time scheduling, logistics, operations...
           | 
           | Add marketing (pricing, assortment, ...) and finance
           | (portfolio, captial budgeting, ...) to those.
           | 
           | And those are _huge_ domains, moving trillions of USD
           | globally each year. Many companies (think most of the Global
           | 2000) benefit from a couple % improvement on their operations
           | that lead to many $$ in savings.
        
         | adeon wrote:
         | Are there some interesting examples of real world applications
         | of this stuff? Is it like planning problems or supply chain
         | optimization etc.?
         | 
         | I've used SAT solvers recreationally to try solve some
         | mathematical combinatoricsish-like problems but interesting
         | more down-to-earth applications have eluded me. I would love to
         | dig deeper into this stuff.
        
           | currymj wrote:
           | Gurobi (maker of a MIP solver) has a bunch of industry case
           | studies as marketing materials on their website:
           | 
           | https://www.gurobi.com/case_studies/
           | 
           | however, it's mixed-integer programming rather than SAT
           | solvers/constraint programming, but similar in spirit.
        
           | jvanderbot wrote:
           | I used a MILP solver to optimize my ship loadouts in
           | Highfleet. It's rugged-looking, but works great.
           | https://hfopt.jodavaho.io
        
       | lqr wrote:
       | The field of "algorithms with predictions" studies how to use
       | predictions/learning within traditional CS problem settings:
       | 
       | > _We introduce algorithms that use predictions from machine
       | learning applied to the input to circumvent worst-case analysis.
       | We aim for algorithms that have near optimal performance when
       | these predictions are good, but recover the prediction-less worst
       | case behavior when the predictions have large errors._
       | 
       | An example is using learning to decide which variable to
       | discretize next in a branch-and-bound integer linear programming
       | solver. The algorithm might be extremely similar to the classic
       | version, but the analysis is new.
       | 
       | https://arxiv.org/abs/2006.09123
       | 
       | More broadly, I think there's a lot of work on applying more
       | refined analysis to classic algorithms - online settings,
       | smoothed analysis, etc.
        
       | sarojmoh1 wrote:
       | First Love?! Haha, jk.
       | 
       | I'm currently re-studying them just bc I'm interview prepping. I
       | wonder if whiteboarding for DS/Algo is gonna be a thing of the
       | past too though.
       | 
       | I would imagine there's certainly new DS/Algos in progress to
       | aide ML training efficiency
        
       | yieldcrv wrote:
       | Merkle trees and state tries were not in my formal education
       | 
       | But I had to learn them and employ them alot lately, very good
       | for resource constrained environments
        
       | di4na wrote:
       | Parsing but also casting to string efficiently is still an open
       | problem with regular breakthrough. Ryu or Dragonbox for float to
       | string come to mind.
       | 
       | Parsing with errors that are useful for the human is definitely
       | an open research topic.
       | 
       | Capabilities have seen interesting work, in particular in
       | relationships to effect handlers.
       | 
       | Hyperloglog and other datasketch keep seeing incremental
       | progress. I have not had time to read it all but Omar Ertl has a
       | massive reduction in memory use in UltraLogLog that i expect a
       | paper on soon. The code is already up.
        
         | hinkley wrote:
         | I've been poking at prometheus internals lately and I am
         | starting to wonder if histograms need some sort of hyperloglog
         | treatment to improve on buckets. They don't seem like a very
         | accurate solution.
        
           | di4na wrote:
           | You may appreciate ddsketch
           | https://www.datadoghq.com/blog/engineering/computing-
           | accurat...
           | 
           | Or tdigest. For the prom use case, i usually prefer ddsketch.
           | Easier to integrate. Tdigest is really complicated.
           | 
           | That said there are definitely interesting work that need to
           | happen on compressing time series histograms for heatmaps.
           | Iirc influx had some interesting work there.
        
       | epberry wrote:
       | Unfortunately for you my favorite recent algorithms result uses
       | AI as part of improving a very traditional algorithm. The paper
       | is "Learned Index Structures"
       | https://blog.acolyer.org/2018/01/08/the-case-for-learned-ind...
        
         | morkpek wrote:
         | This is a wonderful paper. In a similar vein, "Algorithms with
         | Predictions" (https://arxiv.org/pdf/2006.09123.pdf) presents a
         | bunch of fun algorithmic problems related to bringing AI into
         | traditional algorithms.
        
       | inciampati wrote:
       | Yes. There is ongoing work in making succinct full text indexes
       | that use substantially less space than the text they model. See
       | work on the r-index: https://doi.org/10.1089/cmb.2019.0309
        
       | shae wrote:
       | I'd suggest looking up succinct data structures, cache oblivious
       | data structures, and probabilistic data structures.
        
         | mamcx wrote:
         | Anything specific that could cover what "NDArray-like"
         | structures are? In concrete, to make row-oriented/column-
         | oriented hybrids?
         | 
         | (I wish to found a way to encode the equivalent of `struct
         | Data{col1:Vec<..>, ...>})
        
         | LukeEF wrote:
         | Succinct data structures ftw.
        
       | LukeEF wrote:
       | How about some succinct data structures and delta encoding for
       | modern databases [1]. Succinct data structures are a family of
       | data structures which are close in size to the information
       | theoretic minimum representation (while still being queryable).
       | 
       | [1]
       | https://github.com/terminusdb/terminusdb/blob/dev/docs/white...
        
       | TallGuyShort wrote:
       | Algs/DS are still pretty fundamental to AI. Hyperparameter
       | optimization algorithms like SHA/ASHA, stochastic gradient
       | descent, PBT, etc. Structures like transformers, the various
       | topologies, convolutions, etc. Not tickling the same itches?
        
       | abeppu wrote:
       | For numerical algorithms, probabilistic numerics is an
       | interesting research corner. It both provides a new perspective
       | on existing algorithms, and provides a framework for tailoring
       | new ones to specific problems.
       | 
       | https://www.probabilistic-numerics.org/
        
       | softwaredoug wrote:
       | Nobody has mentioned Approximate Nearest Neighbor search (aka
       | vector search), which IMO, are fundamental data structures
       | advancements.
       | 
       | Basically given a set of million (billion, trillion...) ~1000
       | valued vectors, and a query ~1000 valued vector, find the closest
       | vector in the indexed set. This is "nearest neighbor" search and
       | there have been increasingly more and more sophisticated
       | approaches:
       | 
       | http://ann-benchmarks.com
       | 
       | https://haystackconf.com/us2022/talk-6/
       | 
       | And has spawned companies like Weaviate, Pinecone, etc etc (a
       | half a dozen others)
        
         | marginalia_nu wrote:
         | What sort of dataset are you indexing that has trillion
         | entries?
         | 
         | That's 100,000 english wikipedias or 50 googles.
        
           | maerF0x0 wrote:
           | There's about 14B trades per year on the NYSE which i'm sure
           | could represent 10x that in entities (buyer, seller, broker,
           | etc) and could easily hit 1000x that in log lines. The shares
           | per day is in the billions, so hitting 1T if each share is
           | represented uniquely.
        
             | marginalia_nu wrote:
             | You don't typically use vector search for trade data
             | though. It's already ridculously well structured. Assets
             | have identifiers, parties and counterparties have IDs, etc.
             | I'm not sure what nearest neighbors in a vector space would
             | add.
        
               | l33t233372 wrote:
               | Nonetheless it's an example you asked for of a dataset
               | with over a trillion entries.
        
               | marginalia_nu wrote:
               | I asked which dataset they were indexing that was of this
               | size, not whether any such dataset exists in other
               | domains.
        
           | yes_man wrote:
           | Dumb example but still example from practical world. Your
           | body (assuming you are a human) has trillions of cells. Each
           | cell way more complicated than what a 1000-dimensional vector
           | can represent, but maybe it could be a loose estimate of some
           | properties in each cell. Now the algorithm could be about
           | finding the most similar cell. Could be useful for finding
           | e.g. other cancer cells based on properties in one cancer
           | cell.
           | 
           | Not that this is a practical example because we technically
           | cannot index all cells in each body. But even if such an
           | algorithm being studied today might be useful one day when we
           | do capability to collect such data
        
             | marginalia_nu wrote:
             | Where are you going to store this data? It's dozens of
             | petabytes.
        
               | tourist2d wrote:
               | Where do you think computers store data?
        
               | yes_man wrote:
               | Should theoretical research of data structures and
               | algorithms have been capped at 1GB in 1980 because that
               | was the biggest single hard drive available back then and
               | you couldn't store for example a 2GB dataset on a disk?
        
               | marginalia_nu wrote:
               | Not at all, I'll still call out fantastic claims when I
               | see them though.
        
               | karpierz wrote:
               | Google has definitely indexed over a trillion pages.
        
               | marginalia_nu wrote:
               | Do you have any sources for this claim?
               | 
               | As far as I am aware Google doesn't publish any
               | statistics about the size of its index, which no doubt
               | varies.
        
               | sophacles wrote:
               | It's only a few racks worth of disk servers.
               | 
               | If I was building it from my 5 minutes of googling, using
               | 15TB nvme u2 drives, and easily available server chasis,
               | I can get 24 drives per 2u of a rack. That's 360 TB + a
               | couple server nodes. So ~6u per PB. A full height rack is
               | 42u, so 6-7PB per rack once you take up some of the space
               | with networking, etc. So dozens is doable in a short
               | datacenter row.
               | 
               | Realistically you could fit a lot more storage per U,
               | depending on how much compute you need per unit of data.
               | The example above assumes all the disks are at the front
               | of the server only, if you mount them internally also,
               | you can fit a lot more. (see Backblaze's storage pods for
               | how they did it with spinning disks).
               | 
               | Dozens of PB is not that much data in 2023.
        
           | de_nied wrote:
           | One application are LLMs. I've seen a project use Pinecone to
           | enable "infinite context" for GPT-3.
        
             | marginalia_nu wrote:
             | This is still several orders of magnitude more items than
             | the entire training corpus for all GPT models combined. I
             | guess if you were to index individual codepoints in the
             | training corpus, we'd start to see those volumes.
        
         | fzliu wrote:
         | Most commonly used ANN algorithms today are clever
         | optimizations atop "old" data structures. DiskANN is a recent
         | favorite: https://milvus.io/blog/2021-09-24-diskann.md
        
           | kokanee wrote:
           | "Most commonly used X today are clever optimizations atop old
           | Y" is pretty much the story of technology, isn't it?
        
             | RoyGBivCap wrote:
             | It's lego all the way down to the clock cycle.
        
       | curiousgibbon wrote:
       | Negative weight single source shortest paths in near-linear time:
       | https://arxiv.org/abs/2203.03456
       | 
       | Obligatory Quanta link: https://www.quantamagazine.org/finally-a-
       | fast-algorithm-for-...
        
       | ww520 wrote:
       | Not sure it's still considered new as it has come out for couple
       | years, adaptive radix tree and its variants are pretty promising.
        
       | samwillis wrote:
       | Anything CRDT (conflict free replicated datatypes:
       | https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...)
       | related is fun to read up on and play with.
       | 
       | Papers and references (page maintained by central academic in the
       | world of CRDTs): https://crdt.tech
       | 
       | Group doing research into how they can be used to build
       | interesting collaborative (and async) applications:
       | https://www.inkandswitch.com
       | 
       | A few of the major open source implementations - mostly for rich
       | text editing or JSON like data structures:
       | 
       | - Yjs: https://github.com/yjs/yjs
       | 
       | - Automerge: https://github.com/automerge/automerge
       | 
       | - Peritext: https://www.inkandswitch.com/peritext/
       | 
       | - Dimond types: https://github.com/josephg/diamond-types
       | 
       | People building eventually consistent database syncing with them:
       | 
       | - https://electric-sql.com (Postgres <-> SQLite)
       | 
       | - https://vlcn.io (SQLite <-> SQLite)
       | 
       | Open source colaborative servers (coordination, persistance,
       | presence):
       | 
       | - https://github.com/ueberdosis/hocuspocus
       | 
       | - https://github.com/partykit/partykit
       | 
       | - https://github.com/firesync-org/firesync
        
         | beck5 wrote:
         | Thanks for mentioning FireSync, we haven't even offically
         | launched yet! After 10+ years of building real time
         | collabrative apps based on Operational Transformation
         | (ShareLaTeX -> Overleaf.com) we have become quietly very
         | excited about CRDT's and Yjs. Now we are focused on building a
         | scalable Yjs compliant backend ontop of PG with FireSync.
         | 
         | If anyone has thoughts about this space, feature requests,
         | would like a preview of what we are building, or anything else,
         | please do reach out direcly to me at henry@firesync.live, I'm
         | talking to as many people as possible at the moment.
        
         | josephg wrote:
         | Diamond types author here. I'm working on another blog post - I
         | had a big breakthrough getting collaborative text editing many
         | times faster again using some new ideas. I'm in the process of
         | writing up our new work now.
         | 
         | And there's another hard problem I'm stuck on: so, CRDTs like
         | mine and Yjs use (random agent id, sequence number) as unique
         | IDs for operations. But a troublesome user might reuse an
         | operation ID, sending different operations to different peers
         | and giving them the same (agent, seq) pair. This results in
         | really awful, weird bugs. Martin Kleppman wrote a paper about
         | BFT in CRDTs. His suggestion is we do what git does and use SHA
         | hashes for IDs instead.
         | 
         | But there's two problems:
         | 
         | 1. We don't want to store a hash for every keystroke a user
         | makes. That would get big, fast. We want a DAG of hashes like
         | git, except a hash is generated with each keystroke. How do we
         | build an efficient system that can query by hash, without
         | storing all of the hashes?
         | 
         | 2. I want users to be able to permanently delete inserted text.
         | If I accidentally paste an AWS key into a document, I don't
         | want that key in the document history forever. How do I enable
         | that while still getting all the benefits of the hashing system
         | above? (Note if your hash includes the typed character, even if
         | you delete the character itself from disk, you can trivially
         | brute force the hash to recover any typed characters).
         | 
         | Each of these problems in isolation has a solution. But the
         | real prize is how do you solve both at the same time? I think
         | this problem has a solution but I don't think anyone has solved
         | it yet. Want to name an algorithm and get my eternal respect?
         | Here's your chance.
        
           | 020921 wrote:
           | I know you said you didn't like the idea of hashes, so this
           | might be a non-starter for your use case.
           | 
           | But it sounds like a V5 Guid [0] might meet some of the needs
           | for your "operation Ids". The existing "random agent id", and
           | "sequence number" could be used as part of the namespace /
           | salt.
           | 
           | [0] https://www.sohamkamani.com/uuid-versions-
           | explained/#v5-non-...
        
             | josephg wrote:
             | I don't have a problem with hashes generally. Correct me if
             | I'm wrong, but this scheme sounds like hashing with extra
             | steps?
             | 
             | I agree this could be done. But diamond types (my library)
             | considers every keystroke to be a new change. If we store a
             | UUID for every keystroke, that's very inefficient.
        
           | texuf wrote:
           | Troublesome user? As in a malicious user? Are you trying to
           | lock down your attack surface against script kiddies or is
           | there another instance where this situation comes up?
        
             | josephg wrote:
             | Yes, a malicious user. I've also seen buggy software try to
             | reuse IDs - which causes similar mayhem. It would be better
             | if we could remove this entire class of problems.
        
             | lcnPylGDnU4H9OF wrote:
             | It could mean that, as in "someone who causes trouble" or
             | it could be a result of a bug which only affects a handful
             | of users who suddenly "have some troubles".
        
             | KineticLensman wrote:
             | > or is there another instance where this situation comes
             | up?
             | 
             | Yes, if you are a Byzantine general trying to coordinate an
             | attack, and you aren't sure of your peers' true allegiance.
        
           | some_furry wrote:
           | Problem 2 can be solved by "cryptographic erasure"
           | techniques.
           | 
           | 1. Encrypt the data with an ephemeral key.
           | 
           | 2. Upon delete, just wipe the key. The data becomes
           | unrecoverable.
           | 
           | I'm not sure about problem 1. A fast seekable stream cipher
           | like ChaCha might help here (don't use e.g. RC4 for this),
           | but a faster hash like BLAKE3 might be more appropriate. I'm
           | happy to noodle over this some more. These are just my
           | unfiltered thoughts.
           | 
           | If you combined the two: Each entry in the graph could have a
           | public salt used for key derivation for the actual data. This
           | salt would not be part of the DAG, but stored outside of it.
           | To delete an entry, zero-out the salt and the underlying key
           | becomes unrecoverable, which makes the data unreadable,
           | without needing to change the history of the DAG.
        
             | josephg wrote:
             | Yes. But diamond types stores a new entry in the DAG for
             | every keystroke each user makes. The current scheme of IDs
             | being (agent, sequence number) allows subsequent operations
             | by the same user to be trivially run-length encoded. Eg if
             | I make 1000 changes in order, we can encode that as
             | ("seph",1..1000). It takes much less than 1 byte of
             | overhead on disk per character typed.
             | 
             | Salts per character + cryptographic erasure would solve
             | problem 2. But how do we do it without a massive increase
             | in storage, memory use and network bandwidth? I don't want
             | to generate and store a cryptographically unique salt per
             | keystroke. (Even though it might run fine on modern
             | hardware).
        
           | benlivengood wrote:
           | 1. Can you merge inserts of characters into inserts of
           | strings reliably given both the original inserts and the
           | current document history? E.g. if the closed set of history
           | that only inserts in the middle of the subset of inserts that
           | you want to turn into a string those inserts can be
           | permanently merged into a string insert.
           | 
           | 2. String merging should also allow null merges, e.g.
           | identity the same closed set of history inserts that refer to
           | only the inside of the original inserts, and so permanently
           | deleting history of characters in the document could be
           | replaced with a null insert that can still be referred to by
           | other inserts/merges.
        
             | flir wrote:
             | Would (1) hit problems because you may store them coalesced
             | eventually, but you sent them to the other clients as
             | individual events, where they may get interleaved with
             | events from that client?
             | 
             | I think it might mess with fine-grained undo, also?
             | 
             | (idk much about CRDTs)
             | 
             | "Reject re-used sequence numbers" might be simpler?
        
           | debatem1 wrote:
           | Sounds like inverted hash chains. Generate a hash chain h_i =
           | sha(h_i-1) for a random h_0 up to h_N. Each message m_i
           | becomes (agent_id, state, hmac(h_(N-i+1), state), h_(N-i)).
           | Upon receipt of message m_(i+1) the receiver can verify the
           | hmac of the preceding state and that the key is the preimage
           | of the earlier message. By repeated hashing any gaps in the
           | sequence can be fast-forwarded through and still verified.
           | This provides for deletion without a loss of verifiability up
           | to length N.
        
           | joelg wrote:
           | it's definitely not an exact solution, but i've been thinking
           | about ways prolly trees could fit into CRDT systems. they
           | essentially give you efficient peer-to-peer entry-wise diffs,
           | so could be good for identifying conflicts after-the-fact.
           | but if you store operations in a prolly tree, you end up
           | hashing them all (plus logarithmic additional space
           | overhead), so it might be a non-starter.
           | 
           | i have a blog post to shill if you haven't heard of them:
           | https://joelgustafson.com/posts/2023-05-04/merklizing-the-
           | ke...
        
           | vivegi wrote:
           | > 1. We don't want to store a hash for every keystroke a user
           | makes.
           | 
           | So, why does it have to be every keystroke? We can batch them
           | locally. An approximate rule of thumb to use to guide the
           | batching is typing speed. Using 200 keystrokes/minute, we get
           | an average of 3.33 keystrokes/second. If we use 3 seconds
           | latency for the collaborative editing to _resolve_ concurrent
           | edits, we get 10 keystrokes per batch (3 secs). You can also
           | have an override where smaller batches are replicated to
           | nodes if sufficient time has elapsed since last local edit
           | (eg: 10 secs have elapsed and we have a partial batch of 3
           | keystrokes only).
           | 
           | > 2. If I accidentally paste an AWS key into a document, I
           | don't want that key in the document history forever. How do I
           | enable that while still getting all the benefits of the
           | hashing system above?
           | 
           | In order for this to work, we should assume that every
           | replica's current state is always obtained by starting with
           | the empty document and then playing the operations log. We
           | should also permit the author of an operation to _rollback_
           | the operation. So, I suppose if the collaborative editor
           | shows an interface of all operations that originated locally
           | and the author can select the operations that inserted the
           | AWS key into the replica, they can issue a _rollback
           | operations_ op and have it propagated to all replicas.
           | 
           | Since the point-in-time state of the replicas are regenerated
           | by replaying the operations log, we would have deleted the
           | accidentally inserted key.
           | 
           | In order to address the issue of the AWS key still being in
           | the operations log, we can define the semantics of the
           | _rollback op_ to also include secure erasure of the previous
           | operation from the operations log (i.e., the rollback op
           | would update the original operation into a no-op.
           | 
           | So, when all replicas apply the _rollback op_ initiated by
           | the author, eventually all replicas will converge to having
           | the accidentally inserted AWS key removed from the replica
           | and the operations log.
        
             | amelius wrote:
             | I think keystrokes was just a metaphor for whatever
             | operation the application has to deal with.
        
               | vivegi wrote:
               | I was responding to the diamond-types CRDT author's
               | question in the parent comment. Their github project page
               | [1] mentions text editing a lot:
               | 
               | > This repository contains a high performance rust CRDT
               | for text editing. This is a special data type which
               | supports concurrent editing of lists or strings (text
               | documents) by multiple users in a P2P network without
               | needing a centralized server.
               | 
               | > This version of diamond types only supports plain text
               | editing. Work is underway to add support for other JSON-
               | style data types. See the more_types branch for details.
               | 
               | In any case, I agree with the metaphor and batching
               | granular operations can always be done.
               | 
               | [1]: https://github.com/josephg/diamond-types
        
             | gigatexal wrote:
             | I like your approach to thinking about this. Seems like a
             | plausible set of assumptions to make.
        
           | ramses0 wrote:
           | 1. Bloom for existence, re-create via brute-force search (?),
           | checkpoint hash every $N seconds(?)
           | 
           | 2. Godbolt => "This rendered text came from these portions of
           | the DAG/tree, press CTRL-Del to irrecoverably remove them
           | from the DAG and/or replace them with 'REDACTED'".
           | 
           | #1 is "simply" related to efficiency. You can lean on
           | probabilistic data structures to see "yeah, probably this was
           | HERE in the history tree", and then perhaps replay the "N"
           | seconds checkpoints with the original input data to validate
           | or match exactly where the hash is/was.
           | 
           | #2 is the same problem GIT has, where you need to rewrite /
           | filter / "detached head" in order to obliterate something
           | from history, and is somewhat a local/remote UI issue. "A
           | request to obliterate history is being built in parallel over
           | here => click here to adopt it, and diff/transfer only your
           | new work to it"
           | 
           | Related to #2 I've had a thought of a social network
           | protocol, and how to "delete malicious nudes" or "oops, I
           | accidentally published my SSN and Tax Filing *.pdf to my
           | friends". On the "real" open internet, google or malicious
           | actors would/could store that info forever. However, in a
           | semi-closed group (eg: 100-1000 "friends of friends") that
           | are "all" running participating/conforming clients, a
           | "request-to-delete" or "request-to-disavow" is particularly
           | valuable.
           | 
           | My imagined protocol is: "post001.txt" => "$GOOD_SHA",
           | "nudes.jpg" => "$WHOOPS_SHA", "taxes.pdf" => "$YIKES_SHA",
           | "bait.txt" => "$BAIT_SHA".
           | 
           | "Of course" everything is content-addressable, ie: anyone can
           | fetch "post001.txt" from anyone as "$GOOD_SHA", however, "it
           | would be nice" if I as a participant in the network could
           | query "how_many( $GOOD_SHA );" and see distribution across
           | the network participants. "disavow( $WHOOPS_SHA, $YIKES_SHA )
           | && how_many( $WHOOPS_SHA, $YIKES_SHA )" => ie: in a "good"
           | network, after a "disavow(...)" then nodes should not respond
           | with that availability. Occasionally throw in "$BAIT_SHA w/
           | disavow(...)" and make sure that no one is propagating it.
           | 
           | Similar to "key revocation lists", you could eliminate/reject
           | any clients which do not respect requests of non-propagation
           | for $BAIT_SHA or anything that is "disavow(...)'d". Same
           | story with anyone hosting
           | https://en.wikipedia.org/wiki/Celebrity_sex_tape ... I don't
           | really want them in my network, so query for particular
           | content, and if they respond, I can nudge/boot/disengage from
           | users that are hosting them (same with eg: "I love my AK-47
           | and assault rifle" political memes, disengage / dislike /
           | spread apart people sharing/hosting/propagating content that
           | I don't want).
           | 
           | While you're still left with malicious / non-conforming nodes
           | potentially archiving everything for all eternity, there is a
           | relatively immediate ability to fracture / exclude (shun)
           | nodes that are detected as non-conforming, so "your stuff" is
           | still out there, it's just relatively inaccessible, and you
           | have to be "careful" to have your client be up-to-date w.r.t.
           | revocation actions otherwise you'll be excluded from
           | networks.
           | 
           | Meh. Kindof complicated, but those seem to be in the range of
           | characteristics that you're talking about as well.
        
           | samwillis wrote:
           | If my understanding of byzantine fault tolerance with CRDTs
           | and collaborative editing is correct, it's really only a
           | problem in systems where you have an untrusted user, someone
           | who is going to actively try and corrupt the system?
           | 
           | It seems to me that for the majority of use cases,
           | collaborative editing in a work sense, you would normally
           | have trust that a user isn't actively hostile. And so I'm not
           | convinced that BFT needs to be a high priority goal in most
           | of these implementations.
           | 
           | I think I higher priority issue is with how to aid developers
           | with the use generic CRDT structures within a system where
           | all the rules are not encoded into the CRDTs themselves. So
           | for example when using Yjs with Prosemirror, Prosemirror has
           | its own schema that defines how a document can be structured.
           | However not all those rules are encoded into the underlying
           | Yjs structure, it's possible to have two conflicting edits
           | that results in a Yjs document that doesn't comply with the
           | Prosemirror scheme. The current system throws away any none
           | conforming data when the document is loaded or updated.
           | Because of this you need to load the document into a server
           | side instance of Prosemirror in order to apply that schema
           | when doing a server side merge, or even if you are just
           | rendering the static content after "blindly" merging the
           | documents.
           | 
           | My point really is that CRDTs are an encoding of user
           | _intent_ , and when these generic CRTD are merged you have a
           | structure representing merged user intent, not a final
           | conforming state. The solution is either domain specific CRDT
           | that are tailored exactly to you application, or a schema
           | system for these generic types that apply rules when merging.
           | Prosemirror/Yjs does (mostly) exactly this but I think it
           | needs to be solved in a more generic way that can be run on
           | any layer of the system.
        
             | josephg wrote:
             | Yes - The BFT problem only matters when you have Byzantine
             | actors. But I think users deserve and expect the system to
             | be reasonably well behaved and predictable in all
             | situations. Anything publically writable, for example,
             | needs BFT resilience. Or any video game.
             | 
             | As for the prosemirror problem, I assume you're talking
             | about weird merges from users putting markdown in a text
             | crdt? You're totally right - this is a problem. Text CRDTs
             | treat documents as a simple sequence of characters. And
             | that confuses a lot of structured formats. For example, if
             | two users concurrently bold the same word, the system
             | should see that users agree that it should be bolded. But
             | if that "bold" intent is translated into "insert double
             | asterisks here and here", you end up with 4 asterisks
             | before and after the text, and that will confuse markdown
             | parsers. The problem is that a text crdt doesn't understand
             | markdown. It only understands inserting items, and holding
             | something shouldn't be treated as an insert.
             | 
             | JSON editing has similar problems. I've heard of plenty of
             | people over the years putting json text into a text crdt,
             | only to find that when concurrent edits happen, the json
             | grows parse errors. Eg if two users concurrently insert "a"
             | and "b" into an empty list. The result is ["a""b"] which
             | can't be parsed.
             | 
             | The answer to both of these problems is to use CRDTs which
             | understand the shape of your data structure. Eg, use a json
             | OT/crdt system for json data (like sharedb or automerge).
             | Likewise, if the user is editing rich text in prosemirror
             | then you want a rich text crdt like peritext. Rich text
             | CRDTs add the concept of annotations - so if two users bold
             | overlapping regions of text, the crdt understands that the
             | result should be that the entire region is bolded. And that
             | can be translated back to markdown if you want.
             | 
             | Luckily, in over a decade of work in the collaborative
             | editing space, I haven't found any other examples of this
             | problem other than rich text and json. I think if a
             | collaborative editing system supports json data, rich text
             | and plain text then it'll be enough to add collaborative
             | editing support to 99% of applications.
             | 
             | The ink & switch people did a great write up of how rich
             | text CRDTs work here:
             | https://www.inkandswitch.com/peritext/
        
           | jsunderland323 wrote:
           | Hi Joseph,
           | 
           | This is a hard problem and something that I think has to be
           | intrinsic to the serialization method of your CRDT. Unique
           | IDs of any kind are really risky in CRDTs because they
           | basically break the mathematical links between potential
           | partitioned duplicates. Lists make this problem even harder
           | because you have to contend with if you should incorporate
           | the ordinal of each element into the hash value.
           | 
           | I'm working on a version control system for visual programs.
           | You wrote a response to someone on structured edits a few
           | months back about how the UNIX fs not being schema/structure
           | aware is the core problem with collaborative edits and
           | destroys open interoperability. I think you hit the nail on
           | the head. I've been working on something for nearly a year
           | now to try to address that problem. Let me know if you want
           | to compare notes.
        
             | jsunderland323 wrote:
             | To answer your question 2. I don't think you can
             | realistically delete a key, contend with network
             | partitions, and have a true eventually consistent data
             | structure. I think you're sort of running into the uncommit
             | problem. Feels solvable but at the expense of another trade
             | off you don't want to make. The solution here is really in
             | workflow. Git solves it by separating the act of committing
             | and pushing. If you push a commit containing a secret, the
             | only thing you can do is invalidate the secret or not push
             | your offline branch containing the commit in the first
             | place.
        
       | hnfong wrote:
       | CS 168: The Modern Algorithmic Toolbox, Spring 2023
       | 
       | https://web.stanford.edu/class/cs168/index.html
       | 
       | I didn't have time to go through all of this, but the parts that
       | I did read were very interesting to me.
       | 
       | It's not even cutting edge new stuff of 202x, probably most of
       | them are new-ish results from ~2000 onwards.
        
       | eternalban wrote:
       | This is something I planned (2015) on sharing at some point but
       | then years flew by and here we are .. :}
       | 
       | It is a cacheline sized 'container' (CLC) of machine-word length
       | records, with one record used to store the record order and
       | remaining bits for metadata. So you can project different kinds
       | of container semantics, such as FIFO or LRU -- any ordered set
       | semantics -- on this meta record. Using arrays of CLCs you can
       | create e.g. a segmented LRU, where the overhead per item is
       | substantially less than a conventional pointer-based
       | datastructure, and, is naturally suited for concurrent operations
       | (for example by assigning a range to distinct worker threads),
       | and ops require a few or couple of lines to be touched. The LRU
       | (or whatever) semantics in aggregate will be probabilistic, as
       | the LRU order is deterministic per unit container only. It is
       | very fast :)
       | 
       | https://github.com/alphazero/libclc/blob/master/include/libc...
       | 
       | https://github.com/alphazero/libclc/blob/master/include/libc...
       | 
       | As for the non-deterministic aspects: Since container semantics
       | e.g. LRU order is only applied at unit level, the overall cache
       | is ~LRU. We can strictly quantify the 'ordering error' by
       | observing the age of items in FIFO mode as they are removed: for
       | a deterministic container the age of the item is equal to the
       | total capacity of the queue, for a segmented (array) composed of
       | FIFO queues, the age will have a effectively gaussian
       | distribution around the capacity (number of units x unit
       | capacity). But since these containers can use as few as 9 bits
       | per entry vs 24 or more _bytes_ for pointer based solutions
       | (which use linked-lists), for the same allocation of memory, the
       | capacity of the array of CLCs will be much greater, so, the
       | distribution tail of  'short-lived' items will actually be longer
       | lived than items in a strict queue for the same exact memory.
       | Additional techniques, such as n-array hashing, and low order
       | 'clock' bits at container level, can tighten this distribution
       | significantly (i.e. ~LRU -> LRU) via better loading factors.
        
       | hidudeb wrote:
       | [flagged]
        
       | sanxiyn wrote:
       | Theoretical advances are constant (just check the literature),
       | but two particular advances in the last decade are of practical
       | importance: one for sort and one for hash table. As in: check the
       | implementation you are using, if it doesn't incorporate these
       | advances you are leaving substantial performance (2x is not out
       | of the question) on the table.
       | 
       | How Branch Mispredictions don't affect Quicksort (2016)
       | https://arxiv.org/abs/1604.06697
       | 
       | Quicksort is still the algorithm of choice for general purpose
       | internal unstable sort. In modern hardwares, quicksort spends a
       | lot of time recovering from branch misprediction, because
       | quicksort's comparison branches are inherently unpredictable. The
       | problem is severe to the extent that with branchful quicksort,
       | uneven split with more recursion is faster than even split,
       | because it makes branches more predictable! Exact penalty is
       | hardware microarchitecture specific, but one report is 20:80
       | split being optimal.
       | 
       | So... you should use branchless quicksort! The technique was
       | pioneered by an implementation called BlockQuicksort, but pdqsort
       | (for pattern-defeating quicksort)
       | https://arxiv.org/abs/2106.05123 is also popular. pdqsort
       | incorporates branchless technique and also includes other
       | enhancements.
       | 
       | Designing a Fast, Efficient, Cache-friendly Hash Table, Step by
       | Step (2017) https://abseil.io/blog/20180927-swisstables
       | 
       | Hash table is the workhorse of data structures, and while
       | improvements are possible for specific usage patterns, open
       | addressing with linear probing is still the fastest
       | implementation for general purpose. So called "Swiss Table"
       | combines cache-friendly metadata layout with SIMD-enhanced
       | probing. These two optimizations work well with each other and
       | combination of two is particularly effective.
        
         | _dain_ wrote:
         | _> pdqsort (for pattern-defeating quicksort)_
         | 
         | I'm sad that it's not "pretty darn quicksort"
        
           | sanxiyn wrote:
           | It probably is that, just not officially. You know,
           | officially, BFR stands for Big Falcon Rocket.
        
         | anonymoushn wrote:
         | For hit-heavy workloads, I have been unable to get Swiss tables
         | to compete with tables that hit only ~1 cache line per query.
        
         | hinkley wrote:
         | I recall years ago someone suggesting that a 1:2 split being
         | faster and I'm wondering if they were seeing the same thing,
         | and either didn't follow it through to conclusion or the
         | goalposts keep moving in each generation of processors.
        
           | sanxiyn wrote:
           | It is plausible that they were seeing the same thing.
        
       | xiaodai wrote:
       | Cache oblivious algorithms and lock free data structures
        
       | Twisol wrote:
       | E-graphs are pretty awesome, and worth keeping in your back
       | pocket. They're like union-find structures, except they also
       | maintain congruence relations (i.e. if `x` and `y` are in the
       | same set, then `f(x)` and `f(y)` must likewise be in the same
       | set).
       | 
       | https://egraphs-good.github.io/
       | 
       | (Incidentally, union-find structures are also great to know
       | about. But they're not exactly "new".)
        
       | giogadi wrote:
       | IMO the hip new thing in algorithms and data structures is to
       | Just Use Arrays For Everything
        
         | dymk wrote:
         | Surely it's the interesting things built on top of the arrays?
        
         | yamazakiwi wrote:
         | Are you talking about Vector DB's?
        
           | masklinn wrote:
           | Also ECS / columnar data?
        
         | bckr wrote:
         | Got a source?
        
           | CountHackulus wrote:
           | I'm not OP, but I'm guessing he could be referring to
           | something like this talk?
           | https://www.youtube.com/watch?v=LrVi9LHP8Bk
           | 
           | I'm honestly not sure otherwise.
        
         | eternalban wrote:
         | When skirt lengths change and we observe no measurable change
         | in human anatomy, that's fashion and to follow fashion is
         | "hip".
         | 
         | When the computing platform changes and you have to start
         | paying attention to L-level caches and counting your bytes (so
         | you can do you big compute in memory), and use n cores, and
         | then arrays are adopted, that is not fashion; so not "hip",
         | rather _modern_.
        
       | transformi wrote:
       | Follow-up, anything in graph-theory?
        
         | kzrdude wrote:
         | I'm not exactly up to date, but last few years has seen
         | practical improvements to graph isomorphism algorithms.
        
       | bmitc wrote:
       | I'm interested in knowing about graph layout algorithms that
       | allow specific constraints. If anyone knows about these things
       | please let me know. I do not, but I have a use case for them.
       | 
       | For example, the constraints may be:
       | 
       | 1) The graph is read left to right. This informs the layout such
       | that those spiral and circular graph layout algorithms that blow
       | out the graph do not work.
       | 
       | 2) Try to keep certain types of nodes vertical aligned, so that
       | they lie on the same horizontal line.
       | 
       | 3) Minimize edge crosses.
       | 
       | 4) Have the edges be not lines but paths made up of horizontal
       | and vertical lines. This probably requires some path finding as
       | well.
       | 
       | 5) Be able to live update in the sense that a new node or edge
       | entry into the graph does not drastically re-layout the graph for
       | reasonable inserts.
        
         | alejohausner wrote:
         | Look at graphviz. Their "dot" program satisfies 1 through 4.
         | Your fifth wish is hard to satisfy.
        
       ___________________________________________________________________
       (page generated 2023-05-10 23:00 UTC)