[HN Gopher] Ask HN: What is new in algorithms and data structure...
___________________________________________________________________
Ask HN: What is new in algorithms and data structures these days?
Algs/DS were my first love in CS. Nowadays, all we hear about is
AI/ML. There must be hardware/software improvements coming from or
necessitating fundamental Algs/DS research. Care to share any of
the favorite recent results? Are there big fields still making
gains behind all this computing surge?
Author : jvanderbot
Score : 378 points
Date : 2023-05-10 13:16 UTC (9 hours ago)
| zelphirkalt wrote:
| There is ongoing research for purely functional data structures
| and many implementations to be written for many functional
| languages, to make use of PFDS. Whenever one sees a traditional
| data structure, one can ask oneself, what a persistent functional
| one might look like, or whether a different solution would be
| pursued in FP languages. Traditional data structures are often a
| question of transcribing from an example language into a target
| language, with variuos language specific changes for
| optimization. Finding a good way to implement things without
| mutation or side-effects, is a challenge.
| revskill wrote:
| Searching.
| chc4 wrote:
| In compilers, there's been a recent uptick in industry research
| and adoption into using equivalency classes and graphs (egraphs)
| for doing optimization passes in a way that preserves information
| and solves the phase ordering problem. Egg[0] was one of the big
| libraries that came out of it, but can only handle term rewriting
| without guards, and so can't express y/x*x -> y because it's
| unsound when x=0, or optimization passes that are across control
| flow points and thus some of the eclasses are only valid in some
| locations. The Cranelift developers adapted it into a
| construction they call aegraphs[1] that handles this, and are
| migrating Cranelift to use these passes for all optimizations,
| which is (hopefully) faster and achieves more optimization
| opportunitie; Chris Fallin is presenting about their aegraph work
| at PLDI this year.
|
| 0: https://egraphs-good.github.io/
|
| 1:
| https://github.com/cfallin/rfcs/blob/main/accepted/cranelift...
| tester756 wrote:
| thanks, this is cool!
| aseipp wrote:
| The recent work on relational, datalog-inspired egraphs in PLDI
| this year ("Unifying Datalog and Equality Saturation") is
| actually interesting because it can solve cases like the y/x*x
| -> y identity example, by the power of an interval analysis on
| x (among other things.) Sort of like adding a postulate to a
| rewrite term and then proving it. But instead it's by adding
| relations between terms in the graph. See section 6.2 of the
| paper.
|
| https://github.com/egraphs-good/egglog
|
| https://arxiv.org/pdf/2304.04332.pdf
| chc4 wrote:
| The other recent research that's been interesting to watch has
| been in logic programming (Prolog and Datalog), where there's
| been a lot of papers extending Datalog with semilattice types
| that are more efficient for computing dataflow-type queries
| (and even getting lattice types added to Souffle recently),
| including papers like datafun[0] extending it to sound
| incremental queries for more efficient execution - and there's
| even a paper by the Egg authors using it for efficient eclass-
| based Datalog queries using lattices[1]. It also seems like
| there has been more work recently in actually integrating logic
| programming in existing languages and programs, like ascent[2],
| instead of it being off in it's own world like it seems has
| historically been the case.
|
| 0: http://www.rntz.net/files/seminaive-datafun.pdf
|
| 1: https://www.mwillsey.com/papers/egglog
|
| 2: https://s-arash.github.io/ascent/cc22main-p95-seamless-
| deduc...
| ggleason wrote:
| Yeah, it's really unfortunate that these amazing advances in
| datalog don't make it into general purpose languages.
| jorkadeen wrote:
| You might be interested in Flix which has first-class
| Datalog program values:
|
| https://flix.dev/
|
| https://doc.flix.dev/fixpoints.html
|
| (I am one of the developers of Flix)
| thaliaarchi wrote:
| How does Datalog with e-graphs compare to Datalog with BDDs
| in terms of what analyses can be performed? bddbddb is a
| Datalog engine using BDDs, that, according to their 2005
| paper[0], can solve problems like context-sensitive pointer
| analysis for large programs and analyses implemented with it
| are faster than hand-tuned implementations in dramatically
| fewer lines of code. I'm not knowledgeable on what modern
| Datalog engines are based on.
|
| 0: https://people.csail.mit.edu/mcarbin/papers/aplas05.pdf
| chc4 wrote:
| I'm under the impression that BDDs are good for more
| efficient representation of row membership for values, so
| that you can query and saturate more facts faster, but
| eclasses allow for writing rules that "automatically" have
| transitive connectivity of two rules. If you have X -> Y
| and Y -> Z, you only need to store one of X or Y :- Z since
| X and Y have been unioned together and may be treated
| identically, along with a same(X, Z) rule firing instead of
| having to have additional transitive same(X, same(Y, Z))
| rules. I don't profess to be an expert in BDDs, eqsat, or
| Datalog, though! I have one friend that did a thesis on
| eqsat who wasn't that big of a fan of bddbddb but I don't
| remember exactly why.
| aaron695 wrote:
| [dead]
| mad wrote:
| The Cuckoo Trie, an ordered index data structure explicitly
| designed to have memory-level parallelism, which out-of-order
| processors can exploit to execute DRAM accesses in parallel:
|
| https://arxiv.org/pdf/2201.09331.pdf
| barbarr wrote:
| I'm not in the field but it seems that SODA (ACM/SIAM Symposium
| on Discrete Algorithms) is a top conference in DSA and might be a
| good place to check. Here's the abstract book for 2022: [1].
|
| [1]
| https://www.siam.org/Portals/0/Conferences/About_SIAM_Confer...
| vitorsr wrote:
| - Streaming algorithms [1]
|
| - Eventual consistency [2]
|
| - Refinement types [3], [4]
|
| [1] https://www2.cs.uh.edu/~ceick/6340/Streams.pdf
|
| [2] https://www.microsoft.com/en-us/research/wp-
| content/uploads/...
|
| [3] https://ranjitjhala.github.io/static/trust_but_verify.pdf
|
| [4]
| https://ranjitjhala.github.io/static/refinement_types_for_ty...
| NeuroCoder wrote:
| Refinement types are pretty interesting and I've seen mention
| of them recently. I'm curious if there are any practical
| reasons we don't see them implemented in more languages.
| klibertp wrote:
| > Refinement types are pretty interesting
|
| Indeed. To me, they look like a formalized subset of
| precondition contracts that can be checked statically. I
| don't think the idea is new, I seem to recall Ada and Eiffel
| having something like this, but I first learned about them
| via Racket[1]. I still don't know how they work or what's
| needed to implement them.
|
| Interestingly, Raku has a similar concept called `subset`,
| eg. subset OneToTen of Int where * > 0 & *
| < 10;
|
| but from what I know no static analysis is done on the
| refinement/"where" part, it's a runtime assertion only.
|
| [1] https://blog.racket-lang.org/2017/11/adding-refinement-
| types...
| sireat wrote:
| For Maximum Flow (and Minimal Cut) through flow network we are
| getting close to linear time:
|
| 2022 Paper https://arxiv.org/abs/2203.00671
| cjalmeida wrote:
| It's not particularly new, but recently I've been involved a lot
| with solving large combinatorial optimization problems
|
| https://en.m.wikipedia.org/wiki/Combinatorial_optimization
|
| Algos to solve those both exact methods and heuristics are super
| useful in real world applications, and very underappreciated by
| the HN crowd IMO.
| JZL003 wrote:
| Yes! Me too, there's a whole world out there and it's codified
| in bullet-proof programs like CPLEX or gurobi. Sadly they are
| expensive and it's not obvious if a given problem's size and
| complexity will tailor itself to some of their heuristics.
|
| Have you been doing it by hand or using a big program?
| cjalmeida wrote:
| At $DayJob, we deal with OR problems for large companies and
| typically, when possible, we leverage Gurobi to solve Mixed
| Integer Linear Programming (MIP) formulations. However it's
| not unusual for the problems to be huge to the point even
| Gurobi can't solve them in a reasonable time. And of course,
| there's the case where the problem can't be formulated (or
| approximated) as a linear program at all.
|
| So, in a couple projects I was involved we had to resort to
| manually constructed heuristics such as Simulated Annealing
| (SA) and Large Neighborhood Search (LNS).
|
| A very cool large scale approach we resorted in some problems
| (eg. airline scheduling) was Column Generation[1]. Not only
| it solved in minutes what took hours in the MIP
| implementation, unlike SA and LNS it's and _exact_ method
| that provide valid bounds to global optimality.
|
| [1]: https://en.wikipedia.org/wiki/Column_generation
| kherud wrote:
| Answer Set Programming is an incredibly powerful tool to
| declaratively solve combinatorial problems. Clingo is one of
| the best open source implementations in my opinion:
| https://github.com/potassco/clingo - Based on
| conflict driven nogood learning, it has incredible
| performance (including good multithreading) - It has
| different optimization and enumeration modes - It has
| strong built-in heuristics, but also allows manual heuristics
| - It allows incremental solving - It has a nice API to
| integrate other solving paradigms (like difference logic and
| constraint programming, for which implementations already
| exist) - There is a huge ecosystem around it
|
| And much more, go check it out!
| bjornsing wrote:
| Any advice on how to get to work on this for a living?
| mtsolitary wrote:
| I think it depends a lot on the domain you are in. A example
| of some domains with lots of discrete opt problems:
| transport, scheduling, logistics. And then read stuff by
| Stephen Boyd :)
| mtsolitary wrote:
| What makes you think they're underappreciated? I too have been
| spending a lot of time in this space recently and agree it's
| one of the more fertile "maths meets reality" spaces :)
| Karrot_Kream wrote:
| > Algos to solve those both exact methods and heuristics are
| super useful in real world applications, and very
| underappreciated by the HN crowd IMO.
|
| There just aren't many situations where they're useful. Time
| scheduling, logistics, operations, there are only a handful of
| places where these problems are so important that they need to
| be solved optimally or nearly optimally.
| cjalmeida wrote:
| IMO, the main benefit of those exact algos are the they
| provide optimality bounds, allowing you to make informed
| decision to continue or stop and provide a non-optimal
| solution. This is valuable in many practical situations.
|
| >Time scheduling, logistics, operations...
|
| Add marketing (pricing, assortment, ...) and finance
| (portfolio, captial budgeting, ...) to those.
|
| And those are _huge_ domains, moving trillions of USD
| globally each year. Many companies (think most of the Global
| 2000) benefit from a couple % improvement on their operations
| that lead to many $$ in savings.
| adeon wrote:
| Are there some interesting examples of real world applications
| of this stuff? Is it like planning problems or supply chain
| optimization etc.?
|
| I've used SAT solvers recreationally to try solve some
| mathematical combinatoricsish-like problems but interesting
| more down-to-earth applications have eluded me. I would love to
| dig deeper into this stuff.
| currymj wrote:
| Gurobi (maker of a MIP solver) has a bunch of industry case
| studies as marketing materials on their website:
|
| https://www.gurobi.com/case_studies/
|
| however, it's mixed-integer programming rather than SAT
| solvers/constraint programming, but similar in spirit.
| jvanderbot wrote:
| I used a MILP solver to optimize my ship loadouts in
| Highfleet. It's rugged-looking, but works great.
| https://hfopt.jodavaho.io
| lqr wrote:
| The field of "algorithms with predictions" studies how to use
| predictions/learning within traditional CS problem settings:
|
| > _We introduce algorithms that use predictions from machine
| learning applied to the input to circumvent worst-case analysis.
| We aim for algorithms that have near optimal performance when
| these predictions are good, but recover the prediction-less worst
| case behavior when the predictions have large errors._
|
| An example is using learning to decide which variable to
| discretize next in a branch-and-bound integer linear programming
| solver. The algorithm might be extremely similar to the classic
| version, but the analysis is new.
|
| https://arxiv.org/abs/2006.09123
|
| More broadly, I think there's a lot of work on applying more
| refined analysis to classic algorithms - online settings,
| smoothed analysis, etc.
| sarojmoh1 wrote:
| First Love?! Haha, jk.
|
| I'm currently re-studying them just bc I'm interview prepping. I
| wonder if whiteboarding for DS/Algo is gonna be a thing of the
| past too though.
|
| I would imagine there's certainly new DS/Algos in progress to
| aide ML training efficiency
| yieldcrv wrote:
| Merkle trees and state tries were not in my formal education
|
| But I had to learn them and employ them alot lately, very good
| for resource constrained environments
| di4na wrote:
| Parsing but also casting to string efficiently is still an open
| problem with regular breakthrough. Ryu or Dragonbox for float to
| string come to mind.
|
| Parsing with errors that are useful for the human is definitely
| an open research topic.
|
| Capabilities have seen interesting work, in particular in
| relationships to effect handlers.
|
| Hyperloglog and other datasketch keep seeing incremental
| progress. I have not had time to read it all but Omar Ertl has a
| massive reduction in memory use in UltraLogLog that i expect a
| paper on soon. The code is already up.
| hinkley wrote:
| I've been poking at prometheus internals lately and I am
| starting to wonder if histograms need some sort of hyperloglog
| treatment to improve on buckets. They don't seem like a very
| accurate solution.
| di4na wrote:
| You may appreciate ddsketch
| https://www.datadoghq.com/blog/engineering/computing-
| accurat...
|
| Or tdigest. For the prom use case, i usually prefer ddsketch.
| Easier to integrate. Tdigest is really complicated.
|
| That said there are definitely interesting work that need to
| happen on compressing time series histograms for heatmaps.
| Iirc influx had some interesting work there.
| epberry wrote:
| Unfortunately for you my favorite recent algorithms result uses
| AI as part of improving a very traditional algorithm. The paper
| is "Learned Index Structures"
| https://blog.acolyer.org/2018/01/08/the-case-for-learned-ind...
| morkpek wrote:
| This is a wonderful paper. In a similar vein, "Algorithms with
| Predictions" (https://arxiv.org/pdf/2006.09123.pdf) presents a
| bunch of fun algorithmic problems related to bringing AI into
| traditional algorithms.
| inciampati wrote:
| Yes. There is ongoing work in making succinct full text indexes
| that use substantially less space than the text they model. See
| work on the r-index: https://doi.org/10.1089/cmb.2019.0309
| shae wrote:
| I'd suggest looking up succinct data structures, cache oblivious
| data structures, and probabilistic data structures.
| mamcx wrote:
| Anything specific that could cover what "NDArray-like"
| structures are? In concrete, to make row-oriented/column-
| oriented hybrids?
|
| (I wish to found a way to encode the equivalent of `struct
| Data{col1:Vec<..>, ...>})
| LukeEF wrote:
| Succinct data structures ftw.
| LukeEF wrote:
| How about some succinct data structures and delta encoding for
| modern databases [1]. Succinct data structures are a family of
| data structures which are close in size to the information
| theoretic minimum representation (while still being queryable).
|
| [1]
| https://github.com/terminusdb/terminusdb/blob/dev/docs/white...
| TallGuyShort wrote:
| Algs/DS are still pretty fundamental to AI. Hyperparameter
| optimization algorithms like SHA/ASHA, stochastic gradient
| descent, PBT, etc. Structures like transformers, the various
| topologies, convolutions, etc. Not tickling the same itches?
| abeppu wrote:
| For numerical algorithms, probabilistic numerics is an
| interesting research corner. It both provides a new perspective
| on existing algorithms, and provides a framework for tailoring
| new ones to specific problems.
|
| https://www.probabilistic-numerics.org/
| softwaredoug wrote:
| Nobody has mentioned Approximate Nearest Neighbor search (aka
| vector search), which IMO, are fundamental data structures
| advancements.
|
| Basically given a set of million (billion, trillion...) ~1000
| valued vectors, and a query ~1000 valued vector, find the closest
| vector in the indexed set. This is "nearest neighbor" search and
| there have been increasingly more and more sophisticated
| approaches:
|
| http://ann-benchmarks.com
|
| https://haystackconf.com/us2022/talk-6/
|
| And has spawned companies like Weaviate, Pinecone, etc etc (a
| half a dozen others)
| marginalia_nu wrote:
| What sort of dataset are you indexing that has trillion
| entries?
|
| That's 100,000 english wikipedias or 50 googles.
| maerF0x0 wrote:
| There's about 14B trades per year on the NYSE which i'm sure
| could represent 10x that in entities (buyer, seller, broker,
| etc) and could easily hit 1000x that in log lines. The shares
| per day is in the billions, so hitting 1T if each share is
| represented uniquely.
| marginalia_nu wrote:
| You don't typically use vector search for trade data
| though. It's already ridculously well structured. Assets
| have identifiers, parties and counterparties have IDs, etc.
| I'm not sure what nearest neighbors in a vector space would
| add.
| l33t233372 wrote:
| Nonetheless it's an example you asked for of a dataset
| with over a trillion entries.
| marginalia_nu wrote:
| I asked which dataset they were indexing that was of this
| size, not whether any such dataset exists in other
| domains.
| yes_man wrote:
| Dumb example but still example from practical world. Your
| body (assuming you are a human) has trillions of cells. Each
| cell way more complicated than what a 1000-dimensional vector
| can represent, but maybe it could be a loose estimate of some
| properties in each cell. Now the algorithm could be about
| finding the most similar cell. Could be useful for finding
| e.g. other cancer cells based on properties in one cancer
| cell.
|
| Not that this is a practical example because we technically
| cannot index all cells in each body. But even if such an
| algorithm being studied today might be useful one day when we
| do capability to collect such data
| marginalia_nu wrote:
| Where are you going to store this data? It's dozens of
| petabytes.
| tourist2d wrote:
| Where do you think computers store data?
| yes_man wrote:
| Should theoretical research of data structures and
| algorithms have been capped at 1GB in 1980 because that
| was the biggest single hard drive available back then and
| you couldn't store for example a 2GB dataset on a disk?
| marginalia_nu wrote:
| Not at all, I'll still call out fantastic claims when I
| see them though.
| karpierz wrote:
| Google has definitely indexed over a trillion pages.
| marginalia_nu wrote:
| Do you have any sources for this claim?
|
| As far as I am aware Google doesn't publish any
| statistics about the size of its index, which no doubt
| varies.
| sophacles wrote:
| It's only a few racks worth of disk servers.
|
| If I was building it from my 5 minutes of googling, using
| 15TB nvme u2 drives, and easily available server chasis,
| I can get 24 drives per 2u of a rack. That's 360 TB + a
| couple server nodes. So ~6u per PB. A full height rack is
| 42u, so 6-7PB per rack once you take up some of the space
| with networking, etc. So dozens is doable in a short
| datacenter row.
|
| Realistically you could fit a lot more storage per U,
| depending on how much compute you need per unit of data.
| The example above assumes all the disks are at the front
| of the server only, if you mount them internally also,
| you can fit a lot more. (see Backblaze's storage pods for
| how they did it with spinning disks).
|
| Dozens of PB is not that much data in 2023.
| de_nied wrote:
| One application are LLMs. I've seen a project use Pinecone to
| enable "infinite context" for GPT-3.
| marginalia_nu wrote:
| This is still several orders of magnitude more items than
| the entire training corpus for all GPT models combined. I
| guess if you were to index individual codepoints in the
| training corpus, we'd start to see those volumes.
| fzliu wrote:
| Most commonly used ANN algorithms today are clever
| optimizations atop "old" data structures. DiskANN is a recent
| favorite: https://milvus.io/blog/2021-09-24-diskann.md
| kokanee wrote:
| "Most commonly used X today are clever optimizations atop old
| Y" is pretty much the story of technology, isn't it?
| RoyGBivCap wrote:
| It's lego all the way down to the clock cycle.
| curiousgibbon wrote:
| Negative weight single source shortest paths in near-linear time:
| https://arxiv.org/abs/2203.03456
|
| Obligatory Quanta link: https://www.quantamagazine.org/finally-a-
| fast-algorithm-for-...
| ww520 wrote:
| Not sure it's still considered new as it has come out for couple
| years, adaptive radix tree and its variants are pretty promising.
| samwillis wrote:
| Anything CRDT (conflict free replicated datatypes:
| https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...)
| related is fun to read up on and play with.
|
| Papers and references (page maintained by central academic in the
| world of CRDTs): https://crdt.tech
|
| Group doing research into how they can be used to build
| interesting collaborative (and async) applications:
| https://www.inkandswitch.com
|
| A few of the major open source implementations - mostly for rich
| text editing or JSON like data structures:
|
| - Yjs: https://github.com/yjs/yjs
|
| - Automerge: https://github.com/automerge/automerge
|
| - Peritext: https://www.inkandswitch.com/peritext/
|
| - Dimond types: https://github.com/josephg/diamond-types
|
| People building eventually consistent database syncing with them:
|
| - https://electric-sql.com (Postgres <-> SQLite)
|
| - https://vlcn.io (SQLite <-> SQLite)
|
| Open source colaborative servers (coordination, persistance,
| presence):
|
| - https://github.com/ueberdosis/hocuspocus
|
| - https://github.com/partykit/partykit
|
| - https://github.com/firesync-org/firesync
| beck5 wrote:
| Thanks for mentioning FireSync, we haven't even offically
| launched yet! After 10+ years of building real time
| collabrative apps based on Operational Transformation
| (ShareLaTeX -> Overleaf.com) we have become quietly very
| excited about CRDT's and Yjs. Now we are focused on building a
| scalable Yjs compliant backend ontop of PG with FireSync.
|
| If anyone has thoughts about this space, feature requests,
| would like a preview of what we are building, or anything else,
| please do reach out direcly to me at henry@firesync.live, I'm
| talking to as many people as possible at the moment.
| josephg wrote:
| Diamond types author here. I'm working on another blog post - I
| had a big breakthrough getting collaborative text editing many
| times faster again using some new ideas. I'm in the process of
| writing up our new work now.
|
| And there's another hard problem I'm stuck on: so, CRDTs like
| mine and Yjs use (random agent id, sequence number) as unique
| IDs for operations. But a troublesome user might reuse an
| operation ID, sending different operations to different peers
| and giving them the same (agent, seq) pair. This results in
| really awful, weird bugs. Martin Kleppman wrote a paper about
| BFT in CRDTs. His suggestion is we do what git does and use SHA
| hashes for IDs instead.
|
| But there's two problems:
|
| 1. We don't want to store a hash for every keystroke a user
| makes. That would get big, fast. We want a DAG of hashes like
| git, except a hash is generated with each keystroke. How do we
| build an efficient system that can query by hash, without
| storing all of the hashes?
|
| 2. I want users to be able to permanently delete inserted text.
| If I accidentally paste an AWS key into a document, I don't
| want that key in the document history forever. How do I enable
| that while still getting all the benefits of the hashing system
| above? (Note if your hash includes the typed character, even if
| you delete the character itself from disk, you can trivially
| brute force the hash to recover any typed characters).
|
| Each of these problems in isolation has a solution. But the
| real prize is how do you solve both at the same time? I think
| this problem has a solution but I don't think anyone has solved
| it yet. Want to name an algorithm and get my eternal respect?
| Here's your chance.
| 020921 wrote:
| I know you said you didn't like the idea of hashes, so this
| might be a non-starter for your use case.
|
| But it sounds like a V5 Guid [0] might meet some of the needs
| for your "operation Ids". The existing "random agent id", and
| "sequence number" could be used as part of the namespace /
| salt.
|
| [0] https://www.sohamkamani.com/uuid-versions-
| explained/#v5-non-...
| josephg wrote:
| I don't have a problem with hashes generally. Correct me if
| I'm wrong, but this scheme sounds like hashing with extra
| steps?
|
| I agree this could be done. But diamond types (my library)
| considers every keystroke to be a new change. If we store a
| UUID for every keystroke, that's very inefficient.
| texuf wrote:
| Troublesome user? As in a malicious user? Are you trying to
| lock down your attack surface against script kiddies or is
| there another instance where this situation comes up?
| josephg wrote:
| Yes, a malicious user. I've also seen buggy software try to
| reuse IDs - which causes similar mayhem. It would be better
| if we could remove this entire class of problems.
| lcnPylGDnU4H9OF wrote:
| It could mean that, as in "someone who causes trouble" or
| it could be a result of a bug which only affects a handful
| of users who suddenly "have some troubles".
| KineticLensman wrote:
| > or is there another instance where this situation comes
| up?
|
| Yes, if you are a Byzantine general trying to coordinate an
| attack, and you aren't sure of your peers' true allegiance.
| some_furry wrote:
| Problem 2 can be solved by "cryptographic erasure"
| techniques.
|
| 1. Encrypt the data with an ephemeral key.
|
| 2. Upon delete, just wipe the key. The data becomes
| unrecoverable.
|
| I'm not sure about problem 1. A fast seekable stream cipher
| like ChaCha might help here (don't use e.g. RC4 for this),
| but a faster hash like BLAKE3 might be more appropriate. I'm
| happy to noodle over this some more. These are just my
| unfiltered thoughts.
|
| If you combined the two: Each entry in the graph could have a
| public salt used for key derivation for the actual data. This
| salt would not be part of the DAG, but stored outside of it.
| To delete an entry, zero-out the salt and the underlying key
| becomes unrecoverable, which makes the data unreadable,
| without needing to change the history of the DAG.
| josephg wrote:
| Yes. But diamond types stores a new entry in the DAG for
| every keystroke each user makes. The current scheme of IDs
| being (agent, sequence number) allows subsequent operations
| by the same user to be trivially run-length encoded. Eg if
| I make 1000 changes in order, we can encode that as
| ("seph",1..1000). It takes much less than 1 byte of
| overhead on disk per character typed.
|
| Salts per character + cryptographic erasure would solve
| problem 2. But how do we do it without a massive increase
| in storage, memory use and network bandwidth? I don't want
| to generate and store a cryptographically unique salt per
| keystroke. (Even though it might run fine on modern
| hardware).
| benlivengood wrote:
| 1. Can you merge inserts of characters into inserts of
| strings reliably given both the original inserts and the
| current document history? E.g. if the closed set of history
| that only inserts in the middle of the subset of inserts that
| you want to turn into a string those inserts can be
| permanently merged into a string insert.
|
| 2. String merging should also allow null merges, e.g.
| identity the same closed set of history inserts that refer to
| only the inside of the original inserts, and so permanently
| deleting history of characters in the document could be
| replaced with a null insert that can still be referred to by
| other inserts/merges.
| flir wrote:
| Would (1) hit problems because you may store them coalesced
| eventually, but you sent them to the other clients as
| individual events, where they may get interleaved with
| events from that client?
|
| I think it might mess with fine-grained undo, also?
|
| (idk much about CRDTs)
|
| "Reject re-used sequence numbers" might be simpler?
| debatem1 wrote:
| Sounds like inverted hash chains. Generate a hash chain h_i =
| sha(h_i-1) for a random h_0 up to h_N. Each message m_i
| becomes (agent_id, state, hmac(h_(N-i+1), state), h_(N-i)).
| Upon receipt of message m_(i+1) the receiver can verify the
| hmac of the preceding state and that the key is the preimage
| of the earlier message. By repeated hashing any gaps in the
| sequence can be fast-forwarded through and still verified.
| This provides for deletion without a loss of verifiability up
| to length N.
| joelg wrote:
| it's definitely not an exact solution, but i've been thinking
| about ways prolly trees could fit into CRDT systems. they
| essentially give you efficient peer-to-peer entry-wise diffs,
| so could be good for identifying conflicts after-the-fact.
| but if you store operations in a prolly tree, you end up
| hashing them all (plus logarithmic additional space
| overhead), so it might be a non-starter.
|
| i have a blog post to shill if you haven't heard of them:
| https://joelgustafson.com/posts/2023-05-04/merklizing-the-
| ke...
| vivegi wrote:
| > 1. We don't want to store a hash for every keystroke a user
| makes.
|
| So, why does it have to be every keystroke? We can batch them
| locally. An approximate rule of thumb to use to guide the
| batching is typing speed. Using 200 keystrokes/minute, we get
| an average of 3.33 keystrokes/second. If we use 3 seconds
| latency for the collaborative editing to _resolve_ concurrent
| edits, we get 10 keystrokes per batch (3 secs). You can also
| have an override where smaller batches are replicated to
| nodes if sufficient time has elapsed since last local edit
| (eg: 10 secs have elapsed and we have a partial batch of 3
| keystrokes only).
|
| > 2. If I accidentally paste an AWS key into a document, I
| don't want that key in the document history forever. How do I
| enable that while still getting all the benefits of the
| hashing system above?
|
| In order for this to work, we should assume that every
| replica's current state is always obtained by starting with
| the empty document and then playing the operations log. We
| should also permit the author of an operation to _rollback_
| the operation. So, I suppose if the collaborative editor
| shows an interface of all operations that originated locally
| and the author can select the operations that inserted the
| AWS key into the replica, they can issue a _rollback
| operations_ op and have it propagated to all replicas.
|
| Since the point-in-time state of the replicas are regenerated
| by replaying the operations log, we would have deleted the
| accidentally inserted key.
|
| In order to address the issue of the AWS key still being in
| the operations log, we can define the semantics of the
| _rollback op_ to also include secure erasure of the previous
| operation from the operations log (i.e., the rollback op
| would update the original operation into a no-op.
|
| So, when all replicas apply the _rollback op_ initiated by
| the author, eventually all replicas will converge to having
| the accidentally inserted AWS key removed from the replica
| and the operations log.
| amelius wrote:
| I think keystrokes was just a metaphor for whatever
| operation the application has to deal with.
| vivegi wrote:
| I was responding to the diamond-types CRDT author's
| question in the parent comment. Their github project page
| [1] mentions text editing a lot:
|
| > This repository contains a high performance rust CRDT
| for text editing. This is a special data type which
| supports concurrent editing of lists or strings (text
| documents) by multiple users in a P2P network without
| needing a centralized server.
|
| > This version of diamond types only supports plain text
| editing. Work is underway to add support for other JSON-
| style data types. See the more_types branch for details.
|
| In any case, I agree with the metaphor and batching
| granular operations can always be done.
|
| [1]: https://github.com/josephg/diamond-types
| gigatexal wrote:
| I like your approach to thinking about this. Seems like a
| plausible set of assumptions to make.
| ramses0 wrote:
| 1. Bloom for existence, re-create via brute-force search (?),
| checkpoint hash every $N seconds(?)
|
| 2. Godbolt => "This rendered text came from these portions of
| the DAG/tree, press CTRL-Del to irrecoverably remove them
| from the DAG and/or replace them with 'REDACTED'".
|
| #1 is "simply" related to efficiency. You can lean on
| probabilistic data structures to see "yeah, probably this was
| HERE in the history tree", and then perhaps replay the "N"
| seconds checkpoints with the original input data to validate
| or match exactly where the hash is/was.
|
| #2 is the same problem GIT has, where you need to rewrite /
| filter / "detached head" in order to obliterate something
| from history, and is somewhat a local/remote UI issue. "A
| request to obliterate history is being built in parallel over
| here => click here to adopt it, and diff/transfer only your
| new work to it"
|
| Related to #2 I've had a thought of a social network
| protocol, and how to "delete malicious nudes" or "oops, I
| accidentally published my SSN and Tax Filing *.pdf to my
| friends". On the "real" open internet, google or malicious
| actors would/could store that info forever. However, in a
| semi-closed group (eg: 100-1000 "friends of friends") that
| are "all" running participating/conforming clients, a
| "request-to-delete" or "request-to-disavow" is particularly
| valuable.
|
| My imagined protocol is: "post001.txt" => "$GOOD_SHA",
| "nudes.jpg" => "$WHOOPS_SHA", "taxes.pdf" => "$YIKES_SHA",
| "bait.txt" => "$BAIT_SHA".
|
| "Of course" everything is content-addressable, ie: anyone can
| fetch "post001.txt" from anyone as "$GOOD_SHA", however, "it
| would be nice" if I as a participant in the network could
| query "how_many( $GOOD_SHA );" and see distribution across
| the network participants. "disavow( $WHOOPS_SHA, $YIKES_SHA )
| && how_many( $WHOOPS_SHA, $YIKES_SHA )" => ie: in a "good"
| network, after a "disavow(...)" then nodes should not respond
| with that availability. Occasionally throw in "$BAIT_SHA w/
| disavow(...)" and make sure that no one is propagating it.
|
| Similar to "key revocation lists", you could eliminate/reject
| any clients which do not respect requests of non-propagation
| for $BAIT_SHA or anything that is "disavow(...)'d". Same
| story with anyone hosting
| https://en.wikipedia.org/wiki/Celebrity_sex_tape ... I don't
| really want them in my network, so query for particular
| content, and if they respond, I can nudge/boot/disengage from
| users that are hosting them (same with eg: "I love my AK-47
| and assault rifle" political memes, disengage / dislike /
| spread apart people sharing/hosting/propagating content that
| I don't want).
|
| While you're still left with malicious / non-conforming nodes
| potentially archiving everything for all eternity, there is a
| relatively immediate ability to fracture / exclude (shun)
| nodes that are detected as non-conforming, so "your stuff" is
| still out there, it's just relatively inaccessible, and you
| have to be "careful" to have your client be up-to-date w.r.t.
| revocation actions otherwise you'll be excluded from
| networks.
|
| Meh. Kindof complicated, but those seem to be in the range of
| characteristics that you're talking about as well.
| samwillis wrote:
| If my understanding of byzantine fault tolerance with CRDTs
| and collaborative editing is correct, it's really only a
| problem in systems where you have an untrusted user, someone
| who is going to actively try and corrupt the system?
|
| It seems to me that for the majority of use cases,
| collaborative editing in a work sense, you would normally
| have trust that a user isn't actively hostile. And so I'm not
| convinced that BFT needs to be a high priority goal in most
| of these implementations.
|
| I think I higher priority issue is with how to aid developers
| with the use generic CRDT structures within a system where
| all the rules are not encoded into the CRDTs themselves. So
| for example when using Yjs with Prosemirror, Prosemirror has
| its own schema that defines how a document can be structured.
| However not all those rules are encoded into the underlying
| Yjs structure, it's possible to have two conflicting edits
| that results in a Yjs document that doesn't comply with the
| Prosemirror scheme. The current system throws away any none
| conforming data when the document is loaded or updated.
| Because of this you need to load the document into a server
| side instance of Prosemirror in order to apply that schema
| when doing a server side merge, or even if you are just
| rendering the static content after "blindly" merging the
| documents.
|
| My point really is that CRDTs are an encoding of user
| _intent_ , and when these generic CRTD are merged you have a
| structure representing merged user intent, not a final
| conforming state. The solution is either domain specific CRDT
| that are tailored exactly to you application, or a schema
| system for these generic types that apply rules when merging.
| Prosemirror/Yjs does (mostly) exactly this but I think it
| needs to be solved in a more generic way that can be run on
| any layer of the system.
| josephg wrote:
| Yes - The BFT problem only matters when you have Byzantine
| actors. But I think users deserve and expect the system to
| be reasonably well behaved and predictable in all
| situations. Anything publically writable, for example,
| needs BFT resilience. Or any video game.
|
| As for the prosemirror problem, I assume you're talking
| about weird merges from users putting markdown in a text
| crdt? You're totally right - this is a problem. Text CRDTs
| treat documents as a simple sequence of characters. And
| that confuses a lot of structured formats. For example, if
| two users concurrently bold the same word, the system
| should see that users agree that it should be bolded. But
| if that "bold" intent is translated into "insert double
| asterisks here and here", you end up with 4 asterisks
| before and after the text, and that will confuse markdown
| parsers. The problem is that a text crdt doesn't understand
| markdown. It only understands inserting items, and holding
| something shouldn't be treated as an insert.
|
| JSON editing has similar problems. I've heard of plenty of
| people over the years putting json text into a text crdt,
| only to find that when concurrent edits happen, the json
| grows parse errors. Eg if two users concurrently insert "a"
| and "b" into an empty list. The result is ["a""b"] which
| can't be parsed.
|
| The answer to both of these problems is to use CRDTs which
| understand the shape of your data structure. Eg, use a json
| OT/crdt system for json data (like sharedb or automerge).
| Likewise, if the user is editing rich text in prosemirror
| then you want a rich text crdt like peritext. Rich text
| CRDTs add the concept of annotations - so if two users bold
| overlapping regions of text, the crdt understands that the
| result should be that the entire region is bolded. And that
| can be translated back to markdown if you want.
|
| Luckily, in over a decade of work in the collaborative
| editing space, I haven't found any other examples of this
| problem other than rich text and json. I think if a
| collaborative editing system supports json data, rich text
| and plain text then it'll be enough to add collaborative
| editing support to 99% of applications.
|
| The ink & switch people did a great write up of how rich
| text CRDTs work here:
| https://www.inkandswitch.com/peritext/
| jsunderland323 wrote:
| Hi Joseph,
|
| This is a hard problem and something that I think has to be
| intrinsic to the serialization method of your CRDT. Unique
| IDs of any kind are really risky in CRDTs because they
| basically break the mathematical links between potential
| partitioned duplicates. Lists make this problem even harder
| because you have to contend with if you should incorporate
| the ordinal of each element into the hash value.
|
| I'm working on a version control system for visual programs.
| You wrote a response to someone on structured edits a few
| months back about how the UNIX fs not being schema/structure
| aware is the core problem with collaborative edits and
| destroys open interoperability. I think you hit the nail on
| the head. I've been working on something for nearly a year
| now to try to address that problem. Let me know if you want
| to compare notes.
| jsunderland323 wrote:
| To answer your question 2. I don't think you can
| realistically delete a key, contend with network
| partitions, and have a true eventually consistent data
| structure. I think you're sort of running into the uncommit
| problem. Feels solvable but at the expense of another trade
| off you don't want to make. The solution here is really in
| workflow. Git solves it by separating the act of committing
| and pushing. If you push a commit containing a secret, the
| only thing you can do is invalidate the secret or not push
| your offline branch containing the commit in the first
| place.
| hnfong wrote:
| CS 168: The Modern Algorithmic Toolbox, Spring 2023
|
| https://web.stanford.edu/class/cs168/index.html
|
| I didn't have time to go through all of this, but the parts that
| I did read were very interesting to me.
|
| It's not even cutting edge new stuff of 202x, probably most of
| them are new-ish results from ~2000 onwards.
| eternalban wrote:
| This is something I planned (2015) on sharing at some point but
| then years flew by and here we are .. :}
|
| It is a cacheline sized 'container' (CLC) of machine-word length
| records, with one record used to store the record order and
| remaining bits for metadata. So you can project different kinds
| of container semantics, such as FIFO or LRU -- any ordered set
| semantics -- on this meta record. Using arrays of CLCs you can
| create e.g. a segmented LRU, where the overhead per item is
| substantially less than a conventional pointer-based
| datastructure, and, is naturally suited for concurrent operations
| (for example by assigning a range to distinct worker threads),
| and ops require a few or couple of lines to be touched. The LRU
| (or whatever) semantics in aggregate will be probabilistic, as
| the LRU order is deterministic per unit container only. It is
| very fast :)
|
| https://github.com/alphazero/libclc/blob/master/include/libc...
|
| https://github.com/alphazero/libclc/blob/master/include/libc...
|
| As for the non-deterministic aspects: Since container semantics
| e.g. LRU order is only applied at unit level, the overall cache
| is ~LRU. We can strictly quantify the 'ordering error' by
| observing the age of items in FIFO mode as they are removed: for
| a deterministic container the age of the item is equal to the
| total capacity of the queue, for a segmented (array) composed of
| FIFO queues, the age will have a effectively gaussian
| distribution around the capacity (number of units x unit
| capacity). But since these containers can use as few as 9 bits
| per entry vs 24 or more _bytes_ for pointer based solutions
| (which use linked-lists), for the same allocation of memory, the
| capacity of the array of CLCs will be much greater, so, the
| distribution tail of 'short-lived' items will actually be longer
| lived than items in a strict queue for the same exact memory.
| Additional techniques, such as n-array hashing, and low order
| 'clock' bits at container level, can tighten this distribution
| significantly (i.e. ~LRU -> LRU) via better loading factors.
| hidudeb wrote:
| [flagged]
| sanxiyn wrote:
| Theoretical advances are constant (just check the literature),
| but two particular advances in the last decade are of practical
| importance: one for sort and one for hash table. As in: check the
| implementation you are using, if it doesn't incorporate these
| advances you are leaving substantial performance (2x is not out
| of the question) on the table.
|
| How Branch Mispredictions don't affect Quicksort (2016)
| https://arxiv.org/abs/1604.06697
|
| Quicksort is still the algorithm of choice for general purpose
| internal unstable sort. In modern hardwares, quicksort spends a
| lot of time recovering from branch misprediction, because
| quicksort's comparison branches are inherently unpredictable. The
| problem is severe to the extent that with branchful quicksort,
| uneven split with more recursion is faster than even split,
| because it makes branches more predictable! Exact penalty is
| hardware microarchitecture specific, but one report is 20:80
| split being optimal.
|
| So... you should use branchless quicksort! The technique was
| pioneered by an implementation called BlockQuicksort, but pdqsort
| (for pattern-defeating quicksort)
| https://arxiv.org/abs/2106.05123 is also popular. pdqsort
| incorporates branchless technique and also includes other
| enhancements.
|
| Designing a Fast, Efficient, Cache-friendly Hash Table, Step by
| Step (2017) https://abseil.io/blog/20180927-swisstables
|
| Hash table is the workhorse of data structures, and while
| improvements are possible for specific usage patterns, open
| addressing with linear probing is still the fastest
| implementation for general purpose. So called "Swiss Table"
| combines cache-friendly metadata layout with SIMD-enhanced
| probing. These two optimizations work well with each other and
| combination of two is particularly effective.
| _dain_ wrote:
| _> pdqsort (for pattern-defeating quicksort)_
|
| I'm sad that it's not "pretty darn quicksort"
| sanxiyn wrote:
| It probably is that, just not officially. You know,
| officially, BFR stands for Big Falcon Rocket.
| anonymoushn wrote:
| For hit-heavy workloads, I have been unable to get Swiss tables
| to compete with tables that hit only ~1 cache line per query.
| hinkley wrote:
| I recall years ago someone suggesting that a 1:2 split being
| faster and I'm wondering if they were seeing the same thing,
| and either didn't follow it through to conclusion or the
| goalposts keep moving in each generation of processors.
| sanxiyn wrote:
| It is plausible that they were seeing the same thing.
| xiaodai wrote:
| Cache oblivious algorithms and lock free data structures
| Twisol wrote:
| E-graphs are pretty awesome, and worth keeping in your back
| pocket. They're like union-find structures, except they also
| maintain congruence relations (i.e. if `x` and `y` are in the
| same set, then `f(x)` and `f(y)` must likewise be in the same
| set).
|
| https://egraphs-good.github.io/
|
| (Incidentally, union-find structures are also great to know
| about. But they're not exactly "new".)
| giogadi wrote:
| IMO the hip new thing in algorithms and data structures is to
| Just Use Arrays For Everything
| dymk wrote:
| Surely it's the interesting things built on top of the arrays?
| yamazakiwi wrote:
| Are you talking about Vector DB's?
| masklinn wrote:
| Also ECS / columnar data?
| bckr wrote:
| Got a source?
| CountHackulus wrote:
| I'm not OP, but I'm guessing he could be referring to
| something like this talk?
| https://www.youtube.com/watch?v=LrVi9LHP8Bk
|
| I'm honestly not sure otherwise.
| eternalban wrote:
| When skirt lengths change and we observe no measurable change
| in human anatomy, that's fashion and to follow fashion is
| "hip".
|
| When the computing platform changes and you have to start
| paying attention to L-level caches and counting your bytes (so
| you can do you big compute in memory), and use n cores, and
| then arrays are adopted, that is not fashion; so not "hip",
| rather _modern_.
| transformi wrote:
| Follow-up, anything in graph-theory?
| kzrdude wrote:
| I'm not exactly up to date, but last few years has seen
| practical improvements to graph isomorphism algorithms.
| bmitc wrote:
| I'm interested in knowing about graph layout algorithms that
| allow specific constraints. If anyone knows about these things
| please let me know. I do not, but I have a use case for them.
|
| For example, the constraints may be:
|
| 1) The graph is read left to right. This informs the layout such
| that those spiral and circular graph layout algorithms that blow
| out the graph do not work.
|
| 2) Try to keep certain types of nodes vertical aligned, so that
| they lie on the same horizontal line.
|
| 3) Minimize edge crosses.
|
| 4) Have the edges be not lines but paths made up of horizontal
| and vertical lines. This probably requires some path finding as
| well.
|
| 5) Be able to live update in the sense that a new node or edge
| entry into the graph does not drastically re-layout the graph for
| reasonable inserts.
| alejohausner wrote:
| Look at graphviz. Their "dot" program satisfies 1 through 4.
| Your fifth wish is hard to satisfy.
___________________________________________________________________
(page generated 2023-05-10 23:00 UTC)