[HN Gopher] Data Branching for Batch Job Systems
       ___________________________________________________________________
        
       Data Branching for Batch Job Systems
        
       Author : thunderbong
       Score  : 36 points
       Date   : 2025-01-22 10:37 UTC (2 days ago)
        
 (HTM) web link (isaacjordan.me)
 (TXT) w3m dump (isaacjordan.me)
        
       | PaulHoule wrote:
       | An interesting pattern that I've thought about in "intelligent
       | systems" is "patching" which could be applied at various stages
       | of a process.
       | 
       | If you look at the average training set used in competitive
       | evaluations you will find examples in it that are just plain
       | wrong which put an upper limit on both the evaluation scores and
       | the real-world performance based on that data.
       | 
       | Very occasionally someone does the hard work to improve the
       | training and eval set and, if the world was efficient, this would
       | be the new data set everyone uses.
       | 
       | In real life you are getting more data all the time and you need
       | a stable way to keep your data set "patched" as new data comes
       | in. Similarly, an AI step later in the process needs to (i) have
       | a human override so it can get things right for the consumer and
       | (ii) remember that override so not to waste the time of the human
       | or wear them out emotionally.
        
       | larrydavidsdad wrote:
       | you seen this? https://www.doltdb.com
        
         | philsnow wrote:
         | I hadn't seen that before and I can't speak to the quality of
         | the project, but I wanted to call out the first section in the
         | readme [0] for being perfectly clear and succinct:
         | 
         | > Git versions files. Dolt versions tables. It's like Git and
         | MySQL had a baby.
         | 
         | > We also built DoltHub, a place to share Dolt databases. We
         | host public data for free. If you want to host your own version
         | of DoltHub, we have DoltLab. If you want us to run a Dolt
         | server for you, we have Hosted Dolt. If you are looking for a
         | Postgres version of Dolt, we built DoltgreSQL. Warning, it's
         | early Alpha. Dolt is production-ready.
         | 
         | [0] https://github.com/dolthub/dolt?tab=readme-ov-file#dolt-
         | is-g...
        
       | prpl wrote:
       | Iceberg has branching but it doesn't really have great "merge"
       | semantics, but the semantics otherwise would work good for batch
       | semantics.
       | 
       | What I think I'd like is to say "there are only
       | AppendFilesCommits in these two branches" and merge the two, or
       | otherwise look at the operations to determine if they two things
       | can be fast forwarded.
        
       | ragulpr wrote:
       | This pattern is really meaningful conceptually but the tricky
       | thing is to not create a mess in the process.
       | 
       | If it's too easy to branch people will do so and the (knowledge)
       | economics scale disappears (and we'll have a mess).
       | 
       | If the common data definition is too hard to branch from no
       | experiments will happen (slow).
       | 
       | I think most tech for this seem to make it too easy, and in the
       | process injecting a bunch of dependencies that makes it slow and
       | harder to access. May have changed since I last looked.
       | 
       | I found that the simple pattern of versioned paths/table names as
       | `s3://mybucket/mystage/version=42/` or `my_table_v42` puts a high
       | enough evolutionary cost on branching (as consumers need to
       | explicitly adapt) while it also doesn't have the costs associated
       | with using special tech (legacy/lock in/dependencies).
       | 
       | It's also searchable on github/slack/etc if done right.
        
       | buryat wrote:
       | Apache Iceberg supports Branching and Tagging since some early
       | versions
       | https://iceberg.apache.org/docs/1.4.0/branching/#overview
       | 
       | And the broader name for what the author is describing is the
       | Write-Audit-Publish pattern, where data gets written into a
       | branch first, audited/checked, and then the main branch gets
       | replaced with the new one, effectively publishing the updated
       | dataset using a single command. https://www.tabular.io/apache-
       | iceberg-cookbook/data-engineer...
        
       ___________________________________________________________________
       (page generated 2025-01-24 23:01 UTC)