[HN Gopher] Data Branching for Batch Job Systems
___________________________________________________________________
Data Branching for Batch Job Systems
Author : thunderbong
Score : 36 points
Date : 2025-01-22 10:37 UTC (2 days ago)
(HTM) web link (isaacjordan.me)
(TXT) w3m dump (isaacjordan.me)
| PaulHoule wrote:
| An interesting pattern that I've thought about in "intelligent
| systems" is "patching" which could be applied at various stages
| of a process.
|
| If you look at the average training set used in competitive
| evaluations you will find examples in it that are just plain
| wrong which put an upper limit on both the evaluation scores and
| the real-world performance based on that data.
|
| Very occasionally someone does the hard work to improve the
| training and eval set and, if the world was efficient, this would
| be the new data set everyone uses.
|
| In real life you are getting more data all the time and you need
| a stable way to keep your data set "patched" as new data comes
| in. Similarly, an AI step later in the process needs to (i) have
| a human override so it can get things right for the consumer and
| (ii) remember that override so not to waste the time of the human
| or wear them out emotionally.
| larrydavidsdad wrote:
| you seen this? https://www.doltdb.com
| philsnow wrote:
| I hadn't seen that before and I can't speak to the quality of
| the project, but I wanted to call out the first section in the
| readme [0] for being perfectly clear and succinct:
|
| > Git versions files. Dolt versions tables. It's like Git and
| MySQL had a baby.
|
| > We also built DoltHub, a place to share Dolt databases. We
| host public data for free. If you want to host your own version
| of DoltHub, we have DoltLab. If you want us to run a Dolt
| server for you, we have Hosted Dolt. If you are looking for a
| Postgres version of Dolt, we built DoltgreSQL. Warning, it's
| early Alpha. Dolt is production-ready.
|
| [0] https://github.com/dolthub/dolt?tab=readme-ov-file#dolt-
| is-g...
| prpl wrote:
| Iceberg has branching but it doesn't really have great "merge"
| semantics, but the semantics otherwise would work good for batch
| semantics.
|
| What I think I'd like is to say "there are only
| AppendFilesCommits in these two branches" and merge the two, or
| otherwise look at the operations to determine if they two things
| can be fast forwarded.
| ragulpr wrote:
| This pattern is really meaningful conceptually but the tricky
| thing is to not create a mess in the process.
|
| If it's too easy to branch people will do so and the (knowledge)
| economics scale disappears (and we'll have a mess).
|
| If the common data definition is too hard to branch from no
| experiments will happen (slow).
|
| I think most tech for this seem to make it too easy, and in the
| process injecting a bunch of dependencies that makes it slow and
| harder to access. May have changed since I last looked.
|
| I found that the simple pattern of versioned paths/table names as
| `s3://mybucket/mystage/version=42/` or `my_table_v42` puts a high
| enough evolutionary cost on branching (as consumers need to
| explicitly adapt) while it also doesn't have the costs associated
| with using special tech (legacy/lock in/dependencies).
|
| It's also searchable on github/slack/etc if done right.
| buryat wrote:
| Apache Iceberg supports Branching and Tagging since some early
| versions
| https://iceberg.apache.org/docs/1.4.0/branching/#overview
|
| And the broader name for what the author is describing is the
| Write-Audit-Publish pattern, where data gets written into a
| branch first, audited/checked, and then the main branch gets
| replaced with the new one, effectively publishing the updated
| dataset using a single command. https://www.tabular.io/apache-
| iceberg-cookbook/data-engineer...
___________________________________________________________________
(page generated 2025-01-24 23:01 UTC)