[HN Gopher] DataChain: DBT for Unstructured Data
___________________________________________________________________
DataChain: DBT for Unstructured Data
Author : shcheklein
Score : 92 points
Date : 2024-11-04 17:34 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| whalesalad wrote:
| > It is made to organize your unstructured data into datasets and
| wrangle it at scale on your local machine.
|
| How does one wrangle terabytes of data on a local machine?
| shcheklein wrote:
| The idea is that it doesn't store binary files locally, just
| pointers in the DB + meta data (SQLite if you run locally, open
| source). So, it's versioning, structuring of datasets, etc by
| "references" if you wish.
|
| (that's is different from let's say DVC - that does copy files
| into a local cache, always)
| aduffy wrote:
| So in the case from the README, where you're trying to curate
| a sample of your data, the only thing that you're reading is
| the metadata, UNTIL you run `export_files` and that actually
| copies the binary data to your local machine?
| dmpetrov wrote:
| Exactly! DataChain does lazy compute. It will read
| metadata/json while applying filtering and only download a
| sample of data files (jpg) based on the filter.
|
| This way, you might end up downloading just 1% of your
| data, as defined by the metadata filter.
| mpeg wrote:
| It took me a minute to grok what this was for, but I think I like
| it
|
| It doesn't really replace any of the tooling we use to wrangle
| data at scale (like prefect or dagster or temporal) but as a
| local library it seems to be excellent, I think what confused me
| most was the comparison to dbt.
|
| I like the from_* utils and the magic of the Column class
| operator overloading and how chains can be used as datasets. Love
| how easy checkpointing is too. Will give it a go
| dmpetrov wrote:
| Yes, it's not meant to replace data engineering tools like
| Prefect or Temporal. Instead, it serves as a transformation
| engine and ad-hoc analytics for images/video/text data. It's
| pretty much DBT use case for text and images in S3/GCS, though
| every analogy has its limits.
|
| Try it out - looking forward to your feedback!
| dmpetrov wrote:
| Yay! Excited to see DataChain on the front page :)
|
| Maintainer and author here. Happy to answer any questions.
|
| We built DataChain because our DVC couldn't fully handle data
| transformations and versioning directly in S3/GCS/Azure without
| data copying.
|
| Analogy with "DBT for unstractured data" applies very well to
| DataChain since it transforms data (using Python, not SQL) inside
| in storages (S3, not DB). Happy to talk more!
| jerednel wrote:
| Cool! Does this assume the unstructured data already has a
| corresponding metadata file?
|
| My most common use cases involve getting PDFs or HTML files and I
| have to parse the metadata to store along with the embedding.
|
| Would I have to run a process to extract file metadata into JSONs
| for every embedding/chunk? Would keys created based off document
| be title+chunk_no?
|
| Very interested in this because documents from clients are
| subject to random changes and I don't have very robust systems in
| place.
| dmpetrov wrote:
| DataChain has no assumptions about metadata format. However,
| some formats are supported out of the box: WebDataset, json-
| pair, openimage, etc.
|
| Extract metadata as usual, then return the result as JSON or a
| Pydantic object. DataChian will automatically serialize it to
| internal dataset structure (SQLite), which can be exported to
| CSV/Parquet.
|
| In case of PDF/HTML, you will likely produce multiple documents
| per file which is also supported - just `yield return
| my_result` multiple times from map().
|
| Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo
| Blog post: https://datachain.ai/blog/datachain-unstructured-
| pdf-process...
| spott wrote:
| > DataChain has no assumptions about metadata format.
|
| Could your metadata come from something like a Postgres sql
| statement? Or an iceberg view?
| dmpetrov wrote:
| Absolutely, that's a common scenario!
|
| Just connect from your Python code (like the lambda in the
| example) to DB and extract the necessary data.
| nbbaier wrote:
| > However, some formats are supported out of the box:
| WebDataset, json-pair, openimage, etc.
|
| Forgive my ignorance, but what is "json-pair"?
| dmpetrov wrote:
| It's not a format :)
|
| It's simpliy about linking metadata from a json to a
| corresponding image or video file, like pairing data003.png
| & data003.json to a single, virtual record. Some format use
| this approach: open-image or laion datasets.
| Kiro wrote:
| What relevant metadata is there in an HTML file?
| dmpetrov wrote:
| I guess, it involves splitting a file into smaller document
| snippets, getting page numbers and such, and calculating
| embeddings for each snippet--that's the usual approach.
| Specific signals vary by use case.
|
| Hopefully, @jerednel can add more details.
| jerednel wrote:
| For HTML it's markup tags...h1's, page title, meta
| keywords, meta descriptions.
|
| My retriever functions will typically use metadata in
| combination with the similarity search to do impart some
| sort of influence or for reranking.
___________________________________________________________________
(page generated 2024-11-04 23:00 UTC)