[HN Gopher] DataChain: DBT for Unstructured Data
       ___________________________________________________________________
        
       DataChain: DBT for Unstructured Data
        
       Author : shcheklein
       Score  : 92 points
       Date   : 2024-11-04 17:34 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | whalesalad wrote:
       | > It is made to organize your unstructured data into datasets and
       | wrangle it at scale on your local machine.
       | 
       | How does one wrangle terabytes of data on a local machine?
        
         | shcheklein wrote:
         | The idea is that it doesn't store binary files locally, just
         | pointers in the DB + meta data (SQLite if you run locally, open
         | source). So, it's versioning, structuring of datasets, etc by
         | "references" if you wish.
         | 
         | (that's is different from let's say DVC - that does copy files
         | into a local cache, always)
        
           | aduffy wrote:
           | So in the case from the README, where you're trying to curate
           | a sample of your data, the only thing that you're reading is
           | the metadata, UNTIL you run `export_files` and that actually
           | copies the binary data to your local machine?
        
             | dmpetrov wrote:
             | Exactly! DataChain does lazy compute. It will read
             | metadata/json while applying filtering and only download a
             | sample of data files (jpg) based on the filter.
             | 
             | This way, you might end up downloading just 1% of your
             | data, as defined by the metadata filter.
        
       | mpeg wrote:
       | It took me a minute to grok what this was for, but I think I like
       | it
       | 
       | It doesn't really replace any of the tooling we use to wrangle
       | data at scale (like prefect or dagster or temporal) but as a
       | local library it seems to be excellent, I think what confused me
       | most was the comparison to dbt.
       | 
       | I like the from_* utils and the magic of the Column class
       | operator overloading and how chains can be used as datasets. Love
       | how easy checkpointing is too. Will give it a go
        
         | dmpetrov wrote:
         | Yes, it's not meant to replace data engineering tools like
         | Prefect or Temporal. Instead, it serves as a transformation
         | engine and ad-hoc analytics for images/video/text data. It's
         | pretty much DBT use case for text and images in S3/GCS, though
         | every analogy has its limits.
         | 
         | Try it out - looking forward to your feedback!
        
       | dmpetrov wrote:
       | Yay! Excited to see DataChain on the front page :)
       | 
       | Maintainer and author here. Happy to answer any questions.
       | 
       | We built DataChain because our DVC couldn't fully handle data
       | transformations and versioning directly in S3/GCS/Azure without
       | data copying.
       | 
       | Analogy with "DBT for unstractured data" applies very well to
       | DataChain since it transforms data (using Python, not SQL) inside
       | in storages (S3, not DB). Happy to talk more!
        
       | jerednel wrote:
       | Cool! Does this assume the unstructured data already has a
       | corresponding metadata file?
       | 
       | My most common use cases involve getting PDFs or HTML files and I
       | have to parse the metadata to store along with the embedding.
       | 
       | Would I have to run a process to extract file metadata into JSONs
       | for every embedding/chunk? Would keys created based off document
       | be title+chunk_no?
       | 
       | Very interested in this because documents from clients are
       | subject to random changes and I don't have very robust systems in
       | place.
        
         | dmpetrov wrote:
         | DataChain has no assumptions about metadata format. However,
         | some formats are supported out of the box: WebDataset, json-
         | pair, openimage, etc.
         | 
         | Extract metadata as usual, then return the result as JSON or a
         | Pydantic object. DataChian will automatically serialize it to
         | internal dataset structure (SQLite), which can be exported to
         | CSV/Parquet.
         | 
         | In case of PDF/HTML, you will likely produce multiple documents
         | per file which is also supported - just `yield return
         | my_result` multiple times from map().
         | 
         | Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo
         | Blog post: https://datachain.ai/blog/datachain-unstructured-
         | pdf-process...
        
           | spott wrote:
           | > DataChain has no assumptions about metadata format.
           | 
           | Could your metadata come from something like a Postgres sql
           | statement? Or an iceberg view?
        
             | dmpetrov wrote:
             | Absolutely, that's a common scenario!
             | 
             | Just connect from your Python code (like the lambda in the
             | example) to DB and extract the necessary data.
        
           | nbbaier wrote:
           | > However, some formats are supported out of the box:
           | WebDataset, json-pair, openimage, etc.
           | 
           | Forgive my ignorance, but what is "json-pair"?
        
             | dmpetrov wrote:
             | It's not a format :)
             | 
             | It's simpliy about linking metadata from a json to a
             | corresponding image or video file, like pairing data003.png
             | & data003.json to a single, virtual record. Some format use
             | this approach: open-image or laion datasets.
        
         | Kiro wrote:
         | What relevant metadata is there in an HTML file?
        
           | dmpetrov wrote:
           | I guess, it involves splitting a file into smaller document
           | snippets, getting page numbers and such, and calculating
           | embeddings for each snippet--that's the usual approach.
           | Specific signals vary by use case.
           | 
           | Hopefully, @jerednel can add more details.
        
             | jerednel wrote:
             | For HTML it's markup tags...h1's, page title, meta
             | keywords, meta descriptions.
             | 
             | My retriever functions will typically use metadata in
             | combination with the similarity search to do impart some
             | sort of influence or for reranking.
        
       ___________________________________________________________________
       (page generated 2024-11-04 23:00 UTC)