https://github.com/iterative/datachain

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        GitHub Copilot
        Write better code with AI
      +  
        Security
        Find and fix vulnerabilities
      +  
        Actions
        Automate any workflow
      +  
        Codespaces
        Instant dev environments
      +  
        Issues
        Plan and track work
      +  
        Code Review
        Manage code changes
      +  
        Discussions
        Collaborate outside of code
      +  
        Code Search
        Find more, search less
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    By company size
      + Enterprises
      + Small and medium teams
      + Startups
    By use case
      + DevSecOps
      + DevOps
      + CI/CD
      + View all use cases
    By industry
      + Healthcare
      + Financial services
      + Manufacturing
      + Government
      + View all industries
    View all solutions
  * Resources
    Topics
      + AI
      + DevOps
      + Security
      + Software Development
      + View all
    Explore
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Enterprise
      +  
        Enterprise platform
        AI-powered developer platform
    Available add-ons
      +  
        Advanced Security
        Enterprise-grade security features
      +  
        GitHub Copilot
        Enterprise-grade AI features
      +  
        Premium Support
        Enterprise-grade 24/7 support
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up Reseting focus
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
iterative / datachain Public

  * Notifications You must be signed in to change notification
    settings
  * Fork 57
  * Star 1.1k

AI-data warehouse to enrich, transform and analyze data from cloud
storages

docs.datachain.ai

License

Apache-2.0 license
1.1k stars 57 forks Branches Tags Activity
Star
Notifications You must be signed in to change notification settings

  * Code
  * Issues 37
  * Pull requests 9
  * Discussions
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

iterative/datachain

 main
BranchesTags
  
[                    ]
Go to file
Code

Folders and files

                                                Last commit   Last
         Name                    Name             message    commit
                                                              date
Latest commit

 

History

328 Commits
 
.github                 .github                              
docs                    docs                                 
examples                examples                             
overrides               overrides                            
src/datachain           src/datachain                        
tests                   tests                                
.cruft.json             .cruft.json                          
.gitattributes          .gitattributes                       
.gitignore              .gitignore                           
.pre-commit-config.yaml .pre-commit-config.yaml              
CODE_OF_CONDUCT.rst     CODE_OF_CONDUCT.rst                  
CONTRIBUTING.rst        CONTRIBUTING.rst                     
LICENSE                 LICENSE                              
README.rst              README.rst                           
mkdocs.yml              mkdocs.yml                           
noxfile.py              noxfile.py                           
pyproject.toml          pyproject.toml                       
View all files

Repository files navigation

  * README
  * Code of conduct
  * Apache-2.0 license

logo DataChain

 

PyPI Python Version Codecov Tests

DataChain is a modern Pythonic data-frame library designed for
artificial intelligence. It is made to organize your unstructured
data into datasets and wrangle it at scale on your local machine.
Datachain does not abstract or hide the AI models and API calls, but
helps to integrate them into the postmodern data stack.

Key Features

 

 Storage as a Source of Truth.
      + Process unstructured data without redundant copies from S3,
        GCP, Azure, and local file systems.
      + Multimodal data support: images, video, text, PDFs, JSONs,
        CSVs, parquet.
      + Unite files and metadata together into persistent, versioned,
        columnar datasets.
 Python-friendly data pipelines.
      + Operate on Python objects and object fields.
      + Built-in parallelization and out-of-memory compute without
        SQL or Spark.
 Data Enrichment and Processing.
      + Generate metadata using local AI models and LLM APIs.
      + Filter, join, and group by metadata. Search by vector
        embeddings.
      + Pass datasets to Pytorch and Tensorflow, or export them back
        into storage.
 Efficiency.
      + Parallelization, out-of-memory workloads and data caching.
      + Vectorized operations on Python object fields: sum, count,
        avg, etc.
      + Optimized vector search.

Quick Start

 

$ pip install datachain

Selecting files using JSON metadata

 

A storage consists of images of cats and dogs (dog.1048.jpg,
cat.1009.jpg), annotated with ground truth and model inferences in
the 'json-pairs' format, where each image has a matching JSON file
like cat.1009.json:

{
    "class": "cat", "id": "1009", "num_annotators": 8,
    "inference": {"class": "dog", "confidence": 0.68}
}

Example of downloading only "high-confidence cat" inferred images
using JSON metadata:

from datachain import Column, DataChain

meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta")
images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")

images_id = images.map(id=lambda file: file.path.split('.')[-2])
annotated = images_id.merge(meta, on="id", right_on="meta.id")

likely_cats = annotated.filter((Column("meta.inference.confidence") > 0.93) \
                               & (Column("meta.inference.class_") == "cat"))
likely_cats.export_files("high-confidence-cats/", signal="file")

Data curation with a local AI model

 

Batch inference with a simple sentiment model using the transformers
library:

pip install transformers

The code below downloads files the cloud, and applies a user-defined
function to each one of them. All files with a positive sentiment
detected are then copied to the local directory.

from transformers import pipeline
from datachain import DataChain, Column

classifier = pipeline("sentiment-analysis", device="cpu",
                model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

def is_positive_dialogue_ending(file) -> bool:
    dialogue_ending = file.read()[-512:]
    return classifier(dialogue_ending)[0]["label"] == "POSITIVE"

chain = (
   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/",
                          object_name="file", type="text")
   .settings(parallel=8, cache=True)
   .map(is_positive=is_positive_dialogue_ending)
   .save("file_response")
)

positive_chain = chain.filter(Column("is_positive") == True)
positive_chain.export_files("./output")

print(f"{positive_chain.count()} files were exported")

13 files were exported

$ ls output/datachain-demo/chatbot-KiT/
15.txt 20.txt 24.txt 27.txt 28.txt 29.txt 33.txt 37.txt 38.txt 43.txt ...
$ ls output/datachain-demo/chatbot-KiT/ | wc -l
13

LLM judging chatbots

 

LLMs can work as universal classifiers. In the example below, we
employ a free API from Mistral to judge the publicly available
chatbot dialogs. Please get a free Mistral API key at https://
console.mistral.ai

$ pip install mistralai (Requires version >=1.0.0)
$ export MISTRAL_API_KEY=_your_key_

DataChain can parallelize API calls; the free Mistral tier supports
up to 4 requests at the same time.

from mistralai import Mistral
from datachain import File, DataChain, Column

PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."

def eval_dialogue(file: File) -> bool:
     client = Mistral()
     response = client.chat.complete(
         model="open-mixtral-8x22b",
         messages=[{"role": "system", "content": PROMPT},
                   {"role": "user", "content": file.read()}])
     result = response.choices[0].message.content
     return result.lower().startswith("success")

chain = (
   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
   .settings(parallel=4, cache=True)
   .map(is_success=eval_dialogue)
   .save("mistral_files")
)

successful_chain = chain.filter(Column("is_success") == True)
successful_chain.export_files("./output_mistral")

print(f"{successful_chain.count()} files were exported")

With the instruction above, the Mistral model considers 31/50 files
to hold the successful dialogues:

$ ls output_mistral/datachain-demo/chatbot-KiT/
1.txt  15.txt 18.txt 2.txt  22.txt 25.txt 28.txt 33.txt 37.txt 4.txt  41.txt ...
$ ls output_mistral/datachain-demo/chatbot-KiT/ | wc -l
31

Serializing Python-objects

 

LLM responses may contain valuable information for analytics - such
as the number of tokens used, or the model performance parameters.

Instead of extracting this information from the Mistral response data
structure (class ChatCompletionResponse), DataChain can serialize the
entire LLM response to the internal DB:

from mistralai import Mistral
from mistralai.models import ChatCompletionResponse
from datachain import File, DataChain, Column

PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."

def eval_dialog(file: File) -> ChatCompletionResponse:
     client = MistralClient()
     return client.chat(
         model="open-mixtral-8x22b",
         messages=[{"role": "system", "content": PROMPT},
                   {"role": "user", "content": file.read()}])

chain = (
   DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file")
   .settings(parallel=4, cache=True)
   .map(response=eval_dialog)
   .map(status=lambda response: response.choices[0].message.content.lower()[:7])
   .save("response")
)

chain.select("file.name", "status", "response.usage").show(5)

success_rate = chain.filter(Column("status") == "success").count() / chain.count()
print(f"{100*success_rate:.1f}% dialogs were successful")

Output:

     file   status      response     response          response
     name                  usage        usage             usage
                   prompt_tokens total_tokens completion_tokens
0   1.txt  success           547          548                 1
1  10.txt  failure          3576         3578                 2
2  11.txt  failure           626          628                 2
3  12.txt  failure          1144         1182                38
4  13.txt  success          1100         1101                 1

[Limited by 5 rows]
64.0% dialogs were successful

Iterating over Python data structures

 

In the previous examples, datasets were saved in the embedded
database (SQLite in folder .datachain of the working directory).
These datasets were automatically versioned, and can be accessed
using DataChain.from_dataset("dataset_name").

Here is how to retrieve a saved dataset and iterate over the objects:

chain = DataChain.from_dataset("response")

# Iterating one-by-one: support out-of-memory workflow
for file, response in chain.limit(5).collect("file", "response"):
    # verify the collected Python objects
    assert isinstance(response, ChatCompletionResponse)

    status = response.choices[0].message.content[:7]
    tokens = response.usage.total_tokens
    print(f"{file.get_uri()}: {status}, file size: {file.size}, tokens: {tokens}")

Output:

gs://datachain-demo/chatbot-KiT/1.txt: Success, file size: 1776, tokens: 548
gs://datachain-demo/chatbot-KiT/10.txt: Failure, file size: 11576, tokens: 3578
gs://datachain-demo/chatbot-KiT/11.txt: Failure, file size: 2045, tokens: 628
gs://datachain-demo/chatbot-KiT/12.txt: Failure, file size: 3833, tokens: 1207
gs://datachain-demo/chatbot-KiT/13.txt: Success, file size: 3657, tokens: 1101

Vectorized analytics over Python objects

 

Some operations can run inside the DB without deserialization. For
instance, let's calculate the total cost of using the LLM APIs,
assuming the Mixtral call costs $2 per 1M input tokens and $6 per 1M
output tokens:

chain = DataChain.from_dataset("mistral_dataset")

cost = chain.sum("response.usage.prompt_tokens")*0.000002 \
           + chain.sum("response.usage.completion_tokens")*0.000006
print(f"Spent ${cost:.2f} on {chain.count()} calls")

Output:

Spent $0.08 on 50 calls

PyTorch data loader

 

Chain results can be exported or passed directly to PyTorch
dataloader. For example, if we are interested in passing image and a
label based on file name suffix, the following code will do it:

from torch.utils.data import DataLoader
from transformers import CLIPProcessor

from datachain import C, DataChain

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

chain = (
    DataChain.from_storage("gs://datachain-demo/dogs-and-cats/", type="image")
    .map(label=lambda name: name.split(".")[0], params=["file.name"])
    .select("file", "label").to_pytorch(
        transform=processor.image_processor,
        tokenizer=processor.tokenizer,
    )
)
loader = DataLoader(chain, batch_size=1)

Tutorials

 

  * Getting Started
  * Multimodal (try in Colab)
  * LLM evaluations (try in Colab)
  * Reading JSON metadata (try in Colab)

Contributions

 

Contributions are very welcome. To learn more, see the Contributor
Guide.

Community and Support

 

  * Docs
  * File an issue if you encounter any problems
  * Discord Chat
  * Email
  * Twitter

About

AI-data warehouse to enrich, transform and analyze data from cloud
storages

docs.datachain.ai

Topics

ai cv embeddings data-analytics data-wrangling multimodal mlops llm 
llm-eval

Resources

Readme

License

Apache-2.0 license

Code of conduct

Code of conduct
Activity
Custom properties

Stars

1.1k stars

Watchers

14 watching

Forks

57 forks
Report repository

Releases 44

 
0.6.5 Latest
Nov 1, 2024
+ 43 releases

Contributors 23

  * @mattseddon
  * @skshetry
  * @rlamy
  * @ilongin
  * @dreadatour
  * @dtulga
  * @pre-commit-ci[bot]
  * @shcheklein
  * @dmpetrov
  * @volkfox
  * @amritghimire
  * @EdwardLi-coder
  * @dependabot[bot]
  * @mnrozhkov

+ 9 contributors

Languages

  * Python 100.0%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.