https://github.com/iterative/datachain Skip to content Navigation Menu Toggle navigation Sign in * Product + GitHub Copilot Write better code with AI + Security Find and fix vulnerabilities + Actions Automate any workflow + Codespaces Instant dev environments + Issues Plan and track work + Code Review Manage code changes + Discussions Collaborate outside of code + Code Search Find more, search less Explore + All features + Documentation + GitHub Skills + Blog * Solutions By company size + Enterprises + Small and medium teams + Startups By use case + DevSecOps + DevOps + CI/CD + View all use cases By industry + Healthcare + Financial services + Manufacturing + Government + View all industries View all solutions * Resources Topics + AI + DevOps + Security + Software Development + View all Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up Reseting focus You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} iterative / datachain Public * Notifications You must be signed in to change notification settings * Fork 57 * Star 1.1k AI-data warehouse to enrich, transform and analyze data from cloud storages docs.datachain.ai License Apache-2.0 license 1.1k stars 57 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 37 * Pull requests 9 * Discussions * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights iterative/datachain main BranchesTags [ ] Go to file Code Folders and files Last commit Last Name Name message commit date Latest commit History 328 Commits .github .github docs docs examples examples overrides overrides src/datachain src/datachain tests tests .cruft.json .cruft.json .gitattributes .gitattributes .gitignore .gitignore .pre-commit-config.yaml .pre-commit-config.yaml CODE_OF_CONDUCT.rst CODE_OF_CONDUCT.rst CONTRIBUTING.rst CONTRIBUTING.rst LICENSE LICENSE README.rst README.rst mkdocs.yml mkdocs.yml noxfile.py noxfile.py pyproject.toml pyproject.toml View all files Repository files navigation * README * Code of conduct * Apache-2.0 license logo DataChain PyPI Python Version Codecov Tests DataChain is a modern Pythonic data-frame library designed for artificial intelligence. It is made to organize your unstructured data into datasets and wrangle it at scale on your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack. Key Features Storage as a Source of Truth. + Process unstructured data without redundant copies from S3, GCP, Azure, and local file systems. + Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet. + Unite files and metadata together into persistent, versioned, columnar datasets. Python-friendly data pipelines. + Operate on Python objects and object fields. + Built-in parallelization and out-of-memory compute without SQL or Spark. Data Enrichment and Processing. + Generate metadata using local AI models and LLM APIs. + Filter, join, and group by metadata. Search by vector embeddings. + Pass datasets to Pytorch and Tensorflow, or export them back into storage. Efficiency. + Parallelization, out-of-memory workloads and data caching. + Vectorized operations on Python object fields: sum, count, avg, etc. + Optimized vector search. Quick Start $ pip install datachain Selecting files using JSON metadata A storage consists of images of cats and dogs (dog.1048.jpg, cat.1009.jpg), annotated with ground truth and model inferences in the 'json-pairs' format, where each image has a matching JSON file like cat.1009.json: { "class": "cat", "id": "1009", "num_annotators": 8, "inference": {"class": "dog", "confidence": 0.68} } Example of downloading only "high-confidence cat" inferred images using JSON metadata: from datachain import Column, DataChain meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta") images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg") images_id = images.map(id=lambda file: file.path.split('.')[-2]) annotated = images_id.merge(meta, on="id", right_on="meta.id") likely_cats = annotated.filter((Column("meta.inference.confidence") > 0.93) \ & (Column("meta.inference.class_") == "cat")) likely_cats.export_files("high-confidence-cats/", signal="file") Data curation with a local AI model Batch inference with a simple sentiment model using the transformers library: pip install transformers The code below downloads files the cloud, and applies a user-defined function to each one of them. All files with a positive sentiment detected are then copied to the local directory. from transformers import pipeline from datachain import DataChain, Column classifier = pipeline("sentiment-analysis", device="cpu", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english") def is_positive_dialogue_ending(file) -> bool: dialogue_ending = file.read()[-512:] return classifier(dialogue_ending)[0]["label"] == "POSITIVE" chain = ( DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file", type="text") .settings(parallel=8, cache=True) .map(is_positive=is_positive_dialogue_ending) .save("file_response") ) positive_chain = chain.filter(Column("is_positive") == True) positive_chain.export_files("./output") print(f"{positive_chain.count()} files were exported") 13 files were exported $ ls output/datachain-demo/chatbot-KiT/ 15.txt 20.txt 24.txt 27.txt 28.txt 29.txt 33.txt 37.txt 38.txt 43.txt ... $ ls output/datachain-demo/chatbot-KiT/ | wc -l 13 LLM judging chatbots LLMs can work as universal classifiers. In the example below, we employ a free API from Mistral to judge the publicly available chatbot dialogs. Please get a free Mistral API key at https:// console.mistral.ai $ pip install mistralai (Requires version >=1.0.0) $ export MISTRAL_API_KEY=_your_key_ DataChain can parallelize API calls; the free Mistral tier supports up to 4 requests at the same time. from mistralai import Mistral from datachain import File, DataChain, Column PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure." def eval_dialogue(file: File) -> bool: client = Mistral() response = client.chat.complete( model="open-mixtral-8x22b", messages=[{"role": "system", "content": PROMPT}, {"role": "user", "content": file.read()}]) result = response.choices[0].message.content return result.lower().startswith("success") chain = ( DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file") .settings(parallel=4, cache=True) .map(is_success=eval_dialogue) .save("mistral_files") ) successful_chain = chain.filter(Column("is_success") == True) successful_chain.export_files("./output_mistral") print(f"{successful_chain.count()} files were exported") With the instruction above, the Mistral model considers 31/50 files to hold the successful dialogues: $ ls output_mistral/datachain-demo/chatbot-KiT/ 1.txt 15.txt 18.txt 2.txt 22.txt 25.txt 28.txt 33.txt 37.txt 4.txt 41.txt ... $ ls output_mistral/datachain-demo/chatbot-KiT/ | wc -l 31 Serializing Python-objects LLM responses may contain valuable information for analytics - such as the number of tokens used, or the model performance parameters. Instead of extracting this information from the Mistral response data structure (class ChatCompletionResponse), DataChain can serialize the entire LLM response to the internal DB: from mistralai import Mistral from mistralai.models import ChatCompletionResponse from datachain import File, DataChain, Column PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure." def eval_dialog(file: File) -> ChatCompletionResponse: client = MistralClient() return client.chat( model="open-mixtral-8x22b", messages=[{"role": "system", "content": PROMPT}, {"role": "user", "content": file.read()}]) chain = ( DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file") .settings(parallel=4, cache=True) .map(response=eval_dialog) .map(status=lambda response: response.choices[0].message.content.lower()[:7]) .save("response") ) chain.select("file.name", "status", "response.usage").show(5) success_rate = chain.filter(Column("status") == "success").count() / chain.count() print(f"{100*success_rate:.1f}% dialogs were successful") Output: file status response response response name usage usage usage prompt_tokens total_tokens completion_tokens 0 1.txt success 547 548 1 1 10.txt failure 3576 3578 2 2 11.txt failure 626 628 2 3 12.txt failure 1144 1182 38 4 13.txt success 1100 1101 1 [Limited by 5 rows] 64.0% dialogs were successful Iterating over Python data structures In the previous examples, datasets were saved in the embedded database (SQLite in folder .datachain of the working directory). These datasets were automatically versioned, and can be accessed using DataChain.from_dataset("dataset_name"). Here is how to retrieve a saved dataset and iterate over the objects: chain = DataChain.from_dataset("response") # Iterating one-by-one: support out-of-memory workflow for file, response in chain.limit(5).collect("file", "response"): # verify the collected Python objects assert isinstance(response, ChatCompletionResponse) status = response.choices[0].message.content[:7] tokens = response.usage.total_tokens print(f"{file.get_uri()}: {status}, file size: {file.size}, tokens: {tokens}") Output: gs://datachain-demo/chatbot-KiT/1.txt: Success, file size: 1776, tokens: 548 gs://datachain-demo/chatbot-KiT/10.txt: Failure, file size: 11576, tokens: 3578 gs://datachain-demo/chatbot-KiT/11.txt: Failure, file size: 2045, tokens: 628 gs://datachain-demo/chatbot-KiT/12.txt: Failure, file size: 3833, tokens: 1207 gs://datachain-demo/chatbot-KiT/13.txt: Success, file size: 3657, tokens: 1101 Vectorized analytics over Python objects Some operations can run inside the DB without deserialization. For instance, let's calculate the total cost of using the LLM APIs, assuming the Mixtral call costs $2 per 1M input tokens and $6 per 1M output tokens: chain = DataChain.from_dataset("mistral_dataset") cost = chain.sum("response.usage.prompt_tokens")*0.000002 \ + chain.sum("response.usage.completion_tokens")*0.000006 print(f"Spent ${cost:.2f} on {chain.count()} calls") Output: Spent $0.08 on 50 calls PyTorch data loader Chain results can be exported or passed directly to PyTorch dataloader. For example, if we are interested in passing image and a label based on file name suffix, the following code will do it: from torch.utils.data import DataLoader from transformers import CLIPProcessor from datachain import C, DataChain processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") chain = ( DataChain.from_storage("gs://datachain-demo/dogs-and-cats/", type="image") .map(label=lambda name: name.split(".")[0], params=["file.name"]) .select("file", "label").to_pytorch( transform=processor.image_processor, tokenizer=processor.tokenizer, ) ) loader = DataLoader(chain, batch_size=1) Tutorials * Getting Started * Multimodal (try in Colab) * LLM evaluations (try in Colab) * Reading JSON metadata (try in Colab) Contributions Contributions are very welcome. To learn more, see the Contributor Guide. Community and Support * Docs * File an issue if you encounter any problems * Discord Chat * Email * Twitter About AI-data warehouse to enrich, transform and analyze data from cloud storages docs.datachain.ai Topics ai cv embeddings data-analytics data-wrangling multimodal mlops llm llm-eval Resources Readme License Apache-2.0 license Code of conduct Code of conduct Activity Custom properties Stars 1.1k stars Watchers 14 watching Forks 57 forks Report repository Releases 44 0.6.5 Latest Nov 1, 2024 + 43 releases Contributors 23 * @mattseddon * @skshetry * @rlamy * @ilongin * @dreadatour * @dtulga * @pre-commit-ci[bot] * @shcheklein * @dmpetrov * @volkfox * @amritghimire * @EdwardLi-coder * @dependabot[bot] * @mnrozhkov + 9 contributors Languages * Python 100.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.