https://github.com/Nike-Inc/koheesio Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + GitHub Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} Nike-Inc / koheesio Public * Notifications You must be signed in to change notification settings * Fork 5 * Star 391 * Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components. engineering.nike.com/koheesio/ License Apache-2.0 license 391 stars 5 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 11 * Pull requests 1 * Discussions * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights Nike-Inc/koheesio This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit message Last commit date Latest commit History 33 Commits .github .github docs docs src/koheesio src/koheesio tests tests .gitignore .gitignore CONTRIBUTING.md CONTRIBUTING.md LICENSE.txt LICENSE.txt Makefile Makefile README.md README.md mkdocs.yml mkdocs.yml pyproject.toml pyproject.toml View all files Repository files navigation * README * Code of conduct * Apache-2.0 license Koheesio Koheesio logo CI/CD CI - Test CD - Release Koheesio Package PyPI - Version PyPI - Python Version PyPI - Downloads Meta Hatch project linting - Ruff types - Mypy docstring - numpydoc code style - black License - Apache 2.0 Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components. The framework is versatile, aiming to support multiple implementations and working seamlessly with various data processing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the underlying technology or data scale. Koheesio uses Pydantic for strong typing, data validation, and settings management, ensuring a high level of type safety and structured configurations within pipeline components. Koheesio's goal is to ensure predictable pipeline execution through a solid foundation of well-tested code and a rich set of features, making it an excellent choice for developers and organizations seeking to build robust and adaptable Data Pipelines. What sets Koheesio apart from other libraries?" Koheesio encapsulates years of data engineering expertise, fostering a collaborative and innovative community. While similar libraries exist, Koheesio's focus on data pipelines, integration with PySpark, and specific design for tasks like data transformation, ETL jobs, data validation, and large-scale data processing sets it apart. Koheesio aims to provide a rich set of features including readers, writers, and transformations for any type of Data processing. Koheesio is not in competition with other libraries. Its aim is to offer wide-ranging support and focus on utility in a multitude of scenarios. Our preference is for integration, not competition... We invite contributions from all, promoting collaboration and innovation in the data engineering community. Koheesio Core Components Here are the key components included in Koheesio: * Step: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline, taking in inputs and producing outputs. +---------+ +------------------+ +----------+ | Input 1 |------->| +------->| Output 1 | +---------+ | | +----[?]-----+ | | +---------+ | | +----------+ | Input 2 |------->| Step |------->| Output 2 | +---------+ | | +----------+ | | +---------+ | | +----------+ | Input 3 |------->| +------->| Output 3 | +---------+ +------------------+ +----------+ * Context: This is a configuration class used to set up the environment for a Task. It can be used to share variables across tasks and adapt the behavior of a Task based on its environment. * Logger: This is a class for logging messages at different levels. Installation You can install Koheesio using either pip or poetry. Using Pip To install Koheesio using pip, run the following command in your terminal: pip install koheesio Using Hatch If you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your pyproject.toml. [dependencies] koheesio = "" Using Poetry If you're using poetry for package management, you can add Koheesio to your project with the following command: poetry add koheesio or add the following line to your pyproject.toml (under [tool.poetry.dependencies]), making sure to replace ... with the version you want to have installed: koheesio = {version = "..."} Features Koheesio also provides some additional features that can be useful in certain scenarios. These include: * Spark Expectations: Available through the koheesio.steps.integration.spark.dq.spark_expectations module; + Installable through the se extra. + SE Provides Data Quality checks for Spark DataFrames. For more information, refer to the Spark Expectations docs. * Box: Available through the koheesio.steps.integration.box module + Installable through the box extra. + Box is a cloud content management and file sharing service for businesses. * SFTP: Available through the koheesio.steps.integration.spark.sftp module; + Installable through the sftp extra. + SFTP is a network protocol used for secure file transfer over a secure shell. Note: Some of the steps require extra dependencies. See the Features section for additional info. Extras can be done by adding features=['name_of_the_extra'] to the toml entry mentioned above Contributing How to Contribute We welcome contributions to our project! Here's a brief overview of our development process: * Code Standards: We use pylint, black, and mypy to maintain code standards. Please ensure your code passes these checks by running make check. No errors or warnings should be reported by the linter before you submit a pull request. * Testing: We use pytest for testing. Run the tests with make test and ensure all tests pass before submitting a pull request. * Release Process: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with admin rights will create a new release on GitHub and publish the new version to PyPI. For more detailed information, please refer to our contribution guidelines. We also adhere to Nike's Code of Conduct and Nike's Individual Contributor License Agreement. Additional Resources * General GitHub documentation * GitHub pull request documentation * Nike OSS About Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components. engineering.nike.com/koheesio/ Topics python pyspark data-engineering pydantic delta-lake Resources Readme License Apache-2.0 license Code of conduct Code of conduct Activity Custom properties Stars 391 stars Watchers 14 watching Forks 5 forks Report repository Releases 1 v0.7.0 Latest May 29, 2024 Contributors 2 * @dannymeijer dannymeijer Danny Meijer * @mikita-sakalouski mikita-sakalouski Mikita Sakalouski Languages * Python 98.8% * Makefile 1.1% * PLSQL 0.1% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.