https://github.com/fugue-project/fugue Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code + Explore + All features + Documentation + GitHub Skills + Blog * Solutions + For + Enterprise + Teams + Startups + Education + By Solution + CI/CD & Automation + DevOps + DevSecOps + Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles + Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} fugue-project / fugue Public * Notifications * Fork 76 * Star 1.4k A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites. fugue-tutorials.readthedocs.io/ License Apache-2.0 license 1.4k stars 76 forks Star Notifications * Code * Issues 28 * Pull requests 1 * Discussions * Actions * Projects 3 * Security * Insights More * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights fugue-project/fugue This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 2 branches 108 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/f] Use Git or checkout with SVN using the web URL. [gh repo clone fugue-] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @goodwanghan goodwanghan Remove to_local_df and to_local_bounded_df (#448) ... 1ffa28e Mar 25, 2023 Remove to_local_df and to_local_bounded_df (#448) * Remove SQLite * Remove SQLite * Finalize FunctionWrapper refactoring * Update docs * update * Remove to_local_df 1ffa28e Git stats * 377 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time .devcontainer Update devenv and cicd (#326) May 7, 2022 19:54 .github Move fugue_sql into fugue (#390) November 17, 2022 16:49 docs Finalize FunctionWrapper refactoring (#445) March 19, 2023 16:46 fugue Remove to_local_df and to_local_bounded_df (#448) March 25, 2023 14:08 fugue_contrib contrib: register the vizzu extension (#442) March 16, 2023 21:56 fugue_dask Finalize FunctionWrapper refactoring (#445) March 19, 2023 16:46 fugue_duckdb Finalize FunctionWrapper refactoring (#445) March 19, 2023 16:46 fugue_ibis Remove to_local_df and to_local_bounded_df (#448) March 25, 2023 14:08 fugue_notebook Improve Dataset display (#391) November 18, 2022 09:41 fugue_polars Remove to_local_df and to_local_bounded_df (#448) March 25, 2023 14:08 fugue_ray Finalize FunctionWrapper refactoring (#445) March 19, 2023 16:46 fugue_spark Finalize FunctionWrapper refactoring (#445) March 19, 2023 16:46 fugue_sql Improve Dataset display (#391) November 18, 2022 09:41 fugue_test Support Polars as a local dataframe in transformer (#439) March 16, 2023 01:03 fugue_version Make PartitionCursor take functions, fix Ray perf issue, fix spark ge... March 8, 2023 22:11 images update (#173) February 14, 2021 19:58 scripts Create DuckDaskExecutionEngine with Dask Improvements (#301) February 21, 2022 22:21 tests Remove to_local_df and to_local_bounded_df (#448) March 25, 2023 14:08 .gitignore Add fugue API (#396) December 30, 2022 00:16 .gitpod.yml Support arbitrary column names & SQL transpiler (#407) January 4, 2023 23:58 .pre-commit-config.yaml Multiple breaking changes (#383) November 16, 2022 21:32 .pylintrc Make transformation format aware (#433) March 6, 2023 00:46 CONTRIBUTING.md Updating README with new integrations (#395) December 17, 2022 18:38 LICENSE update doc and license (#34) June 19, 2020 00:57 Makefile Support Polars as a local dataframe in transformer (#439) March 16, 2023 01:03 README.md Prepare for 0.8.2 release (#447) March 22, 2023 22:36 RELEASE.md Finalize FunctionWrapper refactoring (#445) March 19, 2023 16:46 requirements.txt small grammar change (#436) March 8, 2023 18:34 setup.cfg Support Polars as a local dataframe in transformer (#439) March 16, 2023 01:03 setup.py Finalize FunctionWrapper refactoring (#445) March 19, 2023 16:46 View code [ ] Fugue API FugueSQL Installation Getting Started Using binder Using Docker Jupyter Notebook Extension Ecosystem Community and Contributing Case Studies Mentioned Uses Further Resources Blogs Conferences README.md [logo] PyPI version PyPI pyversions PyPI license codecov Codacy Badge Downloads Tutorials API Documentation Chat with us on slack! Jupyter Book Badge Doc Slack Status Fugue is a unified interface for distributed computing that lets users execute Python, Pandas, and SQL code on Spark, Dask, and Ray with minimal rewrites. Fugue is most commonly used for: * Parallelizing or scaling existing Python and Pandas code by bringing it to Spark, Dask, or Ray with minimal rewrites. * Using FugueSQL to define end-to-end workflows on top of Pandas, Spark, and Dask DataFrames. FugueSQL is an enhanced SQL interface that can invoke Python code. Fugue API The Fugue API is a collection of functions that are capable of running on Pandas, Spark, Dask, and Ray. The simplest way to use Fugue is the transform() function. This lets users parallelize the execution of a single function by bringing it to Spark, Dask, or Ray. In the example below, the map_letter_to_food() function takes in a mapping and applies it on a column. This is just Pandas and Python so far (without Fugue). import pandas as pd from typing import Dict input_df = pd.DataFrame({"id":[0,1,2], "value": (["A", "B", "C"])}) map_dict = {"A": "Apple", "B": "Banana", "C": "Carrot"} def map_letter_to_food(df: pd.DataFrame, mapping: Dict[str, str]) -> pd.DataFrame: df["value"] = df["value"].map(mapping) return df Now, the map_letter_to_food() function is brought to the Spark execution engine by invoking the transform() function of Fugue. The output schema and params are passed to the transform() call. The schema is needed because it's a requirement for distributed frameworks. A schema of "*" below means all input columns are in the output. from pyspark.sql import SparkSession from fugue import transform spark = SparkSession.builder.getOrCreate() sdf = spark.createDataFrame(input_df) out = transform(sdf, map_letter_to_food, schema="*", params=dict(mapping=map_dict), ) # out is a Spark DataFrame out.show() +---+------+ | id| value| +---+------+ | 0| Apple| | 1|Banana| | 2|Carrot| +---+------+ PySpark equivalent of Fugue transform() from typing import Iterator, Union from pyspark.sql.types import StructType from pyspark.sql import DataFrame, SparkSession spark_session = SparkSession.builder.getOrCreate() def mapping_wrapper(dfs: Iterator[pd.DataFrame], mapping): for df in dfs: yield map_letter_to_food(df, mapping) def run_map_letter_to_food(input_df: Union[DataFrame, pd.DataFrame], mapping): # conversion if isinstance(input_df, pd.DataFrame): sdf = spark_session.createDataFrame(input_df.copy()) else: sdf = input_df.copy() schema = StructType(list(sdf.schema.fields)) return sdf.mapInPandas(lambda dfs: mapping_wrapper(dfs, mapping), schema=schema) result = run_map_letter_to_food(input_df, map_dict) result.show() This syntax is simpler, cleaner, and more maintainable than the PySpark equivalent. At the same time, no edits were made to the original Pandas-based function to bring it to Spark. It is still usable on Pandas DataFrames. Fugue transform() also supports Dask and Ray as execution engines alongside the default Pandas-based engine. The Fugue API has a broader collection of functions that are also compatible with Spark, Dask, and Ray. For example, we can use load() and save() to create an end-to-end workflow compatible with Spark, Dask, and Ray. For the full list of functions, see the Top Level API import fugue.api as fa def run(engine=None): with fa.engine_context(engine): df = fa.load("/path/to/file.parquet") out = fa.transform(df, map_letter_to_food, schema="*") fa.save(out, "/path/to/output_file.parquet") run() # runs on Pandas run(engine="spark") # runs on Spark run(engine="dask") # runs on Dask All functions underneath the context will run on the specified backend. This makes it easy to toggle between local execution, and distributed execution. FugueSQL FugueSQL is a SQL-based language capable of expressing end-to-end data workflows on top of Pandas, Spark, and Dask. The map_letter_to_food() function above is used in the SQL expression below. This is how to use a Python-defined function along with the standard SQL SELECT statement. from fugue.api import fugue_sql import json query = """ SELECT id, value FROM input_df TRANSFORM USING map_letter_to_food(mapping={{mapping}}) SCHEMA * """ map_dict_str = json.dumps(map_dict) # returns Pandas DataFrame fugue_sql(query,mapping=map_dict_str) # returns Spark DataFrame fugue_sql(query, mapping=map_dict_str, engine="spark") Installation Fugue can be installed through pip or conda. For example: pip install fugue It also has the following installation extras: * spark: to support Spark as the ExecutionEngine * dask: to support Dask as the ExecutionEngine. * ray: to support Ray as the ExecutionEngine. * duckdb: to support DuckDB as the ExecutionEngine, read details. * polars: to support Polars DataFrames and extensions using Polars. * ibis: to enable Ibis for Fugue workflows, read details. * cpp_sql_parser: to enable the CPP antlr parser for Fugue SQL. It can be 50+ times faster than the pure Python parser. For the main Python versions and platforms, there is already pre-built binaries, but for the remaining, it needs a C++ compiler to build on the fly. For example a common use case is: pip install fugue[duckdb,spark] Note if you already installed Spark or DuckDB independently, Fugue is able to automatically use them without installing the extras. Getting Started The best way to get started with Fugue is to work through the 10 minute tutorials: * Fugue API in 10 minutes * FugueSQL in 10 minutes For the top level API, see: * Fugue Top Level API The tutorials can also be run in an interactive notebook environment through binder or Docker: Using binder Binder Note it runs slow on binder because the machine on binder isn't powerful enough for a distributed framework such as Spark. Parallel executions can become sequential, so some of the performance comparison examples will not give you the correct numbers. Using Docker Alternatively, you should get decent performance by running this Docker image on your own machine: docker run -p 8888:8888 fugueproject/tutorials:latest Jupyter Notebook Extension There is an accompanying notebook extension for FugueSQL that lets users use the %%fsql cell magic. The extension also provides syntax highlighting for FugueSQL cells. It works for both classic notebook and Jupyter Lab. More details can be found in the installation instructions. FugueSQL gif Ecosystem By being an abstraction layer, Fugue can be used with a lot of other open-source projects seamlessly. Python backends: * Pandas * Polars (DataFrames only) * Spark * Dask * Ray * Ibis FugueSQL backends: * Pandas - FugueSQL can run on Pandas * Duckdb - in-process SQL OLAP database management * dask-sql - SQL interface for Dask * SparkSQL * BigQuery Fugue is available as a backend or can integrate with the following projects: * WhyLogs - data profiling * PyCaret - low code machine learning * Nixtla - timeseries modelling * Prefect - workflow orchestration * Pandera - data validation Registered 3rd party extensions (majorly for Fugue SQL) include: * Pandas plot - visualize data using matplotlib or plotly * Seaborn - visualize data using seaborn * WhyLogs - visualize data profiling * Vizzu - visualize data using ipyvizzu Community and Contributing Feel free to message us on Slack. We also have contributing instructions. Case Studies * How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue * Clobotics - Large Scale Image Processing with Spark through Fugue Mentioned Uses * Productionizing Data Science at Interos, Inc. (LinkedIn post by Anthony Holten) * Multiple Time Series Forecasting with Fugue & Nixtla at Bain & Company(LinkedIn post by Fahad Akbar) Further Resources View some of our latest conferences presentations and content. For a more complete list, check the Content page in the tutorials. Blogs * Why Pandas-like Interfaces are Sub-optimal for Distributed Computing * Introducing FugueSQL -- SQL for Pandas, Spark, and Dask DataFrames (Towards Data Science by Khuyen Tran) Conferences * Distributed Machine Learning at Lyft * Comparing the Different Ways to Scale Python and Pandas Code * Large Scale Data Validation with Spark and Dask (PyCon US) * FugueSQL - The Enhanced SQL Interface for Pandas, Spark, and Dask DataFrames (PyData Global) * Distributed Hybrid Parameter Tuning About A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites. fugue-tutorials.readthedocs.io/ Topics distributed-systems machine-learning sql spark distributed-computing pandas distributed dask data-practitioners Resources Readme License Apache-2.0 license Stars 1.4k stars Watchers 21 watching Forks 76 forks Releases 108 0.8.2 Latest Mar 23, 2023 + 107 releases Packages 0 No packages published Used by 134 * @aheldrich14 * @tombrooks248 * @o-nik-s * @AlexMV12 * @BrunoCavag * @dbfranzosi * @anindya-saha * @MarcinKamil84 + 126 Contributors 17 * @goodwanghan * @kvnkho * @gityow * @rdmolony * @WangCHX * @synapticarbors * @anticorrelator * @RBergeron * @nils-braun * @aholten * @mfahadakbar + 6 contributors Languages * Python 98.3% * Jupyter Notebook 1.2% * Other 0.5% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.