https://github.com/fugue-project/fugue

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
      + Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
      + For
      + Enterprise
      + Teams
      + Startups
      + Education
      + By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
      + Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
      + Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}
fugue-project / fugue Public

  * Notifications
  * Fork 76
  * Star 1.4k

A unified interface for distributed computing. Fugue executes SQL,
Python, and Pandas code on Spark, Dask and Ray without any rewrites.

fugue-tutorials.readthedocs.io/

License

Apache-2.0 license
1.4k stars 76 forks
Star
Notifications

  * Code
  * Issues 28
  * Pull requests 1
  * Discussions
  * Actions
  * Projects 3
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

fugue-project/fugue

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
master
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
2 branches 108 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/f]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone fugue-]

    Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@goodwanghan
goodwanghan Remove to_local_df and to_local_bounded_df (#448)
...
1ffa28e Mar 25, 2023
Remove to_local_df and to_local_bounded_df (#448)

* Remove SQLite

* Remove SQLite

* Finalize FunctionWrapper refactoring

* Update docs

* update

* Remove to_local_df

1ffa28e

Git stats

  * 377 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
.devcontainer
Update devenv and cicd (#326)
May 7, 2022 19:54
.github
Move fugue_sql into fugue (#390)
November 17, 2022 16:49
docs
Finalize FunctionWrapper refactoring (#445)
March 19, 2023 16:46
fugue
Remove to_local_df and to_local_bounded_df (#448)
March 25, 2023 14:08
fugue_contrib
contrib: register the vizzu extension (#442)
March 16, 2023 21:56
fugue_dask
Finalize FunctionWrapper refactoring (#445)
March 19, 2023 16:46
fugue_duckdb
Finalize FunctionWrapper refactoring (#445)
March 19, 2023 16:46
fugue_ibis
Remove to_local_df and to_local_bounded_df (#448)
March 25, 2023 14:08
fugue_notebook
Improve Dataset display (#391)
November 18, 2022 09:41
fugue_polars
Remove to_local_df and to_local_bounded_df (#448)
March 25, 2023 14:08
fugue_ray
Finalize FunctionWrapper refactoring (#445)
March 19, 2023 16:46
fugue_spark
Finalize FunctionWrapper refactoring (#445)
March 19, 2023 16:46
fugue_sql
Improve Dataset display (#391)
November 18, 2022 09:41
fugue_test
Support Polars as a local dataframe in transformer (#439)
March 16, 2023 01:03
fugue_version
Make PartitionCursor take functions, fix Ray perf issue, fix spark
ge...
March 8, 2023 22:11
images
update (#173)
February 14, 2021 19:58
scripts
Create DuckDaskExecutionEngine with Dask Improvements (#301)
February 21, 2022 22:21
tests
Remove to_local_df and to_local_bounded_df (#448)
March 25, 2023 14:08
.gitignore
Add fugue API (#396)
December 30, 2022 00:16
.gitpod.yml
Support arbitrary column names & SQL transpiler (#407)
January 4, 2023 23:58
.pre-commit-config.yaml
Multiple breaking changes (#383)
November 16, 2022 21:32
.pylintrc
Make transformation format aware (#433)
March 6, 2023 00:46
CONTRIBUTING.md
Updating README with new integrations (#395)
December 17, 2022 18:38
LICENSE
update doc and license (#34)
June 19, 2020 00:57
Makefile
Support Polars as a local dataframe in transformer (#439)
March 16, 2023 01:03
README.md
Prepare for 0.8.2 release (#447)
March 22, 2023 22:36
RELEASE.md
Finalize FunctionWrapper refactoring (#445)
March 19, 2023 16:46
requirements.txt
small grammar change (#436)
March 8, 2023 18:34
setup.cfg
Support Polars as a local dataframe in transformer (#439)
March 16, 2023 01:03
setup.py
Finalize FunctionWrapper refactoring (#445)
March 19, 2023 16:46
View code
[                    ]
Fugue API FugueSQL Installation Getting Started Using binder Using
Docker Jupyter Notebook Extension Ecosystem Community and
Contributing Case Studies Mentioned Uses Further Resources Blogs
Conferences

README.md

 [logo]

PyPI version PyPI pyversions PyPI license codecov Codacy Badge
Downloads

    Tutorials      API Documentation Chat with us on slack!
Jupyter Book Badge Doc               Slack Status

Fugue is a unified interface for distributed computing that lets
users execute Python, Pandas, and SQL code on Spark, Dask, and Ray
with minimal rewrites.

Fugue is most commonly used for:

  * Parallelizing or scaling existing Python and Pandas code by
    bringing it to Spark, Dask, or Ray with minimal rewrites.
  * Using FugueSQL to define end-to-end workflows on top of Pandas,
    Spark, and Dask DataFrames. FugueSQL is an enhanced SQL interface
    that can invoke Python code.

 Fugue API

The Fugue API is a collection of functions that are capable of
running on Pandas, Spark, Dask, and Ray. The simplest way to use
Fugue is the transform() function. This lets users parallelize the
execution of a single function by bringing it to Spark, Dask, or Ray.
In the example below, the map_letter_to_food() function takes in a
mapping and applies it on a column. This is just Pandas and Python so
far (without Fugue).

import pandas as pd
from typing import Dict

input_df = pd.DataFrame({"id":[0,1,2], "value": (["A", "B", "C"])})
map_dict = {"A": "Apple", "B": "Banana", "C": "Carrot"}

def map_letter_to_food(df: pd.DataFrame, mapping: Dict[str, str]) -> pd.DataFrame:
    df["value"] = df["value"].map(mapping)
    return df

Now, the map_letter_to_food() function is brought to the Spark
execution engine by invoking the transform() function of Fugue. The
output schema and params are passed to the transform() call. The
schema is needed because it's a requirement for distributed
frameworks. A schema of "*" below means all input columns are in the
output.

from pyspark.sql import SparkSession
from fugue import transform

spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(input_df)

out = transform(sdf,
               map_letter_to_food,
               schema="*",
               params=dict(mapping=map_dict),
               )
# out is a Spark DataFrame
out.show()

+---+------+
| id| value|
+---+------+
|  0| Apple|
|  1|Banana|
|  2|Carrot|
+---+------+

PySpark equivalent of Fugue transform()

from typing import Iterator, Union
from pyspark.sql.types import StructType
from pyspark.sql import DataFrame, SparkSession

spark_session = SparkSession.builder.getOrCreate()

def mapping_wrapper(dfs: Iterator[pd.DataFrame], mapping):
  for df in dfs:
      yield map_letter_to_food(df, mapping)

def run_map_letter_to_food(input_df: Union[DataFrame, pd.DataFrame], mapping):
  # conversion
  if isinstance(input_df, pd.DataFrame):
      sdf = spark_session.createDataFrame(input_df.copy())
  else:
      sdf = input_df.copy()

  schema = StructType(list(sdf.schema.fields))
  return sdf.mapInPandas(lambda dfs: mapping_wrapper(dfs, mapping),
                          schema=schema)

result = run_map_letter_to_food(input_df, map_dict)
result.show()

This syntax is simpler, cleaner, and more maintainable than the
PySpark equivalent. At the same time, no edits were made to the
original Pandas-based function to bring it to Spark. It is still
usable on Pandas DataFrames. Fugue transform() also supports Dask and
Ray as execution engines alongside the default Pandas-based engine.

The Fugue API has a broader collection of functions that are also
compatible with Spark, Dask, and Ray. For example, we can use load()
and save() to create an end-to-end workflow compatible with Spark,
Dask, and Ray. For the full list of functions, see the Top Level API

import fugue.api as fa

def run(engine=None):
    with fa.engine_context(engine):
        df = fa.load("/path/to/file.parquet")
        out = fa.transform(df, map_letter_to_food, schema="*")
        fa.save(out, "/path/to/output_file.parquet")

run()                 # runs on Pandas
run(engine="spark")   # runs on Spark
run(engine="dask")    # runs on Dask

All functions underneath the context will run on the specified
backend. This makes it easy to toggle between local execution, and
distributed execution.

 FugueSQL

FugueSQL is a SQL-based language capable of expressing end-to-end
data workflows on top of Pandas, Spark, and Dask. The
map_letter_to_food() function above is used in the SQL expression
below. This is how to use a Python-defined function along with the
standard SQL SELECT statement.

from fugue.api import fugue_sql
import json

query = """
    SELECT id, value
      FROM input_df
    TRANSFORM USING map_letter_to_food(mapping={{mapping}}) SCHEMA *
    """
map_dict_str = json.dumps(map_dict)

# returns Pandas DataFrame
fugue_sql(query,mapping=map_dict_str)

# returns Spark DataFrame
fugue_sql(query, mapping=map_dict_str, engine="spark")

 Installation

Fugue can be installed through pip or conda. For example:

pip install fugue

It also has the following installation extras:

  * spark: to support Spark as the ExecutionEngine
  * dask: to support Dask as the ExecutionEngine.
  * ray: to support Ray as the ExecutionEngine.
  * duckdb: to support DuckDB as the ExecutionEngine, read details.
  * polars: to support Polars DataFrames and extensions using Polars.
  * ibis: to enable Ibis for Fugue workflows, read details.
  * cpp_sql_parser: to enable the CPP antlr parser for Fugue SQL. It
    can be 50+ times faster than the pure Python parser. For the main
    Python versions and platforms, there is already pre-built
    binaries, but for the remaining, it needs a C++ compiler to build
    on the fly.

For example a common use case is:

pip install fugue[duckdb,spark]

Note if you already installed Spark or DuckDB independently, Fugue is
able to automatically use them without installing the extras.

 Getting Started

The best way to get started with Fugue is to work through the 10
minute tutorials:

  * Fugue API in 10 minutes
  * FugueSQL in 10 minutes

For the top level API, see:

  * Fugue Top Level API

The tutorials can also be run in an interactive notebook environment
through binder or Docker:

 Using binder

Binder

Note it runs slow on binder because the machine on binder isn't
powerful enough for a distributed framework such as Spark. Parallel
executions can become sequential, so some of the performance
comparison examples will not give you the correct numbers.

 Using Docker

Alternatively, you should get decent performance by running this
Docker image on your own machine:

docker run -p 8888:8888 fugueproject/tutorials:latest

 Jupyter Notebook Extension

There is an accompanying notebook extension for FugueSQL that lets
users use the %%fsql cell magic. The extension also provides syntax
highlighting for FugueSQL cells. It works for both classic notebook
and Jupyter Lab. More details can be found in the installation
instructions.

FugueSQL gif

 Ecosystem

By being an abstraction layer, Fugue can be used with a lot of other
open-source projects seamlessly.

Python backends:

  * Pandas
  * Polars (DataFrames only)
  * Spark
  * Dask
  * Ray
  * Ibis

FugueSQL backends:

  * Pandas - FugueSQL can run on Pandas
  * Duckdb - in-process SQL OLAP database management
  * dask-sql - SQL interface for Dask
  * SparkSQL
  * BigQuery

Fugue is available as a backend or can integrate with the following
projects:

  * WhyLogs - data profiling
  * PyCaret - low code machine learning
  * Nixtla - timeseries modelling
  * Prefect - workflow orchestration
  * Pandera - data validation

Registered 3rd party extensions (majorly for Fugue SQL) include:

  * Pandas plot - visualize data using matplotlib or plotly
  * Seaborn - visualize data using seaborn
  * WhyLogs - visualize data profiling
  * Vizzu - visualize data using ipyvizzu

 Community and Contributing

Feel free to message us on Slack. We also have contributing
instructions.

 Case Studies

  * How LyftLearn Democratizes Distributed Compute through Kubernetes
    Spark and Fugue
  * Clobotics - Large Scale Image Processing with Spark through Fugue

 Mentioned Uses

  * Productionizing Data Science at Interos, Inc. (LinkedIn post by
    Anthony Holten)

  * Multiple Time Series Forecasting with Fugue & Nixtla at Bain &
    Company(LinkedIn post by Fahad Akbar)

 Further Resources

View some of our latest conferences presentations and content. For a
more complete list, check the Content page in the tutorials.

 Blogs

  * Why Pandas-like Interfaces are Sub-optimal for Distributed
    Computing
  * Introducing FugueSQL -- SQL for Pandas, Spark, and Dask DataFrames
    (Towards Data Science by Khuyen Tran)

 Conferences

  * Distributed Machine Learning at Lyft
  * Comparing the Different Ways to Scale Python and Pandas Code
  * Large Scale Data Validation with Spark and Dask (PyCon US)
  * FugueSQL - The Enhanced SQL Interface for Pandas, Spark, and Dask
    DataFrames (PyData Global)
  * Distributed Hybrid Parameter Tuning

About

A unified interface for distributed computing. Fugue executes SQL,
Python, and Pandas code on Spark, Dask and Ray without any rewrites.

fugue-tutorials.readthedocs.io/

Topics

distributed-systems machine-learning sql spark distributed-computing 
pandas distributed dask data-practitioners

Resources

Readme

License

Apache-2.0 license

Stars

1.4k stars

Watchers

21 watching

Forks

76 forks

Releases 108

 
0.8.2 Latest
Mar 23, 2023
+ 107 releases

Packages 0

No packages published

Used by 134

 

  * @aheldrich14
  * @tombrooks248
  * @o-nik-s
  * @AlexMV12
  * @BrunoCavag
  * @dbfranzosi
  * @anindya-saha
  * @MarcinKamil84

+ 126

Contributors 17

  * @goodwanghan
  * @kvnkho
  * @gityow
  * @rdmolony
  * @WangCHX
  * @synapticarbors
  * @anticorrelator
  * @RBergeron
  * @nils-braun
  * @aholten
  * @mfahadakbar

+ 6 contributors

Languages

  * Python 98.3%
  * Jupyter Notebook 1.2%
  * Other 0.5%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.