[HN Gopher] Fugue: A unified interface for distributed computing
___________________________________________________________________
Fugue: A unified interface for distributed computing
Author : duck
Score : 64 points
Date : 2023-03-27 05:46 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| antman wrote:
| What is the difference with ibis which is also one of its
| included backends? Also ibis has other backends e.g. duckdb so
| what is the constraint of accessing duckdb with a python
| interface as in: fugue->ibis->duckdb?
|
| Seems very interesting but I can't tell what it's scope range is,
| or its comparative advantages with ibis or dbt or others?
| kvnkho wrote:
| Hi antman, thanks for the question. I will type some points on
| differences, but will answer the first question. The Fugue ->
| Ibis -> DuckDB example is a bit weird. Yes it can be done but
| it's not practical (as you can tell). There may be some overlap
| sometime, but I do think the projects differ in scope (more
| below).
|
| The Ibis integration is more about accessing data in various
| data stores already. For example, we use it under the hood also
| for our recently released BigQuery integration: https://fugue-
| tutorials.readthedocs.io/tutorials/integration...
|
| On to differences:
|
| 1. We guarantee consistency between backends. NULL handling can
| be different depending on the backend. For example, Pandas
| joins NULL with NULL while Spark doesn't. So if you prototype
| locally on Pandas, and then scale to Spark, we guarantee same
| results. Fugue is 100% unit tested and the backends go through
| the same test suite.
|
| 2. Ibis is Pythonic for SQL backends. We embrace SQL, but
| understand its limitations. FugueSQL is an enhanced SQL dialect
| that can invoke Python code. FugueSQL can be the first-class
| grammar instead of being sandwiched by Python code. Fugue's
| Python API and SQL API are 1:1 in capability.
|
| 3. Opinionated here, but we don't want users to learn any new
| language. Ibis is a new way to express things; we just want to
| extend the capabilities of what people already know (SQL,
| native Python, and Pandas). Fugue can also be incrementally
| adopted, meaning it can be used for just one portion of your
| workflow.
|
| 4. Roadmap-wise, we think the optimal solutions will be a mix
| of different tools. A clear one is pre-aggregating data with
| DuckDB, and then using Pandas for further processing.
| Similarly, can we preprocess in Snowflake and do machine
| learning in Spark? Fugue is working on connecting these
| different systems to enable cross-platform workloads.
|
| There may be more information for you here: https://fugue-
| tutorials.readthedocs.io/tutorials/integration...
| cmarschner wrote:
| Fugue is really cool as it makes a complicated thing simple. The
| learning curve is a few minutes (provided that you have an
| installation that works, of course)
| crabbone wrote:
| I want to cry... why distributed programming _in Python_?
|
| I mean, I understand that a lot of people want Pandas to run
| faster, and distributing computation would help... but come on!
| There needs to be a line in the sand that a sane person would not
| cross. It doesn't matter how much effort you put into making
| Python distributed, it's not going to work unless you have the
| power to control the language itself.
|
| You know, I just thought of a metaphor for this. Maybe not 100%,
| but you'll get the idea. Suppose you live in a small country,
| something close to 10M population. And you want to buy pants.
| Well, you want to buy them from an offline store, so you don't
| get all the choices you have by shopping online. You don't care
| about the price, you can afford to buy one pair of pants for a
| price of ten. You just want a specific kind. Let's say, you want
| JNCO jeans.
|
| Tough luck. You go to every store that sells pants, but JNCO is
| just too niche for a small country to make sense for the
| retailers to import. So, every store that wants to make profits
| buys the same exact model of jeans that was advertised in the
| last year fashion catalogue. You can pay a little more or a
| little less for better or worse material, but all the pants on
| the offer have the same trashy design. You just cannot stand
| them.
|
| This is what's happening with Python. It's trash. But it's the
| model from the last year fashion catalogue. So any program you
| write, any problem you solve must be in Python, or else you lose
| the competition before you even enter into the contest.
| [deleted]
| whinvik wrote:
| This looks great but as someone who has been doing lots of Spark
| lately, I feel like this will further worsen the development
| process.
|
| The problem with Spark is that there's a lot of magic that
| happens underneath and being able to debug and figure out issues
| is pretty hairy. If we put this on top then there are 2 levels of
| magic that we will have to go through to figure out what is going
| on.
|
| I wish someone works on this aspect.
| kvnkho wrote:
| Hi whinvik, we agree that development in Spark is hard, and
| that is part of the motivation of Fugue. Spark code couples the
| distributed orchestration and business logic together.
|
| By keeping your code in native Python or Pandas, it will be
| much easier to develop, debug, and maintain the business logic
| because your tracebacks will be in native Python. Fugue then
| takes it to Spark when you are ready to scale.
| whinvik wrote:
| I appreciate your response but that is not what I was getting
| at. I understand that with this you only have to write Pandas
| and then not worry about scaling.
|
| First, I think PySpark syntax is much better than the
| insanity than is Pandas but if you really like Pandas then
| you can always use Pandas UDF which Spark supports.
|
| But let's say that writing only in Pandas is the preferred
| way. Now comes the magic part. How do I know that it is using
| the best join? Will it optimize for spills? Will there be
| OOM's? These are the things we need to worry about which
| often lead us needing to go deep inside Spark magic.
|
| Now if there's another level of magic which is Pandas to
| Spark transpiling as I imagine you do here, then I have even
| less of an idea how to tune it.
|
| Again I appreciate you are solving a specific problem in a
| nice way but I feel like we are actually making the problem
| even more complicated.
| AndrewKemendo wrote:
| This looks really well architected, and your documentation is
| really easy to read, it's structured logically and simply which
| is refreshing.
|
| It also looks pretty powerful, though I admit I haven't used it
| yet and haven't used Dask or Ray but it's making me go look into
| popping spark back up and learning this.
|
| Kudos on what looks to be a really well built product.
| kvnkho wrote:
| Hi AndrewKemendo, Fugue co-author here. Thanks for the kind
| words! We do put a lot of effort into our documentation. Always
| happy to chat potential use cases if you want. My contact info
| is in my profile.
| dscape wrote:
| Me and the Decipad team have been following fugue for a while -
| would love a chat
| kvnkho wrote:
| Yeah let's chat! My contact info is in my profile.
___________________________________________________________________
(page generated 2023-03-28 23:01 UTC)