hngopher.com

       [HN Gopher] Fugue: A unified interface for distributed computing
       ___________________________________________________________________
        
       Fugue: A unified interface for distributed computing
        
       Author : duck
       Score  : 64 points
       Date   : 2023-03-27 05:46 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | antman wrote:
       | What is the difference with ibis which is also one of its
       | included backends? Also ibis has other backends e.g. duckdb so
       | what is the constraint of accessing duckdb with a python
       | interface as in: fugue->ibis->duckdb?
       | 
       | Seems very interesting but I can't tell what it's scope range is,
       | or its comparative advantages with ibis or dbt or others?
        
         | kvnkho wrote:
         | Hi antman, thanks for the question. I will type some points on
         | differences, but will answer the first question. The Fugue ->
         | Ibis -> DuckDB example is a bit weird. Yes it can be done but
         | it's not practical (as you can tell). There may be some overlap
         | sometime, but I do think the projects differ in scope (more
         | below).
         | 
         | The Ibis integration is more about accessing data in various
         | data stores already. For example, we use it under the hood also
         | for our recently released BigQuery integration: https://fugue-
         | tutorials.readthedocs.io/tutorials/integration...
         | 
         | On to differences:
         | 
         | 1. We guarantee consistency between backends. NULL handling can
         | be different depending on the backend. For example, Pandas
         | joins NULL with NULL while Spark doesn't. So if you prototype
         | locally on Pandas, and then scale to Spark, we guarantee same
         | results. Fugue is 100% unit tested and the backends go through
         | the same test suite.
         | 
         | 2. Ibis is Pythonic for SQL backends. We embrace SQL, but
         | understand its limitations. FugueSQL is an enhanced SQL dialect
         | that can invoke Python code. FugueSQL can be the first-class
         | grammar instead of being sandwiched by Python code. Fugue's
         | Python API and SQL API are 1:1 in capability.
         | 
         | 3. Opinionated here, but we don't want users to learn any new
         | language. Ibis is a new way to express things; we just want to
         | extend the capabilities of what people already know (SQL,
         | native Python, and Pandas). Fugue can also be incrementally
         | adopted, meaning it can be used for just one portion of your
         | workflow.
         | 
         | 4. Roadmap-wise, we think the optimal solutions will be a mix
         | of different tools. A clear one is pre-aggregating data with
         | DuckDB, and then using Pandas for further processing.
         | Similarly, can we preprocess in Snowflake and do machine
         | learning in Spark? Fugue is working on connecting these
         | different systems to enable cross-platform workloads.
         | 
         | There may be more information for you here: https://fugue-
         | tutorials.readthedocs.io/tutorials/integration...
        
       | cmarschner wrote:
       | Fugue is really cool as it makes a complicated thing simple. The
       | learning curve is a few minutes (provided that you have an
       | installation that works, of course)
        
       | crabbone wrote:
       | I want to cry... why distributed programming _in Python_?
       | 
       | I mean, I understand that a lot of people want Pandas to run
       | faster, and distributing computation would help... but come on!
       | There needs to be a line in the sand that a sane person would not
       | cross. It doesn't matter how much effort you put into making
       | Python distributed, it's not going to work unless you have the
       | power to control the language itself.
       | 
       | You know, I just thought of a metaphor for this. Maybe not 100%,
       | but you'll get the idea. Suppose you live in a small country,
       | something close to 10M population. And you want to buy pants.
       | Well, you want to buy them from an offline store, so you don't
       | get all the choices you have by shopping online. You don't care
       | about the price, you can afford to buy one pair of pants for a
       | price of ten. You just want a specific kind. Let's say, you want
       | JNCO jeans.
       | 
       | Tough luck. You go to every store that sells pants, but JNCO is
       | just too niche for a small country to make sense for the
       | retailers to import. So, every store that wants to make profits
       | buys the same exact model of jeans that was advertised in the
       | last year fashion catalogue. You can pay a little more or a
       | little less for better or worse material, but all the pants on
       | the offer have the same trashy design. You just cannot stand
       | them.
       | 
       | This is what's happening with Python. It's trash. But it's the
       | model from the last year fashion catalogue. So any program you
       | write, any problem you solve must be in Python, or else you lose
       | the competition before you even enter into the contest.
        
       | [deleted]
        
       | whinvik wrote:
       | This looks great but as someone who has been doing lots of Spark
       | lately, I feel like this will further worsen the development
       | process.
       | 
       | The problem with Spark is that there's a lot of magic that
       | happens underneath and being able to debug and figure out issues
       | is pretty hairy. If we put this on top then there are 2 levels of
       | magic that we will have to go through to figure out what is going
       | on.
       | 
       | I wish someone works on this aspect.
        
         | kvnkho wrote:
         | Hi whinvik, we agree that development in Spark is hard, and
         | that is part of the motivation of Fugue. Spark code couples the
         | distributed orchestration and business logic together.
         | 
         | By keeping your code in native Python or Pandas, it will be
         | much easier to develop, debug, and maintain the business logic
         | because your tracebacks will be in native Python. Fugue then
         | takes it to Spark when you are ready to scale.
        
           | whinvik wrote:
           | I appreciate your response but that is not what I was getting
           | at. I understand that with this you only have to write Pandas
           | and then not worry about scaling.
           | 
           | First, I think PySpark syntax is much better than the
           | insanity than is Pandas but if you really like Pandas then
           | you can always use Pandas UDF which Spark supports.
           | 
           | But let's say that writing only in Pandas is the preferred
           | way. Now comes the magic part. How do I know that it is using
           | the best join? Will it optimize for spills? Will there be
           | OOM's? These are the things we need to worry about which
           | often lead us needing to go deep inside Spark magic.
           | 
           | Now if there's another level of magic which is Pandas to
           | Spark transpiling as I imagine you do here, then I have even
           | less of an idea how to tune it.
           | 
           | Again I appreciate you are solving a specific problem in a
           | nice way but I feel like we are actually making the problem
           | even more complicated.
        
       | AndrewKemendo wrote:
       | This looks really well architected, and your documentation is
       | really easy to read, it's structured logically and simply which
       | is refreshing.
       | 
       | It also looks pretty powerful, though I admit I haven't used it
       | yet and haven't used Dask or Ray but it's making me go look into
       | popping spark back up and learning this.
       | 
       | Kudos on what looks to be a really well built product.
        
         | kvnkho wrote:
         | Hi AndrewKemendo, Fugue co-author here. Thanks for the kind
         | words! We do put a lot of effort into our documentation. Always
         | happy to chat potential use cases if you want. My contact info
         | is in my profile.
        
       | dscape wrote:
       | Me and the Decipad team have been following fugue for a while -
       | would love a chat
        
         | kvnkho wrote:
         | Yeah let's chat! My contact info is in my profile.
        
       ___________________________________________________________________
       (page generated 2023-03-28 23:01 UTC)