hngopher.com

       [HN Gopher] Daft: A High-Performance Distributed Dataframe Libra...
       ___________________________________________________________________
        
       Daft: A High-Performance Distributed Dataframe Library for
       Multimodal Data
        
       Author : skagenpilot
       Score  : 108 points
       Date   : 2023-06-07 11:56 UTC (11 hours ago)
        
 (HTM) web link (blog.getdaft.io)
 (TXT) w3m dump (blog.getdaft.io)
        
       | silveraxe93 wrote:
       | From past experience, the hackernews crowd is super hostile to
       | tools/etc with opt-out [telemetry](https://www.getdaft.io/project
       | s/docs/en/latest/telemetry.htm...).
        
         | jaychia wrote:
         | We hear you, and thanks for making this visible!
         | 
         | As a performance-driven project it's important for us to
         | understand which operations and use-cases are slowest/buggiest
         | for our users so that we can focus on them. We tried to be very
         | intentional in scoping the telemetry we collect and take this
         | very seriously (telemetry is top-level on both our docs and
         | README).
         | 
         | Happy to hear any feedback on this - we understand it's an
         | important topic.
         | 
         | [Edit: parent link was fixed, thanks! :)]
        
           | silveraxe93 wrote:
           | tbh it _was_ quite easy to spot. The gold standard would be
           | making it opt-in, but I can guess barely no-one would enable
           | it.
           | 
           | I don't have strong opinions either way in this. But it's
           | usually something that sparks a big flaming thread.
        
           | nemoniac wrote:
           | Since you ask, some feedback: opt-out on telemetry is in
           | contravention of EU law. It applies when any EU citizen uses
           | your software.
           | 
           | "...consent options structured as an opt-out selected by
           | default is a violation of the GDPR..."
           | 
           | https://en.wikipedia.org/wiki/General_Data_Protection_Regula.
           | ..
        
           | zekrioca wrote:
           | Instead of having your own opt-out, why don't you simply ask
           | your users? Do a simple survey after some time a user is
           | using your product. Offer them a cup of coffee, and they will
           | gladly send you some real, genuine feedback.
        
           | mdaniel wrote:
           | Instead of having your own vanity opt-out, please do consider
           | https://consoledonottrack.com/ which allows the user to
           | express their intentions without their bashrc growing with
           | every new project
        
             | sammysidhu wrote:
             | Hi (one of the maintainers here), that is a good
             | suggestion! I wasn't aware of that project. I went ahead
             | and made an issue to add `export DO_NOT_TRACK=1` as one of
             | the variables we track! https://github.com/Eventual-
             | Inc/Daft/issues/1015
        
       | vvladymyrov wrote:
       | Does Daft support Delta table format? Any plans to support sql
       | queries?
        
         | jaychia wrote:
         | > Does Daft support Delta table format?
         | 
         | Not yet, it's on our todo list to integrate with the ecosystem
         | of data catalogs (Iceberg/Delta/Hudi etc). Join our Slack/get
         | in touch with us if you're keen on this though, we'd love to
         | learn more about your use-case!
         | 
         | > Any plans to support sql queries?
         | 
         | We do eventually want to support SQL as well, but haven't had
         | the bandwidth to build and maintain it. Really we'd just need
         | to compile the SQL down to our logical plan - we could pretty
         | easily integrate UDFs so that they can be registered as SQL
         | functions too!
        
         | marsupialtail_2 wrote:
         | SQL support is very challenging.
         | 
         | I work on Quokka (https://github.com/marsupialtail/quokka). I
         | support Iceberg reads. Recently we are adding SQL support from
         | just parsing the DuckDB logical plan, though that is very
         | challenging as well.
         | 
         | The Python world lacks a standard for a plug and play SQL query
         | optimizer. Apache Calcite is good for the JVM world, but not
         | great if you are trying to cut out the JVM.
        
       | liminal wrote:
       | This looks amazing.
       | 
       | * I started poking around the docs and I'm most excited about a
       | Ray backend runner. I'm hoping this allows more ergonomic
       | distributed data frame computation on an existing Ray cluster.
       | 
       | * Is this based on Apache Arrow? I would assume so, but it's
       | important that it be zero-copy from other tools. Would like to
       | see this mentioned prominently somewhere.
       | 
       | * I really like the Polars expression API. I haven't dug into the
       | docs enough to know how it relates. I do see a reference to
       | pinning the embedded Polars version, so fingers crossed there's
       | compatibility. It would be AMAZING to be able to take Polars code
       | and run it in a distributed cluster with minimal changes. Can
       | anyone chime in?
        
         | jaychia wrote:
         | > Ray backend runner
         | 
         | Yes, give it a whirl and let us know what you think! Ray is
         | _amazing_ and has actually gotten a lot better post their 2.0
         | release :)
         | 
         | > Is this based on Apache Arrow?
         | 
         | Indeed it is, and thanks for the feedback. We'll make this a
         | little more visible. We use the arrow2 Rust crate (same one
         | that Polars uses) for our in-memory data representation.
         | 
         | Our data representation makes it such that converting Daft into
         | a Ray dataset (`df.to_ray_dataset()`) is actually zero-copy. So
         | you can go from data transformations into downstream ML stuff
         | in Ray really easily.
         | 
         | > It would be AMAZING to be able to take Polars code and run it
         | in a distributed cluster with minimal changes.
         | 
         | Unfortunately we don't have Polars API compatibility. This
         | seems to be a recurring theme in this thread though. The
         | problem is that certain Polars expressions are non-trivial to
         | do in a distributed setting, and Polars itself as a project is
         | so young and moves so quickly it's hard for us to maintain 100%
         | API-compatibility.
         | 
         | That being said, you are correct that a lot of the API is very
         | much inspired by Polars, which should hopefully make it easy to
         | move between the two.
        
           | liminal wrote:
           | Please at least reach out to the Polars folks and see what's
           | possible.
        
       | camgunz wrote:
       | 1. This looks super cool
       | 
       | 2. I admit to not understanding data lakes at all. I thought it
       | was like a failure case for like, "we can't figure out how to get
       | this data into a database", because isn't updating it a huge
       | chore? You have to make sure that if you're updating you're not
       | also generating new analytics, which it seems like you're always
       | doing because it's very slow. Don't databases solve this pretty
       | elegantly? Why are there all these tools for dealing with data in
       | flat files?
        
         | rgrieselhuber wrote:
         | 2. I hate the name data lake but I've always used them as part
         | of a pipeline. It's useful to keep the raw data around when you
         | need to recover / replay the pipeline.
        
           | short_sells_poo wrote:
           | I think most data lakes really turn into a data swamp as
           | people/orgs are not diligent enough to keep them in a good
           | shape. In absence of costs, people never delete anything
           | voluntarily and the garbage will grow monotonically until it
           | takes up all space.
        
             | musingsole wrote:
             | That applies to databases as much as data lakes.
             | 
             | I don't know why it's so hard to get across that data
             | expires in the same way that language changes.
        
         | DecoPerson wrote:
         | A database is typically an alive, running program that requires
         | maintenance. There is a strong coupling for most "databases"
         | between disk format and executable code. It's not easy to read
         | from a random Postgres database sitting on disk. You could not
         | do in Python, "import postgres" and then
         | "postgres.open('mydb')".
         | 
         | I'm no data scientist, and have only worked with data lakes a
         | couple times, but I can see why data science tends to be done
         | with very predictable (if inefficient) data formats such as
         | CSV, JSON, and JSONL.
         | 
         | Edit: SQLite is the best of both worlds. It's a database, but
         | it's also "just a file." It's easy to work with, and many
         | languages & frameworks are getting good support for it.
         | SQLite's reliability-first approach means many of the kinks
         | that arise from involving databases (so much complexity!!) are
         | ironed out and don't arise as issues. (Things like auto-
         | indexing, auto-vacuuming, avoiding & dealing with corruption,
         | backwards & forwards compatibility, ...)
        
         | IanCal wrote:
         | One side of data lakes is it's more of a start of where the
         | data is. Your normalised and more processed data may end up in
         | a nice clean database but the start is not like that.
         | 
         | The other main points are usually
         | 
         | * Data size
         | 
         | * Data access patterns
         | 
         | * Data formats
         | 
         | The more you're looking at "I want to pull 400G of data out of
         | my 30TB set of images from a bunch of machines running a custom
         | python script, then shut it down in twenty minutes and not
         | start anything else until tomorrow" then the more a data lake
         | makes sense vs a database.
         | 
         | > because isn't updating it a huge chore?
         | 
         | Not with the right tools, which can also give you things like a
         | git-like commit experience with branching.
         | 
         | > You have to make sure that if you're updating you're not also
         | generating new analytics, which it seems like you're always
         | doing because it's very slow.
         | 
         | Why would you be generating new analytics? I feel I've missed
         | something there.
        
         | marsupialtail_2 wrote:
         | While we are on this topic, the challenge with data lakes for
         | Python based projects like Daft and Quokka (what I work on) is
         | the poor Python support for data lakes like Delta, Iceberg and
         | Hudi. Delta has the best support but its Python API is
         | consistently behind the Java ones. Iceberg doesn't support
         | Python writes. Hudi doesn't support anything Python.
         | 
         | I have users demanding Iceberg writes and Hudi reads/writes. I
         | don't know what to tell them, since I don't have the resources
         | to add a reader/writer myself for those projects.
         | 
         | Hopefully as DuckDB becomes more popular we will see Python
         | bindings for these popular data lake formats this year.
        
         | jaychia wrote:
         | Hi, I'm one of the maintainers of Daft
         | 
         | 1. Thanks! We think so too :)
         | 
         | 2. Here's my 2c in argument of flat files
         | 
         | - Ingestion: ingesting things into a data lake is much easier
         | than writing to a database (all you have to do is drop some
         | JSON, CSVs or protobufs into a bucket). This makes integrating
         | with other systems, especially 3rd-party or vendors, much
         | easier since there's an open language-agnostic format to
         | communicate with.
         | 
         | - Multimodal data: Certain datatypes (e.g. images, tensors) may
         | not make sense in a traditional SQL database. In a datalake
         | though, data is usually "schema-on-read", so you can at least
         | ingest it and now the responsibility is on the downstream
         | application to make use of it if it can/wants to - super
         | flexible!
         | 
         | - "Always on": with databases, you pay for uptime which likely
         | scales with the size of your data. If your requirements are
         | infrequent accesses of your data then a datalake could save you
         | a lot of money! A common example of this: once-a-day data
         | cleanups and ETL of an aggregated subset of your data into
         | downstream (clean!) databases for cheaper consumption.
         | 
         | On "isn't updating it a huge chore?": many data lakes are
         | partitioned by ingestion time, and applications usually consume
         | a subset of these partitions (e.g. all data over the past
         | week). In practice this means that you can lifecycle your data
         | and put old data into cold-storage so that it costs you less
         | money.
        
       | swader999 wrote:
       | Finally a decent name.
        
         | jaychia wrote:
         | Hello! I am one of the maintainers of Daft. Funny enough I just
         | gave a presentation about Daft in London and we all had quite a
         | laugh at the name :D
        
           | IanOzsvald wrote:
           | Indeed I was the one who got confused by the name! Thanks for
           | attending the discussion Jay and I'm happy to see Daft being
           | discussed here
        
           | nhourcard wrote:
           | one of the best bands in the world has Daft in their name!
        
       | Fiahil wrote:
       | So Daft is a distributed Polars ?
       | 
       | If we set apart the distributed part, what's the "killer feature"
       | of Daft for trying to compete with Polars (and Pandas) ? Are they
       | API-compatible ? How's the memory consumption benchmark ? (TBH,
       | this is the only interesting metric. Timing and Latency are not
       | really important when your most important competitor is Spark)
        
         | jaychia wrote:
         | > So Daft is a distributed Polars ?
         | 
         | We did actually start by using Polars as our underlying
         | execution engine, but eventually transitioned off to our own
         | Rust Table abstraction to better suit our needs (e.g. custom
         | datatypes and kernels). We still share the arrow2 dependency
         | with Polars for in-memory representation of our data.
         | 
         | > what's the "killer feature" of Daft for trying to compete
         | with Polars (and Pandas)
         | 
         | We don't specifically try to compete on a local machine since
         | there is so much good new tooling being made recently available
         | (DuckDB, Polars etc). That being said, we do try our best to
         | make the local experience as seamless as possible because we've
         | all felt the pain of developing locally in PySpark. Aside from
         | the ability to go distributed, I like using Daft for:
         | 
         | * Working on more "complex" datatypes such as URLs, images etc
         | 
         | * Working with "many [Parquet/CSV/JSON] files in cloud storage
         | (S3)" which we've found to be quite common for many workloads.
         | We already have some and are building more intelligent
         | optimizations here such as column pruning, predicate pushdowns
         | etc to reduce and optimize I/O from the cloud.
         | 
         | As you've pointed out, one of our main responsibilities is to
         | handle memory very very well. This is something we're actively
         | working on and I'm thinking this will be a big reason to use us
         | locally as well!
         | 
         | > Are they API-compatible?
         | 
         | We are not API-compatible with Pandas/Polars, but our API is
         | quite inspired by Polars. We found that building out the core
         | set of dataframe functionality was much more tractable than
         | attempting to go API-compatible from the get-go.
         | 
         | > How's the memory consumption benchmark ? (TBH, this is the
         | only interesting metric. Timing and Latency are not really
         | important when your most important competitor is Spark.
         | 
         | We think throughput is still important when comparing against
         | Spark, since this can save a lot of money when running some
         | potentially very expensive queries!
         | 
         | That being said, you're spot-on about memory usage being a key
         | metric here. One of the key advantages of having native types
         | for multimodal data (e.g. tensors and images) is that we can
         | much more tightly account for memory requirements when working
         | with these types, beyond the usual black-box Python UDF which
         | often results in a ton of out-of-memory issues.
         | 
         | Our current mechanism for dealing with this is relying on Ray's
         | excellent object spilling mechanisms for working with out-of-
         | core data. We recognize that there are many situations in which
         | this is insufficient.
         | 
         | The team is working on many advanced features here (e.g.
         | microbatching) that will give Daft a big boost, and will
         | release benchmarks as soon as we have them!
         | 
         | [Edit: typos!]
        
           | Fiahil wrote:
           | Thanks you for your responses !
           | 
           | I'm going to be very blunt here, because you need to hear
           | this to go forward :
           | 
           | You HAVE TO be at least API-compatible with Polars or Pandas
           | to exist. Being backend-compatible with arrow is not enough.
           | 
           | There is no technical reason why you would not pick one and
           | go with it, apart from being a very difficult task.
           | 
           | As of today, I have 2 major pains : Pandas being a giant
           | memory hog and Polars not being a drop-in replacement for
           | Pandas. I am pushing Polars, as hard as I can in all projects
           | I can touch, and it's a very long way from being the default
           | DataFrame library. Data Scientists will continue to use
           | Pandas for the foreseeable future, and that saddens me
           | greatly because I will also have to work with OOMKilled
           | pipelines for the foreseeable future.
           | 
           | There is no place for a 3rd alternative, so either you become
           | a "distribution bridge for Polars", and that would be
           | absolutely amazing. Or, you go your own way, I'll put a small
           | star on github, a "Noice!" in the comment section and move on
           | and never come back.
           | 
           | It's tough, but sadly real.
        
             | [deleted]
        
             | sammysidhu wrote:
             | Hi! (one of the Daft maintainers here), thanks for the
             | feedback. Ultimately you're right that supporting the full
             | Polars syntax in a distributed fashion is very difficult.
             | There are libraries out there that do "Pandas but
             | distributed" but from what I have seen is that they
             | prioritized API coverage rather than performance or memory
             | consumption. So you end up in a similar boat to the
             | situation you mentioned.
             | 
             | We're trying to start with a simpler API that maps well to
             | a distributed query query that we can execute well and then
             | add the features that people request for.
             | 
             | I would love to know what you would want to see in Daft!
        
         | marsupialtail_2 wrote:
         | Hi -- I am the author of Quokka:
         | https://github.com/marsupialtail/quokka, trying to be
         | distributed Polars. I am trying to go for API compatibility, or
         | at least supporting most of the API.
         | 
         | I am not focused on complex data types though.
        
           | esafak wrote:
           | Same suggestion to you: please integrate it with fugue.
           | 
           | https://github.com/fugue-project/fugue
        
         | sirfz wrote:
         | Dolars
        
       | esafak wrote:
       | Please integrate it with Fugue; a unified interface that lets you
       | swap out execution engines like pandas for Spark.
       | 
       | https://github.com/fugue-project/fugue
        
         | sammysidhu wrote:
         | (one of the Daft maintainers here) Great call out! I went ahead
         | and make an issue for us to work on this:
         | https://github.com/Eventual-Inc/Daft/issues/1016
        
       ___________________________________________________________________
       (page generated 2023-06-07 23:02 UTC)