[HN Gopher] Daft: A High-Performance Distributed Dataframe Libra...
___________________________________________________________________
Daft: A High-Performance Distributed Dataframe Library for
Multimodal Data
Author : skagenpilot
Score : 108 points
Date : 2023-06-07 11:56 UTC (11 hours ago)
(HTM) web link (blog.getdaft.io)
(TXT) w3m dump (blog.getdaft.io)
| silveraxe93 wrote:
| From past experience, the hackernews crowd is super hostile to
| tools/etc with opt-out [telemetry](https://www.getdaft.io/project
| s/docs/en/latest/telemetry.htm...).
| jaychia wrote:
| We hear you, and thanks for making this visible!
|
| As a performance-driven project it's important for us to
| understand which operations and use-cases are slowest/buggiest
| for our users so that we can focus on them. We tried to be very
| intentional in scoping the telemetry we collect and take this
| very seriously (telemetry is top-level on both our docs and
| README).
|
| Happy to hear any feedback on this - we understand it's an
| important topic.
|
| [Edit: parent link was fixed, thanks! :)]
| silveraxe93 wrote:
| tbh it _was_ quite easy to spot. The gold standard would be
| making it opt-in, but I can guess barely no-one would enable
| it.
|
| I don't have strong opinions either way in this. But it's
| usually something that sparks a big flaming thread.
| nemoniac wrote:
| Since you ask, some feedback: opt-out on telemetry is in
| contravention of EU law. It applies when any EU citizen uses
| your software.
|
| "...consent options structured as an opt-out selected by
| default is a violation of the GDPR..."
|
| https://en.wikipedia.org/wiki/General_Data_Protection_Regula.
| ..
| zekrioca wrote:
| Instead of having your own opt-out, why don't you simply ask
| your users? Do a simple survey after some time a user is
| using your product. Offer them a cup of coffee, and they will
| gladly send you some real, genuine feedback.
| mdaniel wrote:
| Instead of having your own vanity opt-out, please do consider
| https://consoledonottrack.com/ which allows the user to
| express their intentions without their bashrc growing with
| every new project
| sammysidhu wrote:
| Hi (one of the maintainers here), that is a good
| suggestion! I wasn't aware of that project. I went ahead
| and made an issue to add `export DO_NOT_TRACK=1` as one of
| the variables we track! https://github.com/Eventual-
| Inc/Daft/issues/1015
| vvladymyrov wrote:
| Does Daft support Delta table format? Any plans to support sql
| queries?
| jaychia wrote:
| > Does Daft support Delta table format?
|
| Not yet, it's on our todo list to integrate with the ecosystem
| of data catalogs (Iceberg/Delta/Hudi etc). Join our Slack/get
| in touch with us if you're keen on this though, we'd love to
| learn more about your use-case!
|
| > Any plans to support sql queries?
|
| We do eventually want to support SQL as well, but haven't had
| the bandwidth to build and maintain it. Really we'd just need
| to compile the SQL down to our logical plan - we could pretty
| easily integrate UDFs so that they can be registered as SQL
| functions too!
| marsupialtail_2 wrote:
| SQL support is very challenging.
|
| I work on Quokka (https://github.com/marsupialtail/quokka). I
| support Iceberg reads. Recently we are adding SQL support from
| just parsing the DuckDB logical plan, though that is very
| challenging as well.
|
| The Python world lacks a standard for a plug and play SQL query
| optimizer. Apache Calcite is good for the JVM world, but not
| great if you are trying to cut out the JVM.
| liminal wrote:
| This looks amazing.
|
| * I started poking around the docs and I'm most excited about a
| Ray backend runner. I'm hoping this allows more ergonomic
| distributed data frame computation on an existing Ray cluster.
|
| * Is this based on Apache Arrow? I would assume so, but it's
| important that it be zero-copy from other tools. Would like to
| see this mentioned prominently somewhere.
|
| * I really like the Polars expression API. I haven't dug into the
| docs enough to know how it relates. I do see a reference to
| pinning the embedded Polars version, so fingers crossed there's
| compatibility. It would be AMAZING to be able to take Polars code
| and run it in a distributed cluster with minimal changes. Can
| anyone chime in?
| jaychia wrote:
| > Ray backend runner
|
| Yes, give it a whirl and let us know what you think! Ray is
| _amazing_ and has actually gotten a lot better post their 2.0
| release :)
|
| > Is this based on Apache Arrow?
|
| Indeed it is, and thanks for the feedback. We'll make this a
| little more visible. We use the arrow2 Rust crate (same one
| that Polars uses) for our in-memory data representation.
|
| Our data representation makes it such that converting Daft into
| a Ray dataset (`df.to_ray_dataset()`) is actually zero-copy. So
| you can go from data transformations into downstream ML stuff
| in Ray really easily.
|
| > It would be AMAZING to be able to take Polars code and run it
| in a distributed cluster with minimal changes.
|
| Unfortunately we don't have Polars API compatibility. This
| seems to be a recurring theme in this thread though. The
| problem is that certain Polars expressions are non-trivial to
| do in a distributed setting, and Polars itself as a project is
| so young and moves so quickly it's hard for us to maintain 100%
| API-compatibility.
|
| That being said, you are correct that a lot of the API is very
| much inspired by Polars, which should hopefully make it easy to
| move between the two.
| liminal wrote:
| Please at least reach out to the Polars folks and see what's
| possible.
| camgunz wrote:
| 1. This looks super cool
|
| 2. I admit to not understanding data lakes at all. I thought it
| was like a failure case for like, "we can't figure out how to get
| this data into a database", because isn't updating it a huge
| chore? You have to make sure that if you're updating you're not
| also generating new analytics, which it seems like you're always
| doing because it's very slow. Don't databases solve this pretty
| elegantly? Why are there all these tools for dealing with data in
| flat files?
| rgrieselhuber wrote:
| 2. I hate the name data lake but I've always used them as part
| of a pipeline. It's useful to keep the raw data around when you
| need to recover / replay the pipeline.
| short_sells_poo wrote:
| I think most data lakes really turn into a data swamp as
| people/orgs are not diligent enough to keep them in a good
| shape. In absence of costs, people never delete anything
| voluntarily and the garbage will grow monotonically until it
| takes up all space.
| musingsole wrote:
| That applies to databases as much as data lakes.
|
| I don't know why it's so hard to get across that data
| expires in the same way that language changes.
| DecoPerson wrote:
| A database is typically an alive, running program that requires
| maintenance. There is a strong coupling for most "databases"
| between disk format and executable code. It's not easy to read
| from a random Postgres database sitting on disk. You could not
| do in Python, "import postgres" and then
| "postgres.open('mydb')".
|
| I'm no data scientist, and have only worked with data lakes a
| couple times, but I can see why data science tends to be done
| with very predictable (if inefficient) data formats such as
| CSV, JSON, and JSONL.
|
| Edit: SQLite is the best of both worlds. It's a database, but
| it's also "just a file." It's easy to work with, and many
| languages & frameworks are getting good support for it.
| SQLite's reliability-first approach means many of the kinks
| that arise from involving databases (so much complexity!!) are
| ironed out and don't arise as issues. (Things like auto-
| indexing, auto-vacuuming, avoiding & dealing with corruption,
| backwards & forwards compatibility, ...)
| IanCal wrote:
| One side of data lakes is it's more of a start of where the
| data is. Your normalised and more processed data may end up in
| a nice clean database but the start is not like that.
|
| The other main points are usually
|
| * Data size
|
| * Data access patterns
|
| * Data formats
|
| The more you're looking at "I want to pull 400G of data out of
| my 30TB set of images from a bunch of machines running a custom
| python script, then shut it down in twenty minutes and not
| start anything else until tomorrow" then the more a data lake
| makes sense vs a database.
|
| > because isn't updating it a huge chore?
|
| Not with the right tools, which can also give you things like a
| git-like commit experience with branching.
|
| > You have to make sure that if you're updating you're not also
| generating new analytics, which it seems like you're always
| doing because it's very slow.
|
| Why would you be generating new analytics? I feel I've missed
| something there.
| marsupialtail_2 wrote:
| While we are on this topic, the challenge with data lakes for
| Python based projects like Daft and Quokka (what I work on) is
| the poor Python support for data lakes like Delta, Iceberg and
| Hudi. Delta has the best support but its Python API is
| consistently behind the Java ones. Iceberg doesn't support
| Python writes. Hudi doesn't support anything Python.
|
| I have users demanding Iceberg writes and Hudi reads/writes. I
| don't know what to tell them, since I don't have the resources
| to add a reader/writer myself for those projects.
|
| Hopefully as DuckDB becomes more popular we will see Python
| bindings for these popular data lake formats this year.
| jaychia wrote:
| Hi, I'm one of the maintainers of Daft
|
| 1. Thanks! We think so too :)
|
| 2. Here's my 2c in argument of flat files
|
| - Ingestion: ingesting things into a data lake is much easier
| than writing to a database (all you have to do is drop some
| JSON, CSVs or protobufs into a bucket). This makes integrating
| with other systems, especially 3rd-party or vendors, much
| easier since there's an open language-agnostic format to
| communicate with.
|
| - Multimodal data: Certain datatypes (e.g. images, tensors) may
| not make sense in a traditional SQL database. In a datalake
| though, data is usually "schema-on-read", so you can at least
| ingest it and now the responsibility is on the downstream
| application to make use of it if it can/wants to - super
| flexible!
|
| - "Always on": with databases, you pay for uptime which likely
| scales with the size of your data. If your requirements are
| infrequent accesses of your data then a datalake could save you
| a lot of money! A common example of this: once-a-day data
| cleanups and ETL of an aggregated subset of your data into
| downstream (clean!) databases for cheaper consumption.
|
| On "isn't updating it a huge chore?": many data lakes are
| partitioned by ingestion time, and applications usually consume
| a subset of these partitions (e.g. all data over the past
| week). In practice this means that you can lifecycle your data
| and put old data into cold-storage so that it costs you less
| money.
| swader999 wrote:
| Finally a decent name.
| jaychia wrote:
| Hello! I am one of the maintainers of Daft. Funny enough I just
| gave a presentation about Daft in London and we all had quite a
| laugh at the name :D
| IanOzsvald wrote:
| Indeed I was the one who got confused by the name! Thanks for
| attending the discussion Jay and I'm happy to see Daft being
| discussed here
| nhourcard wrote:
| one of the best bands in the world has Daft in their name!
| Fiahil wrote:
| So Daft is a distributed Polars ?
|
| If we set apart the distributed part, what's the "killer feature"
| of Daft for trying to compete with Polars (and Pandas) ? Are they
| API-compatible ? How's the memory consumption benchmark ? (TBH,
| this is the only interesting metric. Timing and Latency are not
| really important when your most important competitor is Spark)
| jaychia wrote:
| > So Daft is a distributed Polars ?
|
| We did actually start by using Polars as our underlying
| execution engine, but eventually transitioned off to our own
| Rust Table abstraction to better suit our needs (e.g. custom
| datatypes and kernels). We still share the arrow2 dependency
| with Polars for in-memory representation of our data.
|
| > what's the "killer feature" of Daft for trying to compete
| with Polars (and Pandas)
|
| We don't specifically try to compete on a local machine since
| there is so much good new tooling being made recently available
| (DuckDB, Polars etc). That being said, we do try our best to
| make the local experience as seamless as possible because we've
| all felt the pain of developing locally in PySpark. Aside from
| the ability to go distributed, I like using Daft for:
|
| * Working on more "complex" datatypes such as URLs, images etc
|
| * Working with "many [Parquet/CSV/JSON] files in cloud storage
| (S3)" which we've found to be quite common for many workloads.
| We already have some and are building more intelligent
| optimizations here such as column pruning, predicate pushdowns
| etc to reduce and optimize I/O from the cloud.
|
| As you've pointed out, one of our main responsibilities is to
| handle memory very very well. This is something we're actively
| working on and I'm thinking this will be a big reason to use us
| locally as well!
|
| > Are they API-compatible?
|
| We are not API-compatible with Pandas/Polars, but our API is
| quite inspired by Polars. We found that building out the core
| set of dataframe functionality was much more tractable than
| attempting to go API-compatible from the get-go.
|
| > How's the memory consumption benchmark ? (TBH, this is the
| only interesting metric. Timing and Latency are not really
| important when your most important competitor is Spark.
|
| We think throughput is still important when comparing against
| Spark, since this can save a lot of money when running some
| potentially very expensive queries!
|
| That being said, you're spot-on about memory usage being a key
| metric here. One of the key advantages of having native types
| for multimodal data (e.g. tensors and images) is that we can
| much more tightly account for memory requirements when working
| with these types, beyond the usual black-box Python UDF which
| often results in a ton of out-of-memory issues.
|
| Our current mechanism for dealing with this is relying on Ray's
| excellent object spilling mechanisms for working with out-of-
| core data. We recognize that there are many situations in which
| this is insufficient.
|
| The team is working on many advanced features here (e.g.
| microbatching) that will give Daft a big boost, and will
| release benchmarks as soon as we have them!
|
| [Edit: typos!]
| Fiahil wrote:
| Thanks you for your responses !
|
| I'm going to be very blunt here, because you need to hear
| this to go forward :
|
| You HAVE TO be at least API-compatible with Polars or Pandas
| to exist. Being backend-compatible with arrow is not enough.
|
| There is no technical reason why you would not pick one and
| go with it, apart from being a very difficult task.
|
| As of today, I have 2 major pains : Pandas being a giant
| memory hog and Polars not being a drop-in replacement for
| Pandas. I am pushing Polars, as hard as I can in all projects
| I can touch, and it's a very long way from being the default
| DataFrame library. Data Scientists will continue to use
| Pandas for the foreseeable future, and that saddens me
| greatly because I will also have to work with OOMKilled
| pipelines for the foreseeable future.
|
| There is no place for a 3rd alternative, so either you become
| a "distribution bridge for Polars", and that would be
| absolutely amazing. Or, you go your own way, I'll put a small
| star on github, a "Noice!" in the comment section and move on
| and never come back.
|
| It's tough, but sadly real.
| [deleted]
| sammysidhu wrote:
| Hi! (one of the Daft maintainers here), thanks for the
| feedback. Ultimately you're right that supporting the full
| Polars syntax in a distributed fashion is very difficult.
| There are libraries out there that do "Pandas but
| distributed" but from what I have seen is that they
| prioritized API coverage rather than performance or memory
| consumption. So you end up in a similar boat to the
| situation you mentioned.
|
| We're trying to start with a simpler API that maps well to
| a distributed query query that we can execute well and then
| add the features that people request for.
|
| I would love to know what you would want to see in Daft!
| marsupialtail_2 wrote:
| Hi -- I am the author of Quokka:
| https://github.com/marsupialtail/quokka, trying to be
| distributed Polars. I am trying to go for API compatibility, or
| at least supporting most of the API.
|
| I am not focused on complex data types though.
| esafak wrote:
| Same suggestion to you: please integrate it with fugue.
|
| https://github.com/fugue-project/fugue
| sirfz wrote:
| Dolars
| esafak wrote:
| Please integrate it with Fugue; a unified interface that lets you
| swap out execution engines like pandas for Spark.
|
| https://github.com/fugue-project/fugue
| sammysidhu wrote:
| (one of the Daft maintainers here) Great call out! I went ahead
| and make an issue for us to work on this:
| https://github.com/Eventual-Inc/Daft/issues/1016
___________________________________________________________________
(page generated 2023-06-07 23:02 UTC)