[HN Gopher] Use DuckDB-WASM to query TB of data in browser
___________________________________________________________________
Use DuckDB-WASM to query TB of data in browser
Author : mlissner
Score : 220 points
Date : 2025-10-31 17:37 UTC (1 days ago)
(HTM) web link (lil.law.harvard.edu)
(TXT) w3m dump (lil.law.harvard.edu)
| mlissner wrote:
| OK, this is really neat: - S3 is really cheap static storage for
| files. - DuckDB is a database that uses S3 for its storage. -
| WASM lets you run binary (non-JS) code in your browser. - DuckDB-
| Wasm allows you to run a database in your browser.
|
| Put all of that together, and you get a website that queries S3
| with no backend at all. Amazing.
| timeflex wrote:
| S3 might be relatively cheap for storing files, but with
| bandwidth you could easily be paying $230/mo. If you make it
| public facing & want to try to use their cloud reporting,
| metrics, etc. to prevent people for running up your bandwidth,
| your "really cheap" static hosting could easily cost you more
| than $500/mo.
| theultdev wrote:
| R2 is S3 compatible with no egress fees.
|
| Cloudflare actually has built in iceberg support for R2
| buckets. It's quite nice.
|
| Combine that with their pipelines it's a simple http request
| to ingest, then just point duckdb to the iceberg enabled R2
| bucket to analyze.
| greatNespresso wrote:
| Was about to jump in to say the same thing. R2 is a much
| cheaper alternative to S3 that just works and I have used
| it with DuckDB, works smoothly
| apwheele wrote:
| For a demo of this (although not sure with duckdb wasm that
| it works with iceberg)
| https://andrewpwheeler.com/2025/06/29/using-duckdb-wasm-
| clou...
| 8organicbits wrote:
| > R2 is S3 compatible with no egress fees.
|
| There's no egress data transfer fees, but you still pay for
| the GET request operations. Lots of little range requests
| can add up quick.
| zenmac wrote:
| Can't believe that is what the industry has come down to.
| Kind like clipping coupon to get the best deal according
| different pricing overlords.
|
| It is time like this that makes self-hosting a lot more
| attractive.
| theultdev wrote:
| Luckily it's just static files. You can use whatever host
| you want.
| 7952 wrote:
| I think this approach makes sense for services with a small
| number of users relative to the data they are searching. That
| just isn't a good fit for a lot of hosted services. Think how
| much that TB's of data would cost on Algolia or similar
| services.
|
| You have to store the data somehow anyway, and you have to
| retrieve some of it to service a query. If egress costs too
| much you could always change later to put the browser code on
| a server. Also it would presumably be possible to quantify
| the trade-off between processing the data client side and on
| the server.
| simonw wrote:
| Stick it behind Cloudflare and it should be effectively free.
| bigiain wrote:
| Until it isn't.
| rubenvanwyk wrote:
| Or use R2 instead. It's even easier.
| thadt wrote:
| S3 is doing quite a lot of sophisticated lifting to qualify as
| _no backend at all._
|
| But yeah - this is pretty neat. Easily seems like the future of
| static datasets should wind up in something like this. Just
| data, with some well chosen indices.
| theultdev wrote:
| Still qualifies imo. Everything is static and on a CDN.
|
| Lack of server/dynamic code qualifies as no backend.
| simonw wrote:
| I believe all S3 has to do here is respond to HTTP Range
| queries, which are supported by almost every static server
| out there - Apache, Nginx etc should all support the same
| trick.
| thadt wrote:
| 100%. I'm with y'all - this is what I would _also_ call a
| "no-backend" solution and I'm all in on this type of
| approach for static data sets - this is the future, and
| could be served with a very simple web server.
|
| I'm just bemused that we all refer to one of the larger,
| more sophisticated storage systems on the plant, composed
| of dozens of subsystems and thousands of servers as "no
| backend at all." Kind of a "draw the rest of the owl".
| codedokode wrote:
| Can you replace S3 with a directory and nginx and save lot of
| money?
| dtech wrote:
| Yes, i.i.r.c. it's not S3 specific just URLs
| mpweiher wrote:
| Yes. Especially if you use Storage Combinators.
|
| They let you easily abstract over storage.
|
| https://2019.splashcon.org/details/splash-2019-Onward-
| papers...
| amazingamazing wrote:
| Neat. Can you use duckdb backed on another store like rocksdb or
| something? Also, I wonder how one stops ddos. Put the whole thing
| behind Cloudflare?
| wewewedxfgdf wrote:
| I tried DuckDB - liked it a lot - was ready to go further.
|
| But found it to be a real hassle to help it understand the right
| number of threads and the amount of memory to use.
|
| This led to lots of crashes. If you look at the projects github
| issues you will see many OOM out of memory errors.
|
| And then there was some indexed bug that crashed seemingly
| unrelated to memory.
|
| Life is too short for crashy database software so I reluctantly
| dropped it. I was disappointed because it was exactly what I was
| looking for.
| lalitmaganti wrote:
| +1 this was my experience trying it out as well. I find that
| for getting started and for simple usecases it works amazing.
| But I have quite a lot of concerns about how it scales to more
| complex and esoteric workloads.
|
| Non-deterministic OOMs especially are some of the worst things
| in the sort of tools I'd want to use DuckDB in and as you say,
| I found it to be more common than I would like.
| tuhgdetzhh wrote:
| I can recommend earlyoom (https://github.com/rfjakob/earlyoom).
| Instead of freezing or crashing your system this tool kills the
| memory eating process just in time (in this case duckdb). This
| allows you repeat with smaller chunks of the dataset, until it
| fits into your mem.
| wewewedxfgdf wrote:
| Yeah memory and thread management is the job of the
| application, not me.
| QuantumNomad_ wrote:
| When I there is a specific program I want to run with a limit
| on how much memory it is allowed to allocate, I have found
| systemd-run to work well.
|
| It uses cgroups to enforce resource limits.
|
| For example, there's a program I wrote myself which I run on
| one of my Raspberry Pi. I had a problem where my program
| would on rare occasions use up too much memory and I wouldn't
| even be able to ssh into the Raspberry Pi.
|
| I run it like this: systemd-run --scope -p
| MemoryMax=5G --user env FOOBAR=baz ./target/release/myprog
|
| The only difficulty I had was that I struggled to find the
| right name to use in the MemoryMax=... part because they've
| changed the name of it around between versions so different
| Linux systems may or may not use the same name for the limit.
|
| In order to figure out if I had the right name for it, I
| tested different names for it with a super small limit that I
| knew was less than the program needs even in normal
| conditions. And when I found the right name, the program
| would as expected be killed right off the bat and so then I
| could set the limit to 5G (five gigabytes) and be confident
| that if it exceeds that then it will be killed instead of
| making my Raspberry Pi impossible to ssh into again.
| thenaturalist wrote:
| This looks amazing!
|
| Have you used this in conjunction with DuckDB?
| tuhgdetzhh wrote:
| Yes, it works just fine.
| mritchie712 wrote:
| what did you use instead? if you hit OOM with the dataset in
| duckdb, I'd think you'd hit the OOM with most other things on
| the same machine.
| wewewedxfgdf wrote:
| The software should manage its own memory not require the
| developer to set specific memory thresholds. Sure, a good
| thing to be able to say "use no more than X RAM".
| thenaturalist wrote:
| How long ago was this, or can you share more context about data
| and mem size you experienced this with?
|
| DuckDB has introduced spilling to disk and some other tweaks
| since a good year now: https://duckdb.org/2024/07/09/memory-
| management
| wewewedxfgdf wrote:
| 3 days ago.
|
| The final straw was an index which generated fine on MacOS
| and failed on Linux - exact same code.
|
| Machine had plenty of RAM.
|
| The thing is, it is really the responsibility of the
| application to regulate its behavior based on available
| memory. Crashing out just should not be an option but that's
| the way DuckDB is built.
| alex-korr wrote:
| I had the same experience - everything runs great on an AWS
| Linux EC2 with 32GB of memory, same workload in a docker on
| ECS with 32GB allocated gets an OOM. But for smaller
| workloads, DuckDB is fantastic... however, there's a
| certain point when Spark or Snowflake start to make more
| sense.
| jdnier wrote:
| Yesterday there was a somewhat similar DuckDB post, "Frozen
| DuckLakes for Multi-User, Serverless Data Access".
| https://news.ycombinator.com/item?id=45702831
| 85392_school wrote:
| This also reminded me of an approach using SQLite:
| https://news.ycombinator.com/item?id=45748186
| pacbard wrote:
| I set up something similar at work. But it was before the
| DuckLake format was available, so it just uses manually
| generated Parquet files saved to a bucket and a light DuckDB
| catalog that uses views to expose the parquet files. This lets
| us update the Parquet files using our ETL process and just
| refresh the catalog when there is a schema change.
|
| We didn't find the frozen DuckLake setup useful for our use
| case. Mostly because the frozen catalog kind of doesn't make
| sense with the DuckLake philosophy and the cost-benefit wasn't
| there over a regular duckdb catalog. It also made making
| updates cumbersome because you need to pull the DuckLake
| catalog, commit the changes, and re-upload the catalog (instead
| of just directly updating the Parquet files). I get that we are
| missing the time travel part of the DuckLake, but that's not
| critical for us and if it becomes important, we would just roll
| out a PostgreSQL database to manage the catalog.
| SteveMoody73 wrote:
| My initial thought is why query 1TB of data in a browser, maybe
| I'm the wrong target audience for this but it seems that it's
| pushing that everything has to be in a browser rather than using
| appropriate tools
| cyanydeez wrote:
| Browsers are now the write-once works everywhere target. Where
| java failed, many hope browsers succeed. WASM is definitely a
| key to that, particularly because it can be output by tools
| like rust, so they can also be the appropriate tools.
| majormajor wrote:
| Why pay for RAM for servers when you can let your users deal
| with it? ;)
|
| (Does not seem like a realistic scenario to me for many uses,
| for RAM among other resource reasons.)
| some_guy_nobel wrote:
| The one word answer is cost.
|
| But, if you'd like to instead read the article, you'll see that
| they qualify the reasoning in the first section of the article,
| titled, "Rethinking the Old Trade-Off: Cost, Complexity, and
| Access".
| simonw wrote:
| What appropriate tool would you use for this instead?
| shawn-butler wrote:
| I doubt they are querying 1 TB of data in the browser. DuckDB-
| WASM issues http range requests on behalf of client to request
| only the bytes required, especially handy with parquet files
| (columnar format) that will exclude columns you don't even
| need.
|
| But the article is a little light on technical details. In some
| cases it might make sense to bring the entire file client-side.
| fragmede wrote:
| For small databases, SQLite is handy, as there are multiple
| ways to parse the format for clients.
| r3tr0 wrote:
| It's one of the best tricks in the book.
|
| We have been doing it for quite some time in our product to bring
| real time system observability with eBPF to the browser and have
| even found other techniques to really max-it-out beyond what you
| get off the shelf.
|
| https://yeet.cx
| mrbluecoat wrote:
| That's pretty cool. Any technical blog posts?
| r3tr0 wrote:
| we got a couple blog posts
|
| https://yeet.cx/blog
| leetrout wrote:
| I built something on top of DuckDB last year but it never got
| deployed. They wanted to trust Postgres.
|
| I didn't use the in browser WASM but I did expose an api endpoint
| that passed data exploration queries directly to the backend like
| a knock off of what new relic does. I also use that same endpoint
| for all the graphs and metrics in the UI.
|
| DuckDB is phenomenal tech and I love to use it with data ponds
| instead of data lakes although it is very capable of large sets
| as well.
| whalesalad wrote:
| Cool thing about DuckDB is it can be embedded. We have a data
| pipeline that produces a duckdb file and puts it on S3. The app
| periodically checks that assets etag and pulls it down when it
| changes. Most of our DB interactions use PSQL, but we have one
| module that leverages DuckDB and this file for reads. So it's
| definitely not all-or-nothing.
| zenmac wrote:
| Are you using pg_duckdb to embedded it inside postgres and
| access it via psql or other pg clients?
| victor106 wrote:
| > data ponds instead of data lakes
|
| What are data ponds? Never heard the term before
| leetrout wrote:
| Haha, my term. Somewhere between a data lake and warehouse -
| still unstructured but not _everything_ in one place. For
| instance, if I have a multi-tenant app I might choose to have
| a duckdb setup for each customer with pre-filtered data
| living alongside some global unstructured data.
|
| Maybe there's already a term that covers this but I like the
| imagery of the metaphor... "smaller, multiple data but same
| idea as the big one".
| didip wrote:
| How... does it not blow up browser's memory?
| Copenjin wrote:
| The UI element is a scrollable table with a fixed size viewport
| window, memory shouldn't be a problem since they just have to
| retrieve and cache a reasonable area around that window. Old
| data can just be discarded.
| barrenko wrote:
| Where do I learn how to set up this sort of stuff? Trial and
| error? I kinda never need it for personal projects (so far),
| which always leads me to forget this stuff in between jobs kinda
| quickly. Is there a decent book?
| vikramkr wrote:
| If you want to learn it the best way is probably to come up
| with a personal project idea that requires it specifically? Idk
| how much you'd get out of a book but you could always do a side
| project with the specific goal of doing it just to learn a
| particular stack or whatever
| dtech wrote:
| My company tried DuckDB-WASM + parquet + S3 a few months ago but
| we ended up stripping it all out and replacing it with a boring
| REST API.
|
| On paper it seemed like a great fit, but it turned out the WASM
| build doesn't have feature-parity with the "normal" variant, so
| things that caused us to pick it like support for parquet
| compression and lazy loading were not supported. So it ended up
| not having great performance while introducing a lot of
| complexity, and also was terrible for first page load time due to
| needing the large WASM blob. Build pipeline complexity was also
| inherently higher due to the dependency and data packaging
| needed.
|
| Just something to be aware of if you're thinking of using it. Our
| conclusion was that it wasn't worth it for most use cases, which
| is a shame because it seems like such a cool tech.
| mentalgear wrote:
| > WASM build doesn't have feature-parity with the "normal"
| variant
|
| It's a good point, but the wasm docs state that feature-parity
| isn't there - yet. It could certainly be more detailed, but it
| seems strange that your company would do all this work without
| first checking the feature-coverage / specs.
|
| > WebAssembly is basically an additional platform, and there
| might be platform-specific limitations that make some
| extensions not able to match their native capabilities or to
| perform them in a different way.
|
| https://duckdb.org/docs/stable/clients/wasm/extensions
| dtech wrote:
| Note that your docs specifically mentions parquet was
| supported, but we found out the hard way some specific
| features turned out not to be supported with WASM + parquet.
| I did a quick glance at your docs and could not find
| references to that, so I'm not surprised it was missed.
|
| It was a project that exploited a new opportunity so time-to-
| market was the most important thing, I'm not suprised these
| things were missed, and replacing the data loading mechanism
| was maybe 1 week of work for 1 person, so it wasn't that
| impactful a change later.
| mentalgear wrote:
| Fair point, thx for sharing your experiences ! You might
| want to edit the duck-wasm docs in that regard to alert
| others/the team of this constraint.
| ludicrousdispla wrote:
| DuckDB-WASM supports parquet file decompression though, so if
| you have a backend process generating them it's a non issue.
|
| How large was your WASM build? I'm using the standard duckdb-
| wasm, along with JS functions to form the SQL queries, and not
| seeing onerous load times.
| ngc6677 wrote:
| Also similar procedure used on joblist.today
| https://github.com/joblisttoday to fetch hiring companies and
| their jobs and store them into sqlite and duckdb, and retrieved
| on the client side with their wasm modules. The database are
| generated with a daily github workflow and hosted as artifact on
| a github page.
| bzmrgonz wrote:
| This is brilliant guys, omg this is brilliant. If you think about
| it, freely available data always suffer with this burden... "But
| but we don't make money, all this stuff is public data by law,
| and government doesn't give us a budget". This solves that, the
| "can't afford it" spirit of public agencies.
___________________________________________________________________
(page generated 2025-11-01 23:01 UTC)