[HN Gopher] Use DuckDB-WASM to query TB of data in browser
       ___________________________________________________________________
        
       Use DuckDB-WASM to query TB of data in browser
        
       Author : mlissner
       Score  : 220 points
       Date   : 2025-10-31 17:37 UTC (1 days ago)
        
 (HTM) web link (lil.law.harvard.edu)
 (TXT) w3m dump (lil.law.harvard.edu)
        
       | mlissner wrote:
       | OK, this is really neat: - S3 is really cheap static storage for
       | files. - DuckDB is a database that uses S3 for its storage. -
       | WASM lets you run binary (non-JS) code in your browser. - DuckDB-
       | Wasm allows you to run a database in your browser.
       | 
       | Put all of that together, and you get a website that queries S3
       | with no backend at all. Amazing.
        
         | timeflex wrote:
         | S3 might be relatively cheap for storing files, but with
         | bandwidth you could easily be paying $230/mo. If you make it
         | public facing & want to try to use their cloud reporting,
         | metrics, etc. to prevent people for running up your bandwidth,
         | your "really cheap" static hosting could easily cost you more
         | than $500/mo.
        
           | theultdev wrote:
           | R2 is S3 compatible with no egress fees.
           | 
           | Cloudflare actually has built in iceberg support for R2
           | buckets. It's quite nice.
           | 
           | Combine that with their pipelines it's a simple http request
           | to ingest, then just point duckdb to the iceberg enabled R2
           | bucket to analyze.
        
             | greatNespresso wrote:
             | Was about to jump in to say the same thing. R2 is a much
             | cheaper alternative to S3 that just works and I have used
             | it with DuckDB, works smoothly
        
             | apwheele wrote:
             | For a demo of this (although not sure with duckdb wasm that
             | it works with iceberg)
             | https://andrewpwheeler.com/2025/06/29/using-duckdb-wasm-
             | clou...
        
             | 8organicbits wrote:
             | > R2 is S3 compatible with no egress fees.
             | 
             | There's no egress data transfer fees, but you still pay for
             | the GET request operations. Lots of little range requests
             | can add up quick.
        
               | zenmac wrote:
               | Can't believe that is what the industry has come down to.
               | Kind like clipping coupon to get the best deal according
               | different pricing overlords.
               | 
               | It is time like this that makes self-hosting a lot more
               | attractive.
        
               | theultdev wrote:
               | Luckily it's just static files. You can use whatever host
               | you want.
        
           | 7952 wrote:
           | I think this approach makes sense for services with a small
           | number of users relative to the data they are searching. That
           | just isn't a good fit for a lot of hosted services. Think how
           | much that TB's of data would cost on Algolia or similar
           | services.
           | 
           | You have to store the data somehow anyway, and you have to
           | retrieve some of it to service a query. If egress costs too
           | much you could always change later to put the browser code on
           | a server. Also it would presumably be possible to quantify
           | the trade-off between processing the data client side and on
           | the server.
        
           | simonw wrote:
           | Stick it behind Cloudflare and it should be effectively free.
        
             | bigiain wrote:
             | Until it isn't.
        
         | rubenvanwyk wrote:
         | Or use R2 instead. It's even easier.
        
         | thadt wrote:
         | S3 is doing quite a lot of sophisticated lifting to qualify as
         | _no backend at all._
         | 
         | But yeah - this is pretty neat. Easily seems like the future of
         | static datasets should wind up in something like this. Just
         | data, with some well chosen indices.
        
           | theultdev wrote:
           | Still qualifies imo. Everything is static and on a CDN.
           | 
           | Lack of server/dynamic code qualifies as no backend.
        
           | simonw wrote:
           | I believe all S3 has to do here is respond to HTTP Range
           | queries, which are supported by almost every static server
           | out there - Apache, Nginx etc should all support the same
           | trick.
        
             | thadt wrote:
             | 100%. I'm with y'all - this is what I would _also_ call a
             | "no-backend" solution and I'm all in on this type of
             | approach for static data sets - this is the future, and
             | could be served with a very simple web server.
             | 
             | I'm just bemused that we all refer to one of the larger,
             | more sophisticated storage systems on the plant, composed
             | of dozens of subsystems and thousands of servers as "no
             | backend at all." Kind of a "draw the rest of the owl".
        
         | codedokode wrote:
         | Can you replace S3 with a directory and nginx and save lot of
         | money?
        
           | dtech wrote:
           | Yes, i.i.r.c. it's not S3 specific just URLs
        
           | mpweiher wrote:
           | Yes. Especially if you use Storage Combinators.
           | 
           | They let you easily abstract over storage.
           | 
           | https://2019.splashcon.org/details/splash-2019-Onward-
           | papers...
        
       | amazingamazing wrote:
       | Neat. Can you use duckdb backed on another store like rocksdb or
       | something? Also, I wonder how one stops ddos. Put the whole thing
       | behind Cloudflare?
        
       | wewewedxfgdf wrote:
       | I tried DuckDB - liked it a lot - was ready to go further.
       | 
       | But found it to be a real hassle to help it understand the right
       | number of threads and the amount of memory to use.
       | 
       | This led to lots of crashes. If you look at the projects github
       | issues you will see many OOM out of memory errors.
       | 
       | And then there was some indexed bug that crashed seemingly
       | unrelated to memory.
       | 
       | Life is too short for crashy database software so I reluctantly
       | dropped it. I was disappointed because it was exactly what I was
       | looking for.
        
         | lalitmaganti wrote:
         | +1 this was my experience trying it out as well. I find that
         | for getting started and for simple usecases it works amazing.
         | But I have quite a lot of concerns about how it scales to more
         | complex and esoteric workloads.
         | 
         | Non-deterministic OOMs especially are some of the worst things
         | in the sort of tools I'd want to use DuckDB in and as you say,
         | I found it to be more common than I would like.
        
         | tuhgdetzhh wrote:
         | I can recommend earlyoom (https://github.com/rfjakob/earlyoom).
         | Instead of freezing or crashing your system this tool kills the
         | memory eating process just in time (in this case duckdb). This
         | allows you repeat with smaller chunks of the dataset, until it
         | fits into your mem.
        
           | wewewedxfgdf wrote:
           | Yeah memory and thread management is the job of the
           | application, not me.
        
           | QuantumNomad_ wrote:
           | When I there is a specific program I want to run with a limit
           | on how much memory it is allowed to allocate, I have found
           | systemd-run to work well.
           | 
           | It uses cgroups to enforce resource limits.
           | 
           | For example, there's a program I wrote myself which I run on
           | one of my Raspberry Pi. I had a problem where my program
           | would on rare occasions use up too much memory and I wouldn't
           | even be able to ssh into the Raspberry Pi.
           | 
           | I run it like this:                 systemd-run --scope -p
           | MemoryMax=5G --user env FOOBAR=baz ./target/release/myprog
           | 
           | The only difficulty I had was that I struggled to find the
           | right name to use in the MemoryMax=... part because they've
           | changed the name of it around between versions so different
           | Linux systems may or may not use the same name for the limit.
           | 
           | In order to figure out if I had the right name for it, I
           | tested different names for it with a super small limit that I
           | knew was less than the program needs even in normal
           | conditions. And when I found the right name, the program
           | would as expected be killed right off the bat and so then I
           | could set the limit to 5G (five gigabytes) and be confident
           | that if it exceeds that then it will be killed instead of
           | making my Raspberry Pi impossible to ssh into again.
        
           | thenaturalist wrote:
           | This looks amazing!
           | 
           | Have you used this in conjunction with DuckDB?
        
             | tuhgdetzhh wrote:
             | Yes, it works just fine.
        
         | mritchie712 wrote:
         | what did you use instead? if you hit OOM with the dataset in
         | duckdb, I'd think you'd hit the OOM with most other things on
         | the same machine.
        
           | wewewedxfgdf wrote:
           | The software should manage its own memory not require the
           | developer to set specific memory thresholds. Sure, a good
           | thing to be able to say "use no more than X RAM".
        
         | thenaturalist wrote:
         | How long ago was this, or can you share more context about data
         | and mem size you experienced this with?
         | 
         | DuckDB has introduced spilling to disk and some other tweaks
         | since a good year now: https://duckdb.org/2024/07/09/memory-
         | management
        
           | wewewedxfgdf wrote:
           | 3 days ago.
           | 
           | The final straw was an index which generated fine on MacOS
           | and failed on Linux - exact same code.
           | 
           | Machine had plenty of RAM.
           | 
           | The thing is, it is really the responsibility of the
           | application to regulate its behavior based on available
           | memory. Crashing out just should not be an option but that's
           | the way DuckDB is built.
        
             | alex-korr wrote:
             | I had the same experience - everything runs great on an AWS
             | Linux EC2 with 32GB of memory, same workload in a docker on
             | ECS with 32GB allocated gets an OOM. But for smaller
             | workloads, DuckDB is fantastic... however, there's a
             | certain point when Spark or Snowflake start to make more
             | sense.
        
       | jdnier wrote:
       | Yesterday there was a somewhat similar DuckDB post, "Frozen
       | DuckLakes for Multi-User, Serverless Data Access".
       | https://news.ycombinator.com/item?id=45702831
        
         | 85392_school wrote:
         | This also reminded me of an approach using SQLite:
         | https://news.ycombinator.com/item?id=45748186
        
         | pacbard wrote:
         | I set up something similar at work. But it was before the
         | DuckLake format was available, so it just uses manually
         | generated Parquet files saved to a bucket and a light DuckDB
         | catalog that uses views to expose the parquet files. This lets
         | us update the Parquet files using our ETL process and just
         | refresh the catalog when there is a schema change.
         | 
         | We didn't find the frozen DuckLake setup useful for our use
         | case. Mostly because the frozen catalog kind of doesn't make
         | sense with the DuckLake philosophy and the cost-benefit wasn't
         | there over a regular duckdb catalog. It also made making
         | updates cumbersome because you need to pull the DuckLake
         | catalog, commit the changes, and re-upload the catalog (instead
         | of just directly updating the Parquet files). I get that we are
         | missing the time travel part of the DuckLake, but that's not
         | critical for us and if it becomes important, we would just roll
         | out a PostgreSQL database to manage the catalog.
        
       | SteveMoody73 wrote:
       | My initial thought is why query 1TB of data in a browser, maybe
       | I'm the wrong target audience for this but it seems that it's
       | pushing that everything has to be in a browser rather than using
       | appropriate tools
        
         | cyanydeez wrote:
         | Browsers are now the write-once works everywhere target. Where
         | java failed, many hope browsers succeed. WASM is definitely a
         | key to that, particularly because it can be output by tools
         | like rust, so they can also be the appropriate tools.
        
         | majormajor wrote:
         | Why pay for RAM for servers when you can let your users deal
         | with it? ;)
         | 
         | (Does not seem like a realistic scenario to me for many uses,
         | for RAM among other resource reasons.)
        
         | some_guy_nobel wrote:
         | The one word answer is cost.
         | 
         | But, if you'd like to instead read the article, you'll see that
         | they qualify the reasoning in the first section of the article,
         | titled, "Rethinking the Old Trade-Off: Cost, Complexity, and
         | Access".
        
         | simonw wrote:
         | What appropriate tool would you use for this instead?
        
         | shawn-butler wrote:
         | I doubt they are querying 1 TB of data in the browser. DuckDB-
         | WASM issues http range requests on behalf of client to request
         | only the bytes required, especially handy with parquet files
         | (columnar format) that will exclude columns you don't even
         | need.
         | 
         | But the article is a little light on technical details. In some
         | cases it might make sense to bring the entire file client-side.
        
           | fragmede wrote:
           | For small databases, SQLite is handy, as there are multiple
           | ways to parse the format for clients.
        
       | r3tr0 wrote:
       | It's one of the best tricks in the book.
       | 
       | We have been doing it for quite some time in our product to bring
       | real time system observability with eBPF to the browser and have
       | even found other techniques to really max-it-out beyond what you
       | get off the shelf.
       | 
       | https://yeet.cx
        
         | mrbluecoat wrote:
         | That's pretty cool. Any technical blog posts?
        
           | r3tr0 wrote:
           | we got a couple blog posts
           | 
           | https://yeet.cx/blog
        
       | leetrout wrote:
       | I built something on top of DuckDB last year but it never got
       | deployed. They wanted to trust Postgres.
       | 
       | I didn't use the in browser WASM but I did expose an api endpoint
       | that passed data exploration queries directly to the backend like
       | a knock off of what new relic does. I also use that same endpoint
       | for all the graphs and metrics in the UI.
       | 
       | DuckDB is phenomenal tech and I love to use it with data ponds
       | instead of data lakes although it is very capable of large sets
       | as well.
        
         | whalesalad wrote:
         | Cool thing about DuckDB is it can be embedded. We have a data
         | pipeline that produces a duckdb file and puts it on S3. The app
         | periodically checks that assets etag and pulls it down when it
         | changes. Most of our DB interactions use PSQL, but we have one
         | module that leverages DuckDB and this file for reads. So it's
         | definitely not all-or-nothing.
        
           | zenmac wrote:
           | Are you using pg_duckdb to embedded it inside postgres and
           | access it via psql or other pg clients?
        
         | victor106 wrote:
         | > data ponds instead of data lakes
         | 
         | What are data ponds? Never heard the term before
        
           | leetrout wrote:
           | Haha, my term. Somewhere between a data lake and warehouse -
           | still unstructured but not _everything_ in one place. For
           | instance, if I have a multi-tenant app I might choose to have
           | a duckdb setup for each customer with pre-filtered data
           | living alongside some global unstructured data.
           | 
           | Maybe there's already a term that covers this but I like the
           | imagery of the metaphor... "smaller, multiple data but same
           | idea as the big one".
        
       | didip wrote:
       | How... does it not blow up browser's memory?
        
         | Copenjin wrote:
         | The UI element is a scrollable table with a fixed size viewport
         | window, memory shouldn't be a problem since they just have to
         | retrieve and cache a reasonable area around that window. Old
         | data can just be discarded.
        
       | barrenko wrote:
       | Where do I learn how to set up this sort of stuff? Trial and
       | error? I kinda never need it for personal projects (so far),
       | which always leads me to forget this stuff in between jobs kinda
       | quickly. Is there a decent book?
        
         | vikramkr wrote:
         | If you want to learn it the best way is probably to come up
         | with a personal project idea that requires it specifically? Idk
         | how much you'd get out of a book but you could always do a side
         | project with the specific goal of doing it just to learn a
         | particular stack or whatever
        
       | dtech wrote:
       | My company tried DuckDB-WASM + parquet + S3 a few months ago but
       | we ended up stripping it all out and replacing it with a boring
       | REST API.
       | 
       | On paper it seemed like a great fit, but it turned out the WASM
       | build doesn't have feature-parity with the "normal" variant, so
       | things that caused us to pick it like support for parquet
       | compression and lazy loading were not supported. So it ended up
       | not having great performance while introducing a lot of
       | complexity, and also was terrible for first page load time due to
       | needing the large WASM blob. Build pipeline complexity was also
       | inherently higher due to the dependency and data packaging
       | needed.
       | 
       | Just something to be aware of if you're thinking of using it. Our
       | conclusion was that it wasn't worth it for most use cases, which
       | is a shame because it seems like such a cool tech.
        
         | mentalgear wrote:
         | > WASM build doesn't have feature-parity with the "normal"
         | variant
         | 
         | It's a good point, but the wasm docs state that feature-parity
         | isn't there - yet. It could certainly be more detailed, but it
         | seems strange that your company would do all this work without
         | first checking the feature-coverage / specs.
         | 
         | > WebAssembly is basically an additional platform, and there
         | might be platform-specific limitations that make some
         | extensions not able to match their native capabilities or to
         | perform them in a different way.
         | 
         | https://duckdb.org/docs/stable/clients/wasm/extensions
        
           | dtech wrote:
           | Note that your docs specifically mentions parquet was
           | supported, but we found out the hard way some specific
           | features turned out not to be supported with WASM + parquet.
           | I did a quick glance at your docs and could not find
           | references to that, so I'm not surprised it was missed.
           | 
           | It was a project that exploited a new opportunity so time-to-
           | market was the most important thing, I'm not suprised these
           | things were missed, and replacing the data loading mechanism
           | was maybe 1 week of work for 1 person, so it wasn't that
           | impactful a change later.
        
             | mentalgear wrote:
             | Fair point, thx for sharing your experiences ! You might
             | want to edit the duck-wasm docs in that regard to alert
             | others/the team of this constraint.
        
         | ludicrousdispla wrote:
         | DuckDB-WASM supports parquet file decompression though, so if
         | you have a backend process generating them it's a non issue.
         | 
         | How large was your WASM build? I'm using the standard duckdb-
         | wasm, along with JS functions to form the SQL queries, and not
         | seeing onerous load times.
        
       | ngc6677 wrote:
       | Also similar procedure used on joblist.today
       | https://github.com/joblisttoday to fetch hiring companies and
       | their jobs and store them into sqlite and duckdb, and retrieved
       | on the client side with their wasm modules. The database are
       | generated with a daily github workflow and hosted as artifact on
       | a github page.
        
       | bzmrgonz wrote:
       | This is brilliant guys, omg this is brilliant. If you think about
       | it, freely available data always suffer with this burden... "But
       | but we don't make money, all this stuff is public data by law,
       | and government doesn't give us a budget". This solves that, the
       | "can't afford it" spirit of public agencies.
        
       ___________________________________________________________________
       (page generated 2025-11-01 23:01 UTC)