hngopher.com

       [HN Gopher] Show HN: Open-source, browser-local data exploration...
       ___________________________________________________________________
        
       Show HN: Open-source, browser-local data exploration using DuckDB-
       WASM and PRQL
        
       Hey HN! We've built Pretzel, an open-source data exploration and
       visualization tool that runs fully in the browser and can handle
       large files (200 MB CSV on my 8gb MacBook air is snappy). It's also
       reactive - so if, for example, you change a filter, all the data
       transform blocks after it re-evaluate automatically. You can try it
       here: https://pretzelai.github.io/ (static hosted webpage) or see a
       demo video here: https://www.youtube.com/watch?v=73wNEun_L7w  You
       can play with the demo CSV that's pre-loaded (GitHub data of text-
       editor adjacent projects) or upload your own CSV/XLSX file. The
       tool runs fully in-browser--you can disconnect from the internet
       once the website loads--so feel free to use sensitive data if you
       like.  Here's how it works: You upload a CSV file and then, explore
       your data as a series of successive data transforms and plots. For
       example, you might: (1) Remove some columns; (2) Apply some filters
       (remove nulls, remove outliers, restrict time range etc); (3) Do a
       pivot (i.e, a group-by but fancier); (4) Plot a chart; (5) Download
       the chart and the the transformed data. See screenshot:
       https://imgur.com/a/qO4yURI  In the UI, each transform step appears
       as a "Block". You can always see the result of the full transform
       in a table on the right. The transform blocks are editable - for
       instance in the example above, you can go to step 2, change some
       filters and the reactivity will take care of re-computing all the
       cells that follow, including the charts.  We wanted Pretzel to run
       locally in the browser _and_ be extremely performant on large
       files. So, we parse CSVs with the fastest CSV parser (uDSV:
       https://github.com/leeoniya/uDSV) and use DuckDB-Wasm
       (https://github.com/duckdb/duckdb-wasm) to do all the heavy lifting
       of processing the data. We also wanted to allow for chained data
       transformations where each new block operates on the result of the
       previous block. For this, we're using PRQL (https://prql-lang.org/)
       since it maps 1-1 with chained data transform blocks - each block
       maps to a chunk of PRQL which when combined, describes the full
       data transform chain. (PRQL doesn't support DuckDB's Pivot
       statement though so we had to make some CTE based hacks).  There's
       also an AI block: This is the only (optional) feature that requires
       an internet connection but we're working on adding local model
       support via Ollama. For now, you can use your own OpenAI API key or
       use an AI server we provide (GPT4 proxy; it's loaded with a few
       credits), specify a transform in plain english and get back the SQL
       for the transform which you can edit.  Our roadmap includes
       allowing API calls to create new columns; support for an SQL block
       with nice autocomplete features, and a Python block (using Pyodide
       to run Python in the browser) on the results of the data
       transforms, much like a jupyter notebook.  There's two of us and
       we've only spent about a week coding this and fixing major bugs so
       there are still some bugs to iron out. We'd _love_ for you to try
       this and to get your feedback!
        
       Author : prasoonds
       Score  : 114 points
       Date   : 2024-03-15 16:02 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dns_snek wrote:
       | That looks great, I love all these new use cases for WASM that
       | are starting to pop up.
       | 
       | Are you planning on supporting parquet files any time soon? I'd
       | love an easy way to just drop a parquet file on there and easily
       | visualize time-series data with it (e.g. sensor readings).
       | 
       | Small issue I noticed: When you chart time-series data, the
       | timestamp axis is just represented as an integer rather than a
       | meaningfully formatted timestamp.
        
         | prasoonds wrote:
         | Thanks for the feedback!
         | 
         | > Are you planning on supporting parquet files any time soon?
         | 
         | Yes. This is an oversight on our part - duckdb natively
         | supports Parquet and Arrow files but we just haven't gotten
         | around to adding upload support for those yet. It's a small
         | change - landing in main tomorrow!
         | 
         | On the bug, yes, we've been trying to parse timestamps but it's
         | just been very finicky. I'll try to fix this! Thanks for the
         | report.
        
       | crowcroft wrote:
       | Oh man this is cool. I don't really have much feedback tbh I just
       | think this awesome. I can see myself using this as a handy little
       | utility when data sets get a bit too big for Google Sheets.
       | 
       | In terms of it being a product, things I can think of that would
       | be useful to me.
       | 
       | - Importing/exporting data from Google Sheets/Google Drive.
       | 
       | - Scheduling queries to run and export etc.
       | 
       | Not sure if that really aligns to this specific project being
       | that it runs locally though.
       | 
       | Edit: Having a few standard dimension tables available to join
       | would be great too. I often want to join time from a date
       | dimension table when I'm making time series data for example.
        
         | prasoonds wrote:
         | That's great feedback, thanks!
         | 
         | This tool definitely comes from a place of personal need -
         | beyond just handling large files, I've also never really gelled
         | well with the Excel/Google Sheet model of changing data in
         | place as if you were editing text. I'm a Data Scientist and
         | always preferred the chained data transforms you see in things
         | like dplyr (https://dplyr.tidyverse.org/) or Polars
         | (https://pola.rs/) and I feel this tool maps very closely to
         | the chained model.
         | 
         | Also, thank you for the feature requests! Those would all be
         | very useful - we'll put them on the roadmap.
        
           | Dikin wrote:
           | How ary you
        
       | xnx wrote:
       | Very cool. Would love to be able to save/share my work (could
       | probably even be in the url string if the CSV were online).
        
         | ramonverse wrote:
         | Ramon from Pretzel here, thank you for your feedback! We
         | definitely have this on the roadmap:
         | 
         | 1. Allow to download/upload workflows as files (maybe JSON, or
         | PRQL)
         | 
         | 2. Allow to share private URLs by attaching the workflow as a
         | URL parameter
         | 
         | We are building a roadmap upvoting site, we'll add it to the
         | README as soon as it's ready
        
       | DannyPage wrote:
       | I was recently searching for something like this - the ability to
       | write PRQL queries on mobile is my ideal usecase. As you note,
       | PRQL is perfect for putting together queries with drop-downs, so
       | I'm a huge fan of the UI that you've put together. Feature
       | request: Exposing the PRQL would be great, and using tabs to
       | switch between the query and results would make it possible to
       | work on mobile. Lastly, bookmarking those queries for later use
       | would be great too!
        
         | prasoonds wrote:
         | Yes! I'm heartened to see people realize the the benefits of
         | PRQL and chained transforms over vanilla SQL (with some
         | caveats!).
         | 
         | We'll definitely add a PRQL/SQL block to the UI soon (you will
         | be able to toggle to select which one you'd like).
         | 
         | On that point - PRQL doesn't natively support PIVOT statements.
         | As a hack, we made our own flavor of PRQL with a PIVOT
         | statement and we parse that to SQL CTEs to make it work for
         | now. I will be submitting a pull request to PRQL to add support
         | to get around this.
         | 
         | We're working on saving queries to local storage as well as
         | sharing via downloadable config files (we're thinking a PRQL
         | file with some sugar that when uploaded, recreates the entire
         | transform)
        
           | maxloh wrote:
           | It would be good to have EdgeQL too.
        
             | prasoonds wrote:
             | Oh this looks really interesting! I hadn't heard of
             | EdgeDB/EdgeQL - I'll have to do a deeper read but at first
             | brush, it looks like EdgeQL can be compiled down to SQL
             | based on this HN comment:
             | https://news.ycombinator.com/item?id=30296669
             | 
             | I'll take a look and if it's simply a matter of changing
             | from the PRQL compiler to EdgeQL compiler, then, we should
             | be able to have land in main in short order. It's not the
             | highest priority thing right now though.
        
           | d1sxeyes wrote:
           | For folks looking for a workaround in the meantime, you can
           | add an AI block and it will give you the returned SQL to
           | edit.
           | 
           | Not ideal because it's just wasting credits, but it worked
           | for me.
        
           | maximilianroos wrote:
           | [PRQL dev here]
           | 
           | Love what you've done and thanks for building on PRQL!
           | 
           | We're very happy to add something like a PIVOT statement. In
           | the long term, we'd have to think about how to make it work
           | deeply in PRQL, since the column names are suddenly runtime
           | dependent. In the short term, we should definitely make PRQL
           | work for your case -- it's important that there's an escape
           | hatch for things that aren't natively supported by PRQL yet.
           | (We have s-strings but I'm guessing they don't cover this
           | specific case?)
        
             | prasoonds wrote:
             | Hey Max, love what you all doing with PRQL! It's a
             | wonderful tool.
             | 
             | That makes sense. I ran into some old issues about Pivot
             | but it seems it's not been implemented yet. And yes,
             | s-strings wouldn't work - the DuckDB pivot statement looks
             | like this:                 PIVOT <dataset>         ON
             | <columns>        USING <values>         GROUP BY <rows>
             | 
             | So, we'd have to pass the `dataset` _so far_ into the pivot
             | statement. This is where CTEs come in handy but happy to
             | hear if there 's a better solution :)
        
       | fillipvt wrote:
       | this is great, thanks for sharing
       | 
       | I'm currently using a self hosted instance of lightdash connected
       | to a dbt project and I can see this being really efficient for
       | data exploration for business users
       | 
       | quite interesting!
        
         | prasoonds wrote:
         | Thank you. We're definitely looking to get to a point where you
         | don't need to jump to a jupyter notebook for simple analyses,
         | especially things which are hard to do in SQL (a basic linear
         | regression model for eg).
         | 
         | Happy to chat at prasoon [at] withpretzel [dot] com if you need
         | any integration help!
        
       | timhh wrote:
       | Neat. I did a similar thing for analysing the output of commands
       | that can produce JSON (SLURM in this case). Worked pretty well
       | but I immediately ran into PRQL's lack of support for DuckDB's
       | struct types.
        
         | prasoonds wrote:
         | Yes, there's definitely some downsides (which honestly, we
         | didn't realize when we started to build this). For us, it was
         | the lack of a PIVOT statement so I had to make a modified
         | dialect of PRQL that supports PIVOT, then split on the final
         | PRQL into PIVOT and non-PIVOT chunks, and then convert it to
         | chained SQL CTEs. Annoying work for sure.
         | 
         | As a result, I'll be adding support for PIVOTs very soon,
         | either in the main repo or just in a temporary fork.
        
       | gnishemcknight wrote:
       | Nice to see DuckDB-Wasm get some love Curious from the pure
       | performance point of view whether you have performed any
       | benchmarking against any of the alternatives for this sort of in-
       | browser data work?
        
         | prasoonds wrote:
         | That's an interesting point! No we haven't really - for now, I
         | asked a bunch of friends to send me large CSVs - I loaded them
         | on Google Sheets to see how it performs (usually terribly -
         | crashes my browser tab or it's unbearably slow). Then I try the
         | CSV in Pretzel with some filters and pivots and if I don't feel
         | any annoyance with speed, it's a success in my book for now.
         | 
         | We'll definitely need to figure out (a) what's the right thing
         | to measure, and (b) an automated testing/CI to check for
         | regressions and do perf testing
        
       | MarcoConverta wrote:
       | Just tried it out - thanks for sharing.
       | 
       | We're a sales tech startup and I've had to look through large
       | CSVs with prospecting information in the past - anything over 10
       | MB crashes my browser. I use a Mac so I don't have Excel. This
       | looks great for simple data manipulations. I tried out a large
       | CSV I had, and it loaded without a problem.
       | 
       | Quick bug report: The filter interface seems to be slow for me
       | though with the large file. Also, one feature that would be
       | really helpful would be connecting this to a database and also
       | some way to share my workflow/analysis.
        
         | prasoonds wrote:
         | Hey Marco, I'll take a look - filters theoretically should be
         | fast, when you create a new filter, it simply reads does a
         | `select * from table limit 1` to get column names
         | 
         | I wasn't sure whether you could query DBs directly from the
         | browser but looks like you can!
         | (https://github.com/alexanderguy/pgress) - will add it to
         | roadmap!
        
           | prasoonds wrote:
           | Replying as I couldn't edit:
           | 
           | Alright, did a bit of digging and the flamegraph points to a
           | problem with the Table component. I'd hoped that BlueprintJS
           | tables would be performant enough for our usecase but
           | apparently not!
           | 
           | We'll try to move to canvas based rendering ASAP - that
           | should fix any lags in the filter UI
        
       | antman wrote:
       | Can we bookmark or localstorage our etl?
        
         | prasoonds wrote:
         | We're going to build this over the next week - it's a high
         | priority item for us. Once done, you should be able to do two
         | things: - Save existing workdflows to localStorage (a little
         | drawer on the left side of the screen) - Export workflows as a
         | JSON file so you can share workflows with other by simply
         | sending them the JSON file however you like
        
         | tobilg wrote:
         | Have a look at https://sql-workbench.com, it supports query
         | sharing via URL, as well as sharing visualizations. Let me know
         | if you have questions!
        
       | thawab wrote:
       | This is really promising, please look into embedded analytics. We
       | are struggling with building our embedding for SAAS. Now we are
       | doing it with cube and superset. All the alternatives are
       | expensive or hard to implement.Being able to offload the
       | analytics to the browser is a plus.
        
         | mritchie712 wrote:
         | What database are you using?
        
           | thawab wrote:
           | now we use postgres, testing duckdb/md for analytics.
        
         | prasoonds wrote:
         | Thanks! This is interesting - just to make sure I'm
         | understanding you correctly - you're trying to build analytics
         | into your SaaS tool to offer it to your users, correct? And
         | you're using Superset to build the analytics inside your SaaS
         | app?
         | 
         | I'll definitely look into it. For the moment, I know a team
         | trying to solve this exact problem (they're not open-
         | source/free but I think they're pretty inexpensive) - happy to
         | connect you with them if you like :)
        
       | dndn1 wrote:
       | Looks great and thanks for sharing!
       | 
       | You mentioned that you went to some length to implement pivots.
       | How far do you think you will take the pivot feature and UI?
       | 
       | For financial type usecases it's pretty much a requirement to be
       | able to map many fields on X/Y, be able to collapse them in the
       | the browser, and also show/control aggregations.
       | 
       | Asking because many apps stop at the simplest level of pivot
       | features and go to great lengths in other areas like more
       | advanced visualization. For mapping many fields though, I think a
       | good pivot table UI is like a secret weapon.
       | 
       | I get that this isn't what you would focus on so much in week 1.
       | 
       | Nice work + love the local-first and direct UI.
        
         | prasoonds wrote:
         | Right - I used to work in GS and they had this really great
         | internal table display tool where you could simply drag a
         | column from the top to the left side and it would Pivot +
         | Collapse all fields on it (plus allow multi level pivots).
         | Then, you could look at the Pivot table OR do a drill-down to
         | see the root-cause why a value was so high.
         | 
         | I really liked that interface and haven't really found anything
         | near as useable yet. Perspective JS (a free library by JP
         | Morgan, you can try it here:
         | https://perspective.finos.org/block/) has some really cool
         | functionality in this direction but it has its one data
         | processing engine, query language, rendering engine etc so we
         | couldn't have used that for this project.
         | 
         | But, this is an interesting thought - I definitely would want
         | to see this functionality in Pretzel - the only question is how
         | to prioritize this. If you know of any performant table
         | libraries that support collapsible pivots out-of-the-box, I'd
         | love to integrate that. Alternatively, we'll write our own!
        
           | dndn1 wrote:
           | One that I'm watching is the Table mark in Graphic Walker
           | [1].
           | 
           | There is a license thing about logos to note if you do use it
           | in Pretzel [2].
           | 
           | On GW [1], Create Dataset -> Public Datasets -> Student
           | Performance then change the Mark Type to Table you can play
           | around. It hits the things that I mentioned pretty good! You
           | might have similar issues as you have with Perspective
           | though.
           | 
           | I make calculang [3] and I'm getting ready to plug into a lot
           | of things. A good pivot table generalizes well for my needs
           | (mainly quick feedback during DX).
           | 
           | Perspective is on the list but so is GW and Pretzel.
           | 
           | Perspective might suit my needs perfectly.
           | 
           | But I like DuckDB-WASM approach anyway so I hope you continue
           | and I hope you nail it one way or another! :)
           | 
           | [1] https://graphic-walker.kanaries.net
           | 
           | [2] https://github.com/Kanaries/graphic-walker/issues/330
           | 
           | [3] https://calculang.dev
        
           | antonycourtney wrote:
           | Very impressive project and vision! Love the demo!
           | 
           | I am also ex-GS and worked on what I am fairly sure is the
           | table display tool you're describing. I tried to carry the
           | essential aspects of that work (multi-level pivots, with
           | drill-down to the leaf level, and all interactive events and
           | analytics supported by db queries) to Tad
           | (https://www.tadviewer.com/,
           | https://github.com/antonycourtney/tad), another open source
           | project powered by DuckDb.
           | 
           | An embeddable version of Tad, powered by DuckDb WASM, is used
           | as the results viewer in the MotherDuck Web UI
           | (https://app.motherduck.com/).
           | 
           | If you're interested in embedding Tad in Pretzel, or
           | leveraging pieces of it in your work, or collaborating on
           | other aspects of DuckDb WASM powered UIs, please get in
           | touch!
        
       | tobilg wrote:
       | The DuckDB WASM space is really heating up! I released
       | https://sql-workbench.com a few weeks ago, which can be used to
       | query and visualize Parquet, CSV, JSON and Arrow data.
       | 
       | There's also a accompanying tutorial blog post at
       | https://tobilg.com/using-duckdb-wasm-for-in-browser-data-eng...
        
         | chrisjc wrote:
         | I just like to thank you (again) for how much you've made a lot
         | of what is required to get duckdb up and started (esp on AWS)
         | much easier.
         | 
         | Interesting to watch the similarities (duckdb wasm) and
         | differences (python vs js) between your SQL IDEs in the
         | browser. Exciting stuff.
        
           | tobilg wrote:
           | Thank you! It's really great that the in-browser data
           | analysis ecosystem is really firing up... Exited to see what
           | others build
        
         | prasoonds wrote:
         | Oh hey! I remember seeing your HN post several weeks ago, I
         | think - it's a great tool, thanks for building this! It was
         | definitely an inspiration for us :)
        
           | tobilg wrote:
           | Thanks and congrats on the launch!
        
       | logical_person wrote:
       | 200mb of data is not a large file, and chromium tabs have a
       | memory limit of something ridiculously low so actual large
       | 20-100gb datasets render this useless.
        
         | yawnxyz wrote:
         | Sometimes you need a scooter, sometimes you need a truck.
         | 
         | I think a snappy interface for <1gb datasets is really neat and
         | super useful for certain kinds of data
        
           | prasoonds wrote:
           | This echoes my thoughts exactly. Right now, we're actually
           | more limited by the JS UI so a couple 100 MBs is the most you
           | can do in a browser otherwise the UI becomes really slow.
           | There's a lot of room for improvement - we're using React and
           | that's causing a bunch of un-needed re-renders right now that
           | we don't need. We probably need to create our own DAG based
           | task management system and use Canvas to render everything -
           | with all that, workflows on much larger files will hopefully
           | become usable.
        
         | slaymaker1907 wrote:
         | I looked up the limit and as of 2021, tabs seem to have been
         | limited to 16GB which is moderate in size for an in-memory
         | dataset. However, I know WASM has a hard limit of 4GB without
         | Memory64. Data size is all relative.
        
         | prasoonds wrote:
         | This is certainly true - I'm not saying "large file" in the
         | colloquial sense of the "big data" but rather as in - a file
         | you might want to open in Excel/Google Sheets. I've worked
         | actual large datasets before - upwards of 500GB - pretty often
         | before an I really wouldn't think about using my laptop for a
         | such a thing!
         | 
         | We are thinking of making data connectors to major DBs though
         | so you should be able to do a similar style visual analysis
         | while keeping the compute on your DB.
        
       | ThreeMoonsAreUp wrote:
       | I enjoyed your demo video and played around with a CSV of my own.
       | This is a cool project, but I have a (perhaps stupid) question:
       | 
       | Why should I use this tool over Excel or Google Sheets?
       | 
       | I don't mean this question to be a challenge. Just trying to
       | understand which sorts of tasks would I be better off using
       | Pretzel as opposed to a traditional spreadsheet tool.
        
         | prasoonds wrote:
         | Hey, absolutely - that's a perfectly fair question. I can give
         | you my personal reasons:
         | 
         | - I don't have excel on a Mac so I need to use Google Sheets.
         | Google Sheets crashes on any file over 30-40 MB (at least on my
         | macbook air)
         | 
         | - I think this type of chained data transform is a much better
         | (more code-like?) way of doing transforms. You can see exactly
         | how you got to your end-result. Vs, for example, in Google
         | Sheets, you're modifying data in-place and your transform
         | history is only accessible via Undo and Redo. This is far more
         | reproducible short of using a python script.
         | 
         | (I've worked as a data scientist for a while and I suppose that
         | influences how I think about chained data transforms vs the in-
         | place Excel way too!)
         | 
         | But this is just for now - the long term vision is: Being able
         | to easily switch between visual no-code blocks, SQL and Python
         | in the same browser-local "notebook"/"workbook" that's easily
         | shareable. I think it doesn't take much convincing that such a
         | tool would be more powerful than Excel for certain workflows!
        
       | auntienomen wrote:
       | This is quite nice. Any plans to add simpler ML tools? A sklearn
       | plugin could be as valueable as LLM access.
        
         | prasoonds wrote:
         | Thank you! Yes, one of the items in the Roadmap is support for
         | Pyodide (https://github.com/pyodide/pyodide) for running in-
         | browser python on the results of each of the code blocks! This
         | should allow most ML libs to be usable in-browser! This is
         | pretty high-up on our priority list.
        
       | hitradostava wrote:
       | Hey congrats on the Show HN. Local, browser based data
       | exploration works for a lot of uses cases and is so much faster
       | thancloud based tools. We've implemented something similar at
       | https://addmaple.com/ - but with a graphical interface designed
       | for rapid exploratory data analysis of large datasets.
       | 
       | Memory per tab can be an issue for really big files (1gb+) but
       | we're exploring a transform to CBOR which allows us to free up JS
       | memory, i.e. when parsing CBOR we can leave row level data as
       | Uint8Array and it doesn't increase the JS memory overhead.
        
         | prasoonds wrote:
         | Thanks! Maple looks really cool - really interesting demo
         | video, too!
         | 
         | This is quite interesting - we've not explored really large
         | files so far and being honest, we haven't thought that far
         | either. Didn't know about CBOR! I will have to look deeper into
         | how this can save on memory. I was wondering though, since WASM
         | memory is limited to 4GB, if I have sufficiently large memory
         | on my compute device, at least one tab should be able to handle
         | 1gb+ files too, correct?
        
           | hitradostava wrote:
           | Thanks :-)
           | 
           | I've not done much profiling on DuckDB and what the overhead
           | is - i.e. after the data is parsed how much memory is used.
           | Would be really interesting to push it to the limit - or to
           | explore not loading the entire file in, but only reading the
           | relevant parts, but again that probably requires a conversion
           | first, e.g. to parquet or some other column based storage
           | format.
        
             | prasoonds wrote:
             | that sounds like a fun project :) maybe I'll take a stab at
             | it!
        
       | bbkane wrote:
       | Wow! Is there any way this could support SQLite databases?
        
         | chrisjc wrote:
         | I haven't taken a close enough look at it yet, but duckdb
         | supports Sqlite databases, so in theory yes.
         | 
         | However, using Sqlite requires a duckdb extension and i'm not
         | sure if works for wasm, and if so enabled this project in
         | particular.
         | 
         | https://duckdb.org/docs/guides/import/query_sqlite
         | 
         | But I believe that most of these duckdb web-IDEs will revolve
         | around a couple of generalized used cases:
         | 
         | 1. OLAP/analytical type workloads/queries. Sqlite is really
         | more for OLTP/transactional workloads.
         | 
         | 2. Querying datasets that are available on, targeted for
         | blob/object storage like S3. Parquet, CSV, line-delimited JSON,
         | etc
        
         | prasoonds wrote:
         | DuckDB supports SQLite through extensions
         | (https://duckdb.org/docs/extensions/sqlite.html)
         | 
         | So, it should be pretty straightforward to let folks drop a
         | SQLite file instead of a CSV file! I haven't been able to get
         | extensions to work on WASM so far though but I will definitely
         | take a look to see if we can make this work!
        
       ___________________________________________________________________
       (page generated 2024-03-15 23:00 UTC)