[HN Gopher] Smallpond - A lightweight data processing framework ...
       ___________________________________________________________________
        
       Smallpond - A lightweight data processing framework built on DuckDB
       and 3FS
        
       Author : overflowcat
       Score  : 131 points
       Date   : 2025-02-28 01:56 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | fastasucan wrote:
       | What does this do - what is the benefit over DuckDB, Polers etc?
        
         | articsputnik wrote:
         | Mehdi just wrote about this. Mainly starting DAGs parallelism
         | using Ray (core) and their filesystem 3FS. See
         | https://mehdio.substack.com/p/duckdb-goes-distributed-
         | deepse....
        
         | mritchie712 wrote:
         | I don't think you get any really benefits over duckdb unless
         | your data is 10tb+ or you spin up 3FS (which seem challenging).
        
       | HackerThemAll wrote:
       | DuckDB itself is cool enough, especially when combined with
       | SQLite and/or PostgreSQL, and now this. Thanks DeepSeek!
        
       | orlp wrote:
       | One thing I found peculiar is that for the GraySort benchmark it
       | dispatches to Polars by default to do the actual sorting, not
       | DuckDB: https://github.com/deepseek-
       | ai/smallpond/blob/ed112db42af4d0....
        
       | shipp02 wrote:
       | Is the code written by the deepseek model?
       | 
       | I should probably give up on being a software engineer if it is.
        
         | breadwinner wrote:
         | Give up and become what? Most white collar jobs will be
         | automated in the coming years. You think doctors' jobs are
         | safe?
        
           | ezst wrote:
           | Not OP, but, anything that actually physically affects the
           | real world for the better? For instance, large infrastructure
           | engineering and construction projects are not going to run
           | themselves any time soon. The world doesn't revolve around ad
           | and fin tech.
        
           | rscho wrote:
           | Yes, doctors are safe. Because they do things. With their
           | hands. That no one else does.
        
             | delfinom wrote:
             | Nope.
             | 
             | Healthcare megacorps are buying up independent practices
             | like crazy. All because doctors can't keep up with the
             | bullshit IT required for insurance, state mandates, etc and
             | that's in addition to the insanity of even renting
             | commercial real estate for an office these days.
             | 
             | These megacorps set quotas and push doctors to nickel and
             | dime like crazy. They sure as shit will spend the money to
             | find robots that can give you a prostate exam with a robot
             | dildo.
        
               | mdaniel wrote:
               | Sounds good; if all these pro-AI folks could get it to
               | complete the insurance paperwork that'd be swell.
               | Actually, come to think of it, do that for the paperwork
               | from both sides, doctor and patient, and eliminate and
               | entire class of leaches upon humanity
               | 
               | I'm going to laugh if DOGE eliminates the IRS, but also
               | might be thankful
        
               | rscho wrote:
               | Don't laugh too quickly, because what you describe is
               | already happening: models are used to design processes
               | allowing insurance corps to deny claims optimally, while
               | on the other side models write your claims. If I were
               | you, I wouldn't be laughing. If you are laughing, then
               | you don't see where this is going to take us.
        
               | tyre wrote:
               | Join us at https://www.camber.health/ if you want to help
               | fix this.
               | 
               | We build software that automates insurance billing for
               | clinics.
               | 
               | And yes, the sentiment is correct that the burden of
               | insurance encourages consolidation in healthcare.
               | Wrapping that away (i.e. Stripe for healthcare financial
               | infra) lowers the barrier to entrepreneurship.
        
               | rscho wrote:
               | Except the tech to do that is not there, and we're quite
               | far from it. It's one thing to have a robot write text,
               | it's a whole other thing to have a robot perform at human
               | level in medical procedures. Not happening tomorrow.
        
             | aragonite wrote:
             | > Because they do things. With their hands. That no one
             | else does
             | 
             | That's only true of surgeons :) What if your specialty is
             | nonsurgical (internal medicine, pediatrics, psychiatry,
             | etc)?
        
               | rscho wrote:
               | Almost all specialties do various technical procedures
               | that only them really know how to do. The extreme is
               | psychoanalytic psychiatry, which are the only ones really
               | doing nothing with their hands (yes, interventional
               | psychiatry is a thing...). Now, you could argue that
               | 'yes, but most of the times it's done by techs/nurses'.
               | Well, no. When things go south, and in all places where
               | there is noone else to do the stuff (of which there are
               | many) docs are on their own.
               | 
               | Regarding surgery, I expect it to be one of the easiest
               | procedures to automate, actually (still quite hard,
               | obviously). Because surgery is the only case where
               | there's always advanced imaging available beforehand, and
               | the environment is relatively fixed (OR).
        
               | ghc wrote:
               | Uh, pediatricians do a lot with their hands. I don't
               | think my kids (or future grandkids) will be seeing an
               | AI/robot doctor.
        
               | downrightmike wrote:
               | Not even true of all surgeons, the ones that make the
               | most money use machines to work on things their hands
               | couldn't do
        
               | rscho wrote:
               | Haha. Have you actually ever seen a surgical robot
               | yourself? Your claim is laughable. There is no automation
               | whatsoever in any robot on the market currently.
        
               | downrightmike wrote:
               | not automation, yet
        
             | mdaniel wrote:
             | Also, a hallucination for 'SELECT mising_field FROM
             | borgus_tuble' is one thing, hallucinating that taking a
             | dose of Cl Na O along with CH3 CO2 H will cure covid is
             | another thing entirely
        
               | sramam wrote:
               | This is so funny!
               | 
               | However it can't even be called hallucinating. Imagine
               | the incident "postmortem":                   But the AI
               | was trained on White House press briefings
               | 
               | Made my day...
        
           | didntknowyou wrote:
           | you can already google the information , the majority of a
           | doctor's value is not in their information but their people
           | and technical skills.
        
             | rscho wrote:
             | Well, googling the info is one thing. But today, medicine
             | is still mostly a know-how profession. Residency is there
             | mostly to transmit know-how.
        
         | cavisne wrote:
         | There is a chinese blogpost from 2019 about 3FS so it predates
         | deepseek [1]. It will be interesting to see the benchmarks but
         | I suspect without 3FS smallpond is not that useful (the
         | bottleneck would move to the networked file system).
         | 
         | None of the big US clouds support Infiniband broadly (Azure &
         | Oracle have some support) so 3FS itself is not very useful to
         | US companies who want to use public clouds.
         | 
         | [1] https://www.high-flyer.cn/blog/3fs/
        
       | rubenvanwyk wrote:
       | May Data Engineering content keep on hitting front page HN!
        
       | RyanHamilton wrote:
       | If you want to checkout duckdb try QStudio. It's a free sql
       | client with duckdb integrated:
       | https://www.timestored.com/qstudio/help/duckdb-sql-editor.
       | Disclaimer: I'm the main author.
        
         | maximilianroos wrote:
         | Big fan of QStudio! Thanks for building it!
        
       | lvl155 wrote:
       | Looking forward to next few years when we can finally abstract
       | away all the back-end techs.
        
         | BobbyJo wrote:
         | We ain't even solved garbage collection yet, and you think
         | "back end systems" are going to abstracted away in the next few
         | years?
        
           | tarruda wrote:
           | > We ain't even solved garbage collection yet
           | 
           | Can you elaborate on that?
        
             | BobbyJo wrote:
             | People still write in languages that force you to manage
             | your own memory.
             | 
             | Once performance starts to matter (either due to scale or
             | time requirements) abstractions always have tradeoffs you
             | can't accept.
        
               | pyrolistical wrote:
               | So then, how can garbage collection ever be solved if
               | it's a trade-off
        
               | BobbyJo wrote:
               | And how can backends be abstracted away if there is a
               | trade off?
               | 
               | As long as compute is a meaningful percentage of spend,
               | the trade off will matter.
        
               | pyrolistical wrote:
               | Right. So what does it look like for garbage collection
               | to be solved? You're saying it's not ever possible
        
           | purplerabbit wrote:
           | Maybe they just mean for the type of projects they care about
        
             | BobbyJo wrote:
             | Can't you already just use FaaS and managed persistence?
        
         | threeseed wrote:
         | We've had this for at least a decade now.
         | 
         | If you use a cloud provider there are managed solutions for
         | data engineering pipelines.
        
       | dang wrote:
       | Related ongoing thread:
       | 
       |  _Understanding Smallpond and 3FS_ -
       | https://news.ycombinator.com/item?id=43232410
       | 
       | also:
       | 
       |  _DuckDB goes distributed? DeepSeek 's smallpond takes on Big
       | Data_ - https://news.ycombinator.com/item?id=43206964 (no
       | comments there, but some people have been recommending that
       | article)
        
       | jamesblonde wrote:
       | We are seeing more and more specialized query engines. This is a
       | query engine specialized for training pipelines. It is not
       | general purpose - it is for providing batches of training data at
       | workers. It uses Ray for parallelization. The kind of queries you
       | need are random reads (to implement shuffling across epochs),
       | arrow support (zero copy to Pandas DataFrames), and efficient
       | checkpointing.
        
       ___________________________________________________________________
       (page generated 2025-03-02 23:00 UTC)