[HN Gopher] Smallpond - A lightweight data processing framework ...
___________________________________________________________________
Smallpond - A lightweight data processing framework built on DuckDB
and 3FS
Author : overflowcat
Score : 131 points
Date : 2025-02-28 01:56 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| fastasucan wrote:
| What does this do - what is the benefit over DuckDB, Polers etc?
| articsputnik wrote:
| Mehdi just wrote about this. Mainly starting DAGs parallelism
| using Ray (core) and their filesystem 3FS. See
| https://mehdio.substack.com/p/duckdb-goes-distributed-
| deepse....
| mritchie712 wrote:
| I don't think you get any really benefits over duckdb unless
| your data is 10tb+ or you spin up 3FS (which seem challenging).
| HackerThemAll wrote:
| DuckDB itself is cool enough, especially when combined with
| SQLite and/or PostgreSQL, and now this. Thanks DeepSeek!
| orlp wrote:
| One thing I found peculiar is that for the GraySort benchmark it
| dispatches to Polars by default to do the actual sorting, not
| DuckDB: https://github.com/deepseek-
| ai/smallpond/blob/ed112db42af4d0....
| shipp02 wrote:
| Is the code written by the deepseek model?
|
| I should probably give up on being a software engineer if it is.
| breadwinner wrote:
| Give up and become what? Most white collar jobs will be
| automated in the coming years. You think doctors' jobs are
| safe?
| ezst wrote:
| Not OP, but, anything that actually physically affects the
| real world for the better? For instance, large infrastructure
| engineering and construction projects are not going to run
| themselves any time soon. The world doesn't revolve around ad
| and fin tech.
| rscho wrote:
| Yes, doctors are safe. Because they do things. With their
| hands. That no one else does.
| delfinom wrote:
| Nope.
|
| Healthcare megacorps are buying up independent practices
| like crazy. All because doctors can't keep up with the
| bullshit IT required for insurance, state mandates, etc and
| that's in addition to the insanity of even renting
| commercial real estate for an office these days.
|
| These megacorps set quotas and push doctors to nickel and
| dime like crazy. They sure as shit will spend the money to
| find robots that can give you a prostate exam with a robot
| dildo.
| mdaniel wrote:
| Sounds good; if all these pro-AI folks could get it to
| complete the insurance paperwork that'd be swell.
| Actually, come to think of it, do that for the paperwork
| from both sides, doctor and patient, and eliminate and
| entire class of leaches upon humanity
|
| I'm going to laugh if DOGE eliminates the IRS, but also
| might be thankful
| rscho wrote:
| Don't laugh too quickly, because what you describe is
| already happening: models are used to design processes
| allowing insurance corps to deny claims optimally, while
| on the other side models write your claims. If I were
| you, I wouldn't be laughing. If you are laughing, then
| you don't see where this is going to take us.
| tyre wrote:
| Join us at https://www.camber.health/ if you want to help
| fix this.
|
| We build software that automates insurance billing for
| clinics.
|
| And yes, the sentiment is correct that the burden of
| insurance encourages consolidation in healthcare.
| Wrapping that away (i.e. Stripe for healthcare financial
| infra) lowers the barrier to entrepreneurship.
| rscho wrote:
| Except the tech to do that is not there, and we're quite
| far from it. It's one thing to have a robot write text,
| it's a whole other thing to have a robot perform at human
| level in medical procedures. Not happening tomorrow.
| aragonite wrote:
| > Because they do things. With their hands. That no one
| else does
|
| That's only true of surgeons :) What if your specialty is
| nonsurgical (internal medicine, pediatrics, psychiatry,
| etc)?
| rscho wrote:
| Almost all specialties do various technical procedures
| that only them really know how to do. The extreme is
| psychoanalytic psychiatry, which are the only ones really
| doing nothing with their hands (yes, interventional
| psychiatry is a thing...). Now, you could argue that
| 'yes, but most of the times it's done by techs/nurses'.
| Well, no. When things go south, and in all places where
| there is noone else to do the stuff (of which there are
| many) docs are on their own.
|
| Regarding surgery, I expect it to be one of the easiest
| procedures to automate, actually (still quite hard,
| obviously). Because surgery is the only case where
| there's always advanced imaging available beforehand, and
| the environment is relatively fixed (OR).
| ghc wrote:
| Uh, pediatricians do a lot with their hands. I don't
| think my kids (or future grandkids) will be seeing an
| AI/robot doctor.
| downrightmike wrote:
| Not even true of all surgeons, the ones that make the
| most money use machines to work on things their hands
| couldn't do
| rscho wrote:
| Haha. Have you actually ever seen a surgical robot
| yourself? Your claim is laughable. There is no automation
| whatsoever in any robot on the market currently.
| downrightmike wrote:
| not automation, yet
| mdaniel wrote:
| Also, a hallucination for 'SELECT mising_field FROM
| borgus_tuble' is one thing, hallucinating that taking a
| dose of Cl Na O along with CH3 CO2 H will cure covid is
| another thing entirely
| sramam wrote:
| This is so funny!
|
| However it can't even be called hallucinating. Imagine
| the incident "postmortem": But the AI
| was trained on White House press briefings
|
| Made my day...
| didntknowyou wrote:
| you can already google the information , the majority of a
| doctor's value is not in their information but their people
| and technical skills.
| rscho wrote:
| Well, googling the info is one thing. But today, medicine
| is still mostly a know-how profession. Residency is there
| mostly to transmit know-how.
| cavisne wrote:
| There is a chinese blogpost from 2019 about 3FS so it predates
| deepseek [1]. It will be interesting to see the benchmarks but
| I suspect without 3FS smallpond is not that useful (the
| bottleneck would move to the networked file system).
|
| None of the big US clouds support Infiniband broadly (Azure &
| Oracle have some support) so 3FS itself is not very useful to
| US companies who want to use public clouds.
|
| [1] https://www.high-flyer.cn/blog/3fs/
| rubenvanwyk wrote:
| May Data Engineering content keep on hitting front page HN!
| RyanHamilton wrote:
| If you want to checkout duckdb try QStudio. It's a free sql
| client with duckdb integrated:
| https://www.timestored.com/qstudio/help/duckdb-sql-editor.
| Disclaimer: I'm the main author.
| maximilianroos wrote:
| Big fan of QStudio! Thanks for building it!
| lvl155 wrote:
| Looking forward to next few years when we can finally abstract
| away all the back-end techs.
| BobbyJo wrote:
| We ain't even solved garbage collection yet, and you think
| "back end systems" are going to abstracted away in the next few
| years?
| tarruda wrote:
| > We ain't even solved garbage collection yet
|
| Can you elaborate on that?
| BobbyJo wrote:
| People still write in languages that force you to manage
| your own memory.
|
| Once performance starts to matter (either due to scale or
| time requirements) abstractions always have tradeoffs you
| can't accept.
| pyrolistical wrote:
| So then, how can garbage collection ever be solved if
| it's a trade-off
| BobbyJo wrote:
| And how can backends be abstracted away if there is a
| trade off?
|
| As long as compute is a meaningful percentage of spend,
| the trade off will matter.
| pyrolistical wrote:
| Right. So what does it look like for garbage collection
| to be solved? You're saying it's not ever possible
| purplerabbit wrote:
| Maybe they just mean for the type of projects they care about
| BobbyJo wrote:
| Can't you already just use FaaS and managed persistence?
| threeseed wrote:
| We've had this for at least a decade now.
|
| If you use a cloud provider there are managed solutions for
| data engineering pipelines.
| dang wrote:
| Related ongoing thread:
|
| _Understanding Smallpond and 3FS_ -
| https://news.ycombinator.com/item?id=43232410
|
| also:
|
| _DuckDB goes distributed? DeepSeek 's smallpond takes on Big
| Data_ - https://news.ycombinator.com/item?id=43206964 (no
| comments there, but some people have been recommending that
| article)
| jamesblonde wrote:
| We are seeing more and more specialized query engines. This is a
| query engine specialized for training pipelines. It is not
| general purpose - it is for providing batches of training data at
| workers. It uses Ray for parallelization. The kind of queries you
| need are random reads (to implement shuffling across epochs),
| arrow support (zero copy to Pandas DataFrames), and efficient
| checkpointing.
___________________________________________________________________
(page generated 2025-03-02 23:00 UTC)