[HN Gopher] A lost decade chasing distributed architectures for ...
___________________________________________________________________
A lost decade chasing distributed architectures for data analytics?
Author : andreasha
Score : 201 points
Date : 2025-05-19 08:39 UTC (3 days ago)
(HTM) web link (duckdb.org)
(TXT) w3m dump (duckdb.org)
| drewm1980 wrote:
| I mean, not everyone spent their decade on distributed computing.
| Some devs with a retrogrouch inclination kept writing single
| threaded code in native languages on a single node. Single core
| clock speed stagnated, but it was still worth buying new CPU's
| with more cores because they also had more cache, and all the
| extra cores are useful for running ~other peoples' bloated code.
| HPsquared wrote:
| High-frequency trading, gaming, audio/DSP, embedded, etc.
| There's a lot of room for that kind of developer.
| nyanpasu64 wrote:
| I find that good multithreading can speed up parallelizable
| workloads by 5-10 times depending on CPU core count, if you
| don't have tight latency constraints (and even games with
| millisecond-level latency deadlines are multithreaded these
| days, though real-time code may look different than general
| code).
| fulafel wrote:
| Related in the big-data-benchmarks-on-old-laptop department:
| https://www.frankmcsherry.org/graph/scalability/cost/2015/01...
| willvarfar wrote:
| I only retired my 2014 MBP ... last week! It started transiently
| not booting and then, after just a few weeks, it switched to be
| only transiently booting. Figured it was time. My new laptop is
| actually a very budget buy, and not a mac, and in many things a
| bit slower than the old MBP.
|
| Anyway, the old laptop is about par with the 'big' VMs that I use
| for work to analyse really big BQ datasets. My current flow is to
| do the kind of 0.001% queries that don't fit on a box on BigQuery
| and massage things with just enough prepping to make the
| intermediate result fit on a box. Then I extract that to parquet
| stored on the VM and do the analysis on the VM using DuckDB from
| python notebooks.
|
| DuckDB has revolutionised not what I can do but how I can do it.
| All the ingredients were around before, but DuckDB brings it
| together and makes the ergonomics completely different. Life is
| so much easier with joins and things than trying to do the same
| in, say, pandas.
| Cthulhu_ wrote:
| I still have mine, but it's languishing, I don't know what to
| do with it / how to get rid of it, it doesn't feel like trash.
| The Apple stores do returns but for this one you get nothing,
| they're just like "yeah we'll take care of it".
|
| The screen started to delaminate on the edges, and its follow-
| up (a MBP with the touch bar)'s screen is completely broken
| (probably just the connector cable).
|
| I don't have a use for it, but it feels wasteful just to throw
| it away.
| HPsquared wrote:
| eBay is pretty active for that kind of thing. Spares/repair.
| compiler-devel wrote:
| I have the same machine and installed Fedora 41 on it.
| Everything works out of the box, including WiFi and sound.
| mediumsmart wrote:
| I am on the late 2015 version and I have an ebay body stashed for
| when the time comes to refurbish that small data machine.
| selimthegrim wrote:
| Any good keywords to search?
| zkmon wrote:
| A database is not only about disk size and query performance.
| Database reflects the company's culture, processes, workflows,
| collaboration etc. It has an entire ecosystem around it - master
| data, business processes, transactions, distributed applications,
| regulatory requirements, resiliency, Ops, reports, tooling etc,
|
| The role of a database is not just to deliver query performance.
| It needs to fit into the ecosystem, serve the overall role on
| multiple facets, deliver on a wide range of expectations - tech
| and non-tech.
|
| While the useful dataset itself may not outpace the hardware
| advancements, the ecosystem complexity will definitely outpace
| any hardware or AI advancements. Overall adaptation to the
| ecosystem will dictate the database choice, not query
| performance. Technologies will not operate in isolation.
| zwnow wrote:
| No, a database reflects what you make out of it. Reports are
| just queries after all. I dont know what all the other stuff
| you named has to do with the database directly. The only
| purpose of databases is to store and read data, thats what it
| comes down to. So query performance IS one of the most
| important metrics.
| willvarfar wrote:
| And its very much the tech culture at large that influences the
| company's tech choices. Those techies chasing shiny things and
| trying to shoehorn it into their job - perhaps cynically to pad
| their cvs or perhaps generously thinking it will actually be
| the right thing to do - have an outsized say in how tech teams
| think about tech and what they imagine their job is.
|
| Back in 2012 we were just recovering from the everything-is-xml
| craze and in the middle of the no-sql craze and everything was
| web-scale and distribute-first micro-services etc.
|
| And now, after all that mess, we have learned to love what came
| before: namely, please please please just give me sql! :D
| threeseed wrote:
| Why you don't just quietly use SQL instead of condescending
| lecturing others about how compromised their tech choices
| are.
|
| NoSQL e.g. Cassandra, MongoDB and Microservices were invented
| to solve real-world problems which is why they are still so
| heavily used today. And the criticism of them is exactly the
| same that was levelled at SQL back in the day.
|
| It's all just tools at the end of the day and there isn't one
| that works for all use cases.
| kukkeliskuu wrote:
| Around 20 years ago I was working for a database company.
| During that time, I attended SIGMOD, which is the top
| conference for databases.
|
| The keynote speaker for the conference Stonebraker, who
| started Postgres, among other things. He talked about the
| history of relational databases.
|
| At that time, XML databases were all the rage -- now nobody
| remembers them. Stonebraker explained that there is nothing
| new in the hierarchical databases. There was a significant
| battle in SIGMOD, I think somewhere in the 1980s (I forget
| the exact time frame) between network databases and
| relational databases.
|
| The relational databases won that battle, as they have won
| against each competing hierarchical database technology
| since.
|
| The reason is that relational databases are based on
| relational algebra. This has very practical consequences,
| for example you can query the data more flexibly.
|
| When you use JSON storage such as MongoDB, when you decide
| your root entities you are stuck with that decision. I see
| very often in practice that there will always come new
| requirements that you did not foresee that you then need to
| work around.
|
| I don't care what other people use, however.
| threeseed wrote:
| MongoDB is a $2b/year revenue company growing at 20% y/y.
| JSON stores are not going anywhere and it's an essential
| tool for dealing in data where you have no control over
| the schema or you want to do it in the application layer.
|
| And the only "battle" is one you've invented in your
| head. People who deal in data for a living just pick the
| right data store for the right data schema.
| lazide wrote:
| Sensitive much?
| pragmatic wrote:
| And sql server alone is like 5 billion/yr.
| threeseed wrote:
| Almost like there is room in the market for more than
| just SQL databases.
| znpy wrote:
| Ah yes MongoDB, it's web-scale!
| hobs wrote:
| Every person I know who has ever used Cassandra in prod has
| cursed its name. Mongo lost data for close to a decade, and
| Microservices mostly are NOT used to solve real world
| problems but instead used either as an organizational or
| technical hammer for which everything is a nail. Hell
| there's entire books written how you should cut people off
| from each other so they can "naturally" write microservices
| and hyperscale your company!!
| threeseed wrote:
| So all of this is just meaningless anecdotes.
|
| Whereas the _fact_ is that Datastax and MongoDB are
| highly successful companies indicating that in fact those
| databases are solving a real world problem.
| DonHopkins wrote:
| You can always make your data bigger without increasing disk
| space or decreasing performance by making the font size larger!
| querez wrote:
| > The geometric mean of the timings improved from 218 to 12, a
| ca. 20x improvement.
|
| Why do they use the geometric mean to average execution times?
| willvarfar wrote:
| Squaring is a really good way to make the common-but-small
| numbers have bigger representation than the outlying-but-large
| numbers.
|
| I just did a quick google and first real result was this blog
| post with a good explanation with some good illustrations
| https://jlmc.medium.com/understanding-three-simple-statistic...
|
| Its the very first illustration at the top of that blog post
| that 'clicks' for me. Hope it helps!
|
| The inverse is also good: mean-square-error is the good way for
| comparing how similar two datasets (e.g. two images) are.
| yorwba wrote:
| The geometric mean of _n_ numbers is the _n_ -th root of the
| product of all numbers. The mean square error is the sum of
| the squares of all numbers, divided by _n_. (I.e. the
| arithmetic mean of the squares.) They 're not the same.
| willvarfar wrote:
| I'm not gonna edit what I wrote but you are interpreting it
| too way too literally. I was not describing the
| implementation of anything, I was just giving a link that
| explains why thinking about things in terms of area
| (geometry) is popular in stats. Its a bit like the epiphany
| that histograms don't need to be bars of equal width.
| ayhanfuat wrote:
| It's a way of saying twice as fast and twice as slow have equal
| effect on opposite sides. If your baseline is 10 seconds, one
| benchmark takes 5 seconds, and another one takes 20 seconds
| then the geometric mean gives you 10 seconds as the result
| because they cancel each other. The arithmetic mean would treat
| it differently because in absolute terms 10 seconds slow down
| is bigger than 5 seconds speedup. But that is not fair for
| speedups because the absolute speedup you can reach is at most
| 10 seconds but slow down has no limits.
| tbillington wrote:
| This is the best explain-like-im-5 I've heard for geo mean
| and helped it click in my head, thank you :)
| zmgsabst wrote:
| But reality doesn't care:
|
| If half your requests are 2x as long and half are 2x as fast,
| you don't take the same wall time to run -- you take longer.
|
| Let's say you have 20 requests, 10 of type A and 10 of type
| B. They originally both take 10 seconds, for 200 seconds
| total. You halve A and double B. Now it takes 50 + 200 = 250
| seconds, or 12.5 on average.
|
| This is a case where geometric mean deceives you - because
| the two really are asymmetric and "twice as fast" is worth
| less than "twice as slow".
| ayhanfuat wrote:
| There is definitely no single magical number that can
| perfectly represent an entire set of numbers. There will
| always be some cases they are not representative enough. In
| the request example you are mostly interested in the total
| processing times so it does make sense you use a metric
| based on addition. But you could also frame a similar
| scenario where halving the processing time lets you handle
| twice as many items in the same duration. In that case a
| ratio-based or multiplicative view might be more
| appropriate.
| zmgsabst wrote:
| Sure -- but the arithmetic mean also captures that case:
| if you only halve the time, it also will report that
| change accurately.
|
| What we're handling is the case where you have _split_
| outcomes -- and there the arithmetic and geometric mean
| disagree, so we can ask which better reflects reality.
|
| I'm not saying the geometric mean is always wrong -- but
| it is in this case.
|
| A case where it makes sense is what happens when your
| stock halves in value then doubles in value?
|
| In general, geometric mean is appropriate where effects
| are compounding (eg, two price changes to the same stock)
| but not when we're combining (requests are handled
| differently). Two benchmarks is more combining (do task A
| then task B), rather than compounding.
| rr808 wrote:
| Ugh I have joined a big data team. 99% of the feeds are less than
| a few GB yet we have to use Scala and Spark. Its so slow to
| develop and slow to run.
| threeseed wrote:
| a) Scala being a JVM language is one of the fastest around.
| Much faster than say Python.
|
| b) How large are the 1% of the feeds and the size of the total
| joined datasets. Because ultimately that is what you build
| platforms for. Not the simple use cases.
| rr808 wrote:
| 1) Yes Scala and JVM is fast. If we could just use that to
| clean up a feed on a single box that would be great. The
| problem is calling the Spark API creates a lot of complexity
| for developers and runtime platform which is super slow. 2)
| Yes for the few feeds that are a TB we need spark. The
| platform really just loads from hadoop transforms then saves
| back again.
| threeseed wrote:
| a) You can easily run Spark jobs on a single box. Just set
| executors = 1.
|
| b) The reason centralised clusters exist is because you
| can't have dozens/hundreds of data engineers/scientists all
| copying company data onto their laptop, causing support
| headaches because they can't install X library and making
| productionising impossible. There are bigger concerns than
| your personal productivity.
| rr808 wrote:
| > a) You can easily run Spark jobs on a single box. Just
| set executors = 1.
|
| Sure but why would you do this? Just using pandas or
| duckdb or even bash scripts makes your life is much
| easier than having to deal with Spark.
| cgio wrote:
| For when you need more executors without rewriting your
| logic.
| this_user wrote:
| Using a Python solution like Dask might actually be
| better, because you can work with all of the Python data
| frameworks and tools, but you can also easily scale it if
| you need it without having to step into the Spark world.
| rpier001 wrote:
| Re: b. This is a place where remote standard dev
| environments are a boon. I'm not going to give each dev a
| terabyte of RAM, but a terabyte to share with a
| reservation mechanism understanding that contention for
| the full resource is low? Yes, please.
| Larrikin wrote:
| But can you justify Scala existing at all in 2025. I think it
| pushed boundaries but ultimately failed as a language worth
| adoption.l anymore.
| threeseed wrote:
| Absolutely.
|
| a) One of the only languages you can write your entire app
| in Scala i.e. it supports compiling to Javascript, JVM and
| LLVM.
|
| b) It has the only formally proven type system of any
| language.
|
| c) It is the innovation language. Many of the concepts that
| are now standard in other languages had their
| implementation borrowed from Scala. And it is continuing to
| innovate with libraries like Gears
| (https://github.com/lampepfl/gears) which does async
| without colouring and compiler additions like resource
| capabilities.
| tomrod wrote:
| PySpark is a wrapper, so Scala is unnecessary and boggy.
| spark1377485 wrote:
| PySpark is great, except for UDF performance. This gap
| means that Scala is helpful for some Spark edge cases like
| column-level encryption/decryption with UDF
| Mortiffer wrote:
| The R community has been hard at work on small data. I still
| highly prefer working on on memory data in R dplyr DataTable are
| elegant and fast.
|
| The CRan packages are all high quality if the maintainer stops
| responding to emails for 2 months your package is automatically
| removed. Most packages come from university Prof's that have been
| doing this their whole career.
| wodenokoto wrote:
| A really big part of a in-memory dataframe centric workflow is
| how easy it is to do one step at a time and inspect the result.
|
| With a database it is difficult to run a query, look at the
| result and then run a query on the result. To me, that is what
| is missing in replacing pandas/dplyr/polars with DuckDB.
| IanCal wrote:
| I'm not sure I really follow, you can create new tables for
| any step if you want to do it entirely within the db, but you
| can also just run duckdb against your dataframes in memory.
| jgalt212 wrote:
| In R, data sources, intermediate results, and final results
| are all dataframes (slight simplification). With DuckDB, to
| have the same consistency you need every layer and step to
| be a database table, not a data frame, which is awkward for
| the standard R user and use case.
| datadrivenangel wrote:
| You can also use duckplyr as a drop in replacing for
| dplyr. Automatically fails over to dplyr for unsupported
| behavior, and for most operations is notably faster.
|
| Data.Table is competitive with DuckDb in many cases,
| though as a DuckDB enthusiast I hate to admit this. :)
| wodenokoto wrote:
| You can, but then every step starts with a drop table if
| exists; insert into ...
| cess11 wrote:
| Or you nest your queries: select second
| from (select 42 as first, (select 69) as second);
|
| Intermediate steps won't be stored but until queries take
| a while to execute it's a nice way to do step-wise
| extension of an analysis.
|
| Edit: It's a rather neat and underestimated property of
| query results that you can query them in the next scope.
| jcheng wrote:
| Or better yet, use CTEs:
| https://duckdb.org/docs/stable/sql/query_syntax/with.html
| PotatoNinja wrote:
| Krazam did a brilliant video on Small Data:
| https://youtu.be/eDr6_cMtfdA?si=izuCAgk_YeWBqfqN
| culebron21 wrote:
| A tangential story. I remember, back in 2010, contemplating the
| idea of completely distributed DBs inspired by then popular
| torrent technology. In this one, a client would not be different
| from a server, except by the amount of data it holds. And it
| would probably receive the data in torrents manner.
|
| What puzzled me was that a client would want others to execute
| its queries, but not want to load all the data and make queries
| for the others. And how to prevent conflicting update queries
| sent to different seeds.
|
| I also thought that Crockford's distributed web idea (where every
| page is hosted like on torrents) was a good one, even though I
| didn't think deep of this one.
|
| Until I saw the discussion on web3, where someone pointed out
| that uploading any data on one server would make a lot of hosts
| to do the job of hosting a part of it, and every small movement
| would cause tremendous amounts of work for the entire web.
| mangecoeur wrote:
| Did my phd around that time and did a project "scaling" my work
| on a spark cluster. Huge pita and no better than my local setup
| which was an MBP15 with pandas a postgres (actually I
| wrote+contributed a big chunk of pandas read_sql at that time to
| make is postgres compatible using sqlalchemy)
| jononor wrote:
| Thank you for read_sql with SQLalchemy/postgres! We use it all
| the time at our company:)
| simlevesque wrote:
| I'm working on a big research project that uses duckdb, I need a
| lot of compute resources to develop my idea but I don't have a
| lot of money.
|
| I'm throwing a bottle into the ocean: if anyone has spare compute
| with good specs they could lend me for a non-commercial project
| it would help me a lot.
|
| My email is in my profile. Thank you.
| hobs wrote:
| I have worked for a half dozen companies all swearing up and down
| they had big data and meaningfully one customer had 100TB of logs
| and another 10TB of stuff, everyone else when actually thought of
| properly and had just utter trash removed was really under 10TB.
|
| Also - sqlite would have been totally fine for these queries a
| decade ago or more (just slower) - I messed with 10GB+ datasets
| with it more than 10 years ago.
| roenxi wrote:
| > As recently shown, the median scan in Amazon Redshift and
| Snowflake reads a doable 100 MB of data, and the 99.9-percentile
| reads less than 300 GB. So the singularity might be closer than
| we think.
|
| This isn't really saying much. It is a bit like saying the 1:1000
| year storm levy is overbuilt for 99.9% of storms. They aren't the
| storms the levy was built for, y'know. It wasn't set up with them
| close to the top of mind. The database might do 1,000 queries in
| a day.
|
| The focus for design purposes is really to queries that live out
| on the tail - can they be done on a smaller database? How much
| value do they add? What capabilities does the database need to
| handle them? Etc. That is what should justify a Redshift
| database. Or you can provision one to hold your 1Tb of data
| because red things go fast and we all know it :/
| benterix wrote:
| > This isn't really saying much.
|
| On the contrary, it's saying a lot about sheer data size,
| that's all. The things you mention may be crucial why Redshift
| and co. have been chosen (or not - in my org Redshift was used
| as standard so even small dataset were put into it as the
| management want to standardize, for better or worse), but the
| fact remains that if you deal with smaller datasets all of the
| time, you may want to reconsider the solutions you use.
| PaulHoule wrote:
| You can take a different approach to the 1-in-1000 jobs. Like
| don't do them, or approximate them. I remember the time I wrote
| a program that would have taken a century to finish and then
| developed an approximation that got it done in about 20
| minutes.
| capitol_ wrote:
| If you only have 1tb of data then you can have it in ram on a
| modern server.
| steveBK123 wrote:
| AND even if you have 10TB of data, NVMe storage is
| ridiculously fast compared to what disk used to look like (or
| s3...)
| xyzzy_plugh wrote:
| In the last few years, sure, but certainly not in 2012.
| steveBK123 wrote:
| 1TB memory servers weren't THAT exotic even in say
| 2014~2018 era either, I know as I had a few at work.
|
| Not cheap, but these were at companies with 100s of SWEs /
| billions in revenue / would eventually have multi-million
| dollar cloud bills for what little they migrated there.
| twic wrote:
| This feels like a companion to classic 2015 paper "Scalability!
| But at what COST?":
|
| https://www.usenix.org/system/files/conference/hotos15/hotos...
| mehulashah wrote:
| For those of you from the AI world, this is the equivalent of the
| bitter lesson and DeWitts argument about database machines from
| the early 80s. That is, if you wait a bit with the exponential
| pace of Moores law (or modern equivalents), improvements in
| "general purpose" hardware will obviate DB specific improvements.
| The problem is that back in 2012, we had customers that wanted to
| query terabytes of logs for observability, or analyze adtech
| streams, etc. So, I feel like this is a pointless argument. If
| your data fit on an old MacBook Pro, sure you should've built for
| that.
| szarnyasg wrote:
| AWS started offering local SSD storage up to 2 TB in 2012 (HI1
| instance type) and in late 2013 this went up to 6.4 TB (I2
| instance type). While these amounts don't cover all customers,
| plenty of data fits on these machines. But the software stack
| to analyze it efficiently was lacking, especially in the open-
| source space.
| mehulashah wrote:
| AWS also had customers that had petabytes of data in Redshift
| for analysis. The conversation is missing a key point: DuckDB
| is optimizing for a different class of use cases. They're
| optimizing for data science and not traditional data
| warehousing use cases. It's masquerading as size. Even for
| small sizes, there are other considerations: access control,
| concurrency control, reliability, availability, and so on.
| The requirements are different for those different use cases.
| Data science tends to be single user, local, and lower
| availability requirements than warehouses that serve
| production pipelines, data sharing, and so on. I also think
| that DuckDB can be used for those, but not optimized for
| those.
|
| Data size is a red herring in the conversation.
| braza wrote:
| This has the same energy of this article named "Command-line
| Tools can be 235x Faster than your Hadoop Cluster" [1]
|
| [1] - https://adamdrake.com/command-line-tools-can-
| be-235x-faster-...
| braza wrote:
| > History is full of "what if"s, what if something like DuckDB
| had existed in 2012? The main ingredients were there, vectorized
| query processing had already been invented in 2005. Would the now
| somewhat-silly-looking move to distributed systems for data
| analysis have ever happened?
|
| I like the gist of the article, but the conclusion sounds like
| 20/20 hindsight.
|
| All the elements were there, and the author nails it, but maybe
| the right incentive structure wasn't there to create the
| conditions to make it able to be done.
|
| Between 2010 and 2015, there was a genuine feeling from almost
| all industry that we would converge to massive amounts of data,
| because until this time, the industry had never faced a time with
| so much abundance of data in terms of data capture and ease of
| placing sensors everywhere.
|
| The natural step in this scenario won't be, most of the time,
| something like "let's find efficient ways to do it with the same
| capacity" but instead "let's invest to be able to process this in
| a distributed manner independent of the volume that we can have."
|
| It's the same thing between OpenAI/ChatGPT and DeepSeek, where
| one can say that the math was always there, but the first runner
| was OpenAI with something less efficient but with a different set
| of incentive structures.
| mamcx wrote:
| It will not happened. The problem is that people believe
| _theirs_ app will be web-scale pretty-soon so need to solve the
| problem ASAP.
|
| Is only after being burned many many times that arise the need
| for simplicity.
|
| Is the same of NoSql. Only after suffer it you appreciate going
| back.
|
| ie: Tools like this circle back only after the pain of a
| bubble. It can't be done inside it
| gopher_space wrote:
| > The problem is that people believe theirs app will be web-
| scale pretty-soon so need to solve the problem ASAP.
|
| Investors really wanted to hear about your scaling
| capabilities, even when it didn't make sense. But the burn
| rate at places that didn't let a spreadsheet determine scale
| was insane.
|
| Years working on microservices, and now I start
| planning/discovery with "why isn't this running on a box in
| the closet" and only accept numerical explanations. Putting a
| dollar value on excess capacity and labeling it "ad spend"
| changes perspectives.
| steveBK123 wrote:
| Maybe it was all VC funded solutions looking for problems?
|
| It's a lot easier to monetize data analytics solutions if users
| code & data are captive in your hosted infra/cloud environment
| than it is to sell people a binary they can run on their own
| kit...
|
| All the better if its an entire ecosystem of .. stuff.. living in
| "the cloud", leaving end users writing checks to 6 different
| portfolio companies.
| braza wrote:
| > Maybe it was all VC funded solutions looking for problems?
|
| Remember, from 2020-2023 we had an entire movement to push a
| thing called "Modern data stack (MDS)" with big actors like
| a16z lecturing the market about it [1].
|
| I am originally from Data. Never worked with anything out of
| the Data: DS, MLE, DE, MLOps and so on. One thing that I envy
| from other developer careers is to have bosses/leaders that had
| battle-tested knowledge around delivering things using
| pragmatic technologies.
|
| Most of the "AI/Data Leaders" have at maximum 15-17 years of
| career dealing with those tools (and I am talking about some
| dinosaurs in a good sense that saw the DWH or Data Mining).
|
| After 2018 we had an explosion of people working in PoCs or
| small projects at best, trying to mimic what the latest blog
| post from some big tech company pushed.
|
| A lot of those guys are the bosses/leaders today, and worse,
| they were formed during a 0% interest environment, tons of hype
| around the technology, little to no scrutiny or business
| necessity for impact, upper management that did not understand
| really what those guys were doing, and in a space that wasn't
| easy for guys from other parts of tech to join easily and call
| it out (e.g., SRE, Backend, Design, Front-end, Systems
| Engineering, etc.).
|
| In other words, it's quite simple to sell complexity or obscure
| technology for most of these people, and the current moment in
| tech is great because we have more guys from other disciplines
| chime in and share their knowledge on how to assess and
| implement technology.
|
| [1] - https://a16z.com/emerging-architectures-for-modern-data-
| infr...
| steveBK123 wrote:
| Right.. shove your data in our data platform.
|
| OK now you need PortCo1's company analytics platform,
| PortCo2's orchestration platform, PortCo3's SRE platform,
| PortCo4's Auth platform, PortCo5's IaC platform, PortCo6's
| Secrets Mgmt Platform, PortoCo7's infosec platform, etc.
|
| I am sure I forgot another 10 things. Even if some of these
| things were open source or "open source", there was the
| upsell to the managed/supported/business license/etc version
| for many of these tools.
| beardedwizard wrote:
| This is the primary failure of data platforms from my
| perspective. You need too many 3rd parties/partners to
| actually get anything done with your data and costs become
| unbearable.
| znpy wrote:
| > and in a space that wasn't easy for guys from other parts
| of tech to join easily and call it out (e.g., SRE, Backend,
| Design, Front-end, Systems Engineering, etc.).
|
| As an SRE/SysEng/Devops/SysAdmin (depending on the company
| that hires me): most people in the same job as me could
| easily call it out.
|
| You don't have to be such a big nerds to know that you can
| fit 6TB of memory in a single (physical) server. That's been
| true for a few years. Heck, AWS had 1TB+ memory instances for
| a few years now.
|
| The thing is... Upper management _wanted_ "big data" and the
| marketing people _wanted_ to put the fancy buzzword on the
| company website and on linkedin. The data people _wanted_ to
| be able to put the fancy buzzword on their CV (and on their
| Linkedin profile -- and command higher salaries due to that -
| can you blame them?).
|
| > In other words, it's quite simple to sell complexity or
| obscure technology for most of these people
|
| The unspoken secret is that this kind of BS wasn't/isn't only
| going on in the data fields (in my opinion).
| steveBK123 wrote:
| > The unspoken secret is that this kind of BS wasn't/isn't
| only going on in the data fields (in my opinion).
|
| Yes, once you see it in one area you notice if everywhere.
|
| A lot of IT spend is CEOs chasing something they half
| heard/misunderstanding a competitor doing, or a CTO taking
| Gartner a little too seriously, or engineering leads doing
| resume driven architecture. My last shop did a lot of this
| kind of this stuff "we need a head of
| [observability|AI|$buzzword].
|
| The ONE thing that gives me the most pause about DuckDB is
| that some people in my industry who are guilty of the above
| are VERY interested in DuckDB. I like to wait for the
| serial tech evangelists to calm down a bit and see where
| the dust settles.
| kwillets wrote:
| Cloud and SaaS were good for a while because they took away
| the old sales-CTO pipeline that often saw a whole org
| suffering from one person's signature. But they also took
| away the benefits of a more formal evaluation process, and
| nowadays nobody knows how to do one.
| bobchadwick wrote:
| It's not the point of the blog post, but I love the fact that the
| author's 2012 MacBook Pro is still useable. I can't imagine there
| are too many Dell laptops from that era still alive and kicking.
| tetromino_ wrote:
| The machine from the article - a 2012 MBP Retina with 16 GB
| memory and 2.6 GHz i7 - had cost $2999 in the US (and
| significantly more in most of the rest of the world) at
| release. That's around $4200 today adjusting for inflation. You
| won't see many Dell laptops with that sort of price tag.
| godber wrote:
| This makes a completely valid point when you constrain the
| meaning of Big Data to "the largest dataset one can fit on a
| single computer".
| dagw wrote:
| At companies I've worked at "Big Data" was often used to mean
| "too big to open in Excel" or in the extreme case "too big to
| fit in RAM on my laptop"
| datadrivenangel wrote:
| Annoyingly medium data is my term for this.
|
| Around 0.5 to 50 GB is such an annoying area, because Excel
| starts falling over on the lower end and even nicer computers
| will start seriously struggling on the larger end if you're
| not being extremely efficient.
| bhouston wrote:
| I have a large analytics dataset in BigQuery and I wrote an
| interactive exploratory UI on top of it and any query I did
| generally finished in 2s or less. This led to a very simple app
| with infinite analytics refinement that was also fast.
|
| I would definitely not trade that for a pre-computed analytics
| approach. The freedom to explore in real time is enlightening and
| freeing.
|
| I think you have restricted yourself to recomputed fix analytics
| but real time interactive analytics is also an interesting area.
| tonyhart7 wrote:
| is there open source project analytics that build on top of duck
| db yet????
|
| I mostly see clickhouse,postgress etc
| carlineng wrote:
| This is really a question of economics. The biggest organizations
| with the most ability to hire engineers have need for
| technologies that can solve their existing problems in
| incremental ways, and thus we end up with horrible technologies
| like Hadoop and Iceberg. They end up hiring talented engineers to
| work on niche problems, and a lot of the technical discourse ends
| up revolving around technologies that don't apply to the majority
| of organizations, but still cause FOMO amongst them. I, for one,
| am extremely happy to see technologies like DuckDB come along to
| serve the long tail.
| jandrewrogers wrote:
| > As recently shown, the median scan in Amazon Redshift and
| Snowflake reads a doable 100 MB of data, and the 99.9-percentile
| reads less than 300 GB. So the singularity might be closer than
| we think.
|
| There is some circular reasoning embedded here. I've seen many,
| many cases of people finding ways to cut up their workloads into
| small chunks because the performance and efficiency of these
| platforms is far from optimal if you actually tried to run your
| workload at its native scale. To some extent, these "small reads"
| reflect the inadequacy of the platform, not the desire of a user
| to run a particular workload.
|
| A better interpretation may be that the existing distributed
| architectures for data analytics don't scale well except for
| relatively trivial workloads. There has been an awareness of this
| for over a decade but a dearth of platform architectures that
| address it.
| hodgesrm wrote:
| > If we look at the time a bit closer, we see the queries take
| anywhere between a minute and half an hour. Those are not
| unreasonable waiting times for analytical queries on that sort of
| data in any way.
|
| I'm really skeptical arguments that say it's OK to be slow. Even
| on the modern laptop example queries still take up to 47 seconds.
|
| Granted, I'm not looking at the queries but the fact is that
| there are _a lot_ of applications where users need results back
| in less than a second.[0] If the results are feeding automated
| processes like page rendering they need it back in 10s of
| millisecond at most. That takes hardware to accomplish
| consistently. Especially if the datasets are large.
|
| The small data argument becomes even weaker when you consider
| that analytic databases don't just do queries on static datasets.
| Large datasets got that way by absorbing a lot of data very
| quickly. They therefore do ingest, compaction, and
| transformations. These require resources, especially if they run
| in parallel with query on the same data. Scaling them
| independently requires distributed systems. There isn't another
| solution.
|
| [0] SIEM, log management, trace management, monitoring
| dashboards, ... All potentially large datasets where people sift
| through data very quickly and repeatedly. Nobody wants to wait
| more than a couple seconds for results to come back.
| npalli wrote:
| DuckDB works well if
|
| * you have a small datasets (total, not just what a single user
| is scanning)
|
| * no real-time updates, just a static dataset that you can
| analyze at leisure
|
| * only few users and only one doing any writes
|
| * several seconds is an OK response time, get's worse if you have
| to load your scanned segment into DuckDB node.
|
| * generally read-only workloads
|
| So yeah, not convinced we lost a decade.
___________________________________________________________________
(page generated 2025-05-22 23:01 UTC)