[HN Gopher] A lost decade chasing distributed architectures for ...
       ___________________________________________________________________
        
       A lost decade chasing distributed architectures for data analytics?
        
       Author : andreasha
       Score  : 201 points
       Date   : 2025-05-19 08:39 UTC (3 days ago)
        
 (HTM) web link (duckdb.org)
 (TXT) w3m dump (duckdb.org)
        
       | drewm1980 wrote:
       | I mean, not everyone spent their decade on distributed computing.
       | Some devs with a retrogrouch inclination kept writing single
       | threaded code in native languages on a single node. Single core
       | clock speed stagnated, but it was still worth buying new CPU's
       | with more cores because they also had more cache, and all the
       | extra cores are useful for running ~other peoples' bloated code.
        
         | HPsquared wrote:
         | High-frequency trading, gaming, audio/DSP, embedded, etc.
         | There's a lot of room for that kind of developer.
        
         | nyanpasu64 wrote:
         | I find that good multithreading can speed up parallelizable
         | workloads by 5-10 times depending on CPU core count, if you
         | don't have tight latency constraints (and even games with
         | millisecond-level latency deadlines are multithreaded these
         | days, though real-time code may look different than general
         | code).
        
       | fulafel wrote:
       | Related in the big-data-benchmarks-on-old-laptop department:
       | https://www.frankmcsherry.org/graph/scalability/cost/2015/01...
        
       | willvarfar wrote:
       | I only retired my 2014 MBP ... last week! It started transiently
       | not booting and then, after just a few weeks, it switched to be
       | only transiently booting. Figured it was time. My new laptop is
       | actually a very budget buy, and not a mac, and in many things a
       | bit slower than the old MBP.
       | 
       | Anyway, the old laptop is about par with the 'big' VMs that I use
       | for work to analyse really big BQ datasets. My current flow is to
       | do the kind of 0.001% queries that don't fit on a box on BigQuery
       | and massage things with just enough prepping to make the
       | intermediate result fit on a box. Then I extract that to parquet
       | stored on the VM and do the analysis on the VM using DuckDB from
       | python notebooks.
       | 
       | DuckDB has revolutionised not what I can do but how I can do it.
       | All the ingredients were around before, but DuckDB brings it
       | together and makes the ergonomics completely different. Life is
       | so much easier with joins and things than trying to do the same
       | in, say, pandas.
        
         | Cthulhu_ wrote:
         | I still have mine, but it's languishing, I don't know what to
         | do with it / how to get rid of it, it doesn't feel like trash.
         | The Apple stores do returns but for this one you get nothing,
         | they're just like "yeah we'll take care of it".
         | 
         | The screen started to delaminate on the edges, and its follow-
         | up (a MBP with the touch bar)'s screen is completely broken
         | (probably just the connector cable).
         | 
         | I don't have a use for it, but it feels wasteful just to throw
         | it away.
        
           | HPsquared wrote:
           | eBay is pretty active for that kind of thing. Spares/repair.
        
           | compiler-devel wrote:
           | I have the same machine and installed Fedora 41 on it.
           | Everything works out of the box, including WiFi and sound.
        
       | mediumsmart wrote:
       | I am on the late 2015 version and I have an ebay body stashed for
       | when the time comes to refurbish that small data machine.
        
         | selimthegrim wrote:
         | Any good keywords to search?
        
       | zkmon wrote:
       | A database is not only about disk size and query performance.
       | Database reflects the company's culture, processes, workflows,
       | collaboration etc. It has an entire ecosystem around it - master
       | data, business processes, transactions, distributed applications,
       | regulatory requirements, resiliency, Ops, reports, tooling etc,
       | 
       | The role of a database is not just to deliver query performance.
       | It needs to fit into the ecosystem, serve the overall role on
       | multiple facets, deliver on a wide range of expectations - tech
       | and non-tech.
       | 
       | While the useful dataset itself may not outpace the hardware
       | advancements, the ecosystem complexity will definitely outpace
       | any hardware or AI advancements. Overall adaptation to the
       | ecosystem will dictate the database choice, not query
       | performance. Technologies will not operate in isolation.
        
         | zwnow wrote:
         | No, a database reflects what you make out of it. Reports are
         | just queries after all. I dont know what all the other stuff
         | you named has to do with the database directly. The only
         | purpose of databases is to store and read data, thats what it
         | comes down to. So query performance IS one of the most
         | important metrics.
        
         | willvarfar wrote:
         | And its very much the tech culture at large that influences the
         | company's tech choices. Those techies chasing shiny things and
         | trying to shoehorn it into their job - perhaps cynically to pad
         | their cvs or perhaps generously thinking it will actually be
         | the right thing to do - have an outsized say in how tech teams
         | think about tech and what they imagine their job is.
         | 
         | Back in 2012 we were just recovering from the everything-is-xml
         | craze and in the middle of the no-sql craze and everything was
         | web-scale and distribute-first micro-services etc.
         | 
         | And now, after all that mess, we have learned to love what came
         | before: namely, please please please just give me sql! :D
        
           | threeseed wrote:
           | Why you don't just quietly use SQL instead of condescending
           | lecturing others about how compromised their tech choices
           | are.
           | 
           | NoSQL e.g. Cassandra, MongoDB and Microservices were invented
           | to solve real-world problems which is why they are still so
           | heavily used today. And the criticism of them is exactly the
           | same that was levelled at SQL back in the day.
           | 
           | It's all just tools at the end of the day and there isn't one
           | that works for all use cases.
        
             | kukkeliskuu wrote:
             | Around 20 years ago I was working for a database company.
             | During that time, I attended SIGMOD, which is the top
             | conference for databases.
             | 
             | The keynote speaker for the conference Stonebraker, who
             | started Postgres, among other things. He talked about the
             | history of relational databases.
             | 
             | At that time, XML databases were all the rage -- now nobody
             | remembers them. Stonebraker explained that there is nothing
             | new in the hierarchical databases. There was a significant
             | battle in SIGMOD, I think somewhere in the 1980s (I forget
             | the exact time frame) between network databases and
             | relational databases.
             | 
             | The relational databases won that battle, as they have won
             | against each competing hierarchical database technology
             | since.
             | 
             | The reason is that relational databases are based on
             | relational algebra. This has very practical consequences,
             | for example you can query the data more flexibly.
             | 
             | When you use JSON storage such as MongoDB, when you decide
             | your root entities you are stuck with that decision. I see
             | very often in practice that there will always come new
             | requirements that you did not foresee that you then need to
             | work around.
             | 
             | I don't care what other people use, however.
        
               | threeseed wrote:
               | MongoDB is a $2b/year revenue company growing at 20% y/y.
               | JSON stores are not going anywhere and it's an essential
               | tool for dealing in data where you have no control over
               | the schema or you want to do it in the application layer.
               | 
               | And the only "battle" is one you've invented in your
               | head. People who deal in data for a living just pick the
               | right data store for the right data schema.
        
               | lazide wrote:
               | Sensitive much?
        
               | pragmatic wrote:
               | And sql server alone is like 5 billion/yr.
        
               | threeseed wrote:
               | Almost like there is room in the market for more than
               | just SQL databases.
        
               | znpy wrote:
               | Ah yes MongoDB, it's web-scale!
        
             | hobs wrote:
             | Every person I know who has ever used Cassandra in prod has
             | cursed its name. Mongo lost data for close to a decade, and
             | Microservices mostly are NOT used to solve real world
             | problems but instead used either as an organizational or
             | technical hammer for which everything is a nail. Hell
             | there's entire books written how you should cut people off
             | from each other so they can "naturally" write microservices
             | and hyperscale your company!!
        
               | threeseed wrote:
               | So all of this is just meaningless anecdotes.
               | 
               | Whereas the _fact_ is that Datastax and MongoDB are
               | highly successful companies indicating that in fact those
               | databases are solving a real world problem.
        
         | DonHopkins wrote:
         | You can always make your data bigger without increasing disk
         | space or decreasing performance by making the font size larger!
        
       | querez wrote:
       | > The geometric mean of the timings improved from 218 to 12, a
       | ca. 20x improvement.
       | 
       | Why do they use the geometric mean to average execution times?
        
         | willvarfar wrote:
         | Squaring is a really good way to make the common-but-small
         | numbers have bigger representation than the outlying-but-large
         | numbers.
         | 
         | I just did a quick google and first real result was this blog
         | post with a good explanation with some good illustrations
         | https://jlmc.medium.com/understanding-three-simple-statistic...
         | 
         | Its the very first illustration at the top of that blog post
         | that 'clicks' for me. Hope it helps!
         | 
         | The inverse is also good: mean-square-error is the good way for
         | comparing how similar two datasets (e.g. two images) are.
        
           | yorwba wrote:
           | The geometric mean of _n_ numbers is the _n_ -th root of the
           | product of all numbers. The mean square error is the sum of
           | the squares of all numbers, divided by _n_. (I.e. the
           | arithmetic mean of the squares.) They 're not the same.
        
             | willvarfar wrote:
             | I'm not gonna edit what I wrote but you are interpreting it
             | too way too literally. I was not describing the
             | implementation of anything, I was just giving a link that
             | explains why thinking about things in terms of area
             | (geometry) is popular in stats. Its a bit like the epiphany
             | that histograms don't need to be bars of equal width.
        
         | ayhanfuat wrote:
         | It's a way of saying twice as fast and twice as slow have equal
         | effect on opposite sides. If your baseline is 10 seconds, one
         | benchmark takes 5 seconds, and another one takes 20 seconds
         | then the geometric mean gives you 10 seconds as the result
         | because they cancel each other. The arithmetic mean would treat
         | it differently because in absolute terms 10 seconds slow down
         | is bigger than 5 seconds speedup. But that is not fair for
         | speedups because the absolute speedup you can reach is at most
         | 10 seconds but slow down has no limits.
        
           | tbillington wrote:
           | This is the best explain-like-im-5 I've heard for geo mean
           | and helped it click in my head, thank you :)
        
           | zmgsabst wrote:
           | But reality doesn't care:
           | 
           | If half your requests are 2x as long and half are 2x as fast,
           | you don't take the same wall time to run -- you take longer.
           | 
           | Let's say you have 20 requests, 10 of type A and 10 of type
           | B. They originally both take 10 seconds, for 200 seconds
           | total. You halve A and double B. Now it takes 50 + 200 = 250
           | seconds, or 12.5 on average.
           | 
           | This is a case where geometric mean deceives you - because
           | the two really are asymmetric and "twice as fast" is worth
           | less than "twice as slow".
        
             | ayhanfuat wrote:
             | There is definitely no single magical number that can
             | perfectly represent an entire set of numbers. There will
             | always be some cases they are not representative enough. In
             | the request example you are mostly interested in the total
             | processing times so it does make sense you use a metric
             | based on addition. But you could also frame a similar
             | scenario where halving the processing time lets you handle
             | twice as many items in the same duration. In that case a
             | ratio-based or multiplicative view might be more
             | appropriate.
        
               | zmgsabst wrote:
               | Sure -- but the arithmetic mean also captures that case:
               | if you only halve the time, it also will report that
               | change accurately.
               | 
               | What we're handling is the case where you have _split_
               | outcomes -- and there the arithmetic and geometric mean
               | disagree, so we can ask which better reflects reality.
               | 
               | I'm not saying the geometric mean is always wrong -- but
               | it is in this case.
               | 
               | A case where it makes sense is what happens when your
               | stock halves in value then doubles in value?
               | 
               | In general, geometric mean is appropriate where effects
               | are compounding (eg, two price changes to the same stock)
               | but not when we're combining (requests are handled
               | differently). Two benchmarks is more combining (do task A
               | then task B), rather than compounding.
        
       | rr808 wrote:
       | Ugh I have joined a big data team. 99% of the feeds are less than
       | a few GB yet we have to use Scala and Spark. Its so slow to
       | develop and slow to run.
        
         | threeseed wrote:
         | a) Scala being a JVM language is one of the fastest around.
         | Much faster than say Python.
         | 
         | b) How large are the 1% of the feeds and the size of the total
         | joined datasets. Because ultimately that is what you build
         | platforms for. Not the simple use cases.
        
           | rr808 wrote:
           | 1) Yes Scala and JVM is fast. If we could just use that to
           | clean up a feed on a single box that would be great. The
           | problem is calling the Spark API creates a lot of complexity
           | for developers and runtime platform which is super slow. 2)
           | Yes for the few feeds that are a TB we need spark. The
           | platform really just loads from hadoop transforms then saves
           | back again.
        
             | threeseed wrote:
             | a) You can easily run Spark jobs on a single box. Just set
             | executors = 1.
             | 
             | b) The reason centralised clusters exist is because you
             | can't have dozens/hundreds of data engineers/scientists all
             | copying company data onto their laptop, causing support
             | headaches because they can't install X library and making
             | productionising impossible. There are bigger concerns than
             | your personal productivity.
        
               | rr808 wrote:
               | > a) You can easily run Spark jobs on a single box. Just
               | set executors = 1.
               | 
               | Sure but why would you do this? Just using pandas or
               | duckdb or even bash scripts makes your life is much
               | easier than having to deal with Spark.
        
               | cgio wrote:
               | For when you need more executors without rewriting your
               | logic.
        
               | this_user wrote:
               | Using a Python solution like Dask might actually be
               | better, because you can work with all of the Python data
               | frameworks and tools, but you can also easily scale it if
               | you need it without having to step into the Spark world.
        
               | rpier001 wrote:
               | Re: b. This is a place where remote standard dev
               | environments are a boon. I'm not going to give each dev a
               | terabyte of RAM, but a terabyte to share with a
               | reservation mechanism understanding that contention for
               | the full resource is low? Yes, please.
        
           | Larrikin wrote:
           | But can you justify Scala existing at all in 2025. I think it
           | pushed boundaries but ultimately failed as a language worth
           | adoption.l anymore.
        
             | threeseed wrote:
             | Absolutely.
             | 
             | a) One of the only languages you can write your entire app
             | in Scala i.e. it supports compiling to Javascript, JVM and
             | LLVM.
             | 
             | b) It has the only formally proven type system of any
             | language.
             | 
             | c) It is the innovation language. Many of the concepts that
             | are now standard in other languages had their
             | implementation borrowed from Scala. And it is continuing to
             | innovate with libraries like Gears
             | (https://github.com/lampepfl/gears) which does async
             | without colouring and compiler additions like resource
             | capabilities.
        
           | tomrod wrote:
           | PySpark is a wrapper, so Scala is unnecessary and boggy.
        
             | spark1377485 wrote:
             | PySpark is great, except for UDF performance. This gap
             | means that Scala is helpful for some Spark edge cases like
             | column-level encryption/decryption with UDF
        
       | Mortiffer wrote:
       | The R community has been hard at work on small data. I still
       | highly prefer working on on memory data in R dplyr DataTable are
       | elegant and fast.
       | 
       | The CRan packages are all high quality if the maintainer stops
       | responding to emails for 2 months your package is automatically
       | removed. Most packages come from university Prof's that have been
       | doing this their whole career.
        
         | wodenokoto wrote:
         | A really big part of a in-memory dataframe centric workflow is
         | how easy it is to do one step at a time and inspect the result.
         | 
         | With a database it is difficult to run a query, look at the
         | result and then run a query on the result. To me, that is what
         | is missing in replacing pandas/dplyr/polars with DuckDB.
        
           | IanCal wrote:
           | I'm not sure I really follow, you can create new tables for
           | any step if you want to do it entirely within the db, but you
           | can also just run duckdb against your dataframes in memory.
        
             | jgalt212 wrote:
             | In R, data sources, intermediate results, and final results
             | are all dataframes (slight simplification). With DuckDB, to
             | have the same consistency you need every layer and step to
             | be a database table, not a data frame, which is awkward for
             | the standard R user and use case.
        
               | datadrivenangel wrote:
               | You can also use duckplyr as a drop in replacing for
               | dplyr. Automatically fails over to dplyr for unsupported
               | behavior, and for most operations is notably faster.
               | 
               | Data.Table is competitive with DuckDb in many cases,
               | though as a DuckDB enthusiast I hate to admit this. :)
        
             | wodenokoto wrote:
             | You can, but then every step starts with a drop table if
             | exists; insert into ...
        
               | cess11 wrote:
               | Or you nest your queries:                   select second
               | from (select 42 as first, (select 69) as second);
               | 
               | Intermediate steps won't be stored but until queries take
               | a while to execute it's a nice way to do step-wise
               | extension of an analysis.
               | 
               | Edit: It's a rather neat and underestimated property of
               | query results that you can query them in the next scope.
        
               | jcheng wrote:
               | Or better yet, use CTEs:
               | https://duckdb.org/docs/stable/sql/query_syntax/with.html
        
       | PotatoNinja wrote:
       | Krazam did a brilliant video on Small Data:
       | https://youtu.be/eDr6_cMtfdA?si=izuCAgk_YeWBqfqN
        
       | culebron21 wrote:
       | A tangential story. I remember, back in 2010, contemplating the
       | idea of completely distributed DBs inspired by then popular
       | torrent technology. In this one, a client would not be different
       | from a server, except by the amount of data it holds. And it
       | would probably receive the data in torrents manner.
       | 
       | What puzzled me was that a client would want others to execute
       | its queries, but not want to load all the data and make queries
       | for the others. And how to prevent conflicting update queries
       | sent to different seeds.
       | 
       | I also thought that Crockford's distributed web idea (where every
       | page is hosted like on torrents) was a good one, even though I
       | didn't think deep of this one.
       | 
       | Until I saw the discussion on web3, where someone pointed out
       | that uploading any data on one server would make a lot of hosts
       | to do the job of hosting a part of it, and every small movement
       | would cause tremendous amounts of work for the entire web.
        
       | mangecoeur wrote:
       | Did my phd around that time and did a project "scaling" my work
       | on a spark cluster. Huge pita and no better than my local setup
       | which was an MBP15 with pandas a postgres (actually I
       | wrote+contributed a big chunk of pandas read_sql at that time to
       | make is postgres compatible using sqlalchemy)
        
         | jononor wrote:
         | Thank you for read_sql with SQLalchemy/postgres! We use it all
         | the time at our company:)
        
       | simlevesque wrote:
       | I'm working on a big research project that uses duckdb, I need a
       | lot of compute resources to develop my idea but I don't have a
       | lot of money.
       | 
       | I'm throwing a bottle into the ocean: if anyone has spare compute
       | with good specs they could lend me for a non-commercial project
       | it would help me a lot.
       | 
       | My email is in my profile. Thank you.
        
       | hobs wrote:
       | I have worked for a half dozen companies all swearing up and down
       | they had big data and meaningfully one customer had 100TB of logs
       | and another 10TB of stuff, everyone else when actually thought of
       | properly and had just utter trash removed was really under 10TB.
       | 
       | Also - sqlite would have been totally fine for these queries a
       | decade ago or more (just slower) - I messed with 10GB+ datasets
       | with it more than 10 years ago.
        
       | roenxi wrote:
       | > As recently shown, the median scan in Amazon Redshift and
       | Snowflake reads a doable 100 MB of data, and the 99.9-percentile
       | reads less than 300 GB. So the singularity might be closer than
       | we think.
       | 
       | This isn't really saying much. It is a bit like saying the 1:1000
       | year storm levy is overbuilt for 99.9% of storms. They aren't the
       | storms the levy was built for, y'know. It wasn't set up with them
       | close to the top of mind. The database might do 1,000 queries in
       | a day.
       | 
       | The focus for design purposes is really to queries that live out
       | on the tail - can they be done on a smaller database? How much
       | value do they add? What capabilities does the database need to
       | handle them? Etc. That is what should justify a Redshift
       | database. Or you can provision one to hold your 1Tb of data
       | because red things go fast and we all know it :/
        
         | benterix wrote:
         | > This isn't really saying much.
         | 
         | On the contrary, it's saying a lot about sheer data size,
         | that's all. The things you mention may be crucial why Redshift
         | and co. have been chosen (or not - in my org Redshift was used
         | as standard so even small dataset were put into it as the
         | management want to standardize, for better or worse), but the
         | fact remains that if you deal with smaller datasets all of the
         | time, you may want to reconsider the solutions you use.
        
         | PaulHoule wrote:
         | You can take a different approach to the 1-in-1000 jobs. Like
         | don't do them, or approximate them. I remember the time I wrote
         | a program that would have taken a century to finish and then
         | developed an approximation that got it done in about 20
         | minutes.
        
         | capitol_ wrote:
         | If you only have 1tb of data then you can have it in ram on a
         | modern server.
        
           | steveBK123 wrote:
           | AND even if you have 10TB of data, NVMe storage is
           | ridiculously fast compared to what disk used to look like (or
           | s3...)
        
           | xyzzy_plugh wrote:
           | In the last few years, sure, but certainly not in 2012.
        
             | steveBK123 wrote:
             | 1TB memory servers weren't THAT exotic even in say
             | 2014~2018 era either, I know as I had a few at work.
             | 
             | Not cheap, but these were at companies with 100s of SWEs /
             | billions in revenue / would eventually have multi-million
             | dollar cloud bills for what little they migrated there.
        
       | twic wrote:
       | This feels like a companion to classic 2015 paper "Scalability!
       | But at what COST?":
       | 
       | https://www.usenix.org/system/files/conference/hotos15/hotos...
        
       | mehulashah wrote:
       | For those of you from the AI world, this is the equivalent of the
       | bitter lesson and DeWitts argument about database machines from
       | the early 80s. That is, if you wait a bit with the exponential
       | pace of Moores law (or modern equivalents), improvements in
       | "general purpose" hardware will obviate DB specific improvements.
       | The problem is that back in 2012, we had customers that wanted to
       | query terabytes of logs for observability, or analyze adtech
       | streams, etc. So, I feel like this is a pointless argument. If
       | your data fit on an old MacBook Pro, sure you should've built for
       | that.
        
         | szarnyasg wrote:
         | AWS started offering local SSD storage up to 2 TB in 2012 (HI1
         | instance type) and in late 2013 this went up to 6.4 TB (I2
         | instance type). While these amounts don't cover all customers,
         | plenty of data fits on these machines. But the software stack
         | to analyze it efficiently was lacking, especially in the open-
         | source space.
        
           | mehulashah wrote:
           | AWS also had customers that had petabytes of data in Redshift
           | for analysis. The conversation is missing a key point: DuckDB
           | is optimizing for a different class of use cases. They're
           | optimizing for data science and not traditional data
           | warehousing use cases. It's masquerading as size. Even for
           | small sizes, there are other considerations: access control,
           | concurrency control, reliability, availability, and so on.
           | The requirements are different for those different use cases.
           | Data science tends to be single user, local, and lower
           | availability requirements than warehouses that serve
           | production pipelines, data sharing, and so on. I also think
           | that DuckDB can be used for those, but not optimized for
           | those.
           | 
           | Data size is a red herring in the conversation.
        
       | braza wrote:
       | This has the same energy of this article named "Command-line
       | Tools can be 235x Faster than your Hadoop Cluster" [1]
       | 
       | [1] - https://adamdrake.com/command-line-tools-can-
       | be-235x-faster-...
        
       | braza wrote:
       | > History is full of "what if"s, what if something like DuckDB
       | had existed in 2012? The main ingredients were there, vectorized
       | query processing had already been invented in 2005. Would the now
       | somewhat-silly-looking move to distributed systems for data
       | analysis have ever happened?
       | 
       | I like the gist of the article, but the conclusion sounds like
       | 20/20 hindsight.
       | 
       | All the elements were there, and the author nails it, but maybe
       | the right incentive structure wasn't there to create the
       | conditions to make it able to be done.
       | 
       | Between 2010 and 2015, there was a genuine feeling from almost
       | all industry that we would converge to massive amounts of data,
       | because until this time, the industry had never faced a time with
       | so much abundance of data in terms of data capture and ease of
       | placing sensors everywhere.
       | 
       | The natural step in this scenario won't be, most of the time,
       | something like "let's find efficient ways to do it with the same
       | capacity" but instead "let's invest to be able to process this in
       | a distributed manner independent of the volume that we can have."
       | 
       | It's the same thing between OpenAI/ChatGPT and DeepSeek, where
       | one can say that the math was always there, but the first runner
       | was OpenAI with something less efficient but with a different set
       | of incentive structures.
        
         | mamcx wrote:
         | It will not happened. The problem is that people believe
         | _theirs_ app will be web-scale pretty-soon so need to solve the
         | problem ASAP.
         | 
         | Is only after being burned many many times that arise the need
         | for simplicity.
         | 
         | Is the same of NoSql. Only after suffer it you appreciate going
         | back.
         | 
         | ie: Tools like this circle back only after the pain of a
         | bubble. It can't be done inside it
        
           | gopher_space wrote:
           | > The problem is that people believe theirs app will be web-
           | scale pretty-soon so need to solve the problem ASAP.
           | 
           | Investors really wanted to hear about your scaling
           | capabilities, even when it didn't make sense. But the burn
           | rate at places that didn't let a spreadsheet determine scale
           | was insane.
           | 
           | Years working on microservices, and now I start
           | planning/discovery with "why isn't this running on a box in
           | the closet" and only accept numerical explanations. Putting a
           | dollar value on excess capacity and labeling it "ad spend"
           | changes perspectives.
        
       | steveBK123 wrote:
       | Maybe it was all VC funded solutions looking for problems?
       | 
       | It's a lot easier to monetize data analytics solutions if users
       | code & data are captive in your hosted infra/cloud environment
       | than it is to sell people a binary they can run on their own
       | kit...
       | 
       | All the better if its an entire ecosystem of .. stuff.. living in
       | "the cloud", leaving end users writing checks to 6 different
       | portfolio companies.
        
         | braza wrote:
         | > Maybe it was all VC funded solutions looking for problems?
         | 
         | Remember, from 2020-2023 we had an entire movement to push a
         | thing called "Modern data stack (MDS)" with big actors like
         | a16z lecturing the market about it [1].
         | 
         | I am originally from Data. Never worked with anything out of
         | the Data: DS, MLE, DE, MLOps and so on. One thing that I envy
         | from other developer careers is to have bosses/leaders that had
         | battle-tested knowledge around delivering things using
         | pragmatic technologies.
         | 
         | Most of the "AI/Data Leaders" have at maximum 15-17 years of
         | career dealing with those tools (and I am talking about some
         | dinosaurs in a good sense that saw the DWH or Data Mining).
         | 
         | After 2018 we had an explosion of people working in PoCs or
         | small projects at best, trying to mimic what the latest blog
         | post from some big tech company pushed.
         | 
         | A lot of those guys are the bosses/leaders today, and worse,
         | they were formed during a 0% interest environment, tons of hype
         | around the technology, little to no scrutiny or business
         | necessity for impact, upper management that did not understand
         | really what those guys were doing, and in a space that wasn't
         | easy for guys from other parts of tech to join easily and call
         | it out (e.g., SRE, Backend, Design, Front-end, Systems
         | Engineering, etc.).
         | 
         | In other words, it's quite simple to sell complexity or obscure
         | technology for most of these people, and the current moment in
         | tech is great because we have more guys from other disciplines
         | chime in and share their knowledge on how to assess and
         | implement technology.
         | 
         | [1] - https://a16z.com/emerging-architectures-for-modern-data-
         | infr...
        
           | steveBK123 wrote:
           | Right.. shove your data in our data platform.
           | 
           | OK now you need PortCo1's company analytics platform,
           | PortCo2's orchestration platform, PortCo3's SRE platform,
           | PortCo4's Auth platform, PortCo5's IaC platform, PortCo6's
           | Secrets Mgmt Platform, PortoCo7's infosec platform, etc.
           | 
           | I am sure I forgot another 10 things. Even if some of these
           | things were open source or "open source", there was the
           | upsell to the managed/supported/business license/etc version
           | for many of these tools.
        
             | beardedwizard wrote:
             | This is the primary failure of data platforms from my
             | perspective. You need too many 3rd parties/partners to
             | actually get anything done with your data and costs become
             | unbearable.
        
           | znpy wrote:
           | > and in a space that wasn't easy for guys from other parts
           | of tech to join easily and call it out (e.g., SRE, Backend,
           | Design, Front-end, Systems Engineering, etc.).
           | 
           | As an SRE/SysEng/Devops/SysAdmin (depending on the company
           | that hires me): most people in the same job as me could
           | easily call it out.
           | 
           | You don't have to be such a big nerds to know that you can
           | fit 6TB of memory in a single (physical) server. That's been
           | true for a few years. Heck, AWS had 1TB+ memory instances for
           | a few years now.
           | 
           | The thing is... Upper management _wanted_ "big data" and the
           | marketing people _wanted_ to put the fancy buzzword on the
           | company website and on linkedin. The data people _wanted_ to
           | be able to put the fancy buzzword on their CV (and on their
           | Linkedin profile -- and command higher salaries due to that -
           | can you blame them?).
           | 
           | > In other words, it's quite simple to sell complexity or
           | obscure technology for most of these people
           | 
           | The unspoken secret is that this kind of BS wasn't/isn't only
           | going on in the data fields (in my opinion).
        
             | steveBK123 wrote:
             | > The unspoken secret is that this kind of BS wasn't/isn't
             | only going on in the data fields (in my opinion).
             | 
             | Yes, once you see it in one area you notice if everywhere.
             | 
             | A lot of IT spend is CEOs chasing something they half
             | heard/misunderstanding a competitor doing, or a CTO taking
             | Gartner a little too seriously, or engineering leads doing
             | resume driven architecture. My last shop did a lot of this
             | kind of this stuff "we need a head of
             | [observability|AI|$buzzword].
             | 
             | The ONE thing that gives me the most pause about DuckDB is
             | that some people in my industry who are guilty of the above
             | are VERY interested in DuckDB. I like to wait for the
             | serial tech evangelists to calm down a bit and see where
             | the dust settles.
        
           | kwillets wrote:
           | Cloud and SaaS were good for a while because they took away
           | the old sales-CTO pipeline that often saw a whole org
           | suffering from one person's signature. But they also took
           | away the benefits of a more formal evaluation process, and
           | nowadays nobody knows how to do one.
        
       | bobchadwick wrote:
       | It's not the point of the blog post, but I love the fact that the
       | author's 2012 MacBook Pro is still useable. I can't imagine there
       | are too many Dell laptops from that era still alive and kicking.
        
         | tetromino_ wrote:
         | The machine from the article - a 2012 MBP Retina with 16 GB
         | memory and 2.6 GHz i7 - had cost $2999 in the US (and
         | significantly more in most of the rest of the world) at
         | release. That's around $4200 today adjusting for inflation. You
         | won't see many Dell laptops with that sort of price tag.
        
       | godber wrote:
       | This makes a completely valid point when you constrain the
       | meaning of Big Data to "the largest dataset one can fit on a
       | single computer".
        
         | dagw wrote:
         | At companies I've worked at "Big Data" was often used to mean
         | "too big to open in Excel" or in the extreme case "too big to
         | fit in RAM on my laptop"
        
           | datadrivenangel wrote:
           | Annoyingly medium data is my term for this.
           | 
           | Around 0.5 to 50 GB is such an annoying area, because Excel
           | starts falling over on the lower end and even nicer computers
           | will start seriously struggling on the larger end if you're
           | not being extremely efficient.
        
       | bhouston wrote:
       | I have a large analytics dataset in BigQuery and I wrote an
       | interactive exploratory UI on top of it and any query I did
       | generally finished in 2s or less. This led to a very simple app
       | with infinite analytics refinement that was also fast.
       | 
       | I would definitely not trade that for a pre-computed analytics
       | approach. The freedom to explore in real time is enlightening and
       | freeing.
       | 
       | I think you have restricted yourself to recomputed fix analytics
       | but real time interactive analytics is also an interesting area.
        
       | tonyhart7 wrote:
       | is there open source project analytics that build on top of duck
       | db yet????
       | 
       | I mostly see clickhouse,postgress etc
        
       | carlineng wrote:
       | This is really a question of economics. The biggest organizations
       | with the most ability to hire engineers have need for
       | technologies that can solve their existing problems in
       | incremental ways, and thus we end up with horrible technologies
       | like Hadoop and Iceberg. They end up hiring talented engineers to
       | work on niche problems, and a lot of the technical discourse ends
       | up revolving around technologies that don't apply to the majority
       | of organizations, but still cause FOMO amongst them. I, for one,
       | am extremely happy to see technologies like DuckDB come along to
       | serve the long tail.
        
       | jandrewrogers wrote:
       | > As recently shown, the median scan in Amazon Redshift and
       | Snowflake reads a doable 100 MB of data, and the 99.9-percentile
       | reads less than 300 GB. So the singularity might be closer than
       | we think.
       | 
       | There is some circular reasoning embedded here. I've seen many,
       | many cases of people finding ways to cut up their workloads into
       | small chunks because the performance and efficiency of these
       | platforms is far from optimal if you actually tried to run your
       | workload at its native scale. To some extent, these "small reads"
       | reflect the inadequacy of the platform, not the desire of a user
       | to run a particular workload.
       | 
       | A better interpretation may be that the existing distributed
       | architectures for data analytics don't scale well except for
       | relatively trivial workloads. There has been an awareness of this
       | for over a decade but a dearth of platform architectures that
       | address it.
        
       | hodgesrm wrote:
       | > If we look at the time a bit closer, we see the queries take
       | anywhere between a minute and half an hour. Those are not
       | unreasonable waiting times for analytical queries on that sort of
       | data in any way.
       | 
       | I'm really skeptical arguments that say it's OK to be slow. Even
       | on the modern laptop example queries still take up to 47 seconds.
       | 
       | Granted, I'm not looking at the queries but the fact is that
       | there are _a lot_ of applications where users need results back
       | in less than a second.[0] If the results are feeding automated
       | processes like page rendering they need it back in 10s of
       | millisecond at most. That takes hardware to accomplish
       | consistently. Especially if the datasets are large.
       | 
       | The small data argument becomes even weaker when you consider
       | that analytic databases don't just do queries on static datasets.
       | Large datasets got that way by absorbing a lot of data very
       | quickly. They therefore do ingest, compaction, and
       | transformations. These require resources, especially if they run
       | in parallel with query on the same data. Scaling them
       | independently requires distributed systems. There isn't another
       | solution.
       | 
       | [0] SIEM, log management, trace management, monitoring
       | dashboards, ... All potentially large datasets where people sift
       | through data very quickly and repeatedly. Nobody wants to wait
       | more than a couple seconds for results to come back.
        
       | npalli wrote:
       | DuckDB works well if
       | 
       | * you have a small datasets (total, not just what a single user
       | is scanning)
       | 
       | * no real-time updates, just a static dataset that you can
       | analyze at leisure
       | 
       | * only few users and only one doing any writes
       | 
       | * several seconds is an OK response time, get's worse if you have
       | to load your scanned segment into DuckDB node.
       | 
       | * generally read-only workloads
       | 
       | So yeah, not convinced we lost a decade.
        
       ___________________________________________________________________
       (page generated 2025-05-22 23:01 UTC)