hngopher.com

       [HN Gopher] DeWitt and Stonebraker's "MapReduce: A major step ba...
       ___________________________________________________________________
        
       DeWitt and Stonebraker's "MapReduce: A major step backwards" (2009)
        
       Author : mooreds
       Score  : 89 points
       Date   : 2024-03-30 14:57 UTC (8 hours ago)
        
 (HTM) web link (craig-henderson.blogspot.com)
 (TXT) w3m dump (craig-henderson.blogspot.com)
        
       | mrkeen wrote:
       | > Not novel at all -- it represents a specific implementation of
       | well known techniques developed nearly 25 years ago
       | 
       | Your old stuff is bad.
       | 
       | > We, however, assert that they should not overlook the lessons
       | of more than 40 years of database technology --
       | 
       | Our old stuff is good.
        
         | mucle6 wrote:
         | I like your perspective
        
         | sinkasapa wrote:
         | I think that in academic work this is reasonable. People
         | sometimes claim to have done something novel, when it is
         | clearly a repetition of something older. Perhaps the citations
         | of the works being criticized didn't make it clear that they
         | were repeating an old idea. It allows the reader to see that
         | there were relevant previous works that may have been subjected
         | to criticism in the past and that those previous criticisms
         | remain important for the more recent work. The question then is
         | whether the recent work resolves those criticisms. The authors
         | in this case seem not to think so.
         | 
         | A better paraphrase is the following: The new idea is not so
         | new as some claim. It is 20 years old and was previously shown
         | to be flawed in comparison to other approaches. Looking over 40
         | years of literature, one can see the flaws in the approach and
         | how the flaws were subsequently resolved. Proponents of the
         | "new idea" are either unaware of these advancements or are
         | ignoring them.
         | 
         | I'd say that ignoring previous literature and work is a big
         | problem is CS adjacent studies. My experience is being a
         | theoretical linguist who interfaces with computational
         | linguists. I have had colleagues receive criticism for citing
         | work that is 10 years old or more, even if such a work
         | represents the earliest example of a particular idea they are
         | making use of. It is suggested that a more modern work should
         | be cited. There is kind of an "anti-memory" culture that
         | results from trying to make work seem cutting edge, even if a
         | work is clearly an extension or reinvention of very old ideas.
        
       | lewis1028282 wrote:
       | Some of these issues were solved by Spark. Do agree with the
       | overall point, people shouldn't be reaching for Hadoop when
       | Postgres would suffice. Indexes are fast. Use them.
        
         | ysofunny wrote:
         | > _people shouldn't be reaching for Hadoop when Postgres would
         | suffice. Indexes are fast. Use them._
         | 
         | let me try a riskier one: "people should not be reaching for
         | kubernetes when a system administrator would suffice. sysadmins
         | are cheap(er?) than clouds, use them"
         | 
         | it did not come out very well.... I got stuck trying to find
         | what to contrast kube with; all I got in them minutes alloted
         | to comment posting was 'system administrator'. meh.
        
           | rijx wrote:
           | Nowadays K3s is worth the small learning curve with a big
           | payoff as you get a lot of automation included / by
           | installing an operator :-)
        
             | jt2190 wrote:
             | > ... K3s...
             | 
             | I'm assuming this isn't a typo and you mean Kubes?
             | 
             | https://kubes.guru/getting-started/
        
               | maxcoder4 wrote:
               | I don't know what OP meant, but https://k3s.io/ also fits
               | the context.
        
               | Areading314 wrote:
               | K8s is an abbreviation of kubernetes K3s is an
               | abbreviation of k8s
        
               | eichin wrote:
               | ... wouldn't that be k1s?
        
           | dunk010 wrote:
           | Everyone wants to use k8s and like 1% need it in any shape or
           | form. It basically acts as a conspiracy between otherwise-
           | redundant ops level people and those running Kubernettes
           | against the companies who have to pay for all this. Just use
           | Fargate, and be done.
        
           | nurettin wrote:
           | I thought you still need a sysadmin or whoever is needed to
           | set up and maintain kubernetes.
        
         | kevindamm wrote:
         | Indexes are fast when they're built well and used often.
         | Indexes are expensive (and paid for in triplicate via backup
         | costs) when they are seldom or never used. Sometimes you just
         | need to materialize a table temporarily, which of course you
         | can do in the RDBMS as well, but sometimes the data sources are
         | so scattered (or also ephemeral) that keeping all processing
         | inside the DB system is a stretch.
         | 
         | But perhaps the most compelling justification is based on the
         | DB systems familiarity on the team. Not everyone has the same
         | level of SQL expertise and some of the visualization tools
         | added to MapReduce systems and the source language itself are
         | more familiar to them than the output of an EXPLAIN statement.
         | Especially if the same pipeline is effectively hundreds of
         | lines in SQL.
        
         | woooooo wrote:
         | If you're doing analytics that require full table scans,
         | indices are pure overhead.
         | 
         | No database will beat just piping all the records through some
         | process for full scans.
        
       | kevindamm wrote:
       | MapReduce is a good example of choosing tools suitable to the
       | task. I'll agree with the authors that there are certainly large
       | DB systems that a traditional RDBMS is better suited or more
       | optimal.
       | 
       | But the designers and avid users of MapReduce did not use it
       | because it seemed a more optimal DB query engine. The
       | utility|cost being optimized was not compute cycles or disk seeks
       | or logappends, it was the developer time needed to construct
       | another large-scale OLAP definition and an overall tolerance to
       | hardware and system failure.
        
       | nopurpose wrote:
       | Paper evaluates MapReduce in context of RDBMS. Was it ever
       | advocated to replace them?
       | 
       | Closest MR product to RDBMS I know is Hive, but it was just a
       | thin SQL layer mostly for convenience, not a killer product by
       | any means.
       | 
       | MR was cool and I can't deny that managing Hadoop/Spark clusters
       | was quite fun.
        
       | BoiledCabbage wrote:
       | I also agree we should stop using MapReduce because of its lack
       | of support for Crystal Reports. /s
       | 
       | Kidding aside it's easy to bash a paper like this with the
       | benefit of hindsight and move on. But instead look at what we can
       | take from it.
       | 
       | The benefit of new tech isn't always by being better than old
       | tech at its strengths, it can also be by being better at old tech
       | at its weaknesses and finding a way to change the problem domain
       | so that its strengths don't matter much.
        
       | edejong wrote:
       | The 'no schema needed' paradigm that persisted within DE is
       | finally replaced with the hard lessons learned in the 70's.
       | 
       | Systems die, data stays.
        
       | dunk010 wrote:
       | I remember this paper, and was at a company at the time that was
       | one of the first to use MapReduce, so saw this all play out first
       | hand. I appreciated the paper. Then and now developers rush to
       | grab new technologies, especially those that stroke their ego-
       | driven fantasies of "working at scale" without considering their
       | underlying constraints or applicability. At the time this was
       | published every company and startup under the sun was rushing to
       | use MapReduce, most often in places where it wasn't warranted.
       | I'm glad someone surfaced this paper again; people still need to
       | learn the lessons that it outlines. Microservices and k8s: I'm
       | looking straight at you.
        
         | jeffbee wrote:
         | Which problem is more serious? 1) your small company has an
         | over-complex system that could have been postgres; 2) your
         | medium-sized company has a postgres that's on fire at the
         | bottom of the ocean every day despite the forty people you
         | hired to stabilize postgres, and your scalable replacement
         | system is still six months away?
        
           | davidw wrote:
           | I have very rarely seen the second scenario, but the first
           | seems more common.
        
             | growse wrote:
             | Isn't the second example representative of all tech debt /
             | neglect ever? If so, it's _very_ common.
        
               | natebc wrote:
               | I generally classify tech debt more as a long todo/wish
               | list that we'll never get a chance to work on rather than
               | a server or service being on fire.
        
               | voakbasda wrote:
               | I have found that these fires become uncontrollable
               | because of tech debt. Whole rarely the spark, it's a
               | latent fuel source.
               | 
               | It's like our modern forests; unless something clears out
               | the brush, we see wildfires start from the smallest
               | spark. Once it starts, it's almost impossible to do
               | anything but try to limit the extent of the disaster.
        
               | sitkack wrote:
               | In the second scenario, they can't do math. They could
               | have bought themselves 6-18 months by getting the most
               | powerful machine available using probably at most 1-2
               | salaries worth of those 40 people.
               | 
               | Less a single digit percentage of workloads needs
               | massive, hard to use horizontal scale out (for things
               | that can solved on a single machine, or a single
               | database).
               | 
               | MR _is_ useful as an adhoc scheduler over data. Need to
               | OCR 10k files, MR it.
               | 
               | Hadoop was the worst possible implementation of MR,
               | wasted so much of everything. That was its primary
               | strength.
        
               | PaulHoule wrote:
               | Reminds me when I had a 3-machine Hadoop cluster in my
               | home lab and 2 nodes were turned off but I was submitting
               | jobs to get and getting results just fine.
               | 
               | I remember all the people pushing erasure code based
               | distributed file systems pointing out how crazy it is to
               | have three copies of something but Hadoop could run in a
               | degraded condition without degraded performance.
        
               | sitkack wrote:
               | I agree. I used Disco MR to do amazing things. Trivial to
               | use, like anyone could be productive in under an hour.
               | 
               | Erasure codes are awesome, but so is just having 3
               | copies. When you have skin in the game, simplicity is the
               | most important driver of good outcomes. Look at the
               | dimensions that Netezza optimized, they saw a
               | technological window and they took it. Right now we have
               | workstations that can push 100GB/s from from flash. We
               | are talking about being able to sort 1TB of data in 20
               | seconds (from flash) the same machine could do it from
               | ram in 10.
               | 
               | https://github.com/discoproject/disco
               | 
               | I need to give Ray and Dask a try.
               | 
               | I don't know where to put this comment so I'll put it
               | here. DeWitt and Stonebraker are right, but also wrong.
               | Everyone is talking past each other there. Both are
               | geniuses, this essay wasn't super strong.
               | 
               | If I was their editor, I would say, reframe it as
               | MapReduce is an implementation detail, we also need these
               | other things for this to be usable by the masses. Their
               | point about indexes proves my point about talking past
               | each other. If you are scanning the data basically once,
               | building an index is a waste.
        
               | hinkley wrote:
               | Very early on in my enterprise career, in a continuance
               | of a discussion where it was mentioned that our customer
               | was contemplating a terabyte disk array (that would fill
               | an entire server rack, so very fucking early) I learned
               | about the great grandfather of NVME drives: battery
               | backed RAM disks that cost $40k inflation adjusted.
               | 
               | "Why on earth would you spend the cost of a brand new
               | sedan on a drive like this?" I asked. Answer: to put the
               | Oracle or DB2 WAL data on so you could vertically scale
               | your database just that much higher while you tried to
               | solve the throughput problems you were having another
               | way. It was either the bargaining phase of loss or a Hail
               | Mary you could throw in to help a behind-schedule
               | rearchitecture. Last resort vertical scaling.
        
               | nfw2 wrote:
               | No, plenty of tech debt is caused by over-engineering or
               | pre-maturely optimizing for the wrong thing.
               | 
               | I'm not sure if the second outcome is meant to blame
               | Postgres specifically on under-engineering in general,
               | but neither seems to me like it should be a concern for
               | an early-stage startup.
        
           | alanbernstein wrote:
           | Probably 3) the system you overengineered too early solved
           | the wrong problem, and your replacement is six months away,
           | but you've paid for it twice.
        
           | feoren wrote:
           | #1 is more serious. #2 limits the growth of your already
           | successful company. #1 sinks your struggling small business.
           | You have to be successful to be a victim of your own success,
           | after all. Not to mention the fact that #1 is way more
           | common. Do you know how far Postgres scales? Because it's way
           | past almost any medium- scale business.
        
             | jvans wrote:
             | Exactly. A lot of us work at #2 so we wish our predecessors
             | saved us our current pain. But if they went that route we
             | wouldn't be employed at that company because it wouldn't
             | exist
        
             | nfw2 wrote:
             | Exactly, if a medium-sized company is struggling with
             | Postgres, either they have very niche requirements or the
             | scalability problems are in their own code.
        
           | asah wrote:
           | This was true in 2009. Since then, multiple PostgreSQL-
           | compatible databases have launched.
        
         | merb wrote:
         | Kubernetes is not like mapreduce. It does not need
         | microservices at all. It is a scheduling and deployment
         | framework, which you will implement yourself anyway (hopefully
         | you do) or you use a pass. It's not even that hard to work with
         | it. Of course it is complex, but a lot of these tools are, even
         | the lower level ones like terraform.
        
           | hinkley wrote:
           | Between monoliths and microservices you have services and
           | sidecars. If you don't at least have sidecars I really don't
           | see the point of kubernetes, because most of the rest of the
           | services will follow Conway's Law and can reasonably do their
           | own thing for less than 125% of the cost of full bin packing.
        
         | hinkley wrote:
         | Map reduce came to my radar around the same time the Trough of
         | Disillusionment hit for some other things, including design
         | patterns. We still believed in the 8 Fallacies of Distributed
         | Computing back then, before cloud providers came along and
         | started selling Fallacies as a Service.
         | 
         | I can't wait for that hangover to hit us. Its likely to be the
         | best one of my career.
        
         | jauntywundrkind wrote:
         | Generally I appreciate this post, as yeah the bandwagon effect
         | is real.
         | 
         | I'd characterize mapreduce as a very very specific narrow
         | architectural pattern. Trying to apply it contorts the code you
         | write. I don't see anything remotely like that that's true
         | about Kubernetes or containers (microservices much more so in
         | creating constraints).
         | 
         | We had to reset the Days Since Kubernetes Winge counter again
         | yesterday: https://news.ycombinator.com/item?id=39868586 . And
         | a couple of people spoke to how you might not need containers,
         | but I still haven't heard anyone say what not having containers
         | could win you. What types of code can you only write _without_
         | containers? We can convince ourselves that Kubernetes is hard,
         | but also lots of people also say it 's easy/not bad, so there's
         | some difficulty-factor that's unknown/variable. But I strongly
         | struggle to see a parallel between a strong code architecture
         | choice like MapReduce and a generic platform like Kubernetes or
         | containers.
         | 
         | The platform seems pleasantly neutral in shaping what you do
         | with it, in my view; if that wasn't true it would never have
         | been a success.
        
           | dwattttt wrote:
           | > I still haven't heard anyone say what not having containers
           | could win you. What types of code can you only write without
           | containers?
           | 
           | Code with one less layer of abstraction. If that layer is
           | buying you something you need, it's great. But abstraction
           | isn't a positive in & of itself, it's why we get upset about
           | GenericAbstractFactoryBeanSingleton(s).
        
         | BiteCode_dev wrote:
         | Yep, also GraphQL and all those data lakes.
         | 
         | That reminds me "XML is the future":
         | https://www.bitecode.dev/p/hype-cycles
        
       | fiddlerwoaroof wrote:
       | The objection I have to this paper is that I keep seeing SQL
       | forced onto streaming workflows when the functional programming
       | operations provided by the non-SQL APIs of systems like Flink and
       | Spark are a lot easier to think about in this context: SQL stream
       | joins often have very surprising performance properties and it
       | can take a long time to figure out the tunables needed to make
       | them perform at scale.
        
       | mccanne wrote:
       | Necessity is the mother of invention. MapReduce-based systems
       | were developed because the state-of-the-art RDBMS systems of that
       | age could not scale to the needs of the Googles/Yahoos/Facebooks
       | during the phenomenal growth spurt of the early Web. The novelty
       | here was the tradeoffs they made to scale out and up using the
       | compute and storage footprints available at the time.
       | 
       | "We thought of that" vs "we built it and made it work".
        
         | dekhn wrote:
         | MapReduce was never built to compete with RDBMS systems. It was
         | built to compete with batch-scheduled distribution data
         | processing, typically where there was no index. It was also
         | built to build indices (the search index), not really use them
         | during any of the three phases. It was also built to be
         | reliable in the face of cheap machines (bad RAM and disk).
         | 
         | Google built MR because it was in an existential crisis: they
         | couldn't build a new index for the search engine, and freshness
         | and size of the index was important for early search engines.
         | The previous tools would crash part-way through due to the
         | cheap hardware that Google bought. If Google had based search
         | indexing on RDBMS, they would not exist today.
         | 
         | Now Google _did_ use RDBMS- they used MySQL at scale. It wasn
         | 't unheard-of for mapreduces to run against MySQL (typically
         | doing a query to get a bunch of records, and then mapping over
         | those records).
         | 
         | I worked on later mapreduce (long after it was mature) which
         | used all sorts of tricks to extend the MapReduce paradigm as
         | far as possible but ultimately nearly everything got replace
         | with Flume, which is effectively a computational superset of
         | what MR can do.
         | 
         | I think the paper must have been pulled because Stonebreaker
         | must have gotten huge pushback for attacking MR for something
         | it wasn't good at. See the original paper for what they
         | proposed as good use cases: counting word occurences in a large
         | corpus (far larger than the storage limits of postgres and
         | others at the time), distributed grep (without an index),
         | counting unique items (where the number of items is larger than
         | the capacity of a database at the time), reversing a graph
         | (convert (source, target) pairs to (target, [source, source,
         | source]), term vectors, inverted index (the original use case
         | for building the index) and distributed sort. None of the RDBMS
         | of that day could handle the scale of the web. That's all.
        
           | makmanalp wrote:
           | > Stonebreaker must have gotten huge pushback for attacking
           | MR for something it wasn't good at
           | 
           | I like this comment because it gets to the heart of a
           | misunderstanding. I'd further correct it to say "for
           | something it wasn't trying to be good at". DeWitt and
           | Stonebraker just didn't understand why anyone would want
           | this, and I can see why: change was coming faster than it
           | ever did, from many angles. Let's travel back in time to see
           | why:
           | 
           | The decade after mapreduce appeared - when I came of age as a
           | programmer - was a fascinating time of change:
           | 
           | The backdrop is the post-dotcom bubble when the hype cycle
           | came to a close, and the web market somewhat consolidated in
           | a smaller set of winners who now were more proven and ready
           | to go all in on a new business model that elevates doing
           | business on the web above all else, in a way that would truly
           | threaten brick and mortar.
           | 
           | Alongside that we have CPU manufacturers struggling with
           | escalating clock speeds and jamming more transistors into a
           | single die to keep up with Moore's law and consumer demand,
           | which leads to the first commodity dual and multi core CPUs.
           | 
           | But I remember that most non-scientific software just
           | couldn't make use of multiple CPUs or cores effectively yet.
           | So we were ripe for a programming model that engineers who've
           | never heard of lamport before can actually understand and
           | work with: threads and locks and socket programming in C and
           | C++ were a rough proposition, and MPI was certainly a thing
           | but the scientific computing people who were working on
           | supercomputers, grids, and Beowulf clusters were not the same
           | people as the dotcom engineers using commodity hardware.
           | 
           | Companies pushing these boundaries were wanting to do things
           | that traditional DBMSes could not offer at a certain scale,
           | at least for cheap enough. The RDBMS vendors and priesthood
           | were defending that it's hard to offer that while also
           | offering ACID and everything else a database offers, which
           | was certainly not wrong: it's hard to support an OLAP use
           | case with the OLTP-style System-R-ish design that dominated
           | the market in those days. This was some of the most
           | complicated and sophisticated software ever made, imbued with
           | magiclike qualities from decades of academic research
           | hardened by years of industrial use.
           | 
           | Then there was data warehouse style solutions that were
           | "appliances" that were locked into a specific and expensive
           | combination of hardware and software optimized to work well
           | together and also to extract millions and billions of dollars
           | from the fortune 500s that could afford them.
           | 
           | So the ethos at the booming post-dotcoms definitely was "do
           | we really need all this crap that's getting in our way?", and
           | we would soon find out. Couching it in formalism and calling
           | it "mapreduce" made it sound fancier than what it really was:
           | some glue that made it easy for engineers to declaratively
           | define how to split work into chunks, shuffle them around and
           | assemble them again across many computers, without having to
           | worry about the pedestrian details of the glue in between. A
           | corporate drone didn't have to understand /how/ it worked,
           | just how to fill in the blanks for each step properly: a much
           | more viable proposition than thousands of engineers writing
           | software together that involves finnicky locks and
           | semaphores.
           | 
           | The DBMS crowd thumbed their noses at this because it was
           | truly SO primitive and wasteful compared to the sophisticated
           | mechanisms built to preserve efficiency that dated back to
           | the 70s: indexes, access patterns, query optimizers,
           | optimized storage layouts. What they didn't get was that
           | every million dollar you didn't waste on what was essentially
           | the space shuttle of computer software - fabulously expensive
           | and complicated - could now buy a /lot/ more cheapo computing
           | power duct taped together. The question was how to leverage
           | that. Plus, with things changing at the pace that they did
           | back then, last year's CPU could be obsolete by next year, so
           | how well could the vendors building custom hardware even keep
           | up with that, after you paid them their hefty fees? The value
           | proposition was "it's so basic that it will run on anything,
           | and it's future proof" - the democratization aspect could be
           | hard to understand for an observer at that point, because the
           | tidal wave hadn't hit yet.
           | 
           | What came was the start a transition from datacenters to rack
           | mounts in colos and dedicated hosts to virtualization and
           | very soon after the first programmable commodity clouds: why
           | settle for an administered unixlike timesharing environment
           | when you can manage everything yourself and don't have to ask
           | for permission? Why deal with buying and maintaining
           | hardware? This lowered the barrier for smaller companies and
           | startups who previously couldn't afford access to such things
           | nor markets that required them, which unleashed what can only
           | be described as a hunger for anything that could leverage
           | that model.
           | 
           | So it's not so much that worse was better, but that worse was
           | briefly more appropriate for the times. "Do we really need
           | all this crap that's getting in our way?" really took hold
           | for a moment, and programmers were willing to dump anything
           | and everything that was previously sacred if they thought
           | it'd buy them scalability, schemas and complex queries to
           | start.
           | 
           | Soon after, people started figuring out how to maintain all
           | the benefits they'd gained (democratized massively parallel
           | commodity computing) while bringing back some of the good
           | stuff from the past. Only 2 years later, Google itself
           | published the BigTable paper where it described a more
           | sophisticated storage mechanism which optimized accesses
           | better, and admittedly was tailored for a different use case,
           | but could work in conjunction with mapreduce. Academia and
           | the VLDB / CIDR crowd was more interested now.
           | 
           | Some years after that came out the papers for F1 and Spanner,
           | which added back in a SQL-like query engine, transactions,
           | secondary indexes etc on top of a similar distributed model
           | in the context of WAN-distributed datacenters. Everyone
           | preached the end of nosql and document databases, whitepapers
           | were written about "newsql", frustrated veterans complained
           | about yet another fad cycle where what was old was new again.
           | 
           | Of course that's not what happened: the story here was how a
           | software paradigm failed to adapt to the changing hardware
           | climate and business needs, so capitalism had its guts ripped
           | apart and slowly reassembled in a more context-applicable
           | way. Instead of storage engines we got so many things it's
           | hard to keep up with, but leveldb comes to mind as an
           | ancestor. With locks we got was chubby and zookeeper. With
           | log structures we got kafka and its ilk. With query optimizer
           | engines we got presto. With in-memory storage we got arrow.
           | We got a cambrian explosion of all kinds of variations and
           | combinations of these, but eventually the market started to
           | settle again and now we're in a new generation of "no,
           | really, our product can do it all". It's the lifecycle of
           | unbundling and rebundling. It will happen again. Very curious
           | what will come next.
        
         | loeg wrote:
         | My recollection of the time is that lots of people thought they
         | needed to use MapReduce for their "big data" but their data was
         | like 100GB of logs they wanted to run a O(N) analysis on.
        
       | jey wrote:
       | I wonder whether they pulled their article because they changed
       | their mind. Maybe over time they got to understand the difference
       | in use-cases and tradeoffs. Though it's definitely true that
       | plain MapReduce can be shockingly inefficient, systems like Spark
       | which add rich data structures and in-memory caching make a big
       | improvement towards closing that gap. But it's still not a
       | panacea, and I've definitely seen usecases where SQLite could
       | replace and far outperform Spark, but the general programming
       | model behind Spark is still a powerful and widely applicable
       | abstraction.
       | 
       | My main complaint with Spark is that it's pretty hard for non-
       | experts to debug crashes and failures. While the programming
       | model beautifully abstracts away all the complexities of parallel
       | and distributed programming with its headaches of managing
       | clusters and dealing with synchronization and communication, the
       | entire abstraction bursts instantly when debugging is needed.
       | Suddenly you need to understand the entire Spark programming
       | model from top-to-bottom, from DataFrames to RDDs and partitions,
       | along with implementation details like shuffle files,
       | DAGScheduler, block caching, Python-JVM interop, OOM killer, and
       | on and on.
       | 
       | And there's absolutely no incentive for the commercial vendors to
       | improve this situation in the open source project. (This isn't
       | specific to Spark though, the entire Hadoop ecosystem seems to
       | operate on hidden complexity in the open source version with
       | vendors who make money by providing support and more usable
       | distributions.)
        
         | sitkack wrote:
         | Spark exists because Python is slow and the Java GC absolutely
         | sucked for large heaps, they were unknowingly optimizing around
         | the wrong things.
         | 
         | A single Rust program on a large machine can handle 95% of what
         | any org on earth would need.
         | 
         | https://web.archive.org/web/20080509185243/http://www.databa...
         | 
         | The article died from link rot and website "refreshes".
        
           | VHRanger wrote:
           | I'll be honest, the performance issue is almost never the
           | language used.
           | 
           | You can write Python code that's very fast for almost any
           | usecase (using numpy/pandas/vaex/polars/etc). Similarly for
           | Java.
           | 
           | The main thing is the way people use Python/Java is generally
           | extremely inefficient - the code is full of cache misses
           | because of how the idiomatic way to use those languages
           | conflicts with how CPU caches work.
        
             | sitkack wrote:
             | I agree, we trade horses for the wrong reasons.
             | 
             | For big companies, the cloud is about downsizing their tech
             | workforce and speeding up delivery.
             | 
             | For tech focused companies, they could get all the
             | advantages of the cloud by adopting cloud dev practices on
             | prem.
        
         | bornfreddy wrote:
         | Is Spark still used anywhere? I haven't heard about it in a
         | loooong time. At the time it seemed like a very nice
         | abstraction, but while I played around with it, I never had a
         | problem that needed such a complex solution.
        
           | twoodfin wrote:
           | Well, Databricks is likely to either go public or be bought
           | by Microsoft for $40B or so in the next 12 months, so ...
           | yes?
        
           | Kamq wrote:
           | PySpark instead of spark, but I had a job a couple years back
           | using it in glue to generate financial reports. No longer on
           | the project, but I'm pretty sure they're using it.
           | 
           | Honestly wasn't that bad of a model. But, then again, the job
           | didn't actually need spark, someone just sold it that way
           | before I was on the project. Fun to work with though
        
         | gregw134 wrote:
         | Isn't the Hadoop story also playing out with spark--the open
         | source version is kinda buggy, while the spark committers at
         | Databricks have retained all the fixes for themselves.
        
       | advisedwang wrote:
       | Interesting that all of the successors to MapReduce fix all the
       | major criticism DeWitt and Stonebraker have. (I'm thinking
       | BigQuery, Snowflake, Spark) So the lessons were re-learned and
       | features re-introduced, but the massively parallel execution has
       | persisted.
       | 
       | Perhaps MapReduce was non-novel and flawed, but it certainly
       | seems to have led to a flowering of rich, large scale data
       | querying systems.
        
       | nmca wrote:
       | A different criticism of mapreduce-like technologies, and one of
       | my all time favourite papers in any field, is:
       | 
       | Scalability, but at what COST?
       | 
       | By Frank McSherry. If you enjoy dist-systems bashing or honestly
       | engineersing in general it's a must-read.
        
       | marcinzm wrote:
       | I was at Yahoo during the time of Hadoop/MapReduce. I'd summarize
       | the value of MapReduce in one sentence as: a solution you can use
       | is infinitely better than a solution you can't. An optimized DB
       | might bo much better for a specific workload. To get an optimized
       | DB would take 6 months of conversations with a half dozen teams
       | and getting approval for hardware. Then you'd use it for a few
       | days before moving to your new workload. Then get yelled t for
       | the waste of resources. To use MapReduce you logged into a
       | cluster and ran your code the same day. MapReduce wasn't
       | replacing databases. It was replacing local scripts running on
       | database dumps and dozen machine clusters cobbled together to
       | improve throughput. It's clear from the backgrounds of the people
       | who wrote that piece that they never had to be in the shoes of
       | the people actually using MapReduce on a daily basis and getting
       | value from it.
       | 
       | edit: This is a roughly the same reason cloud took off. Cloud
       | costs more but waiting 6 months for IT to deploy a half-bake
       | compromise solution is significantly more expensive for the
       | business in lost opportunity and productivity.
        
         | mistrial9 wrote:
         | read between the lines -- management giving orders that happen
         | right away, is worth money to management -- full stop.
        
       | indepnd wrote:
       | A case of "worse is better"
        
       | jdlyga wrote:
       | Is Apache Spark really a MapReduce replacement? All the recent
       | technical books I've read look to MapReduce as a thing of the
       | past.
        
         | gregw134 wrote:
         | Short answer: yes
        
       | wwarner wrote:
       | this seems to be an accurate capture of the original
       | "databasecolumn" blog post
       | https://dsf.berkeley.edu/cs286/papers/backwards-vertica2008....
        
       | ignoreusernames wrote:
       | 100% agree. mapReduce hype always seemed strange to me because
       | it's basically the volcano paper from the 90s but with custom
       | user defined operators instead of pre baked ones in a more
       | traditional engine. To make everything worse, hadoop came along,
       | ignoring every industry advance of the past 40 years with its
       | "one tuple at a time" iterator based model on a garbage collected
       | language. I realize it's very easy for me to say those things in
       | hindsight, but it's not like vectorized execution was a weird
       | obscure secret by the time these things came out.
       | 
       | On a side note, it finally looks like the industry is moving
       | towards saner tools that implement a lot of things that this
       | article mentions mapReduce was missing
        
       | fifilura wrote:
       | I am sure this was true 2009.
       | 
       | But I feel like the technologies that started with MapReduce have
       | now matured into BigQuery and Presto/Trino.
       | 
       | And those have incorporated some of the criticism in the article
       | such as schemas.
       | 
       | Not indexes though, they instead optimize on partitions. But I
       | think for the better since thay are more suited for the job and
       | easier to work with.
        
       | ein0p wrote:
       | Distributed SQL backends such as BigQuery and Spanner basically
       | use MapReduce underneath anyway. As do most other distributed
       | backends, SQL or not, capable of aggregation between shards.
        
       | random3 wrote:
       | Having been an early adopter (2007) and riding the whole wave I
       | think the entire movement ignited with MapReduce, BigTable, etc.
       | was probably one of the best things that happenned in the
       | industry.
       | 
       | For me personally, it allowed to break things down to first
       | principles in a way that "industry coding" wouldn't have been
       | able to. It was the practical side of theory that was confined in
       | school.
       | 
       | There were, however, two types of big data adopters. I was in the
       | bottom up camp, where passion for learing and finding the best
       | solution to the problem was the driver. The top-down camp that
       | eventually filled the Hadoop conferences by the time they got
       | large (>1000 people) I suspect didn't get much out of it, neither
       | for their organizations, nor personally.
       | 
       | So back to Stonebreaker, back then, same as now, looks like a
       | frustration more than anything. I do understand where it comes
       | from, but still a frustration more than anything. Relational
       | algebra is nice, but classical databases and SQL never nailed
       | neither theory nor practice. NoSQL for me was more NoOracle,
       | No<MSSQL, etc. and an ability to learn by doing from the ground
       | up.
        
       | dtjohnnymonkey wrote:
       | I am responsible for a MapReduce-based system built about 13
       | years ago and it is the bane of my team's existence right now. It
       | was the hotness in its day, but did not age well at all. We are
       | working on a replacement using ClickHouse.
        
       ___________________________________________________________________
       (page generated 2024-03-30 23:01 UTC)