[HN Gopher] DeWitt and Stonebraker's "MapReduce: A major step ba...
___________________________________________________________________
DeWitt and Stonebraker's "MapReduce: A major step backwards" (2009)
Author : mooreds
Score : 89 points
Date : 2024-03-30 14:57 UTC (8 hours ago)
(HTM) web link (craig-henderson.blogspot.com)
(TXT) w3m dump (craig-henderson.blogspot.com)
| mrkeen wrote:
| > Not novel at all -- it represents a specific implementation of
| well known techniques developed nearly 25 years ago
|
| Your old stuff is bad.
|
| > We, however, assert that they should not overlook the lessons
| of more than 40 years of database technology --
|
| Our old stuff is good.
| mucle6 wrote:
| I like your perspective
| sinkasapa wrote:
| I think that in academic work this is reasonable. People
| sometimes claim to have done something novel, when it is
| clearly a repetition of something older. Perhaps the citations
| of the works being criticized didn't make it clear that they
| were repeating an old idea. It allows the reader to see that
| there were relevant previous works that may have been subjected
| to criticism in the past and that those previous criticisms
| remain important for the more recent work. The question then is
| whether the recent work resolves those criticisms. The authors
| in this case seem not to think so.
|
| A better paraphrase is the following: The new idea is not so
| new as some claim. It is 20 years old and was previously shown
| to be flawed in comparison to other approaches. Looking over 40
| years of literature, one can see the flaws in the approach and
| how the flaws were subsequently resolved. Proponents of the
| "new idea" are either unaware of these advancements or are
| ignoring them.
|
| I'd say that ignoring previous literature and work is a big
| problem is CS adjacent studies. My experience is being a
| theoretical linguist who interfaces with computational
| linguists. I have had colleagues receive criticism for citing
| work that is 10 years old or more, even if such a work
| represents the earliest example of a particular idea they are
| making use of. It is suggested that a more modern work should
| be cited. There is kind of an "anti-memory" culture that
| results from trying to make work seem cutting edge, even if a
| work is clearly an extension or reinvention of very old ideas.
| lewis1028282 wrote:
| Some of these issues were solved by Spark. Do agree with the
| overall point, people shouldn't be reaching for Hadoop when
| Postgres would suffice. Indexes are fast. Use them.
| ysofunny wrote:
| > _people shouldn't be reaching for Hadoop when Postgres would
| suffice. Indexes are fast. Use them._
|
| let me try a riskier one: "people should not be reaching for
| kubernetes when a system administrator would suffice. sysadmins
| are cheap(er?) than clouds, use them"
|
| it did not come out very well.... I got stuck trying to find
| what to contrast kube with; all I got in them minutes alloted
| to comment posting was 'system administrator'. meh.
| rijx wrote:
| Nowadays K3s is worth the small learning curve with a big
| payoff as you get a lot of automation included / by
| installing an operator :-)
| jt2190 wrote:
| > ... K3s...
|
| I'm assuming this isn't a typo and you mean Kubes?
|
| https://kubes.guru/getting-started/
| maxcoder4 wrote:
| I don't know what OP meant, but https://k3s.io/ also fits
| the context.
| Areading314 wrote:
| K8s is an abbreviation of kubernetes K3s is an
| abbreviation of k8s
| eichin wrote:
| ... wouldn't that be k1s?
| dunk010 wrote:
| Everyone wants to use k8s and like 1% need it in any shape or
| form. It basically acts as a conspiracy between otherwise-
| redundant ops level people and those running Kubernettes
| against the companies who have to pay for all this. Just use
| Fargate, and be done.
| nurettin wrote:
| I thought you still need a sysadmin or whoever is needed to
| set up and maintain kubernetes.
| kevindamm wrote:
| Indexes are fast when they're built well and used often.
| Indexes are expensive (and paid for in triplicate via backup
| costs) when they are seldom or never used. Sometimes you just
| need to materialize a table temporarily, which of course you
| can do in the RDBMS as well, but sometimes the data sources are
| so scattered (or also ephemeral) that keeping all processing
| inside the DB system is a stretch.
|
| But perhaps the most compelling justification is based on the
| DB systems familiarity on the team. Not everyone has the same
| level of SQL expertise and some of the visualization tools
| added to MapReduce systems and the source language itself are
| more familiar to them than the output of an EXPLAIN statement.
| Especially if the same pipeline is effectively hundreds of
| lines in SQL.
| woooooo wrote:
| If you're doing analytics that require full table scans,
| indices are pure overhead.
|
| No database will beat just piping all the records through some
| process for full scans.
| kevindamm wrote:
| MapReduce is a good example of choosing tools suitable to the
| task. I'll agree with the authors that there are certainly large
| DB systems that a traditional RDBMS is better suited or more
| optimal.
|
| But the designers and avid users of MapReduce did not use it
| because it seemed a more optimal DB query engine. The
| utility|cost being optimized was not compute cycles or disk seeks
| or logappends, it was the developer time needed to construct
| another large-scale OLAP definition and an overall tolerance to
| hardware and system failure.
| nopurpose wrote:
| Paper evaluates MapReduce in context of RDBMS. Was it ever
| advocated to replace them?
|
| Closest MR product to RDBMS I know is Hive, but it was just a
| thin SQL layer mostly for convenience, not a killer product by
| any means.
|
| MR was cool and I can't deny that managing Hadoop/Spark clusters
| was quite fun.
| BoiledCabbage wrote:
| I also agree we should stop using MapReduce because of its lack
| of support for Crystal Reports. /s
|
| Kidding aside it's easy to bash a paper like this with the
| benefit of hindsight and move on. But instead look at what we can
| take from it.
|
| The benefit of new tech isn't always by being better than old
| tech at its strengths, it can also be by being better at old tech
| at its weaknesses and finding a way to change the problem domain
| so that its strengths don't matter much.
| edejong wrote:
| The 'no schema needed' paradigm that persisted within DE is
| finally replaced with the hard lessons learned in the 70's.
|
| Systems die, data stays.
| dunk010 wrote:
| I remember this paper, and was at a company at the time that was
| one of the first to use MapReduce, so saw this all play out first
| hand. I appreciated the paper. Then and now developers rush to
| grab new technologies, especially those that stroke their ego-
| driven fantasies of "working at scale" without considering their
| underlying constraints or applicability. At the time this was
| published every company and startup under the sun was rushing to
| use MapReduce, most often in places where it wasn't warranted.
| I'm glad someone surfaced this paper again; people still need to
| learn the lessons that it outlines. Microservices and k8s: I'm
| looking straight at you.
| jeffbee wrote:
| Which problem is more serious? 1) your small company has an
| over-complex system that could have been postgres; 2) your
| medium-sized company has a postgres that's on fire at the
| bottom of the ocean every day despite the forty people you
| hired to stabilize postgres, and your scalable replacement
| system is still six months away?
| davidw wrote:
| I have very rarely seen the second scenario, but the first
| seems more common.
| growse wrote:
| Isn't the second example representative of all tech debt /
| neglect ever? If so, it's _very_ common.
| natebc wrote:
| I generally classify tech debt more as a long todo/wish
| list that we'll never get a chance to work on rather than
| a server or service being on fire.
| voakbasda wrote:
| I have found that these fires become uncontrollable
| because of tech debt. Whole rarely the spark, it's a
| latent fuel source.
|
| It's like our modern forests; unless something clears out
| the brush, we see wildfires start from the smallest
| spark. Once it starts, it's almost impossible to do
| anything but try to limit the extent of the disaster.
| sitkack wrote:
| In the second scenario, they can't do math. They could
| have bought themselves 6-18 months by getting the most
| powerful machine available using probably at most 1-2
| salaries worth of those 40 people.
|
| Less a single digit percentage of workloads needs
| massive, hard to use horizontal scale out (for things
| that can solved on a single machine, or a single
| database).
|
| MR _is_ useful as an adhoc scheduler over data. Need to
| OCR 10k files, MR it.
|
| Hadoop was the worst possible implementation of MR,
| wasted so much of everything. That was its primary
| strength.
| PaulHoule wrote:
| Reminds me when I had a 3-machine Hadoop cluster in my
| home lab and 2 nodes were turned off but I was submitting
| jobs to get and getting results just fine.
|
| I remember all the people pushing erasure code based
| distributed file systems pointing out how crazy it is to
| have three copies of something but Hadoop could run in a
| degraded condition without degraded performance.
| sitkack wrote:
| I agree. I used Disco MR to do amazing things. Trivial to
| use, like anyone could be productive in under an hour.
|
| Erasure codes are awesome, but so is just having 3
| copies. When you have skin in the game, simplicity is the
| most important driver of good outcomes. Look at the
| dimensions that Netezza optimized, they saw a
| technological window and they took it. Right now we have
| workstations that can push 100GB/s from from flash. We
| are talking about being able to sort 1TB of data in 20
| seconds (from flash) the same machine could do it from
| ram in 10.
|
| https://github.com/discoproject/disco
|
| I need to give Ray and Dask a try.
|
| I don't know where to put this comment so I'll put it
| here. DeWitt and Stonebraker are right, but also wrong.
| Everyone is talking past each other there. Both are
| geniuses, this essay wasn't super strong.
|
| If I was their editor, I would say, reframe it as
| MapReduce is an implementation detail, we also need these
| other things for this to be usable by the masses. Their
| point about indexes proves my point about talking past
| each other. If you are scanning the data basically once,
| building an index is a waste.
| hinkley wrote:
| Very early on in my enterprise career, in a continuance
| of a discussion where it was mentioned that our customer
| was contemplating a terabyte disk array (that would fill
| an entire server rack, so very fucking early) I learned
| about the great grandfather of NVME drives: battery
| backed RAM disks that cost $40k inflation adjusted.
|
| "Why on earth would you spend the cost of a brand new
| sedan on a drive like this?" I asked. Answer: to put the
| Oracle or DB2 WAL data on so you could vertically scale
| your database just that much higher while you tried to
| solve the throughput problems you were having another
| way. It was either the bargaining phase of loss or a Hail
| Mary you could throw in to help a behind-schedule
| rearchitecture. Last resort vertical scaling.
| nfw2 wrote:
| No, plenty of tech debt is caused by over-engineering or
| pre-maturely optimizing for the wrong thing.
|
| I'm not sure if the second outcome is meant to blame
| Postgres specifically on under-engineering in general,
| but neither seems to me like it should be a concern for
| an early-stage startup.
| alanbernstein wrote:
| Probably 3) the system you overengineered too early solved
| the wrong problem, and your replacement is six months away,
| but you've paid for it twice.
| feoren wrote:
| #1 is more serious. #2 limits the growth of your already
| successful company. #1 sinks your struggling small business.
| You have to be successful to be a victim of your own success,
| after all. Not to mention the fact that #1 is way more
| common. Do you know how far Postgres scales? Because it's way
| past almost any medium- scale business.
| jvans wrote:
| Exactly. A lot of us work at #2 so we wish our predecessors
| saved us our current pain. But if they went that route we
| wouldn't be employed at that company because it wouldn't
| exist
| nfw2 wrote:
| Exactly, if a medium-sized company is struggling with
| Postgres, either they have very niche requirements or the
| scalability problems are in their own code.
| asah wrote:
| This was true in 2009. Since then, multiple PostgreSQL-
| compatible databases have launched.
| merb wrote:
| Kubernetes is not like mapreduce. It does not need
| microservices at all. It is a scheduling and deployment
| framework, which you will implement yourself anyway (hopefully
| you do) or you use a pass. It's not even that hard to work with
| it. Of course it is complex, but a lot of these tools are, even
| the lower level ones like terraform.
| hinkley wrote:
| Between monoliths and microservices you have services and
| sidecars. If you don't at least have sidecars I really don't
| see the point of kubernetes, because most of the rest of the
| services will follow Conway's Law and can reasonably do their
| own thing for less than 125% of the cost of full bin packing.
| hinkley wrote:
| Map reduce came to my radar around the same time the Trough of
| Disillusionment hit for some other things, including design
| patterns. We still believed in the 8 Fallacies of Distributed
| Computing back then, before cloud providers came along and
| started selling Fallacies as a Service.
|
| I can't wait for that hangover to hit us. Its likely to be the
| best one of my career.
| jauntywundrkind wrote:
| Generally I appreciate this post, as yeah the bandwagon effect
| is real.
|
| I'd characterize mapreduce as a very very specific narrow
| architectural pattern. Trying to apply it contorts the code you
| write. I don't see anything remotely like that that's true
| about Kubernetes or containers (microservices much more so in
| creating constraints).
|
| We had to reset the Days Since Kubernetes Winge counter again
| yesterday: https://news.ycombinator.com/item?id=39868586 . And
| a couple of people spoke to how you might not need containers,
| but I still haven't heard anyone say what not having containers
| could win you. What types of code can you only write _without_
| containers? We can convince ourselves that Kubernetes is hard,
| but also lots of people also say it 's easy/not bad, so there's
| some difficulty-factor that's unknown/variable. But I strongly
| struggle to see a parallel between a strong code architecture
| choice like MapReduce and a generic platform like Kubernetes or
| containers.
|
| The platform seems pleasantly neutral in shaping what you do
| with it, in my view; if that wasn't true it would never have
| been a success.
| dwattttt wrote:
| > I still haven't heard anyone say what not having containers
| could win you. What types of code can you only write without
| containers?
|
| Code with one less layer of abstraction. If that layer is
| buying you something you need, it's great. But abstraction
| isn't a positive in & of itself, it's why we get upset about
| GenericAbstractFactoryBeanSingleton(s).
| BiteCode_dev wrote:
| Yep, also GraphQL and all those data lakes.
|
| That reminds me "XML is the future":
| https://www.bitecode.dev/p/hype-cycles
| fiddlerwoaroof wrote:
| The objection I have to this paper is that I keep seeing SQL
| forced onto streaming workflows when the functional programming
| operations provided by the non-SQL APIs of systems like Flink and
| Spark are a lot easier to think about in this context: SQL stream
| joins often have very surprising performance properties and it
| can take a long time to figure out the tunables needed to make
| them perform at scale.
| mccanne wrote:
| Necessity is the mother of invention. MapReduce-based systems
| were developed because the state-of-the-art RDBMS systems of that
| age could not scale to the needs of the Googles/Yahoos/Facebooks
| during the phenomenal growth spurt of the early Web. The novelty
| here was the tradeoffs they made to scale out and up using the
| compute and storage footprints available at the time.
|
| "We thought of that" vs "we built it and made it work".
| dekhn wrote:
| MapReduce was never built to compete with RDBMS systems. It was
| built to compete with batch-scheduled distribution data
| processing, typically where there was no index. It was also
| built to build indices (the search index), not really use them
| during any of the three phases. It was also built to be
| reliable in the face of cheap machines (bad RAM and disk).
|
| Google built MR because it was in an existential crisis: they
| couldn't build a new index for the search engine, and freshness
| and size of the index was important for early search engines.
| The previous tools would crash part-way through due to the
| cheap hardware that Google bought. If Google had based search
| indexing on RDBMS, they would not exist today.
|
| Now Google _did_ use RDBMS- they used MySQL at scale. It wasn
| 't unheard-of for mapreduces to run against MySQL (typically
| doing a query to get a bunch of records, and then mapping over
| those records).
|
| I worked on later mapreduce (long after it was mature) which
| used all sorts of tricks to extend the MapReduce paradigm as
| far as possible but ultimately nearly everything got replace
| with Flume, which is effectively a computational superset of
| what MR can do.
|
| I think the paper must have been pulled because Stonebreaker
| must have gotten huge pushback for attacking MR for something
| it wasn't good at. See the original paper for what they
| proposed as good use cases: counting word occurences in a large
| corpus (far larger than the storage limits of postgres and
| others at the time), distributed grep (without an index),
| counting unique items (where the number of items is larger than
| the capacity of a database at the time), reversing a graph
| (convert (source, target) pairs to (target, [source, source,
| source]), term vectors, inverted index (the original use case
| for building the index) and distributed sort. None of the RDBMS
| of that day could handle the scale of the web. That's all.
| makmanalp wrote:
| > Stonebreaker must have gotten huge pushback for attacking
| MR for something it wasn't good at
|
| I like this comment because it gets to the heart of a
| misunderstanding. I'd further correct it to say "for
| something it wasn't trying to be good at". DeWitt and
| Stonebraker just didn't understand why anyone would want
| this, and I can see why: change was coming faster than it
| ever did, from many angles. Let's travel back in time to see
| why:
|
| The decade after mapreduce appeared - when I came of age as a
| programmer - was a fascinating time of change:
|
| The backdrop is the post-dotcom bubble when the hype cycle
| came to a close, and the web market somewhat consolidated in
| a smaller set of winners who now were more proven and ready
| to go all in on a new business model that elevates doing
| business on the web above all else, in a way that would truly
| threaten brick and mortar.
|
| Alongside that we have CPU manufacturers struggling with
| escalating clock speeds and jamming more transistors into a
| single die to keep up with Moore's law and consumer demand,
| which leads to the first commodity dual and multi core CPUs.
|
| But I remember that most non-scientific software just
| couldn't make use of multiple CPUs or cores effectively yet.
| So we were ripe for a programming model that engineers who've
| never heard of lamport before can actually understand and
| work with: threads and locks and socket programming in C and
| C++ were a rough proposition, and MPI was certainly a thing
| but the scientific computing people who were working on
| supercomputers, grids, and Beowulf clusters were not the same
| people as the dotcom engineers using commodity hardware.
|
| Companies pushing these boundaries were wanting to do things
| that traditional DBMSes could not offer at a certain scale,
| at least for cheap enough. The RDBMS vendors and priesthood
| were defending that it's hard to offer that while also
| offering ACID and everything else a database offers, which
| was certainly not wrong: it's hard to support an OLAP use
| case with the OLTP-style System-R-ish design that dominated
| the market in those days. This was some of the most
| complicated and sophisticated software ever made, imbued with
| magiclike qualities from decades of academic research
| hardened by years of industrial use.
|
| Then there was data warehouse style solutions that were
| "appliances" that were locked into a specific and expensive
| combination of hardware and software optimized to work well
| together and also to extract millions and billions of dollars
| from the fortune 500s that could afford them.
|
| So the ethos at the booming post-dotcoms definitely was "do
| we really need all this crap that's getting in our way?", and
| we would soon find out. Couching it in formalism and calling
| it "mapreduce" made it sound fancier than what it really was:
| some glue that made it easy for engineers to declaratively
| define how to split work into chunks, shuffle them around and
| assemble them again across many computers, without having to
| worry about the pedestrian details of the glue in between. A
| corporate drone didn't have to understand /how/ it worked,
| just how to fill in the blanks for each step properly: a much
| more viable proposition than thousands of engineers writing
| software together that involves finnicky locks and
| semaphores.
|
| The DBMS crowd thumbed their noses at this because it was
| truly SO primitive and wasteful compared to the sophisticated
| mechanisms built to preserve efficiency that dated back to
| the 70s: indexes, access patterns, query optimizers,
| optimized storage layouts. What they didn't get was that
| every million dollar you didn't waste on what was essentially
| the space shuttle of computer software - fabulously expensive
| and complicated - could now buy a /lot/ more cheapo computing
| power duct taped together. The question was how to leverage
| that. Plus, with things changing at the pace that they did
| back then, last year's CPU could be obsolete by next year, so
| how well could the vendors building custom hardware even keep
| up with that, after you paid them their hefty fees? The value
| proposition was "it's so basic that it will run on anything,
| and it's future proof" - the democratization aspect could be
| hard to understand for an observer at that point, because the
| tidal wave hadn't hit yet.
|
| What came was the start a transition from datacenters to rack
| mounts in colos and dedicated hosts to virtualization and
| very soon after the first programmable commodity clouds: why
| settle for an administered unixlike timesharing environment
| when you can manage everything yourself and don't have to ask
| for permission? Why deal with buying and maintaining
| hardware? This lowered the barrier for smaller companies and
| startups who previously couldn't afford access to such things
| nor markets that required them, which unleashed what can only
| be described as a hunger for anything that could leverage
| that model.
|
| So it's not so much that worse was better, but that worse was
| briefly more appropriate for the times. "Do we really need
| all this crap that's getting in our way?" really took hold
| for a moment, and programmers were willing to dump anything
| and everything that was previously sacred if they thought
| it'd buy them scalability, schemas and complex queries to
| start.
|
| Soon after, people started figuring out how to maintain all
| the benefits they'd gained (democratized massively parallel
| commodity computing) while bringing back some of the good
| stuff from the past. Only 2 years later, Google itself
| published the BigTable paper where it described a more
| sophisticated storage mechanism which optimized accesses
| better, and admittedly was tailored for a different use case,
| but could work in conjunction with mapreduce. Academia and
| the VLDB / CIDR crowd was more interested now.
|
| Some years after that came out the papers for F1 and Spanner,
| which added back in a SQL-like query engine, transactions,
| secondary indexes etc on top of a similar distributed model
| in the context of WAN-distributed datacenters. Everyone
| preached the end of nosql and document databases, whitepapers
| were written about "newsql", frustrated veterans complained
| about yet another fad cycle where what was old was new again.
|
| Of course that's not what happened: the story here was how a
| software paradigm failed to adapt to the changing hardware
| climate and business needs, so capitalism had its guts ripped
| apart and slowly reassembled in a more context-applicable
| way. Instead of storage engines we got so many things it's
| hard to keep up with, but leveldb comes to mind as an
| ancestor. With locks we got was chubby and zookeeper. With
| log structures we got kafka and its ilk. With query optimizer
| engines we got presto. With in-memory storage we got arrow.
| We got a cambrian explosion of all kinds of variations and
| combinations of these, but eventually the market started to
| settle again and now we're in a new generation of "no,
| really, our product can do it all". It's the lifecycle of
| unbundling and rebundling. It will happen again. Very curious
| what will come next.
| loeg wrote:
| My recollection of the time is that lots of people thought they
| needed to use MapReduce for their "big data" but their data was
| like 100GB of logs they wanted to run a O(N) analysis on.
| jey wrote:
| I wonder whether they pulled their article because they changed
| their mind. Maybe over time they got to understand the difference
| in use-cases and tradeoffs. Though it's definitely true that
| plain MapReduce can be shockingly inefficient, systems like Spark
| which add rich data structures and in-memory caching make a big
| improvement towards closing that gap. But it's still not a
| panacea, and I've definitely seen usecases where SQLite could
| replace and far outperform Spark, but the general programming
| model behind Spark is still a powerful and widely applicable
| abstraction.
|
| My main complaint with Spark is that it's pretty hard for non-
| experts to debug crashes and failures. While the programming
| model beautifully abstracts away all the complexities of parallel
| and distributed programming with its headaches of managing
| clusters and dealing with synchronization and communication, the
| entire abstraction bursts instantly when debugging is needed.
| Suddenly you need to understand the entire Spark programming
| model from top-to-bottom, from DataFrames to RDDs and partitions,
| along with implementation details like shuffle files,
| DAGScheduler, block caching, Python-JVM interop, OOM killer, and
| on and on.
|
| And there's absolutely no incentive for the commercial vendors to
| improve this situation in the open source project. (This isn't
| specific to Spark though, the entire Hadoop ecosystem seems to
| operate on hidden complexity in the open source version with
| vendors who make money by providing support and more usable
| distributions.)
| sitkack wrote:
| Spark exists because Python is slow and the Java GC absolutely
| sucked for large heaps, they were unknowingly optimizing around
| the wrong things.
|
| A single Rust program on a large machine can handle 95% of what
| any org on earth would need.
|
| https://web.archive.org/web/20080509185243/http://www.databa...
|
| The article died from link rot and website "refreshes".
| VHRanger wrote:
| I'll be honest, the performance issue is almost never the
| language used.
|
| You can write Python code that's very fast for almost any
| usecase (using numpy/pandas/vaex/polars/etc). Similarly for
| Java.
|
| The main thing is the way people use Python/Java is generally
| extremely inefficient - the code is full of cache misses
| because of how the idiomatic way to use those languages
| conflicts with how CPU caches work.
| sitkack wrote:
| I agree, we trade horses for the wrong reasons.
|
| For big companies, the cloud is about downsizing their tech
| workforce and speeding up delivery.
|
| For tech focused companies, they could get all the
| advantages of the cloud by adopting cloud dev practices on
| prem.
| bornfreddy wrote:
| Is Spark still used anywhere? I haven't heard about it in a
| loooong time. At the time it seemed like a very nice
| abstraction, but while I played around with it, I never had a
| problem that needed such a complex solution.
| twoodfin wrote:
| Well, Databricks is likely to either go public or be bought
| by Microsoft for $40B or so in the next 12 months, so ...
| yes?
| Kamq wrote:
| PySpark instead of spark, but I had a job a couple years back
| using it in glue to generate financial reports. No longer on
| the project, but I'm pretty sure they're using it.
|
| Honestly wasn't that bad of a model. But, then again, the job
| didn't actually need spark, someone just sold it that way
| before I was on the project. Fun to work with though
| gregw134 wrote:
| Isn't the Hadoop story also playing out with spark--the open
| source version is kinda buggy, while the spark committers at
| Databricks have retained all the fixes for themselves.
| advisedwang wrote:
| Interesting that all of the successors to MapReduce fix all the
| major criticism DeWitt and Stonebraker have. (I'm thinking
| BigQuery, Snowflake, Spark) So the lessons were re-learned and
| features re-introduced, but the massively parallel execution has
| persisted.
|
| Perhaps MapReduce was non-novel and flawed, but it certainly
| seems to have led to a flowering of rich, large scale data
| querying systems.
| nmca wrote:
| A different criticism of mapreduce-like technologies, and one of
| my all time favourite papers in any field, is:
|
| Scalability, but at what COST?
|
| By Frank McSherry. If you enjoy dist-systems bashing or honestly
| engineersing in general it's a must-read.
| marcinzm wrote:
| I was at Yahoo during the time of Hadoop/MapReduce. I'd summarize
| the value of MapReduce in one sentence as: a solution you can use
| is infinitely better than a solution you can't. An optimized DB
| might bo much better for a specific workload. To get an optimized
| DB would take 6 months of conversations with a half dozen teams
| and getting approval for hardware. Then you'd use it for a few
| days before moving to your new workload. Then get yelled t for
| the waste of resources. To use MapReduce you logged into a
| cluster and ran your code the same day. MapReduce wasn't
| replacing databases. It was replacing local scripts running on
| database dumps and dozen machine clusters cobbled together to
| improve throughput. It's clear from the backgrounds of the people
| who wrote that piece that they never had to be in the shoes of
| the people actually using MapReduce on a daily basis and getting
| value from it.
|
| edit: This is a roughly the same reason cloud took off. Cloud
| costs more but waiting 6 months for IT to deploy a half-bake
| compromise solution is significantly more expensive for the
| business in lost opportunity and productivity.
| mistrial9 wrote:
| read between the lines -- management giving orders that happen
| right away, is worth money to management -- full stop.
| indepnd wrote:
| A case of "worse is better"
| jdlyga wrote:
| Is Apache Spark really a MapReduce replacement? All the recent
| technical books I've read look to MapReduce as a thing of the
| past.
| gregw134 wrote:
| Short answer: yes
| wwarner wrote:
| this seems to be an accurate capture of the original
| "databasecolumn" blog post
| https://dsf.berkeley.edu/cs286/papers/backwards-vertica2008....
| ignoreusernames wrote:
| 100% agree. mapReduce hype always seemed strange to me because
| it's basically the volcano paper from the 90s but with custom
| user defined operators instead of pre baked ones in a more
| traditional engine. To make everything worse, hadoop came along,
| ignoring every industry advance of the past 40 years with its
| "one tuple at a time" iterator based model on a garbage collected
| language. I realize it's very easy for me to say those things in
| hindsight, but it's not like vectorized execution was a weird
| obscure secret by the time these things came out.
|
| On a side note, it finally looks like the industry is moving
| towards saner tools that implement a lot of things that this
| article mentions mapReduce was missing
| fifilura wrote:
| I am sure this was true 2009.
|
| But I feel like the technologies that started with MapReduce have
| now matured into BigQuery and Presto/Trino.
|
| And those have incorporated some of the criticism in the article
| such as schemas.
|
| Not indexes though, they instead optimize on partitions. But I
| think for the better since thay are more suited for the job and
| easier to work with.
| ein0p wrote:
| Distributed SQL backends such as BigQuery and Spanner basically
| use MapReduce underneath anyway. As do most other distributed
| backends, SQL or not, capable of aggregation between shards.
| random3 wrote:
| Having been an early adopter (2007) and riding the whole wave I
| think the entire movement ignited with MapReduce, BigTable, etc.
| was probably one of the best things that happenned in the
| industry.
|
| For me personally, it allowed to break things down to first
| principles in a way that "industry coding" wouldn't have been
| able to. It was the practical side of theory that was confined in
| school.
|
| There were, however, two types of big data adopters. I was in the
| bottom up camp, where passion for learing and finding the best
| solution to the problem was the driver. The top-down camp that
| eventually filled the Hadoop conferences by the time they got
| large (>1000 people) I suspect didn't get much out of it, neither
| for their organizations, nor personally.
|
| So back to Stonebreaker, back then, same as now, looks like a
| frustration more than anything. I do understand where it comes
| from, but still a frustration more than anything. Relational
| algebra is nice, but classical databases and SQL never nailed
| neither theory nor practice. NoSQL for me was more NoOracle,
| No<MSSQL, etc. and an ability to learn by doing from the ground
| up.
| dtjohnnymonkey wrote:
| I am responsible for a MapReduce-based system built about 13
| years ago and it is the bane of my team's existence right now. It
| was the hotness in its day, but did not age well at all. We are
| working on a replacement using ClickHouse.
___________________________________________________________________
(page generated 2024-03-30 23:01 UTC)