[HN Gopher] MapReduce, TensorFlow, Vertex: Google's bet to avoid...
___________________________________________________________________
MapReduce, TensorFlow, Vertex: Google's bet to avoid repeating
history in AI
Author : coloneltcb
Score : 62 points
Date : 2023-08-29 19:08 UTC (3 hours ago)
(HTM) web link (www.supervised.news)
(TXT) w3m dump (www.supervised.news)
| jeffbee wrote:
| The part about Vertex might be right but the establishing story
| about mapreduce is totally wrong. By the time Hadoop took off,
| mapreduce at Google already had one foot in the grave. If you are
| using Hadoop today you have adopted a technology stack that
| Google recognized as obsolete 15 years ago. It is difficult to
| see how Google lost that battle. They effectively disabled an
| entire industry by suggesting an obsolete stack, while
| simultaneously moving on to cheaper, better sequels.
| opportune wrote:
| You're right and wrong. MapReduce is two things: a pattern for
| massively parallel computation, and the name of the initial
| implementation of the pattern at Google.
|
| While the initial implementation at Google quickly got replaced
| with better things, the MapReduce pattern is everywhere in the
| data space, and almost taken for granted now. Hadoop is
| basically the same: a shitty (I think HDFS is still pretty
| good, just not the compute part) initial implementation of the
| pattern that was quickly iterated and improved upon.
|
| Also, a big reason people stopped having to think about eg
| rack-local operations is that most people operating on huge
| amounts of data now aren't doing it on traditional generic
| servers, they're using something like s3 on VMs in Public Cloud
| datacenters if they're doing something relatively "low level"
| or more likely just using something like Snowflake,
| Spark/Databricks (pretty close to OG mapreduce...), etc.
| dekhn wrote:
| It wasn't obsolete 15 years ago. There were production
| mapreduces making double-digit improvements in key metrics
| (watch time, apps purchased) much more recently than that. The
| system I worked on, Sibyl, isn't well known outside of Google,
| but used MR as an engine to do extremely large-scale machine
| learning. we added a number of features and used MR in ways
| that would have been extremely challenging to reimplement while
| maintaining performance targets.
|
| I'm not even sure the mapreduce code has been deleted from
| google3 yet.
|
| To be fair, MR was definitely dated by the time I joined- 2007-
| and I'm surprised it lasted as long as it did. But it was
| really well-tuned and reliable.
|
| Also the MR paper was never intended to stake Google's position
| as a data processing provider (that came far, far later). The
| MR, Bigtable, and GFS papers were written to attract software
| engineers to work on infra at google, to share some useful
| ideas with the world (specifically, the distributed shuffle in
| mapreduce, the bloom filter in bigtable, and the single-master-
| index-in-ram of GFS), and finally, to "show off".
| ithkuil wrote:
| I remember the joke CL that deleted mapreduce from google3,
| even got an approval from Urs IIRC
| choppaface wrote:
| > They effectively disabled an entire industry by suggesting an
| obsolete stack
|
| Interesting opinion but not supported at all by evidence. Most
| non-Google datasets are small and stored on off-the-shelf
| heterogenous hardware, so HDFS / Mapreduce for streaming OLAP
| is a great fit. Cassandra (BigTable) and Parquet (Dremel) plus
| Cloudera's Impala had much quicker time-to-market when large-
| scale BI became more relevant.
|
| "Obsolete" for Google problems sure, but Google problems
| largely only happen at Google. Stuff like ad targeting and ML
| look a lot different for products outside the Chocolate
| Factory.
| jeffbee wrote:
| I can't imagine HDFS being a "Good fit" for anything. Its
| mere existence was undoubtedly a disabling force at two large
| companies where I worked.
| moandcompany wrote:
| I recommend reading the GFS paper and consider that there
| were/are use cases for horizontally scalable, fault-
| tolerant object storage, with the nuance of understanding
| that some of your storage/data nodes may also be compute
| nodes and there can be a benefit or preference for
| assigning applications to run where underlying data is
| stored.
|
| In the HDFS case with Hadoop's ecosystem, consider Hive,
| BigTable, Drill, and even Spark when running on YARN.
|
| In the peak days of Hadoop, many organizations were
| primarily on-prem, and S3 or S3 compatible object stores
| were mostly reserved for people using AWS.
| opportune wrote:
| The problem with using technology X "because Google does
| it" or "this is the best open source version of what Google
| uses, so let's use it because Google does it" is that
| companies neglect that Google _does not just use the
| technology out of the box_ they
|
| 1. created the software for their own needs 2. maintain a
| developer team to improve it and address requirements/pain
| points/integration 3. have an internal pool of experts in
| the form of the developer team and "customers"/early
| adopters 4. most likely have other proprietary systems like
| Borg or Colossus which integrate with the software very
| well, which OSS like Hadoop may not (another example: OSS
| Bazel vs Blaze+Forge+Piper+Monorepo structure).
|
| Something like HDFS was hugely painful for many teams
| because they had no idea how it worked or how to debug it,
| had no idea how to fix it or extend it, and didn't have any
| good tooling to understand why something was slow. All they
| could do was try to configure it, integrate with it, and
| find answers for their problems online. That's because HDFS
| was "free" but a team capable of properly maintaining,
| supporting/operations, and developing HDFS was extremely
| expensive.
| sberens wrote:
| [Genuinely curious,] what have people moved on to?
| choppaface wrote:
| Dremel / Parquet, which helps facilitate much more efficient
| joins and filters versus sawzall on mapreduce.
|
| For streaming there is flume and beam, or just load important
| data into Spanner.
| moandcompany wrote:
| Google itself moved on to "Flume" and later created
| "Dataflow" the precursor for Apache Beam. While Dataflow/Beam
| aren't execution engines for data processing themselves, they
| abstract away the language of expressing data computation
| from the engines themselves. At Google for example, a data
| processing job might be expressed using Beam on top of Flume
| for processing.
|
| Outside of Google, most organizations with large distributed
| data processing problems moved on to Hadoop2
| (YARN/MapReduce2) and later in present day to Apache Spark.
| When organizations say they are using "Databricks" they are
| using Apache Spark provided as a service, from a company
| started by the creators of Apache Spark, which happens to be
| Databricks.
|
| Apache Beam is also used outside of Google on top of other
| data processing "engines" or runners for these jobs, such as
| Google's Cloud Dataflow service, Apache Flink, Apache Spark,
| etc.
| gravypod wrote:
| Opinions are my own.
|
| Some info on flume: https://research.google/pubs/pub35650/
|
| To quote from there: "MapReduce and similar systems
| significantly ease the task of writing data-parallel code.
| However, many real-world computations require a pipeline of
| MapReduces, and programming and managing such pipelines can
| be difficult."
| summerlight wrote:
| In addition to Flume/Dataflow, there's a significant push
| toward SQL engines. In general, SQL (or similar query
| engines written in more declarative languages/APIs) has
| some performance benefits over usual Flume codes thanks to
| vectorized execution and other optimizations.
| chaxor wrote:
| So if I read this right, if you're not a big company
| (perhaps just a standard dev with maybe a tiny cluster of
| computers or just one beefy one), you just make a Docker
| container with pyspark and put your scripts in there, and
| everyone can reproduce your work easily on any type of
| machine or cluster? It seems like a reasonable approach,
| though it would be nice to not need the OS
| dependencies/docker for spark.
| moandcompany wrote:
| Regarding "obsolete" and missed opportunities: this idea of
| missed opportunities for Google has been touched upon several
| times in the last decade, including the launch of Google Cloud
| Platform itself.
|
| Urs Holzle, the former head of Google TI (Technical
| Infrastructure), discussed in public some of the challenges and
| reasons for creating Google Cloud Platform as a platform, and
| backing projects like Kubernetes.
|
| Over time, Google has become a proprietary tech "island" in
| severals ways and arguably more fragmented than other large
| tech companies, such as Microsoft and Amazon, which happen to
| both have commercial cloud offerings, and Meta/Facebook. While
| all of these companies certainly have challenges with not-
| invented-here ("NIH") syndrome, and lots of internal,
| proprietary tools, as a software engineer at one of these
| three, odds are you will use and touch more commercial and
| open-source technologies than you would at Google. Google
| itself still struggles with having teams and projects use GCP
| for internal work versus Borg/etc; and there are plenty of
| valid reasons why Google teams don't use GCP.
|
| The proprietary tech "island" issue is a non-trival concern
| when you need to hire new software engineers from
| industry/outside and ramp-up time with some of these systems
| may be 6-months or even greater; today Alphabet/Google is at
| around 200k+ FTE, and you aren't going to be able to find many
| engineers outside that have experience with
| Borg/Flume/Spanner/Monarch/etc. Likewise when you are an
| experienced Google software engineer looking to work elsewhere,
| you need a translation map to figure out what tools outside are
| similar to the ones from inside.
|
| Google's proprietary tech island has its legitimate reasons for
| existing, and when people say 'xyz' commercial/open-source
| thing is "better," they often mean it is better for their
| problem at hand.
|
| At Google, a decade-plus ago many of the problems it had to
| solve were problems that few other organizations had, such as
| large-scale data processing (to be made cost-efficient on
| commodity hardware), and it needed to create a number of
| tools/platforms as solutions such as Map Reduce/GFS.
|
| Many of these tools and platforms were discussed via papers,
| and inspired open-source work. In the Map Reduce case, it
| changed how Apache Hadoop itself took shape, and the lessons
| from all of these later led to things like Apache Spark.
|
| The idea of losing a battle can only be applied with the
| benefit of hindsight, and many of the Google examples given
| were created at a time where there were no peers, nor at that
| time was Google interested in selling these things as
| commercial products at the time (i.e. GCP vs AWS vs Azure); it
| built these things according to its unique internal needs that
| few other organizations could relate to. I acknowledge that I
| am intentionally leaving out organizational politics, and
| culture (e.g. PERF) as non-trivial contributors for this
| result).
| ilaksh wrote:
| They supposedly just improved Cody by "up to" 25%. I wonder how
| it compares to GPT-4 now.
| flakiness wrote:
| This is over-indexing the event.
|
| XLA (backing Jax) has been supporting the NVIDIA GPU for a long
| time. (Otherwise TensorFlow were TPU-only, which cannot be true.)
|
| I think the announcement is more ceremonial than technical, maybe
| expecting this kind of superficial reaction.
| shmerl wrote:
| Google didn't release MapReduce implementation? It took off with
| Hadoop. If anything, it should have been open source from the
| start.
| FrustratedMonky wrote:
| Nice recap.
| JohnMakin wrote:
| > The final piece of Google's strategy today came in the form of
| a subtle, and very vague, announcement from Nvidia CEO Jensen
| Huang on stage in a brief appearance of only a few minutes. Huang
| announced that Google and Nvidia had been collaborating with a
| new language model development framework, PaxML, built on top of
| Google's cutting-edge machine learning framework JAX and its
| Accelerated Linear Algebra framework (or XLA).
|
| My only thought is, I wonder how well the nvidia/google
| partnership will do against Azure/Intel (I believe Azure invested
| heavily in FPGA's for their ML use cases).
| ipsum2 wrote:
| Azure doesn't use FPGA for ML, they use Nvidia like everyone
| else. Azure used FPGAs a few years ago for network switches
| though.
| JohnMakin wrote:
| Ah, I see, looks like that might have been a few years ago,
| when I last worked in that space.
| brucethemoose2 wrote:
| Does Microsft use Gaudi 2 or Ponte Vecchio much? If Azure
| offers them, I never hear about projects using them out in the
| wild.
|
| And Microsoft is seemingly shooting for their own AI hardware:
| https://www.nextplatform.com/2023/07/12/microsofts-chiplet-c...
| choppaface wrote:
| Wow, XLA has historically been absolute crap outside of TPUs
| and even for TPUs the error messages are incredibly poor. If
| nvidia actually wants to support XLA now perhaps that means the
| TPU 5 is the last TPU, and/or future TPUs might be targeted at
| just inference and efficiency (like TPU 5) and then nvidia owns
| the training game.
|
| After all if you compare Nvidia's success with H100 sales
| versus GCloud TPU sales, it would be easy for Sundar to say "if
| you can't beat em join em" and just maintain TPU team for
| inference which is more closely tied to wall street margins.
| rhelz wrote:
| Happens at every big company. The reason we don't hear about it
| more often isn't that big companies don't know this is a problem.
| Its just that the only solution they seem to be able to come up
| with is to enforce keeping unused breakthroughs secret.
|
| We can be happy that at Google, at least, these things can seep
| out and be of some benefit to the rest of us.
| Upvoter33 wrote:
| Google didn't miss on MapReduce; it missed on Cloud. Amazon was
| light years behind in datacenter technology, but made it all
| available via AWS, while Google kept everything to themselves. It
| was a colossal failure.
|
| LLMs are shaping up to be the second such failure.
| wrs wrote:
| In both cases Google regarded the technology as a competitive
| advantage in the business they were in (web search), so
| naturally wanted to keep it internal. Maybe almost as
| important, they were so far ahead on those technologies that
| making a viable product out of them would have been a huge
| effort with no benefit to search. Google tech has always been
| an "island". Even when they did release GCP, the offerings like
| AppEngine and transparent networking were incomprehensible to
| customers who just wanted to lift-and-shift their existing
| datacenter, not adopt Google practices.
|
| Amazon, on the other hand, has no qualms about converting their
| internal expertise into products ("turn every major cost into a
| source of revenue" [0]) and giving customers what they ask for.
|
| [0] https://twitter.com/BrianFeroldi/status/1284795114187919362
___________________________________________________________________
(page generated 2023-08-29 23:00 UTC)