[HN Gopher] MapReduce, TensorFlow, Vertex: Google's bet to avoid...
       ___________________________________________________________________
        
       MapReduce, TensorFlow, Vertex: Google's bet to avoid repeating
       history in AI
        
       Author : coloneltcb
       Score  : 62 points
       Date   : 2023-08-29 19:08 UTC (3 hours ago)
        
 (HTM) web link (www.supervised.news)
 (TXT) w3m dump (www.supervised.news)
        
       | jeffbee wrote:
       | The part about Vertex might be right but the establishing story
       | about mapreduce is totally wrong. By the time Hadoop took off,
       | mapreduce at Google already had one foot in the grave. If you are
       | using Hadoop today you have adopted a technology stack that
       | Google recognized as obsolete 15 years ago. It is difficult to
       | see how Google lost that battle. They effectively disabled an
       | entire industry by suggesting an obsolete stack, while
       | simultaneously moving on to cheaper, better sequels.
        
         | opportune wrote:
         | You're right and wrong. MapReduce is two things: a pattern for
         | massively parallel computation, and the name of the initial
         | implementation of the pattern at Google.
         | 
         | While the initial implementation at Google quickly got replaced
         | with better things, the MapReduce pattern is everywhere in the
         | data space, and almost taken for granted now. Hadoop is
         | basically the same: a shitty (I think HDFS is still pretty
         | good, just not the compute part) initial implementation of the
         | pattern that was quickly iterated and improved upon.
         | 
         | Also, a big reason people stopped having to think about eg
         | rack-local operations is that most people operating on huge
         | amounts of data now aren't doing it on traditional generic
         | servers, they're using something like s3 on VMs in Public Cloud
         | datacenters if they're doing something relatively "low level"
         | or more likely just using something like Snowflake,
         | Spark/Databricks (pretty close to OG mapreduce...), etc.
        
         | dekhn wrote:
         | It wasn't obsolete 15 years ago. There were production
         | mapreduces making double-digit improvements in key metrics
         | (watch time, apps purchased) much more recently than that. The
         | system I worked on, Sibyl, isn't well known outside of Google,
         | but used MR as an engine to do extremely large-scale machine
         | learning. we added a number of features and used MR in ways
         | that would have been extremely challenging to reimplement while
         | maintaining performance targets.
         | 
         | I'm not even sure the mapreduce code has been deleted from
         | google3 yet.
         | 
         | To be fair, MR was definitely dated by the time I joined- 2007-
         | and I'm surprised it lasted as long as it did. But it was
         | really well-tuned and reliable.
         | 
         | Also the MR paper was never intended to stake Google's position
         | as a data processing provider (that came far, far later). The
         | MR, Bigtable, and GFS papers were written to attract software
         | engineers to work on infra at google, to share some useful
         | ideas with the world (specifically, the distributed shuffle in
         | mapreduce, the bloom filter in bigtable, and the single-master-
         | index-in-ram of GFS), and finally, to "show off".
        
           | ithkuil wrote:
           | I remember the joke CL that deleted mapreduce from google3,
           | even got an approval from Urs IIRC
        
         | choppaface wrote:
         | > They effectively disabled an entire industry by suggesting an
         | obsolete stack
         | 
         | Interesting opinion but not supported at all by evidence. Most
         | non-Google datasets are small and stored on off-the-shelf
         | heterogenous hardware, so HDFS / Mapreduce for streaming OLAP
         | is a great fit. Cassandra (BigTable) and Parquet (Dremel) plus
         | Cloudera's Impala had much quicker time-to-market when large-
         | scale BI became more relevant.
         | 
         | "Obsolete" for Google problems sure, but Google problems
         | largely only happen at Google. Stuff like ad targeting and ML
         | look a lot different for products outside the Chocolate
         | Factory.
        
           | jeffbee wrote:
           | I can't imagine HDFS being a "Good fit" for anything. Its
           | mere existence was undoubtedly a disabling force at two large
           | companies where I worked.
        
             | moandcompany wrote:
             | I recommend reading the GFS paper and consider that there
             | were/are use cases for horizontally scalable, fault-
             | tolerant object storage, with the nuance of understanding
             | that some of your storage/data nodes may also be compute
             | nodes and there can be a benefit or preference for
             | assigning applications to run where underlying data is
             | stored.
             | 
             | In the HDFS case with Hadoop's ecosystem, consider Hive,
             | BigTable, Drill, and even Spark when running on YARN.
             | 
             | In the peak days of Hadoop, many organizations were
             | primarily on-prem, and S3 or S3 compatible object stores
             | were mostly reserved for people using AWS.
        
             | opportune wrote:
             | The problem with using technology X "because Google does
             | it" or "this is the best open source version of what Google
             | uses, so let's use it because Google does it" is that
             | companies neglect that Google _does not just use the
             | technology out of the box_ they
             | 
             | 1. created the software for their own needs 2. maintain a
             | developer team to improve it and address requirements/pain
             | points/integration 3. have an internal pool of experts in
             | the form of the developer team and "customers"/early
             | adopters 4. most likely have other proprietary systems like
             | Borg or Colossus which integrate with the software very
             | well, which OSS like Hadoop may not (another example: OSS
             | Bazel vs Blaze+Forge+Piper+Monorepo structure).
             | 
             | Something like HDFS was hugely painful for many teams
             | because they had no idea how it worked or how to debug it,
             | had no idea how to fix it or extend it, and didn't have any
             | good tooling to understand why something was slow. All they
             | could do was try to configure it, integrate with it, and
             | find answers for their problems online. That's because HDFS
             | was "free" but a team capable of properly maintaining,
             | supporting/operations, and developing HDFS was extremely
             | expensive.
        
         | sberens wrote:
         | [Genuinely curious,] what have people moved on to?
        
           | choppaface wrote:
           | Dremel / Parquet, which helps facilitate much more efficient
           | joins and filters versus sawzall on mapreduce.
           | 
           | For streaming there is flume and beam, or just load important
           | data into Spanner.
        
           | moandcompany wrote:
           | Google itself moved on to "Flume" and later created
           | "Dataflow" the precursor for Apache Beam. While Dataflow/Beam
           | aren't execution engines for data processing themselves, they
           | abstract away the language of expressing data computation
           | from the engines themselves. At Google for example, a data
           | processing job might be expressed using Beam on top of Flume
           | for processing.
           | 
           | Outside of Google, most organizations with large distributed
           | data processing problems moved on to Hadoop2
           | (YARN/MapReduce2) and later in present day to Apache Spark.
           | When organizations say they are using "Databricks" they are
           | using Apache Spark provided as a service, from a company
           | started by the creators of Apache Spark, which happens to be
           | Databricks.
           | 
           | Apache Beam is also used outside of Google on top of other
           | data processing "engines" or runners for these jobs, such as
           | Google's Cloud Dataflow service, Apache Flink, Apache Spark,
           | etc.
        
             | gravypod wrote:
             | Opinions are my own.
             | 
             | Some info on flume: https://research.google/pubs/pub35650/
             | 
             | To quote from there: "MapReduce and similar systems
             | significantly ease the task of writing data-parallel code.
             | However, many real-world computations require a pipeline of
             | MapReduces, and programming and managing such pipelines can
             | be difficult."
        
             | summerlight wrote:
             | In addition to Flume/Dataflow, there's a significant push
             | toward SQL engines. In general, SQL (or similar query
             | engines written in more declarative languages/APIs) has
             | some performance benefits over usual Flume codes thanks to
             | vectorized execution and other optimizations.
        
             | chaxor wrote:
             | So if I read this right, if you're not a big company
             | (perhaps just a standard dev with maybe a tiny cluster of
             | computers or just one beefy one), you just make a Docker
             | container with pyspark and put your scripts in there, and
             | everyone can reproduce your work easily on any type of
             | machine or cluster? It seems like a reasonable approach,
             | though it would be nice to not need the OS
             | dependencies/docker for spark.
        
         | moandcompany wrote:
         | Regarding "obsolete" and missed opportunities: this idea of
         | missed opportunities for Google has been touched upon several
         | times in the last decade, including the launch of Google Cloud
         | Platform itself.
         | 
         | Urs Holzle, the former head of Google TI (Technical
         | Infrastructure), discussed in public some of the challenges and
         | reasons for creating Google Cloud Platform as a platform, and
         | backing projects like Kubernetes.
         | 
         | Over time, Google has become a proprietary tech "island" in
         | severals ways and arguably more fragmented than other large
         | tech companies, such as Microsoft and Amazon, which happen to
         | both have commercial cloud offerings, and Meta/Facebook. While
         | all of these companies certainly have challenges with not-
         | invented-here ("NIH") syndrome, and lots of internal,
         | proprietary tools, as a software engineer at one of these
         | three, odds are you will use and touch more commercial and
         | open-source technologies than you would at Google. Google
         | itself still struggles with having teams and projects use GCP
         | for internal work versus Borg/etc; and there are plenty of
         | valid reasons why Google teams don't use GCP.
         | 
         | The proprietary tech "island" issue is a non-trival concern
         | when you need to hire new software engineers from
         | industry/outside and ramp-up time with some of these systems
         | may be 6-months or even greater; today Alphabet/Google is at
         | around 200k+ FTE, and you aren't going to be able to find many
         | engineers outside that have experience with
         | Borg/Flume/Spanner/Monarch/etc. Likewise when you are an
         | experienced Google software engineer looking to work elsewhere,
         | you need a translation map to figure out what tools outside are
         | similar to the ones from inside.
         | 
         | Google's proprietary tech island has its legitimate reasons for
         | existing, and when people say 'xyz' commercial/open-source
         | thing is "better," they often mean it is better for their
         | problem at hand.
         | 
         | At Google, a decade-plus ago many of the problems it had to
         | solve were problems that few other organizations had, such as
         | large-scale data processing (to be made cost-efficient on
         | commodity hardware), and it needed to create a number of
         | tools/platforms as solutions such as Map Reduce/GFS.
         | 
         | Many of these tools and platforms were discussed via papers,
         | and inspired open-source work. In the Map Reduce case, it
         | changed how Apache Hadoop itself took shape, and the lessons
         | from all of these later led to things like Apache Spark.
         | 
         | The idea of losing a battle can only be applied with the
         | benefit of hindsight, and many of the Google examples given
         | were created at a time where there were no peers, nor at that
         | time was Google interested in selling these things as
         | commercial products at the time (i.e. GCP vs AWS vs Azure); it
         | built these things according to its unique internal needs that
         | few other organizations could relate to. I acknowledge that I
         | am intentionally leaving out organizational politics, and
         | culture (e.g. PERF) as non-trivial contributors for this
         | result).
        
       | ilaksh wrote:
       | They supposedly just improved Cody by "up to" 25%. I wonder how
       | it compares to GPT-4 now.
        
       | flakiness wrote:
       | This is over-indexing the event.
       | 
       | XLA (backing Jax) has been supporting the NVIDIA GPU for a long
       | time. (Otherwise TensorFlow were TPU-only, which cannot be true.)
       | 
       | I think the announcement is more ceremonial than technical, maybe
       | expecting this kind of superficial reaction.
        
       | shmerl wrote:
       | Google didn't release MapReduce implementation? It took off with
       | Hadoop. If anything, it should have been open source from the
       | start.
        
       | FrustratedMonky wrote:
       | Nice recap.
        
       | JohnMakin wrote:
       | > The final piece of Google's strategy today came in the form of
       | a subtle, and very vague, announcement from Nvidia CEO Jensen
       | Huang on stage in a brief appearance of only a few minutes. Huang
       | announced that Google and Nvidia had been collaborating with a
       | new language model development framework, PaxML, built on top of
       | Google's cutting-edge machine learning framework JAX and its
       | Accelerated Linear Algebra framework (or XLA).
       | 
       | My only thought is, I wonder how well the nvidia/google
       | partnership will do against Azure/Intel (I believe Azure invested
       | heavily in FPGA's for their ML use cases).
        
         | ipsum2 wrote:
         | Azure doesn't use FPGA for ML, they use Nvidia like everyone
         | else. Azure used FPGAs a few years ago for network switches
         | though.
        
           | JohnMakin wrote:
           | Ah, I see, looks like that might have been a few years ago,
           | when I last worked in that space.
        
         | brucethemoose2 wrote:
         | Does Microsft use Gaudi 2 or Ponte Vecchio much? If Azure
         | offers them, I never hear about projects using them out in the
         | wild.
         | 
         | And Microsoft is seemingly shooting for their own AI hardware:
         | https://www.nextplatform.com/2023/07/12/microsofts-chiplet-c...
        
         | choppaface wrote:
         | Wow, XLA has historically been absolute crap outside of TPUs
         | and even for TPUs the error messages are incredibly poor. If
         | nvidia actually wants to support XLA now perhaps that means the
         | TPU 5 is the last TPU, and/or future TPUs might be targeted at
         | just inference and efficiency (like TPU 5) and then nvidia owns
         | the training game.
         | 
         | After all if you compare Nvidia's success with H100 sales
         | versus GCloud TPU sales, it would be easy for Sundar to say "if
         | you can't beat em join em" and just maintain TPU team for
         | inference which is more closely tied to wall street margins.
        
       | rhelz wrote:
       | Happens at every big company. The reason we don't hear about it
       | more often isn't that big companies don't know this is a problem.
       | Its just that the only solution they seem to be able to come up
       | with is to enforce keeping unused breakthroughs secret.
       | 
       | We can be happy that at Google, at least, these things can seep
       | out and be of some benefit to the rest of us.
        
       | Upvoter33 wrote:
       | Google didn't miss on MapReduce; it missed on Cloud. Amazon was
       | light years behind in datacenter technology, but made it all
       | available via AWS, while Google kept everything to themselves. It
       | was a colossal failure.
       | 
       | LLMs are shaping up to be the second such failure.
        
         | wrs wrote:
         | In both cases Google regarded the technology as a competitive
         | advantage in the business they were in (web search), so
         | naturally wanted to keep it internal. Maybe almost as
         | important, they were so far ahead on those technologies that
         | making a viable product out of them would have been a huge
         | effort with no benefit to search. Google tech has always been
         | an "island". Even when they did release GCP, the offerings like
         | AppEngine and transparent networking were incomprehensible to
         | customers who just wanted to lift-and-shift their existing
         | datacenter, not adopt Google practices.
         | 
         | Amazon, on the other hand, has no qualms about converting their
         | internal expertise into products ("turn every major cost into a
         | source of revenue" [0]) and giving customers what they ask for.
         | 
         | [0] https://twitter.com/BrianFeroldi/status/1284795114187919362
        
       ___________________________________________________________________
       (page generated 2023-08-29 23:00 UTC)