[HN Gopher] Why Databricks Is Winning
       ___________________________________________________________________
        
       Why Databricks Is Winning
        
       Author : benjaminwootton
       Score  : 46 points
       Date   : 2021-02-14 19:08 UTC (3 hours ago)
        
 (HTM) web link (cloudnativeenterprise.substack.com)
 (TXT) w3m dump (cloudnativeenterprise.substack.com)
        
       | mobileexpert wrote:
       | Databricks seems on a convergent evolution towards Snowflake,
       | between the two of them I'd rather be starting at the position
       | Snowflake is versus Databricks.
        
         | benjaminwootton wrote:
         | They are both going towards the "Data Lakehouse" end point
         | which is driving some of the convergence. Silly term, but
         | basically providing analytics and a database type experience
         | over a data lake.
         | 
         | That said, Databricks is a much broader platform, with all of
         | the collaboration environments and is generally much more
         | programmable than Snowflake.
        
           | nchammas wrote:
           | > providing analytics and a database type experience over a
           | data lake.
           | 
           | It's interesting how the modern data lake is developing in
           | this way, recreating many patterns from the traditional
           | database for distributed systems and massive scale: SQL and
           | query optimization, transactions and time travel, schema
           | evolution and data constraints...
           | 
           | Having started out as a database developer / DBA many years
           | ago, working with data lakes today reminds me in many ways of
           | that early part of my career.
           | 
           | I wrote a post tracing a common interface from the typical
           | relational database to the modern data lake.
           | 
           | https://nchammas.com/writing/modern-data-lake-database
        
       | 00117 wrote:
       | Why does nobody discuss Apache Pulsar as a viable alternative?
        
         | peterthehacker wrote:
         | Apache Pulsar is a streaming platform more comparable to Kafka.
         | It doesn't have built in parallel computation APIs like spark.
         | You can hook spark streaming up with pulsar as a data source
         | though.
         | 
         | https://pulsar.apache.org/docs/en/adaptors-spark/
        
       | fs111 wrote:
       | DBFS is a complete joke though. A filesystem that has no
       | timestamps? realy?
        
         | cfeduke wrote:
         | It's an abstraction over other file systems, like S3 or Azure
         | BLOB storage. More a convenience than anything else, and helps
         | ease porting code between cloud providers.
         | 
         | Not a traditional file system, or even a HDFS clone.
        
       | flowerlad wrote:
       | "Fully managed Spark" sounded awesome a few years ago. But now,
       | Spark can run on Kubernetes clusters [1]. If your infrastructure
       | is already running on Kubernetes then you already have a cluster
       | capable of running Spark. And because of the magic of Kubernetes
       | you don't even have to dedicate nodes to Spark.
       | 
       | [1] https://spark.apache.org/docs/latest/running-on-
       | kubernetes.h...
        
         | bpodgursky wrote:
         | 100%.
         | 
         | The DataBricks notebooks have a lot of value (for now) but ever
         | since running Spark on Kube... I have had literally 0 cluster
         | issues. It's absolutely shocking, coming from a YARN-based
         | environment, where I was constantly plagued with
         | ResourceManager issues, autoscaling issues, preemption issues,
         | network disconnect issues...
        
       | fmajid wrote:
       | I'd say the difference in valuation between Snowflake and
       | Databricks clearly shows the latter is _not_ "winning".
        
         | MrPowers wrote:
         | Snowflake and Databricks are different, sometimes complementary
         | technologies. You can store data in Snowflake & query it with
         | Databricks for example: https://github.com/snowflakedb/spark-
         | snowflake
         | 
         | Snowflake predicate pushdown filtering seems quite promising:
         | https://www.snowflake.com/blog/snowflake-spark-part-2-pushin...
         | 
         | Think both these companies can win.
        
           | jayparth wrote:
           | There's definitely place in the industry for a no 1 and a no
           | 2, but no. I don't think they're complementary, I think
           | they're competitive. Both are trying to own the whole storage
           | + compute layer eventually. I guess you can see Snowflake as
           | a storage layer but they have attached scalable compute to
           | it. Databricks started out as scalable compute and is now
           | becoming more focused on storage.
           | 
           | The idea of storing your data in Snowflake and querying it in
           | Databricks is pretty silly. Why would you want to do that?
           | Why not just use Snowflake's compute? Sure you could argue
           | Spark has some transformations that are hard to express in
           | SQL, but that is why Snowflake introduced Snowpark.
        
       | lykr0n wrote:
       | The one thing I see in my current company, and a growing trend
       | with SaaS apps is that companies are forgetting how to actually
       | engineer. Like Boeing- the more you outsource the less able
       | you're able to react to changing market forces and fix issues.
       | 
       | We run Hadoop & Spark internally, but the team is underfunded and
       | stuck in a constant cycle of fighting fires. And the result (and
       | part of a larger push of the company due to the same cycle of
       | under-funding and culture issues) is that we're moving our
       | petabytes of data to cloud providers into their systems. Not only
       | is the cost of doing this dwarfing that it would take to actually
       | fix our issues, but we're going to lose the people who know how
       | to design and manage petabyte scale hadoop clusters.
       | 
       | We wind up in a situation where we locked up data fundamental to
       | our company and our position in the market with a 3rd party, and
       | losing the talent that would allow us to maintain full control
       | over the data. If the service increases prices, changes it's
       | offering, or we get to a point where the offering doesn't meet
       | our needs- we're fucked.
       | 
       | It's nice that Databricks has a nice "offramp" that you can take
       | to go somewhere else, but the general idea is the same.
        
         | lumost wrote:
         | It's incredibly difficult to fund internal platform teams
         | appropriately. Usually one of three failure patterns emerges
         | 
         | 1) The team is competent but picks up migration work to
         | arbitrary technologies, and approaches with no clear ROI. These
         | migrations block feature development and never seem to end e.g.
         | teams ceaselessly migrating from GCP to AWS, to kubernetes, to
         | podman from mysql to postgresql etc.
         | 
         | 2) The team is operationally heavy and generates arbitrary
         | requirements for everyone else to follow. The toolchain seems
         | to get worse over time, and the number of hoops to jump through
         | to get anything done endlessly grows e.g. Wait 2 weeks and get
         | three business approvals for a server which you aren't allowed
         | to have root access to.
         | 
         | 3) The team has big ideas, but the business constantly under
         | invests. The team is called to fight every fire but unable to
         | stop the fires through any meaningful project. The company ends
         | up on a platform thats constantly on fire.
         | 
         | When weighing these execution risks building an internal
         | platform for just about anything looks incredibly expensive.
         | I've only been at 1 company out of 6 which nailed the internal
         | platform tooling requirements. The only reason I can attribute
         | to their success was through quarterly NPS surveys on the
         | developer experience for every major piece of the companies
         | toolchain + hard to meet SLAs for uptime.
        
         | hodgesrm wrote:
         | > we're going to lose the people who know how to design and
         | manage petabyte scale hadoop clusters.
         | 
         | Why is that different from "lose the people who know how to
         | design and write accounting systems from scratch?" That was
         | what happened when packaged accounting systems showed up. I'm
         | not sure why you would want to preserve knowledge of Hadoop if
         | other technologies are more efficient.
        
         | david38 wrote:
         | Using third party tools doesn't lock you out of them. Nothing
         | stops you from the collecting that data before sending it off.
         | 
         | Every company outsources something fundamental. Does Google
         | mine its own metal? Generate its own electricity?
         | 
         | Even if you did it on prem, that doesn't save you from license
         | renewal costs or upgrades. You can write the software yourself,
         | but that's not cheap either.
        
           | paulryanrogers wrote:
           | Is metal and electricity Google's core competency though?
           | 
           | I don't think anyone is arguing one should maintain their own
           | silica, atoms, independent universe, etc .
        
         | foobiekr wrote:
         | As you say, this de-skilling problem is broader than SaaS, it
         | is tied to oursourcing of critical competencies or, in younger
         | companies, never having them in the first place.
         | 
         | If you want to see de-skilling in action, hard core, go look
         | into the service providers, wireless and fixed. They are
         | running on fumes and attrition-victorious teams that last had a
         | new technology in the late 70s/early 80s because everyone good
         | at networking went to the FANG predecessors and then FANG
         | proper.
        
         | random314 wrote:
         | This happens with every technology stack from steam engines to
         | software. There was a time when programmers could solder
         | together an ALU using transistor gates.
        
       | MrPowers wrote:
       | Some additional background:
       | 
       | * AWS has a managed Spark offering called EMR
       | 
       | * EMR pricing (https://aws.amazon.com/emr/pricing/) is lower than
       | Databricks pricing (https://databricks.com/product/aws-pricing)
       | 
       | * Databricks notebook development experience is better than EMR
       | (but still really basic compared to IntelliJ / PyCharm text
       | editing)
       | 
       | * Both Databricks & EMR have proprietary Spark runtimes
       | 
       | * Databricks is building a Spark runtime in C++ that might be
       | faster (Delta Engine)
       | 
       | * Spark lets you process massive datasets easily, with small
       | teams. 2-3 person teams can build data ingestion pipelines to
       | clean & process terabytes of data a day. It's an incredible
       | technology.
       | 
       | * The difference between the PySpark & Scala APIs confuse the
       | hell out of people
       | 
       | * Whether or not ppl can run Python machine learning models on
       | Spark clusters confuses people
       | 
       | * Overreliance on notebooks causes big issues (no version
       | control, tests, deployment process, dependency management)
       | 
       | The big data ecosystem is constantly evolving and you need to
       | study constantly to keep up.
        
         | nknealk wrote:
         | > The difference between the PySpark & Scala APIs confuse the
         | hell out of people
         | 
         | I personally had this experience when first wanting to learn
         | spark, and it really turned me off to the whole spark
         | ecosystem. Curious if you have any suggested resources that do
         | a good job on this?
        
           | MrPowers wrote:
           | Spark offers Scala, Python, Java, and R APIs. Scala & Python
           | are the most viable options (R lacks a lot of features and
           | Java is only good for ppl that love Java).
           | 
           | Scala & PySpark are both great options. Lots of devs are
           | terrified of Scala, so PySpark is more popular now. I'd say
           | Scala has a slight technical advantage, see here for more
           | details: https://mungingdata.com/apache-spark/python-pyspark-
           | scala-wh.... Both are great overall.
           | 
           | I wrote a book that's a practical introduction to Spark:
           | https://leanpub.com/beautiful-spark/
           | 
           | Most of the training materials are theoretical, which makes
           | Spark seem really intimidating. You can learn some basic
           | Spark principles and get up-and-running with production
           | workflows quickly.
        
         | dragonwriter wrote:
         | > Overreliance on notebooks causes big issues (no version
         | control, tests, deployment process, dependency management)
         | 
         | The last three of those things might be valid issues, but since
         | when are notebooks not just as subject as any other source code
         | format to version control (I get that the difference between
         | the UI and the on-site structure may make typical diff tools
         | less-than-ideal, but VC itself is unaffected.)
        
           | HuwFulcher wrote:
           | In my experience the use of notebooks (exclusively) goes hand
           | in hand with not knowing things such as version control,
           | tests, deployment process or dependency management exist.
           | 
           | I don't mean to sound harsh about other Data Scientists from
           | a non software engineering background but the standard
           | workflow is to fiddle around with a notebook until you can
           | get a result. That's as far as it goes, no real robustness to
           | it.
           | 
           | That's a pretty big generalisation but in organisations where
           | they "home grow" their Data Science capability many of the
           | online courses don't cover production level Data Science.
        
             | MrPowers wrote:
             | Your experience aligns with what I've seen.
             | 
             | All the notebooks are in one place. Some are for important
             | production jobs, other are for data exploration.
             | 
             | It's easy to make a little edit in a notebook and
             | accidentally break production jobs.
             | 
             | Comparatively harder to make an edit in a git repo and do a
             | deploy that'll break production jobs (e.g. if the JAR
             | doesn't compile or the CI errors out cause the tests don't
             | pass).
             | 
             | Notebook based production jobs get even more dangerous when
             | NotebookA depends on NotebookB and so on.
        
               | alexott wrote:
               | Treat notebooks like other code - separate onto staging
               | and production, with defined promotions between them -
               | it's possible. You can run tests in CI/CD pipelines, etc.
               | You can set permissions so nobody can update production
               | notebooks manually, ...
        
         | alexott wrote:
         | The last point isn't so dramatic. There is version control for
         | notebooks, and better version is coming (right now it's in
         | preview - code name - Projects, it's in official docs). Tests
         | are possible - either Nutter from Microsoft, or home grown
         | (about 20-30 lines on top of built in unittest). Deployment is
         | also not so complicated - there are tools (provider for
         | terraform, cicd-templates project, Databricks-cli, etc), you
         | can refer MS Learn for short course about CI/CD for Databricks
         | and Azure DevOps, etc...
        
           | MrPowers wrote:
           | The last point was for teams that only rely on notebooks,
           | sorry if I didn't make that clear.
           | 
           | You're right that all those issues can be sidestepped if you
           | build projects in version controlled Git repos, test the
           | code, and deploy JAR / Wheel files.
           | 
           | Speaking of testing, can you let me know if this PySpark
           | testing fix worked for you ;)
           | https://github.com/MrPowers/chispa/issues/6
        
             | Someone wrote:
             | As _alexott_ says, you _can_ link notebooks to a git
             | repository and commit and rollback them from the Databricks
             | UI (https://docs.databricks.com/notebooks/github-version-
             | control...)
             | 
             | Problem is that it's limited. You can't, for example,
             | commit multiple files in one go (that improves a little bit
             | in https://docs.databricks.com/projects.html), or merge
             | changes a colleague made with your changes. You also have
             | to use the UI Databricks provides. You can't use a git CLI
             | or whatever GUI you prefer. (all AFAIK, but I'm fairly
             | certain about it)
        
             | alexott wrote:
             | I'm sorry for delay, will fix ASAP...
             | 
             | My point is that you can do that even without jars/wheels -
             | you can do VC and tests of notebooks. For example,
             | https://github.com/alexott/databricks-nutter-projects-demo
        
         | nchammas wrote:
         | > * AWS has a managed Spark offering called EMR
         | 
         | There is also my rinky-dink open source project, Flintrock [0],
         | that will launch open source Spark clusters on AWS for you.
         | 
         | It's probably not the right tool for production use (and you
         | would be right to wonder why Flintrock exists when we have EMR
         | [1]), but I know of several companies that have used Flintrock
         | at one point or other in production at large scale (like, 400+
         | node clusters).
         | 
         | [0]: https://github.com/nchammas/flintrock
         | 
         | [1]: https://github.com/nchammas/flintrock#why-build-flintrock-
         | wh...
        
         | throwaway556179 wrote:
         | We recently ran a large clustering job over billions of records
         | (and a few TBs of data on spinning disks) on a single machine
         | with minimal command line tooling in a few hours. Not really
         | optimized yet. People forget how fast modern hardware is and
         | overestimate how much useful data they have (or need).
         | 
         | I think I should start a company around minimalistic data
         | tooling or the like - the amount of waste seems large across
         | the industry.
         | 
         | I saw the de-skilling a few year ago, where a guy stiched
         | together a compete application from a couple of SaaS APIs.
         | Cool, but it somehow does not impress me.
        
           | paulryanrogers wrote:
           | Bare metal can be incredibly fast, if you can get access for
           | a reasonable price. Virtualization is becoming a continuum
           | but the overhead is always there.
        
           | MrPowers wrote:
           | Great point, r5.metal instances have 96 CPUs and 768 GB of
           | RAM. Lots of "big data problems" can actually be solved with
           | a single big EC2 instance. Cluster computing should always be
           | avoided when a single node will do.
        
             | throwaway556179 wrote:
             | We had something like 24G of RAM, but we have 500GB RAM
             | machines as well (we own the hardware) and the job would
             | have been even more of a breeze there.
        
       | spicyramen wrote:
       | For reasons I don't understand our company ended up choosing
       | Google Cloud and is a complete mess. You have BigQuery, Dataproc,
       | AI Platform training, Colab, AI Platform notebooks, and many
       | tools that do the same thing but not well integrated. My
       | personal.favourite is AI Notebooks, but i need additional.plugins
       | to interact with BigQuery, S3 and GCS. Requires a lot of
       | customization, we used before Azure and Databricks where we had
       | one stop shop. I heard Google is integrating some of their
       | products but that will take a while. In the meantime we lose
       | hours of productivity figuring out how to use their products
       | (which stability is horrible. Example AI.platform Jobs and
       | permission model)
        
       | simo7 wrote:
       | A (big?) part of Databricks' success is the complicated mess that
       | Spark is to run.
        
         | HuwFulcher wrote:
         | Considering the makers of Databricks are also the maintainers
         | of Spark it makes you wonder whether this is deliberate
        
       | peterthehacker wrote:
       | I've had a lot of success with Dask lately. It's comparable to
       | spark in some ways [0]. Being written in python and built on top
       | of pandas/numpy it allows much more flexibility. It's also easier
       | to get adoption from data scientists who are usually more
       | comfortable with python than scala. It also has great tools built
       | on top of kubernetes making deployment quick and easy [1].
       | 
       | [0]https://docs.dask.org/en/latest/spark.html
       | 
       | [1]https://github.com/dask/dask-gateway
        
         | pletnes wrote:
         | Just for clarification, dask lets you run any python function
         | on any python datatype. Numpy/pandas is faster, but anything is
         | doable with dask. This makes it eminently flexible, unlike many
         | <<big data>> tools that only work on tables, or other arbitrary
         | limitations.
        
           | peterthehacker wrote:
           | Yes, you can use dask's lower level APIs, like futures [0] or
           | dask delayed, and it'll just pickle the python objects. Dask
           | provides a group of collections APIs, like dataframes [1] and
           | arrays that use more efficient serialization methods.
           | 
           | [0]https://docs.dask.org/en/latest/futures.html
           | 
           | [1]https://docs.dask.org/en/latest/dataframe.html
        
         | MrPowers wrote:
         | My feeling is that Spark is better for huge ETL jobs and Dask
         | is better for certain types of model building (because of easy
         | access to Python libraries), but don't have any benchmarks to
         | back this up. Would love some benchmark results comparing
         | processing tens of terabytes on equal sized i3.xlarge EC2
         | clusters to get a better idea of benchmarks.
         | 
         | Have you seen any good Spark vs. Dask benchmarks?
        
           | peterthehacker wrote:
           | I would be interested in seeing some spark v dask benchmarks
           | too. Haven't seen any yet though.
           | 
           | In my experience, Dask really shines when you implement
           | custom numpy computations that could only be done in spark
           | UDFs. We saw a decent performance difference there, but for
           | common built-in computations I'd imagine that spark has
           | better performance.
           | 
           | Edit: after some googling I found this paper with benchmarks.
           | 
           | https://arxiv.org/pdf/1907.13030.pdf
           | 
           | > Results show that despite slight differences between Spark
           | and Dask, both engines perform comparably. However, Dask
           | pipelines risk being limited by Python's GIL depending on
           | task type and cluster configuration. In all cases, the major
           | limiting factor was data transfer.
        
             | kornish wrote:
             | It's also worth noting that with Spark, you can perform
             | arbitrary computation using the Dataset API and operating
             | on case classes.
        
       | fractionalhare wrote:
       | Interestingly, my team is actually moving off Databricks. I have
       | anecdotally heard the same from other teams in the industry (buy
       | side finance).
       | 
       | We found that notebook-based development is actually an
       | antipattern for software engineering. It was ostensibly helpful
       | for the narrower "data science" use case, but we have a much more
       | robust ETL and research platform we built on our own using
       | Pandas, Dask, Prefect and AWS.
       | 
       | And personally I hated writing code in notebooks. If you're
       | attached to that, you can basically get the same thing by using
       | PyCharm in scientific mode with cell execution.
        
       | ineedasername wrote:
       | _logging into the same system and interacting with the same
       | datasets through the same Notebook based UI_
       | 
       | I don't think there's a good one size fits all UI that can be
       | applied the different types of work that take place with data and
       | consumption by users. This is evidenced in Databrick's own
       | feature set which includes an integration with R Studio and
       | Tableau to Databricks as a data source rather than a work
       | environment.
        
       | dpq wrote:
       | We've had mixed experience with Databricks, to be honest. The
       | quality and responsiveness of support we were getting was not
       | worth the amount of issues we were seeing + the considerable
       | cost, so we decided to use "exit strategy to DIY Spark" (using
       | the original post's terms) instead and we're pretty happy so far.
        
         | mathattack wrote:
         | Their sales team is somewhere between Oracle and a pro
         | wrestling villain in terms of sleaziness.
        
         | benjaminwootton wrote:
         | Cost is one thing we didn't really get to grips with. They use
         | a fairly abstract Databricks Processing Unit -
         | https://databricks.com/product/aws-pricing - and the costs felt
         | disjointed compared to the workloads we were running.
         | 
         | This said, cost didn't really spiral or become an issue for us,
         | especially when you take into account the cost avoidance of
         | administering the Spark cluster and all of the tools. However,
         | the pricing model did feel a little opaque.
        
       ___________________________________________________________________
       (page generated 2021-02-14 23:01 UTC)