[HN Gopher] Why Databricks Is Winning
___________________________________________________________________
Why Databricks Is Winning
Author : benjaminwootton
Score : 46 points
Date : 2021-02-14 19:08 UTC (3 hours ago)
(HTM) web link (cloudnativeenterprise.substack.com)
(TXT) w3m dump (cloudnativeenterprise.substack.com)
| mobileexpert wrote:
| Databricks seems on a convergent evolution towards Snowflake,
| between the two of them I'd rather be starting at the position
| Snowflake is versus Databricks.
| benjaminwootton wrote:
| They are both going towards the "Data Lakehouse" end point
| which is driving some of the convergence. Silly term, but
| basically providing analytics and a database type experience
| over a data lake.
|
| That said, Databricks is a much broader platform, with all of
| the collaboration environments and is generally much more
| programmable than Snowflake.
| nchammas wrote:
| > providing analytics and a database type experience over a
| data lake.
|
| It's interesting how the modern data lake is developing in
| this way, recreating many patterns from the traditional
| database for distributed systems and massive scale: SQL and
| query optimization, transactions and time travel, schema
| evolution and data constraints...
|
| Having started out as a database developer / DBA many years
| ago, working with data lakes today reminds me in many ways of
| that early part of my career.
|
| I wrote a post tracing a common interface from the typical
| relational database to the modern data lake.
|
| https://nchammas.com/writing/modern-data-lake-database
| 00117 wrote:
| Why does nobody discuss Apache Pulsar as a viable alternative?
| peterthehacker wrote:
| Apache Pulsar is a streaming platform more comparable to Kafka.
| It doesn't have built in parallel computation APIs like spark.
| You can hook spark streaming up with pulsar as a data source
| though.
|
| https://pulsar.apache.org/docs/en/adaptors-spark/
| fs111 wrote:
| DBFS is a complete joke though. A filesystem that has no
| timestamps? realy?
| cfeduke wrote:
| It's an abstraction over other file systems, like S3 or Azure
| BLOB storage. More a convenience than anything else, and helps
| ease porting code between cloud providers.
|
| Not a traditional file system, or even a HDFS clone.
| flowerlad wrote:
| "Fully managed Spark" sounded awesome a few years ago. But now,
| Spark can run on Kubernetes clusters [1]. If your infrastructure
| is already running on Kubernetes then you already have a cluster
| capable of running Spark. And because of the magic of Kubernetes
| you don't even have to dedicate nodes to Spark.
|
| [1] https://spark.apache.org/docs/latest/running-on-
| kubernetes.h...
| bpodgursky wrote:
| 100%.
|
| The DataBricks notebooks have a lot of value (for now) but ever
| since running Spark on Kube... I have had literally 0 cluster
| issues. It's absolutely shocking, coming from a YARN-based
| environment, where I was constantly plagued with
| ResourceManager issues, autoscaling issues, preemption issues,
| network disconnect issues...
| fmajid wrote:
| I'd say the difference in valuation between Snowflake and
| Databricks clearly shows the latter is _not_ "winning".
| MrPowers wrote:
| Snowflake and Databricks are different, sometimes complementary
| technologies. You can store data in Snowflake & query it with
| Databricks for example: https://github.com/snowflakedb/spark-
| snowflake
|
| Snowflake predicate pushdown filtering seems quite promising:
| https://www.snowflake.com/blog/snowflake-spark-part-2-pushin...
|
| Think both these companies can win.
| jayparth wrote:
| There's definitely place in the industry for a no 1 and a no
| 2, but no. I don't think they're complementary, I think
| they're competitive. Both are trying to own the whole storage
| + compute layer eventually. I guess you can see Snowflake as
| a storage layer but they have attached scalable compute to
| it. Databricks started out as scalable compute and is now
| becoming more focused on storage.
|
| The idea of storing your data in Snowflake and querying it in
| Databricks is pretty silly. Why would you want to do that?
| Why not just use Snowflake's compute? Sure you could argue
| Spark has some transformations that are hard to express in
| SQL, but that is why Snowflake introduced Snowpark.
| lykr0n wrote:
| The one thing I see in my current company, and a growing trend
| with SaaS apps is that companies are forgetting how to actually
| engineer. Like Boeing- the more you outsource the less able
| you're able to react to changing market forces and fix issues.
|
| We run Hadoop & Spark internally, but the team is underfunded and
| stuck in a constant cycle of fighting fires. And the result (and
| part of a larger push of the company due to the same cycle of
| under-funding and culture issues) is that we're moving our
| petabytes of data to cloud providers into their systems. Not only
| is the cost of doing this dwarfing that it would take to actually
| fix our issues, but we're going to lose the people who know how
| to design and manage petabyte scale hadoop clusters.
|
| We wind up in a situation where we locked up data fundamental to
| our company and our position in the market with a 3rd party, and
| losing the talent that would allow us to maintain full control
| over the data. If the service increases prices, changes it's
| offering, or we get to a point where the offering doesn't meet
| our needs- we're fucked.
|
| It's nice that Databricks has a nice "offramp" that you can take
| to go somewhere else, but the general idea is the same.
| lumost wrote:
| It's incredibly difficult to fund internal platform teams
| appropriately. Usually one of three failure patterns emerges
|
| 1) The team is competent but picks up migration work to
| arbitrary technologies, and approaches with no clear ROI. These
| migrations block feature development and never seem to end e.g.
| teams ceaselessly migrating from GCP to AWS, to kubernetes, to
| podman from mysql to postgresql etc.
|
| 2) The team is operationally heavy and generates arbitrary
| requirements for everyone else to follow. The toolchain seems
| to get worse over time, and the number of hoops to jump through
| to get anything done endlessly grows e.g. Wait 2 weeks and get
| three business approvals for a server which you aren't allowed
| to have root access to.
|
| 3) The team has big ideas, but the business constantly under
| invests. The team is called to fight every fire but unable to
| stop the fires through any meaningful project. The company ends
| up on a platform thats constantly on fire.
|
| When weighing these execution risks building an internal
| platform for just about anything looks incredibly expensive.
| I've only been at 1 company out of 6 which nailed the internal
| platform tooling requirements. The only reason I can attribute
| to their success was through quarterly NPS surveys on the
| developer experience for every major piece of the companies
| toolchain + hard to meet SLAs for uptime.
| hodgesrm wrote:
| > we're going to lose the people who know how to design and
| manage petabyte scale hadoop clusters.
|
| Why is that different from "lose the people who know how to
| design and write accounting systems from scratch?" That was
| what happened when packaged accounting systems showed up. I'm
| not sure why you would want to preserve knowledge of Hadoop if
| other technologies are more efficient.
| david38 wrote:
| Using third party tools doesn't lock you out of them. Nothing
| stops you from the collecting that data before sending it off.
|
| Every company outsources something fundamental. Does Google
| mine its own metal? Generate its own electricity?
|
| Even if you did it on prem, that doesn't save you from license
| renewal costs or upgrades. You can write the software yourself,
| but that's not cheap either.
| paulryanrogers wrote:
| Is metal and electricity Google's core competency though?
|
| I don't think anyone is arguing one should maintain their own
| silica, atoms, independent universe, etc .
| foobiekr wrote:
| As you say, this de-skilling problem is broader than SaaS, it
| is tied to oursourcing of critical competencies or, in younger
| companies, never having them in the first place.
|
| If you want to see de-skilling in action, hard core, go look
| into the service providers, wireless and fixed. They are
| running on fumes and attrition-victorious teams that last had a
| new technology in the late 70s/early 80s because everyone good
| at networking went to the FANG predecessors and then FANG
| proper.
| random314 wrote:
| This happens with every technology stack from steam engines to
| software. There was a time when programmers could solder
| together an ALU using transistor gates.
| MrPowers wrote:
| Some additional background:
|
| * AWS has a managed Spark offering called EMR
|
| * EMR pricing (https://aws.amazon.com/emr/pricing/) is lower than
| Databricks pricing (https://databricks.com/product/aws-pricing)
|
| * Databricks notebook development experience is better than EMR
| (but still really basic compared to IntelliJ / PyCharm text
| editing)
|
| * Both Databricks & EMR have proprietary Spark runtimes
|
| * Databricks is building a Spark runtime in C++ that might be
| faster (Delta Engine)
|
| * Spark lets you process massive datasets easily, with small
| teams. 2-3 person teams can build data ingestion pipelines to
| clean & process terabytes of data a day. It's an incredible
| technology.
|
| * The difference between the PySpark & Scala APIs confuse the
| hell out of people
|
| * Whether or not ppl can run Python machine learning models on
| Spark clusters confuses people
|
| * Overreliance on notebooks causes big issues (no version
| control, tests, deployment process, dependency management)
|
| The big data ecosystem is constantly evolving and you need to
| study constantly to keep up.
| nknealk wrote:
| > The difference between the PySpark & Scala APIs confuse the
| hell out of people
|
| I personally had this experience when first wanting to learn
| spark, and it really turned me off to the whole spark
| ecosystem. Curious if you have any suggested resources that do
| a good job on this?
| MrPowers wrote:
| Spark offers Scala, Python, Java, and R APIs. Scala & Python
| are the most viable options (R lacks a lot of features and
| Java is only good for ppl that love Java).
|
| Scala & PySpark are both great options. Lots of devs are
| terrified of Scala, so PySpark is more popular now. I'd say
| Scala has a slight technical advantage, see here for more
| details: https://mungingdata.com/apache-spark/python-pyspark-
| scala-wh.... Both are great overall.
|
| I wrote a book that's a practical introduction to Spark:
| https://leanpub.com/beautiful-spark/
|
| Most of the training materials are theoretical, which makes
| Spark seem really intimidating. You can learn some basic
| Spark principles and get up-and-running with production
| workflows quickly.
| dragonwriter wrote:
| > Overreliance on notebooks causes big issues (no version
| control, tests, deployment process, dependency management)
|
| The last three of those things might be valid issues, but since
| when are notebooks not just as subject as any other source code
| format to version control (I get that the difference between
| the UI and the on-site structure may make typical diff tools
| less-than-ideal, but VC itself is unaffected.)
| HuwFulcher wrote:
| In my experience the use of notebooks (exclusively) goes hand
| in hand with not knowing things such as version control,
| tests, deployment process or dependency management exist.
|
| I don't mean to sound harsh about other Data Scientists from
| a non software engineering background but the standard
| workflow is to fiddle around with a notebook until you can
| get a result. That's as far as it goes, no real robustness to
| it.
|
| That's a pretty big generalisation but in organisations where
| they "home grow" their Data Science capability many of the
| online courses don't cover production level Data Science.
| MrPowers wrote:
| Your experience aligns with what I've seen.
|
| All the notebooks are in one place. Some are for important
| production jobs, other are for data exploration.
|
| It's easy to make a little edit in a notebook and
| accidentally break production jobs.
|
| Comparatively harder to make an edit in a git repo and do a
| deploy that'll break production jobs (e.g. if the JAR
| doesn't compile or the CI errors out cause the tests don't
| pass).
|
| Notebook based production jobs get even more dangerous when
| NotebookA depends on NotebookB and so on.
| alexott wrote:
| Treat notebooks like other code - separate onto staging
| and production, with defined promotions between them -
| it's possible. You can run tests in CI/CD pipelines, etc.
| You can set permissions so nobody can update production
| notebooks manually, ...
| alexott wrote:
| The last point isn't so dramatic. There is version control for
| notebooks, and better version is coming (right now it's in
| preview - code name - Projects, it's in official docs). Tests
| are possible - either Nutter from Microsoft, or home grown
| (about 20-30 lines on top of built in unittest). Deployment is
| also not so complicated - there are tools (provider for
| terraform, cicd-templates project, Databricks-cli, etc), you
| can refer MS Learn for short course about CI/CD for Databricks
| and Azure DevOps, etc...
| MrPowers wrote:
| The last point was for teams that only rely on notebooks,
| sorry if I didn't make that clear.
|
| You're right that all those issues can be sidestepped if you
| build projects in version controlled Git repos, test the
| code, and deploy JAR / Wheel files.
|
| Speaking of testing, can you let me know if this PySpark
| testing fix worked for you ;)
| https://github.com/MrPowers/chispa/issues/6
| Someone wrote:
| As _alexott_ says, you _can_ link notebooks to a git
| repository and commit and rollback them from the Databricks
| UI (https://docs.databricks.com/notebooks/github-version-
| control...)
|
| Problem is that it's limited. You can't, for example,
| commit multiple files in one go (that improves a little bit
| in https://docs.databricks.com/projects.html), or merge
| changes a colleague made with your changes. You also have
| to use the UI Databricks provides. You can't use a git CLI
| or whatever GUI you prefer. (all AFAIK, but I'm fairly
| certain about it)
| alexott wrote:
| I'm sorry for delay, will fix ASAP...
|
| My point is that you can do that even without jars/wheels -
| you can do VC and tests of notebooks. For example,
| https://github.com/alexott/databricks-nutter-projects-demo
| nchammas wrote:
| > * AWS has a managed Spark offering called EMR
|
| There is also my rinky-dink open source project, Flintrock [0],
| that will launch open source Spark clusters on AWS for you.
|
| It's probably not the right tool for production use (and you
| would be right to wonder why Flintrock exists when we have EMR
| [1]), but I know of several companies that have used Flintrock
| at one point or other in production at large scale (like, 400+
| node clusters).
|
| [0]: https://github.com/nchammas/flintrock
|
| [1]: https://github.com/nchammas/flintrock#why-build-flintrock-
| wh...
| throwaway556179 wrote:
| We recently ran a large clustering job over billions of records
| (and a few TBs of data on spinning disks) on a single machine
| with minimal command line tooling in a few hours. Not really
| optimized yet. People forget how fast modern hardware is and
| overestimate how much useful data they have (or need).
|
| I think I should start a company around minimalistic data
| tooling or the like - the amount of waste seems large across
| the industry.
|
| I saw the de-skilling a few year ago, where a guy stiched
| together a compete application from a couple of SaaS APIs.
| Cool, but it somehow does not impress me.
| paulryanrogers wrote:
| Bare metal can be incredibly fast, if you can get access for
| a reasonable price. Virtualization is becoming a continuum
| but the overhead is always there.
| MrPowers wrote:
| Great point, r5.metal instances have 96 CPUs and 768 GB of
| RAM. Lots of "big data problems" can actually be solved with
| a single big EC2 instance. Cluster computing should always be
| avoided when a single node will do.
| throwaway556179 wrote:
| We had something like 24G of RAM, but we have 500GB RAM
| machines as well (we own the hardware) and the job would
| have been even more of a breeze there.
| spicyramen wrote:
| For reasons I don't understand our company ended up choosing
| Google Cloud and is a complete mess. You have BigQuery, Dataproc,
| AI Platform training, Colab, AI Platform notebooks, and many
| tools that do the same thing but not well integrated. My
| personal.favourite is AI Notebooks, but i need additional.plugins
| to interact with BigQuery, S3 and GCS. Requires a lot of
| customization, we used before Azure and Databricks where we had
| one stop shop. I heard Google is integrating some of their
| products but that will take a while. In the meantime we lose
| hours of productivity figuring out how to use their products
| (which stability is horrible. Example AI.platform Jobs and
| permission model)
| simo7 wrote:
| A (big?) part of Databricks' success is the complicated mess that
| Spark is to run.
| HuwFulcher wrote:
| Considering the makers of Databricks are also the maintainers
| of Spark it makes you wonder whether this is deliberate
| peterthehacker wrote:
| I've had a lot of success with Dask lately. It's comparable to
| spark in some ways [0]. Being written in python and built on top
| of pandas/numpy it allows much more flexibility. It's also easier
| to get adoption from data scientists who are usually more
| comfortable with python than scala. It also has great tools built
| on top of kubernetes making deployment quick and easy [1].
|
| [0]https://docs.dask.org/en/latest/spark.html
|
| [1]https://github.com/dask/dask-gateway
| pletnes wrote:
| Just for clarification, dask lets you run any python function
| on any python datatype. Numpy/pandas is faster, but anything is
| doable with dask. This makes it eminently flexible, unlike many
| <<big data>> tools that only work on tables, or other arbitrary
| limitations.
| peterthehacker wrote:
| Yes, you can use dask's lower level APIs, like futures [0] or
| dask delayed, and it'll just pickle the python objects. Dask
| provides a group of collections APIs, like dataframes [1] and
| arrays that use more efficient serialization methods.
|
| [0]https://docs.dask.org/en/latest/futures.html
|
| [1]https://docs.dask.org/en/latest/dataframe.html
| MrPowers wrote:
| My feeling is that Spark is better for huge ETL jobs and Dask
| is better for certain types of model building (because of easy
| access to Python libraries), but don't have any benchmarks to
| back this up. Would love some benchmark results comparing
| processing tens of terabytes on equal sized i3.xlarge EC2
| clusters to get a better idea of benchmarks.
|
| Have you seen any good Spark vs. Dask benchmarks?
| peterthehacker wrote:
| I would be interested in seeing some spark v dask benchmarks
| too. Haven't seen any yet though.
|
| In my experience, Dask really shines when you implement
| custom numpy computations that could only be done in spark
| UDFs. We saw a decent performance difference there, but for
| common built-in computations I'd imagine that spark has
| better performance.
|
| Edit: after some googling I found this paper with benchmarks.
|
| https://arxiv.org/pdf/1907.13030.pdf
|
| > Results show that despite slight differences between Spark
| and Dask, both engines perform comparably. However, Dask
| pipelines risk being limited by Python's GIL depending on
| task type and cluster configuration. In all cases, the major
| limiting factor was data transfer.
| kornish wrote:
| It's also worth noting that with Spark, you can perform
| arbitrary computation using the Dataset API and operating
| on case classes.
| fractionalhare wrote:
| Interestingly, my team is actually moving off Databricks. I have
| anecdotally heard the same from other teams in the industry (buy
| side finance).
|
| We found that notebook-based development is actually an
| antipattern for software engineering. It was ostensibly helpful
| for the narrower "data science" use case, but we have a much more
| robust ETL and research platform we built on our own using
| Pandas, Dask, Prefect and AWS.
|
| And personally I hated writing code in notebooks. If you're
| attached to that, you can basically get the same thing by using
| PyCharm in scientific mode with cell execution.
| ineedasername wrote:
| _logging into the same system and interacting with the same
| datasets through the same Notebook based UI_
|
| I don't think there's a good one size fits all UI that can be
| applied the different types of work that take place with data and
| consumption by users. This is evidenced in Databrick's own
| feature set which includes an integration with R Studio and
| Tableau to Databricks as a data source rather than a work
| environment.
| dpq wrote:
| We've had mixed experience with Databricks, to be honest. The
| quality and responsiveness of support we were getting was not
| worth the amount of issues we were seeing + the considerable
| cost, so we decided to use "exit strategy to DIY Spark" (using
| the original post's terms) instead and we're pretty happy so far.
| mathattack wrote:
| Their sales team is somewhere between Oracle and a pro
| wrestling villain in terms of sleaziness.
| benjaminwootton wrote:
| Cost is one thing we didn't really get to grips with. They use
| a fairly abstract Databricks Processing Unit -
| https://databricks.com/product/aws-pricing - and the costs felt
| disjointed compared to the workloads we were running.
|
| This said, cost didn't really spiral or become an issue for us,
| especially when you take into account the cost avoidance of
| administering the Spark cluster and all of the tools. However,
| the pricing model did feel a little opaque.
___________________________________________________________________
(page generated 2021-02-14 23:01 UTC)