[HN Gopher] Building and scaling Notion's data lake
___________________________________________________________________
Building and scaling Notion's data lake
Author : alexzeitler
Score : 140 points
Date : 2024-07-14 09:02 UTC (13 hours ago)
(HTM) web link (www.notion.so)
(TXT) w3m dump (www.notion.so)
| mritchie712 wrote:
| > Iceberg and Delta Lake, on the other hand, weren't optimized
| for our update-heavy workload when we considered them in 2022
|
| "when we considered them in 2022" is significant here because
| both Iceberg and Delta Lake have made rapid progress since then.
| I talk to a lot of companies making this decision and the
| consensus is swinging towards Iceberg. If they're already heavy
| Databricks users, then Delta is the obvious choice.
|
| For anyone that missed it, Databricks acquired Tabular[0] (which
| was founded by the creators of Iceberg). The public facing story
| is that both projects will continue independently and I really
| hope that's true.
|
| Shameless plug: this is the same infrastructure we're using at
| Definite[1] and we're betting a lot of companies want a setup
| like this, but can't afford to build it themselves. It's
| radically cheaper then the standard Snowflake + Fivetran + Looker
| stack and works day one. A lot of companies just want dashboards
| and it's pretty ridiculous the hoops you need to jump thru to get
| them running.
|
| We use iceberg for storage, duckdb as a query engine, a few open
| source projects for ETL and built a frontend to manage it all and
| create dashboards.
|
| 0 - https://www.definite.app/blog/databricks-tabular-acquisition
|
| 1 - https://www.youtube.com/watch?v=7FAJLc3k2Fo
| whinvik wrote:
| Is there any advantage to having both a Data Lake setup as well
| as Snowflake. Why would one also want Snowflake after doing such
| an extensive data lake setup?
| luizfwolf wrote:
| Saving money 100% also lower latency on distributed access.
| Accessing file partitioned S3 doesn't require to spin a
| warehouse and wait for your query to go on a queue, so if every
| job runs in like k8s you don't have to manage resources and
| auto scale in snowflake is a "paid feature"
|
| I believe just not having to handle a query queue system is
| already.
| mritchie712 wrote:
| Many BI / analytics tools don't have great support for Data
| Lakes, so part of the reason could be supporting those tools
| (e.g. they still load some of their data to snowflake to power
| BI / dashboards)
| Lucasoato wrote:
| We've solved that issue with Trino. Superset and a lot of
| other BI tools support connection to it and it's a very cost
| efficient engine (compared to DWH solutions). Another way to
| go even cheaper is using Athena, if you're on AWS.
| adolph wrote:
| They seem to be doing lots of work but I don't understand what
| customer value this creates.
|
| What does a backing data lake afford a Notion user that can't be
| done in a similar product, like Obsidian?
| jpalomaki wrote:
| From the article: "Unlock AI, Search, and other product use
| cases that require denormalized data"
| SOLAR_FIELDS wrote:
| Beyond the features that the sibling comment mentioned, this
| kind of data isn't really for end users. It's a way that you
| can package it up, "anonymize" it, and sell the data to
| interested parties.
| sghiassy wrote:
| Is that why they're putting images in Postgres? I don't
| understand that design decision yet.
| bastawhiz wrote:
| I... Don't think they are? If you look at the URL for
| images in notion, you can see the S3 hostname.
| benaubin wrote:
| Notion employee here. We don't put images themselves in
| Postgres- we use s3 to store them. The article is referring
| to image blocks, which are effectively pointers to the
| image.
| Ozzie_osman wrote:
| For someone like Notion, they probably aren't selling this
| data. The primary use case is internally for analysis (eg
| product usage, business analysis, etc).
|
| It can also be used to train AI models, of course.
| Cthulhu_ wrote:
| That "probably" is doing a lot of heavy lifting. That said,
| whether they sell it or not, it's all that data that is
| their primary value store at the moment. They will either
| go public or sell, eventually. If they go public, it'll
| likely be similar to Dropbox; a single fairly successful
| product, but failing attempts to diversify.
| TeMPOraL wrote:
| "Selling" is a load-bearing word, too. They're probably
| not literally selling SQL dumps for hard cash. But there
| are many ways of indirectly selling data, that are almost
| equivalent to trading database dumps, but indirect enough
| that the company can say they're not selling data, and be
| technically correct.
| ambicapter wrote:
| 1st paragraph: "Managing this rapid growth while meeting the
| ever-increasing data demands of critical product and analytics
| use cases, especially our recent Notion AI features, meant
| building and scaling Notion's data lake."
| bastawhiz wrote:
| The whole point of a data warehouse is that you can rapidly
| query a huge amount of data with ad hoc queries.
|
| When your data is in Postgres, running an arbitrary query might
| take hours or days (or longer). Postgres does very poorly for
| queries that read huge amounts of data when there's no
| preexisting index (and you're not going to be building one-off
| indexes for ad hoc queries--that defeats the point). A data
| warehouse is slower for basic queries but substantially faster
| for queries that run against terabytes or petabytes of data.
|
| I can imagine some use cases at Notion:
|
| - You want to know the most popular syntax highlighting
| languages
|
| - You're searching for data corruption, where blocks form a
| cycle
|
| - You're looking for users who are committing fraud or abuse
| (like using bots in violation of your tos)
| SOLAR_FIELDS wrote:
| They didn't say the quiet part out loud, which is almost
| certainly that the Fivetran and Snowflake bills for what they
| were doing were probably enormous and those were undoubtedly what
| got management's attention about fixing this.
| rorymalcolm wrote:
| Found this comment (from Fivetran's CEO, so, with that in mind)
| regarding this article enlightening regarding the costs they
| were facing here
| https://twitter.com/frasergeorgew/status/1808326803796512865
| sneak wrote:
| I thought the quiet part was that they are data mining their
| customer data (and disclosing it to multiple third parties)
| because it's not E2EE and they can read everyone's private and
| proprietary notes.
|
| Otherwise, this is the perfect app for sharding/horizontal
| scalability. Your notes don't need to be queried or joined with
| anyone else's notes.
| redpoint wrote:
| This ^. This switch from managed to in house is a good
| example of only building when necessary.
| altdataseller wrote:
| Also whether this data lake is worth the costs/effort. How
| does this data lake add value to the user experience? What is
| this "AI" stuff that this data lake enables?
|
| For example, they mention search. But i imagine it is just
| searching only within your own docs. Which i presume should
| be fast and efficient if everything is sharded by user in
| Postgres.
|
| The tech stuff is all fine and good, but if it adds no value,
| its just playing with technology for technology sakes
| mritchie712 wrote:
| They weren't that quiet about it:
|
| > Moving several large, crucial Postgres datasets (some of them
| tens of TB large) to data lake gave us a net savings of over a
| million dollars for 2022 and proportionally higher savings in
| 2023 and 2024.
| patrickmay wrote:
| I'd like to see more details. 10s of TB isn't that large --
| why so expensive?
| riku_iki wrote:
| Maybe cloud hosted
| methou wrote:
| > Data lake > Data warehouse
|
| These aren't something I would like to hear if I'm still using
| Notion. It's very bold to publish something like this on their
| own website.
| bnj wrote:
| Could you expand on this?
| lopkeny12ko wrote:
| What's there to expand on? Do you not realize how bad of a
| look it is for a company to publicly admit, _on their own
| blog_ , the amount of time and engineering effort they spent
| to package up, move, analyze, and sell all their customer's
| private data?
|
| This is why laws like CCPA "do not sell my personal
| information" exist, which I certainly hope Notion is abiding
| by, otherwise they'll have lawyers knocking on their door
| soon.
| Cthulhu_ wrote:
| Where do they say they sell it? Citation needed; that's a
| legal and reputational minefield that I don't think they
| would admit to, like you said.
| lopkeny12ko wrote:
| I would challenge you to find any broker who sells data
| (like the T-Mobile location data scandal) who says
| plainly and clearly they sell user data.
| quest88 wrote:
| This is not answering the question.
| bnj wrote:
| Right, yes, tone aside that's very helpful- at first I
| didn't understand the implication of the blog post for
| implementing customer hostile solutions, but you've helped
| me understand it now.
| bastawhiz wrote:
| Those are just different words for "database". What do you care
| what kind of database your Notion data is sitting in?
| TeMPOraL wrote:
| A "data lake" strongly suggests there's lot of information
| the company needs to aggregate and process globally, which
| should very much _not_ be the case with a semi-private rich
| notebook product.
| bastawhiz wrote:
| They literally explained in the article why they have a
| data lake instead of just a data warehouse: their data
| model means it's slow and expensive to ingest that data
| into the warehouse from Postgres. The data lake is serving
| the same functions that the data warehouse did, but now
| that the volume of data has exceeded what the warehouse can
| handle, the data lake fills that gap.
|
| I wrote another comment about why you'd need this in the
| first place:
|
| https://news.ycombinator.com/item?id=40961622
|
| Frankly the argument "they shouldn't need to query the data
| in their system" is kind of silly. If you don't want your
| data processed for the features and services the company
| offers, don't use them.
| anoncareer0212 wrote:
| > Frankly the argument "they shouldn't need to query the
| data in their system" is kind of silly.
|
| Neutral party here: that's not what they said.
|
| A) Quotes shouldn't be there.
|
| B) Heuristic I've started applying to my comments: if I'm
| tempted to "quote" something that _isn 't a quote_, it
| means I don't fully understand what they mean and should
| ask a question. This dovetails nicely with the spirit of
| HN's "come with curiosity"
|
| It is disquieting because:
|
| A) This are very much ill-defined terms (what, exactly,
| is data lake, vs. data warehouse, vs. database?), and as
| far as I've had to understand this stuff, and a quick
| spot check of Google shows, it's about making it so
| you're accumulating more data in one place.
|
| B) This is antithetical to a consumer's desired approach
| to data, which will described parodically as: stored
| individually, on one computer, behind 3 locked doors and
| 20 layers of encryption.
| nojvek wrote:
| At the scale of Notion, with millions of users, they'd have
| that much data.
|
| I've seen 100TB+ workloads at smaller companies. Not
| unusual.
| iLoveOncall wrote:
| The concern isn't the scale, it's the use. What is there
| to _process_ when they're supposed to only store and
| retrieve to show to users?
| zarmin wrote:
| Given how infuriating their implementation is of an in-app
| database, perhaps it's not that surprising.
| DataDaemon wrote:
| OK, thanks, when E2EE ?
| j45 wrote:
| This was a nice read, interesting to see how far Postgres
| (largely alone) can get you.
|
| Also we see how at self hosting within a startup can make perfect
| sense. :)
|
| Devops that abstract away things in some cases to the cloud might
| just add to architectural and technical debt later, without the
| history of learning from working through the challenges
|
| Still, it might have been a great opportunity to figure out
| offline first use of notion.
|
| I have been forced to use anytype instead of notion for the
| offline first reason. Time to checkout to learn how they handle
| storage from the source code.
| wejick wrote:
| I'm not familiar with S3 on datalake setup. When replicating a db
| table to S3, what format will be used?
|
| And I'm wondering if it's possible to update the S3 files to
| reflect latest incoming changes on the db table?
| mritchie712 wrote:
| The file format is often Parquet. The "table format" depend on
| what data lake you're using (e.g. Iceberg, Delta, etc.).
|
| If you know Python, here's[0] a practical example of how
| Iceberg works.
|
| 0 - https://www.definite.app/blog/iceberg-query-engine
___________________________________________________________________
(page generated 2024-07-14 23:00 UTC)