[HN Gopher] Building and scaling Notion's data lake
       ___________________________________________________________________
        
       Building and scaling Notion's data lake
        
       Author : alexzeitler
       Score  : 140 points
       Date   : 2024-07-14 09:02 UTC (13 hours ago)
        
 (HTM) web link (www.notion.so)
 (TXT) w3m dump (www.notion.so)
        
       | mritchie712 wrote:
       | > Iceberg and Delta Lake, on the other hand, weren't optimized
       | for our update-heavy workload when we considered them in 2022
       | 
       | "when we considered them in 2022" is significant here because
       | both Iceberg and Delta Lake have made rapid progress since then.
       | I talk to a lot of companies making this decision and the
       | consensus is swinging towards Iceberg. If they're already heavy
       | Databricks users, then Delta is the obvious choice.
       | 
       | For anyone that missed it, Databricks acquired Tabular[0] (which
       | was founded by the creators of Iceberg). The public facing story
       | is that both projects will continue independently and I really
       | hope that's true.
       | 
       | Shameless plug: this is the same infrastructure we're using at
       | Definite[1] and we're betting a lot of companies want a setup
       | like this, but can't afford to build it themselves. It's
       | radically cheaper then the standard Snowflake + Fivetran + Looker
       | stack and works day one. A lot of companies just want dashboards
       | and it's pretty ridiculous the hoops you need to jump thru to get
       | them running.
       | 
       | We use iceberg for storage, duckdb as a query engine, a few open
       | source projects for ETL and built a frontend to manage it all and
       | create dashboards.
       | 
       | 0 - https://www.definite.app/blog/databricks-tabular-acquisition
       | 
       | 1 - https://www.youtube.com/watch?v=7FAJLc3k2Fo
        
       | whinvik wrote:
       | Is there any advantage to having both a Data Lake setup as well
       | as Snowflake. Why would one also want Snowflake after doing such
       | an extensive data lake setup?
        
         | luizfwolf wrote:
         | Saving money 100% also lower latency on distributed access.
         | Accessing file partitioned S3 doesn't require to spin a
         | warehouse and wait for your query to go on a queue, so if every
         | job runs in like k8s you don't have to manage resources and
         | auto scale in snowflake is a "paid feature"
         | 
         | I believe just not having to handle a query queue system is
         | already.
        
         | mritchie712 wrote:
         | Many BI / analytics tools don't have great support for Data
         | Lakes, so part of the reason could be supporting those tools
         | (e.g. they still load some of their data to snowflake to power
         | BI / dashboards)
        
           | Lucasoato wrote:
           | We've solved that issue with Trino. Superset and a lot of
           | other BI tools support connection to it and it's a very cost
           | efficient engine (compared to DWH solutions). Another way to
           | go even cheaper is using Athena, if you're on AWS.
        
       | adolph wrote:
       | They seem to be doing lots of work but I don't understand what
       | customer value this creates.
       | 
       | What does a backing data lake afford a Notion user that can't be
       | done in a similar product, like Obsidian?
        
         | jpalomaki wrote:
         | From the article: "Unlock AI, Search, and other product use
         | cases that require denormalized data"
        
         | SOLAR_FIELDS wrote:
         | Beyond the features that the sibling comment mentioned, this
         | kind of data isn't really for end users. It's a way that you
         | can package it up, "anonymize" it, and sell the data to
         | interested parties.
        
           | sghiassy wrote:
           | Is that why they're putting images in Postgres? I don't
           | understand that design decision yet.
        
             | bastawhiz wrote:
             | I... Don't think they are? If you look at the URL for
             | images in notion, you can see the S3 hostname.
        
             | benaubin wrote:
             | Notion employee here. We don't put images themselves in
             | Postgres- we use s3 to store them. The article is referring
             | to image blocks, which are effectively pointers to the
             | image.
        
           | Ozzie_osman wrote:
           | For someone like Notion, they probably aren't selling this
           | data. The primary use case is internally for analysis (eg
           | product usage, business analysis, etc).
           | 
           | It can also be used to train AI models, of course.
        
             | Cthulhu_ wrote:
             | That "probably" is doing a lot of heavy lifting. That said,
             | whether they sell it or not, it's all that data that is
             | their primary value store at the moment. They will either
             | go public or sell, eventually. If they go public, it'll
             | likely be similar to Dropbox; a single fairly successful
             | product, but failing attempts to diversify.
        
               | TeMPOraL wrote:
               | "Selling" is a load-bearing word, too. They're probably
               | not literally selling SQL dumps for hard cash. But there
               | are many ways of indirectly selling data, that are almost
               | equivalent to trading database dumps, but indirect enough
               | that the company can say they're not selling data, and be
               | technically correct.
        
         | ambicapter wrote:
         | 1st paragraph: "Managing this rapid growth while meeting the
         | ever-increasing data demands of critical product and analytics
         | use cases, especially our recent Notion AI features, meant
         | building and scaling Notion's data lake."
        
         | bastawhiz wrote:
         | The whole point of a data warehouse is that you can rapidly
         | query a huge amount of data with ad hoc queries.
         | 
         | When your data is in Postgres, running an arbitrary query might
         | take hours or days (or longer). Postgres does very poorly for
         | queries that read huge amounts of data when there's no
         | preexisting index (and you're not going to be building one-off
         | indexes for ad hoc queries--that defeats the point). A data
         | warehouse is slower for basic queries but substantially faster
         | for queries that run against terabytes or petabytes of data.
         | 
         | I can imagine some use cases at Notion:
         | 
         | - You want to know the most popular syntax highlighting
         | languages
         | 
         | - You're searching for data corruption, where blocks form a
         | cycle
         | 
         | - You're looking for users who are committing fraud or abuse
         | (like using bots in violation of your tos)
        
       | SOLAR_FIELDS wrote:
       | They didn't say the quiet part out loud, which is almost
       | certainly that the Fivetran and Snowflake bills for what they
       | were doing were probably enormous and those were undoubtedly what
       | got management's attention about fixing this.
        
         | rorymalcolm wrote:
         | Found this comment (from Fivetran's CEO, so, with that in mind)
         | regarding this article enlightening regarding the costs they
         | were facing here
         | https://twitter.com/frasergeorgew/status/1808326803796512865
        
         | sneak wrote:
         | I thought the quiet part was that they are data mining their
         | customer data (and disclosing it to multiple third parties)
         | because it's not E2EE and they can read everyone's private and
         | proprietary notes.
         | 
         | Otherwise, this is the perfect app for sharding/horizontal
         | scalability. Your notes don't need to be queried or joined with
         | anyone else's notes.
        
           | redpoint wrote:
           | This ^. This switch from managed to in house is a good
           | example of only building when necessary.
        
           | altdataseller wrote:
           | Also whether this data lake is worth the costs/effort. How
           | does this data lake add value to the user experience? What is
           | this "AI" stuff that this data lake enables?
           | 
           | For example, they mention search. But i imagine it is just
           | searching only within your own docs. Which i presume should
           | be fast and efficient if everything is sharded by user in
           | Postgres.
           | 
           | The tech stuff is all fine and good, but if it adds no value,
           | its just playing with technology for technology sakes
        
         | mritchie712 wrote:
         | They weren't that quiet about it:
         | 
         | > Moving several large, crucial Postgres datasets (some of them
         | tens of TB large) to data lake gave us a net savings of over a
         | million dollars for 2022 and proportionally higher savings in
         | 2023 and 2024.
        
           | patrickmay wrote:
           | I'd like to see more details. 10s of TB isn't that large --
           | why so expensive?
        
             | riku_iki wrote:
             | Maybe cloud hosted
        
       | methou wrote:
       | > Data lake > Data warehouse
       | 
       | These aren't something I would like to hear if I'm still using
       | Notion. It's very bold to publish something like this on their
       | own website.
        
         | bnj wrote:
         | Could you expand on this?
        
           | lopkeny12ko wrote:
           | What's there to expand on? Do you not realize how bad of a
           | look it is for a company to publicly admit, _on their own
           | blog_ , the amount of time and engineering effort they spent
           | to package up, move, analyze, and sell all their customer's
           | private data?
           | 
           | This is why laws like CCPA "do not sell my personal
           | information" exist, which I certainly hope Notion is abiding
           | by, otherwise they'll have lawyers knocking on their door
           | soon.
        
             | Cthulhu_ wrote:
             | Where do they say they sell it? Citation needed; that's a
             | legal and reputational minefield that I don't think they
             | would admit to, like you said.
        
               | lopkeny12ko wrote:
               | I would challenge you to find any broker who sells data
               | (like the T-Mobile location data scandal) who says
               | plainly and clearly they sell user data.
        
               | quest88 wrote:
               | This is not answering the question.
        
             | bnj wrote:
             | Right, yes, tone aside that's very helpful- at first I
             | didn't understand the implication of the blog post for
             | implementing customer hostile solutions, but you've helped
             | me understand it now.
        
         | bastawhiz wrote:
         | Those are just different words for "database". What do you care
         | what kind of database your Notion data is sitting in?
        
           | TeMPOraL wrote:
           | A "data lake" strongly suggests there's lot of information
           | the company needs to aggregate and process globally, which
           | should very much _not_ be the case with a semi-private rich
           | notebook product.
        
             | bastawhiz wrote:
             | They literally explained in the article why they have a
             | data lake instead of just a data warehouse: their data
             | model means it's slow and expensive to ingest that data
             | into the warehouse from Postgres. The data lake is serving
             | the same functions that the data warehouse did, but now
             | that the volume of data has exceeded what the warehouse can
             | handle, the data lake fills that gap.
             | 
             | I wrote another comment about why you'd need this in the
             | first place:
             | 
             | https://news.ycombinator.com/item?id=40961622
             | 
             | Frankly the argument "they shouldn't need to query the data
             | in their system" is kind of silly. If you don't want your
             | data processed for the features and services the company
             | offers, don't use them.
        
               | anoncareer0212 wrote:
               | > Frankly the argument "they shouldn't need to query the
               | data in their system" is kind of silly.
               | 
               | Neutral party here: that's not what they said.
               | 
               | A) Quotes shouldn't be there.
               | 
               | B) Heuristic I've started applying to my comments: if I'm
               | tempted to "quote" something that _isn 't a quote_, it
               | means I don't fully understand what they mean and should
               | ask a question. This dovetails nicely with the spirit of
               | HN's "come with curiosity"
               | 
               | It is disquieting because:
               | 
               | A) This are very much ill-defined terms (what, exactly,
               | is data lake, vs. data warehouse, vs. database?), and as
               | far as I've had to understand this stuff, and a quick
               | spot check of Google shows, it's about making it so
               | you're accumulating more data in one place.
               | 
               | B) This is antithetical to a consumer's desired approach
               | to data, which will described parodically as: stored
               | individually, on one computer, behind 3 locked doors and
               | 20 layers of encryption.
        
             | nojvek wrote:
             | At the scale of Notion, with millions of users, they'd have
             | that much data.
             | 
             | I've seen 100TB+ workloads at smaller companies. Not
             | unusual.
        
               | iLoveOncall wrote:
               | The concern isn't the scale, it's the use. What is there
               | to _process_ when they're supposed to only store and
               | retrieve to show to users?
        
         | zarmin wrote:
         | Given how infuriating their implementation is of an in-app
         | database, perhaps it's not that surprising.
        
       | DataDaemon wrote:
       | OK, thanks, when E2EE ?
        
       | j45 wrote:
       | This was a nice read, interesting to see how far Postgres
       | (largely alone) can get you.
       | 
       | Also we see how at self hosting within a startup can make perfect
       | sense. :)
       | 
       | Devops that abstract away things in some cases to the cloud might
       | just add to architectural and technical debt later, without the
       | history of learning from working through the challenges
       | 
       | Still, it might have been a great opportunity to figure out
       | offline first use of notion.
       | 
       | I have been forced to use anytype instead of notion for the
       | offline first reason. Time to checkout to learn how they handle
       | storage from the source code.
        
       | wejick wrote:
       | I'm not familiar with S3 on datalake setup. When replicating a db
       | table to S3, what format will be used?
       | 
       | And I'm wondering if it's possible to update the S3 files to
       | reflect latest incoming changes on the db table?
        
         | mritchie712 wrote:
         | The file format is often Parquet. The "table format" depend on
         | what data lake you're using (e.g. Iceberg, Delta, etc.).
         | 
         | If you know Python, here's[0] a practical example of how
         | Iceberg works.
         | 
         | 0 - https://www.definite.app/blog/iceberg-query-engine
        
       ___________________________________________________________________
       (page generated 2024-07-14 23:00 UTC)