[HN Gopher] Introducing dbt + Materialize
       ___________________________________________________________________
        
       Introducing dbt + Materialize
        
       Author : jldlaughlin
       Score  : 58 points
       Date   : 2021-03-01 19:13 UTC (3 hours ago)
        
 (HTM) web link (materialize.com)
 (TXT) w3m dump (materialize.com)
        
       | andrenotgiant wrote:
       | Relevant related post: A data pipeline is just a materialized
       | view: https://nchammas.com/writing/data-pipeline-materialized-
       | view
        
         | jldlaughlin wrote:
         | Neat! We're totally on the same page--incremental view
         | maintenance not only makes materialized views a useful building
         | block for data pipelines, it can make them much simpler, too!
        
           | touisteur wrote:
           | Mmmh I've been thinking a lot about generated/computed fields
           | as I wanted to use them in pg. They were introduced in pg12
           | but only materialized.
        
       | efangs wrote:
       | Is it possible to backfill a materialized view? I'm maybe
       | confused, but does this potentially replace a traditional data
       | lake or is it specifically for streaming applications?
        
       | Nican wrote:
       | Materialize is a really interesting solution, and I love for what
       | it stands.
       | 
       | But the documentation is missing more details about the
       | architecture overview. A single update could cause many gigabytes
       | of data to shift on the materialize view, and I do not understand
       | how Materialize would handle that scale.
        
         | benesch wrote:
         | What you're asking about is the magic at the heart of
         | Materialize. We're built atop an open-source incremental
         | compute framework called Differential Dataflow [0] that one of
         | our co-founders has been working on for ten years or so.
         | 
         | The basic insight is that for many computations, when an update
         | arrives, the amount of incremental compute that must be
         | performed is tiny. If you're computing `SELECT count(1) FROM
         | relation`, a new row arriving just increments the count by one.
         | If you're computing a `WHERE` clause, you just need to check
         | whether the update satisfies the predicate or not. Of course,
         | things get more complicated with operators like `JOIN`, and
         | that's where Differential Dataflow's incremental join
         | algorithms really shine.
         | 
         | It's true that there are some computations that are very
         | expensive to maintain incrementally. For example, maintaining
         | an ordered query like                   SELECT * FROM relation
         | ORDER BY col
         | 
         | would be quite expensive, because the arrival of a new value
         | will change the ordering of all values that sort greater than
         | the new value.
         | 
         | Materialize can still be quite a useful tool here, though! You
         | can use Materialize to incrementally-maintain the parts of your
         | queries that are cheap to incrementally maintain, and execute
         | the other parts of your query ad hoc. This is in fact how
         | `ORDER BY` already works in Materialize. A materialized view
         | never maintains ordering, but you can request a sort when you
         | fetch the contents of that view by using an `ORDER BY` clause
         | in your `SELECT` statement. For example:
         | CREATE MATERIALIZED VIEW v AS SELECT complicated FROM t1, t2,
         | ... -- incrementally maintained         SELECT * FROM v ORDER
         | BY col LIMIT 5                              -- order and limit
         | computed ad hoc, but still fast
         | 
         | [0]: https://github.com/TimelyDataflow/differential-dataflow
        
           | Nican wrote:
           | Thank you for taking your time to write that up. To
           | illustrate my point a little more:
           | 
           | Suppose a customer has several resources, and each resource
           | has several metrics. From my understanding, Materialized
           | could be used to have an aggregated view of metrics per
           | customer.
           | 
           | The problem is that resources can also be migrated between
           | customers. When a resource migrates between customers, the
           | whole history of the customer changes. This could cause huge
           | updates depended on how many resources are moved, or how many
           | metrics per resource are being collected.
           | 
           | I have a conundrum between doing the "customer-resource join"
           | late, and causing huge CPU cost when running queries. Or
           | making aggregates early, and then having huge Disk cost when
           | migrating resources. At the moment, we just have daily jobs
           | that aggregates the TBs of customer data daily, because there
           | is no way to do the joins in real-time.
           | 
           | Is Materialize designed to be able to handle something like
           | this?
        
         | glogla wrote:
         | At the moment, it does it by not allowing updates at all.
         | 
         | It also forbids some parts of SQL that could get you update-
         | like functionality, like appending new versions of records to a
         | stream and then running                   select * from (select
         | *, row_number() over (partition by pk order by updated_at desc)
         | from stream) where row_number = 1
         | 
         | or something.
         | 
         | (Boy I wish there was less awkward way to do this.)
         | 
         | EDIT: I'd love to ship bunch of data from few CDCs to
         | Materialize and then run realtime reporting on that but without
         | updates or window functions, Materialize can't do that just
         | yet.
         | 
         | EDIT2: The part about updates isn't true, see below.
        
           | jldlaughlin wrote:
           | We (I work at Materialize) actually do support updates! Our
           | CDC sources support them, as well as any source using an
           | UPSERT envelope (more info here:
           | https://materialize.com/docs/sql/create-source/text-
           | kafka/#u...).
           | 
           | As per your second point, I have a less awkward way for you!
           | Materialize supports a top-k idiom
           | (https://materialize.com/docs/sql/idioms/#top-k-by-group)
           | that is hopefully a bit more clear.
        
             | glogla wrote:
             | Right, I saw no update in the SQL keyword list, and saw
             | some (I guess old now) talk about Materialize where the
             | speaker mentioned no updates, sorry!
        
               | jldlaughlin wrote:
               | No apologies necessary, we're an active work in progress!
               | (And, hopefully, moving quickly!)
        
       | d_watt wrote:
       | Materialize seems really, really interesting. To the point where
       | my crotchety old man senses are telling me that I should be
       | careful about getting too excited about new technology.
       | 
       | Are there any interesting case studies of people pushing the edge
       | of what's possible? For my use case, I have max throughput of 10k
       | updates/second potentially materialized into tens of thousands of
       | different views (with some good partition keys available, if
       | needed).
        
         | albertwang wrote:
         | Nothing we can share publicly at the moment yet, but if you
         | reach out and chat, we're more than happy to give you some
         | numbers that I think will address what you're looking for!
        
       ___________________________________________________________________
       (page generated 2021-03-01 23:01 UTC)