[HN Gopher] Introducing dbt + Materialize
___________________________________________________________________
Introducing dbt + Materialize
Author : jldlaughlin
Score : 58 points
Date : 2021-03-01 19:13 UTC (3 hours ago)
(HTM) web link (materialize.com)
(TXT) w3m dump (materialize.com)
| andrenotgiant wrote:
| Relevant related post: A data pipeline is just a materialized
| view: https://nchammas.com/writing/data-pipeline-materialized-
| view
| jldlaughlin wrote:
| Neat! We're totally on the same page--incremental view
| maintenance not only makes materialized views a useful building
| block for data pipelines, it can make them much simpler, too!
| touisteur wrote:
| Mmmh I've been thinking a lot about generated/computed fields
| as I wanted to use them in pg. They were introduced in pg12
| but only materialized.
| efangs wrote:
| Is it possible to backfill a materialized view? I'm maybe
| confused, but does this potentially replace a traditional data
| lake or is it specifically for streaming applications?
| Nican wrote:
| Materialize is a really interesting solution, and I love for what
| it stands.
|
| But the documentation is missing more details about the
| architecture overview. A single update could cause many gigabytes
| of data to shift on the materialize view, and I do not understand
| how Materialize would handle that scale.
| benesch wrote:
| What you're asking about is the magic at the heart of
| Materialize. We're built atop an open-source incremental
| compute framework called Differential Dataflow [0] that one of
| our co-founders has been working on for ten years or so.
|
| The basic insight is that for many computations, when an update
| arrives, the amount of incremental compute that must be
| performed is tiny. If you're computing `SELECT count(1) FROM
| relation`, a new row arriving just increments the count by one.
| If you're computing a `WHERE` clause, you just need to check
| whether the update satisfies the predicate or not. Of course,
| things get more complicated with operators like `JOIN`, and
| that's where Differential Dataflow's incremental join
| algorithms really shine.
|
| It's true that there are some computations that are very
| expensive to maintain incrementally. For example, maintaining
| an ordered query like SELECT * FROM relation
| ORDER BY col
|
| would be quite expensive, because the arrival of a new value
| will change the ordering of all values that sort greater than
| the new value.
|
| Materialize can still be quite a useful tool here, though! You
| can use Materialize to incrementally-maintain the parts of your
| queries that are cheap to incrementally maintain, and execute
| the other parts of your query ad hoc. This is in fact how
| `ORDER BY` already works in Materialize. A materialized view
| never maintains ordering, but you can request a sort when you
| fetch the contents of that view by using an `ORDER BY` clause
| in your `SELECT` statement. For example:
| CREATE MATERIALIZED VIEW v AS SELECT complicated FROM t1, t2,
| ... -- incrementally maintained SELECT * FROM v ORDER
| BY col LIMIT 5 -- order and limit
| computed ad hoc, but still fast
|
| [0]: https://github.com/TimelyDataflow/differential-dataflow
| Nican wrote:
| Thank you for taking your time to write that up. To
| illustrate my point a little more:
|
| Suppose a customer has several resources, and each resource
| has several metrics. From my understanding, Materialized
| could be used to have an aggregated view of metrics per
| customer.
|
| The problem is that resources can also be migrated between
| customers. When a resource migrates between customers, the
| whole history of the customer changes. This could cause huge
| updates depended on how many resources are moved, or how many
| metrics per resource are being collected.
|
| I have a conundrum between doing the "customer-resource join"
| late, and causing huge CPU cost when running queries. Or
| making aggregates early, and then having huge Disk cost when
| migrating resources. At the moment, we just have daily jobs
| that aggregates the TBs of customer data daily, because there
| is no way to do the joins in real-time.
|
| Is Materialize designed to be able to handle something like
| this?
| glogla wrote:
| At the moment, it does it by not allowing updates at all.
|
| It also forbids some parts of SQL that could get you update-
| like functionality, like appending new versions of records to a
| stream and then running select * from (select
| *, row_number() over (partition by pk order by updated_at desc)
| from stream) where row_number = 1
|
| or something.
|
| (Boy I wish there was less awkward way to do this.)
|
| EDIT: I'd love to ship bunch of data from few CDCs to
| Materialize and then run realtime reporting on that but without
| updates or window functions, Materialize can't do that just
| yet.
|
| EDIT2: The part about updates isn't true, see below.
| jldlaughlin wrote:
| We (I work at Materialize) actually do support updates! Our
| CDC sources support them, as well as any source using an
| UPSERT envelope (more info here:
| https://materialize.com/docs/sql/create-source/text-
| kafka/#u...).
|
| As per your second point, I have a less awkward way for you!
| Materialize supports a top-k idiom
| (https://materialize.com/docs/sql/idioms/#top-k-by-group)
| that is hopefully a bit more clear.
| glogla wrote:
| Right, I saw no update in the SQL keyword list, and saw
| some (I guess old now) talk about Materialize where the
| speaker mentioned no updates, sorry!
| jldlaughlin wrote:
| No apologies necessary, we're an active work in progress!
| (And, hopefully, moving quickly!)
| d_watt wrote:
| Materialize seems really, really interesting. To the point where
| my crotchety old man senses are telling me that I should be
| careful about getting too excited about new technology.
|
| Are there any interesting case studies of people pushing the edge
| of what's possible? For my use case, I have max throughput of 10k
| updates/second potentially materialized into tens of thousands of
| different views (with some good partition keys available, if
| needed).
| albertwang wrote:
| Nothing we can share publicly at the moment yet, but if you
| reach out and chat, we're more than happy to give you some
| numbers that I think will address what you're looking for!
___________________________________________________________________
(page generated 2021-03-01 23:01 UTC)