[HN Gopher] PostgreSQL views and materialized views and how they...
___________________________________________________________________
PostgreSQL views and materialized views and how they influenced
TimescaleDB
Author : od0
Score : 127 points
Date : 2022-07-14 15:52 UTC (7 hours ago)
(HTM) web link (www.timescale.com)
(TXT) w3m dump (www.timescale.com)
| jbylund wrote:
| Did you consider a type of continuous aggregate that works over
| pre-aggregated partitions for time buckets without modifications
| and swapping those buckets in the continuous aggregate view with
| a live view when data is modified? I guess it would mean that
| performance would sort of continually degrade as
| inserts/updates/deletes happened in more and more time buckets,
| but as soon as the aggregation event fired performance would
| recover. It seems like one could provide a more strongly
| consistent view of aggregates with this sort of approach.
| djk447 wrote:
| NB: Post author
|
| I'm not 100% sure I understand what you're asking, but
| essentially something that would look for data modifications
| and at query time run the query over the older regions as well?
|
| If that's what you're asking the answer is yes, we did consider
| it, but basically decided that it was something that relatively
| few people needed and the complexity and performance tradeoffs
| were unlikely to be worth it for most folks.
|
| Essentially, we could do something like this now by looking at
| our invalidation log and running a join against it to get to a
| more strongly consistent state (I haven't thought through the
| full implications and whether it's truly strong consistency, I
| think it might be, but it'd require a proof / some thinking
| through of all of our locking logic to really get there). It's
| interesting to consider though.
| gfody wrote:
| > If I'd written out the query, I might have seen that I didn't
| need the JOIN (or never written it in the first place). Whereas
| the view hides that complexity. So they can make things easier,
| but that can lead to performance pitfalls if we're not careful.
|
| you can avoid this particular pitfall by using left joins for
| views like this (that join stuff in for convenience that you
| might not select) - postgres will eliminate a left join but not
| an inner join since the inner join could filter rows (eg symbols
| in stocks_real_time that don't exist in company (commercial
| engines will use the presence of fk constraints to know that's
| impossible and go ahead and eliminate inner joins as well))
| gopalv wrote:
| > commercial engines will use the presence of fk constraints
|
| What, postgres doesn't do FK join removals?
|
| Like I tried it right now and it didn't remove the hash-join
|
| http://sqlfiddle.com/#!17/073747/2
|
| Should've been just a count off the PK Employees table.
|
| Apache Calcite in Hive had to bake in this specific
| optimization because there were a ton of deep join views
| (coming off Teradata) where someone would run something like a
| "select count(*) + where" or just project 2-3 columns out of a
| 60+ join view from a couple of tables.
|
| And those just ran forever without the PK-FK removals.
| michael1999 wrote:
| How would that work? Your schema allows an employee to be
| assigned to multiple departments, and the query must count
| them.
| doctor_eval wrote:
| The schema only allows one employee per department, so
| there is no need to look up the employee table.
| jmalicki wrote:
| Because of the foreign key, the query should reduce to
| "explain select count(*) from Department d;"
| [deleted]
| djk447 wrote:
| NB: post author here!
|
| Thanks yes! Totally true, was thinking about including some of
| that but it felt like it opened a can of worms about join types
| and why certain things would be included and others not (ie
| inner join needs to see that it's there on both sides whereas
| the left join doesn't) etc. and the post was already kinda long
| in the tooth.
| garyclarke27 wrote:
| Postgres is an amazing database. It's only significant weakness
| now is in Materialized views, with their lack of incremental
| refresh. Was disappointing to see there was no progress towards
| this in v15.
| gunnarmorling wrote:
| That work towards incrementally updated views is happening and
| progressing. For now, it's a separate extension, though:
| https://github.com/sraoss/pg_ivm.
| edmundsauto wrote:
| I also wish this were in core posgres, but if you use a build
| tool like DBT you can enable this type of thing.
| snicker7 wrote:
| Interestingly, DBT does not support creating materialized
| views.
| mritchie712 wrote:
| yeah, this issue has been open for years. We need them for
| Clickhouse for our product.
|
| https://github.com/dbt-labs/dbt-core/issues/1162
| WXLCKNO wrote:
| Enjoyed this post! Djk447 would love to ask a question as well.
|
| We've started working with Timescale to process historical time
| series data. However there is so much of it that we chose which
| parts we process.
|
| It's possible that in the future we may need to go back and
| reprocess the dataset to add something we decided we want after
| all.
|
| In your post it seems like this is handled automatically on a
| smaller time scale, meaning you could insert into the past
| without doing anything special.
|
| What happens if you need to insert data at numerous points across
| two years worth of data instead? Do you have to use backfilling
| as described in your documentation? Or is it better to maybe
| rebuild the entire hypertable?
| djk447 wrote:
| NB: Post author
|
| Yes. this is generally handled automatically, there may be
| times though where you want to essentially pause refreshing the
| view for a while while you do some backfilling and then
| eventually let it catch up, especially if you're overwriting
| the same time period multiple times in a row. If you can insert
| in time order then it just breaks up re-calculation into
| smaller segments, which can be quite useful rather than having
| to process the whole data set again.
|
| This can be a little bit different if you're doing compression,
| but with continuous aggregates I think it should work fine. I'm
| not 100% sure that was what you were looking for, let me know
| if it's not.
| hodgesrm wrote:
| This is an excellent article. I like the way the author builds up
| in steps to eventual consistency between source tables and their
| materialized views. It was fun to guess the next step.
|
| I do have one question: how does the algorithm described in the
| article work when the source table is spread across multiple
| servers, say in multiple shards? Can TimescaleDB maintain
| materialized views on each shard and then run a query that
| reconciles them?
|
| Edited: clarification
| djk447 wrote:
| NB: Post author
|
| So we thought about doing something like that with multinode
| where each of the nodes would maintain their own
| materialization but abandoned it for that very reason it's
| very, very difficult to maintain any sort of consistency
| guarantees in that case, or even to reason about it.
|
| Instead we use the access nodes as coordinators to do the
| materialization. right now the materialization only exists on
| the access node but there's no reason we couldn't send it back
| out to the data nodes, you just need a coordination point to
| start a distributed transaction to have some semblance of a
| guarantee.
| djk447 wrote:
| And glad you liked the article!
| gigatexal wrote:
| I wonder how well these perform compared to the aggregate tables
| of Clickhouse fame.
___________________________________________________________________
(page generated 2022-07-14 23:00 UTC)