[HN Gopher] Crunchy Bridge for Analytics: Your Data Lake in Post...
___________________________________________________________________
Crunchy Bridge for Analytics: Your Data Lake in PostgreSQL
Author : tosh
Score : 55 points
Date : 2024-04-30 16:26 UTC (6 hours ago)
(HTM) web link (www.crunchydata.com)
(TXT) w3m dump (www.crunchydata.com)
| koolba wrote:
| > You can also use a wildcard in the path (e.g.
| s3://mybucket/hits/*.parquet) to query a list of files.
|
| So is this an FDW that provides a similar function to Amazon
| Athena?
|
| Does every query fetch everything again or does it somehow cache
| the downloaded data files?
|
| Running something like this on EC2 is interesting as you wouldn't
| pay the intermediate transfer costs. Only the final results would
| egress.
| mslot wrote:
| It caches the files on locally attached NVMe drives in the
| background, and continues to use range requests to S3 for
| queries until the full download completes.
| mistrial9 wrote:
| tablescan implied on auto-column types with CSV and JSON inputs?
| mslot wrote:
| It will do a sample, so usually much less than a table scan,
| but it can definitely take longer than for Parquet.
| memset wrote:
| This is pretty cool. Which extension are you using for this? (Or
| have you written a custom one?)
| craigkerstiens wrote:
| It's a custom extension and actually a number of custom
| extensions, with quite a few more planned to further enhance
| the product experience. All extensions work together as a
| single unit to compose Crunchy Bridge for Analytics, but under
| the covers lots of building blocks that work together.
|
| Marco and team worked were the architects behind the Citus
| extension for Postgres and have quite a bit of experience
| building advanced Postgres extensions. Marco gave a talk at
| PGConf EU on all the mistakes you can make when building
| extensions and best practices to follow-so in short quite a bit
| gone into the quality of this vs. a quick one off. Even in the
| standup with the team today it was remarked "we haven't even
| been able to make it segfault yet, which we could pull off
| quite quickly and commonly with Citus".
| koolba wrote:
| How do AWS credentials get loaded? Is it a static set
| populated in the CREATE SERVER or can it pull from the usual
| suspects of AWS credential sources like instance profiles?
|
| Is the code for the extension itself available?
| mslot wrote:
| The credentials are currently managed via the platform, so
| you enter them in the dashboard. We wanted to avoid
| specifying credentials via a SQL interface, because they
| can easily leak into logs and such. We'll add more
| authentication options over time.
| koolba wrote:
| How does the extension get access to them? Is there some
| other "master" token for the Crunchy PG server itself
| that is used to fetch the real token?
|
| The extension is not FOSS right?
| craigkerstiens wrote:
| There is coordination from the Crunchy Bridge control
| plane to the data plane, that the extension is then aware
| of.
|
| At this time it's not FOSS, we are going to consider
| opening some of the building blocks in time, but at the
| moment they have a pretty tight coupling on both the
| other extensions and on how Crunchy Bridge operates.
| riku_iki wrote:
| > Crunchy Bridge for Analytics
|
| what exactly is "Crunchy Bridge for Analytics"? It is some
| hosted cloud infra? I can't install it locally as extension?
| mslot wrote:
| Crunchy Bridge is a managed PostgreSQL service by Crunchy
| Data available on AWS, Azure, and GCP.
|
| Bridge for Analytics is a special instance/cluster type in
| Crunchy Bridge with additional extensions and
| infrastructure for querying data lakes. Currently AWS only.
| ccleve wrote:
| Do you have a link to the slides or the video of that
| presentation? I found this, but no links: https://www.postgre
| sql.eu/events/pgconfeu2023/schedule/sessi...
| wking wrote:
| That has been a fun one to play a small role in. Great work by
| Marco and the team.
| omg2864 wrote:
| A shame that I can't use any of this I am on AWS RDS, so I can't
| install any of these interesting extensions.
|
| My problem I would like to be better at OLAP queries in addition
| to OLTP queries for my Postgres db. Currently, my OLAP queries
| clear my whole RAM with on disk reads and when these queries run
| I also have high CPU peaks, which makes me have a bigger instance
| then I have during regular hours. We use Postgres already so I
| would like to stick as close as possible to that.
|
| The best thing - operationally (running something like starrocks
| seems new and complex) and cost-wise (red shift seems expensive)
| seems to be Clockhouse with the materialized database and
| Postgres wire compatible extension - or at I missing something?
| However, these are still marked experimental.
|
| The option of additional read replica and tuning the parameters
| for OLAP seems to be quite expensive too, the RDS costs are high.
| And I could not have a significantly smaller replica to avoid
| replication sync? An additional wish for this change is
| decreasing the cost.
| takeda wrote:
| So I was able to set up a machine with configuration management
| (saltstack) + barman[1] + repmgr[2] and the maintenance was
| comparable with RDS.
|
| This was in 2017 and I would expect better tools available.
| This was on premises though, there were also Wal-E and later
| Wal-G that were more cloud oriented, although looks like barman
| also can now use S3 and equivalent.
|
| [1] https://pgbarman.org/
|
| [2] https://www.repmgr.org/
| mslot wrote:
| Crunchy Bridge is similar to RDS. It runs on EC2 and is a
| mature managed service with features such as VPC peering, and
| now analytics, so you could consider it as an alternative.
| tootie wrote:
| If you're on AWS you can use something like Athena or you can
| load data from csv/parquet in S3 into Redshift. But yeah, not
| directly within RDS Postgres to my knowledge.
| chatmasta wrote:
| What is powering the vectorized query execution? Are you
| embedding something like DataFusion or is it entirely custom?
| (Also will any of this be open sourced?)
| mritchie712 wrote:
| not sure what's powering crunchy, but you might be interested
| in Parade:
| https://blog.paradedb.com/pages/introducing_analytics
| clintboxe wrote:
| Looks pretty similar to Redshift Spectrum.
___________________________________________________________________
(page generated 2024-04-30 23:01 UTC)