[HN Gopher] Crunchy Bridge for Analytics: Your Data Lake in Post...
       ___________________________________________________________________
        
       Crunchy Bridge for Analytics: Your Data Lake in PostgreSQL
        
       Author : tosh
       Score  : 55 points
       Date   : 2024-04-30 16:26 UTC (6 hours ago)
        
 (HTM) web link (www.crunchydata.com)
 (TXT) w3m dump (www.crunchydata.com)
        
       | koolba wrote:
       | > You can also use a wildcard in the path (e.g.
       | s3://mybucket/hits/*.parquet) to query a list of files.
       | 
       | So is this an FDW that provides a similar function to Amazon
       | Athena?
       | 
       | Does every query fetch everything again or does it somehow cache
       | the downloaded data files?
       | 
       | Running something like this on EC2 is interesting as you wouldn't
       | pay the intermediate transfer costs. Only the final results would
       | egress.
        
         | mslot wrote:
         | It caches the files on locally attached NVMe drives in the
         | background, and continues to use range requests to S3 for
         | queries until the full download completes.
        
       | mistrial9 wrote:
       | tablescan implied on auto-column types with CSV and JSON inputs?
        
         | mslot wrote:
         | It will do a sample, so usually much less than a table scan,
         | but it can definitely take longer than for Parquet.
        
       | memset wrote:
       | This is pretty cool. Which extension are you using for this? (Or
       | have you written a custom one?)
        
         | craigkerstiens wrote:
         | It's a custom extension and actually a number of custom
         | extensions, with quite a few more planned to further enhance
         | the product experience. All extensions work together as a
         | single unit to compose Crunchy Bridge for Analytics, but under
         | the covers lots of building blocks that work together.
         | 
         | Marco and team worked were the architects behind the Citus
         | extension for Postgres and have quite a bit of experience
         | building advanced Postgres extensions. Marco gave a talk at
         | PGConf EU on all the mistakes you can make when building
         | extensions and best practices to follow-so in short quite a bit
         | gone into the quality of this vs. a quick one off. Even in the
         | standup with the team today it was remarked "we haven't even
         | been able to make it segfault yet, which we could pull off
         | quite quickly and commonly with Citus".
        
           | koolba wrote:
           | How do AWS credentials get loaded? Is it a static set
           | populated in the CREATE SERVER or can it pull from the usual
           | suspects of AWS credential sources like instance profiles?
           | 
           | Is the code for the extension itself available?
        
             | mslot wrote:
             | The credentials are currently managed via the platform, so
             | you enter them in the dashboard. We wanted to avoid
             | specifying credentials via a SQL interface, because they
             | can easily leak into logs and such. We'll add more
             | authentication options over time.
        
               | koolba wrote:
               | How does the extension get access to them? Is there some
               | other "master" token for the Crunchy PG server itself
               | that is used to fetch the real token?
               | 
               | The extension is not FOSS right?
        
               | craigkerstiens wrote:
               | There is coordination from the Crunchy Bridge control
               | plane to the data plane, that the extension is then aware
               | of.
               | 
               | At this time it's not FOSS, we are going to consider
               | opening some of the building blocks in time, but at the
               | moment they have a pretty tight coupling on both the
               | other extensions and on how Crunchy Bridge operates.
        
           | riku_iki wrote:
           | > Crunchy Bridge for Analytics
           | 
           | what exactly is "Crunchy Bridge for Analytics"? It is some
           | hosted cloud infra? I can't install it locally as extension?
        
             | mslot wrote:
             | Crunchy Bridge is a managed PostgreSQL service by Crunchy
             | Data available on AWS, Azure, and GCP.
             | 
             | Bridge for Analytics is a special instance/cluster type in
             | Crunchy Bridge with additional extensions and
             | infrastructure for querying data lakes. Currently AWS only.
        
           | ccleve wrote:
           | Do you have a link to the slides or the video of that
           | presentation? I found this, but no links: https://www.postgre
           | sql.eu/events/pgconfeu2023/schedule/sessi...
        
       | wking wrote:
       | That has been a fun one to play a small role in. Great work by
       | Marco and the team.
        
       | omg2864 wrote:
       | A shame that I can't use any of this I am on AWS RDS, so I can't
       | install any of these interesting extensions.
       | 
       | My problem I would like to be better at OLAP queries in addition
       | to OLTP queries for my Postgres db. Currently, my OLAP queries
       | clear my whole RAM with on disk reads and when these queries run
       | I also have high CPU peaks, which makes me have a bigger instance
       | then I have during regular hours. We use Postgres already so I
       | would like to stick as close as possible to that.
       | 
       | The best thing - operationally (running something like starrocks
       | seems new and complex) and cost-wise (red shift seems expensive)
       | seems to be Clockhouse with the materialized database and
       | Postgres wire compatible extension - or at I missing something?
       | However, these are still marked experimental.
       | 
       | The option of additional read replica and tuning the parameters
       | for OLAP seems to be quite expensive too, the RDS costs are high.
       | And I could not have a significantly smaller replica to avoid
       | replication sync? An additional wish for this change is
       | decreasing the cost.
        
         | takeda wrote:
         | So I was able to set up a machine with configuration management
         | (saltstack) + barman[1] + repmgr[2] and the maintenance was
         | comparable with RDS.
         | 
         | This was in 2017 and I would expect better tools available.
         | This was on premises though, there were also Wal-E and later
         | Wal-G that were more cloud oriented, although looks like barman
         | also can now use S3 and equivalent.
         | 
         | [1] https://pgbarman.org/
         | 
         | [2] https://www.repmgr.org/
        
         | mslot wrote:
         | Crunchy Bridge is similar to RDS. It runs on EC2 and is a
         | mature managed service with features such as VPC peering, and
         | now analytics, so you could consider it as an alternative.
        
         | tootie wrote:
         | If you're on AWS you can use something like Athena or you can
         | load data from csv/parquet in S3 into Redshift. But yeah, not
         | directly within RDS Postgres to my knowledge.
        
       | chatmasta wrote:
       | What is powering the vectorized query execution? Are you
       | embedding something like DataFusion or is it entirely custom?
       | (Also will any of this be open sourced?)
        
         | mritchie712 wrote:
         | not sure what's powering crunchy, but you might be interested
         | in Parade:
         | https://blog.paradedb.com/pages/introducing_analytics
        
       | clintboxe wrote:
       | Looks pretty similar to Redshift Spectrum.
        
       ___________________________________________________________________
       (page generated 2024-04-30 23:01 UTC)