[HN Gopher] Moving a billion Postgres rows on a $100 budget
       ___________________________________________________________________
        
       Moving a billion Postgres rows on a $100 budget
        
       Author : samaysharma
       Score  : 47 points
       Date   : 2024-02-21 19:54 UTC (3 hours ago)
        
 (HTM) web link (blog.peerdb.io)
 (TXT) w3m dump (blog.peerdb.io)
        
       | _boffin_ wrote:
       | How about 11b and a horrible python script over to parquet to
       | your wee little NAS for your homelab?
        
         | sss111 wrote:
         | now we're talking about innovation
        
       | twelfthnight wrote:
       | If someone is price conscious, why move from postgres to
       | snowflake?
        
         | happytiger wrote:
         | That was my first thought as well -- like who is the audience
         | for this?
         | 
         | My first thought was that you could stay on Postgres _and save
         | that $100_ by using the secret power of Open Source.
         | 
         | Well said.
        
           | __s wrote:
           | Seems like the title would be more accurate specifying
           | postgres to snowflake
           | 
           | In the end, there's companies paying to use snowflake, &
           | despite what one may believe they aren't price oblivious.
           | Having their application in postgres is a cost optimization,
           | but then still relying on snowflake for data warehouse
           | integrations
        
           | saisrirampur wrote:
           | Based on my experience working with Postgres users since a
           | decade (ex-Citus/Microsoft), I don't think Postgres is there
           | yet to support every possible workload - ex: medium to larger
           | scale (ex:1TB+) workloads in Real Time analytics, Data
           | Warehousing, Search etc. Sure at smaller scales, it is very
           | versatile to support any workload. That is why it is super
           | common for companies to complement Postgres with other data
           | stores. Don't mean to say that Postgres will not get there, I
           | think it will! But I see it to be more in the long term.
        
             | xenator wrote:
             | We operate with 80 Tb of data ATM. It is laying in several
             | nodes and meta nodes (this is our own terminology). All
             | Postgres.
             | 
             | Recently we need to move data from one DB to another, about
             | 600M records. It is not biggest chank of the data, but we
             | need it on different server because we use FTS a lot. And
             | don't want to interrupt other operations on previous
             | server. It took 3 days and costs 0.
        
               | saisrirampur wrote:
               | Thanks for context! Totally understand where you are
               | coming from. Postgres can be moulded to work for many
               | use-cases. However it could take good amount of effort to
               | make it happen. For example in your case building and
               | managing a sharded Postgres environment isn't
               | straightforward. It requires quite a lot of time and
               | expertise. Citus automated exactly this (sharding).
               | However it wasn't a fit for every workload. https://docs.
               | citusdata.com/en/v12.1/get_started/what_is_citu...
        
         | quadrature wrote:
         | Its not about being price conscious, its about finding an
         | efficient way to replicate data into your data warehouse.
        
           | saisrirampur wrote:
           | OP here, thanks for chiming in. Yep, the blog talks about how
           | efficient you can be while replicating data from Postgres to
           | Data Warehouses. Also just for some context, the inspiration
           | behind writing this blog is a common concern from customers
           | that data-movement is expensive and also the insanely crazy
           | margins in this space.
        
         | saisrirampur wrote:
         | Thanks for chiming in. The concepts/idea of this blog can be
         | extended to other target data-stores too, say ClickHouse.
        
       | rplnt wrote:
       | (edit: mostly offtopic observation follows)
       | 
       | I only knew Snowflake the id selection algorithm, so was a bit
       | confused, but googling "snowflake db" showed me this blurb and
       | now I'm even more confused.
       | 
       | > Snowflake enables organizations to learn, build, and connect
       | with their data-driven peers. Collaborate, build data apps &
       | power diverse workloads in the ...
        
         | quickslowdown wrote:
         | Snowflake & Azure Synapse are competing products, if that
         | helps.
        
       | pryelluw wrote:
       | Does the elastic license fall under open source as defined by
       | OSI?
        
       | ralusek wrote:
       | If someone asked me how much it would cost to move a billion
       | Postgres rows, I would say "probably under $100."
       | 
       | I just had to move 500 million "rows" into S3, and it came in at
       | about $100. I would expect S3 to be more expensive.
        
       | hipadev23 wrote:
       | > Moving 1 billion rows is no easy task
       | 
       | This isn't an accurate premise. Modern OLAP databases make
       | dealing with billions to trillions of rows manageable, including
       | on a single server. Exporting "select * from table" from an OLTP
       | such as Postgres or MySQL into an OLAP is trivial and quite fast,
       | and if 100M rows/sec on commodity servers isn't fast enough,
       | there's always performance tuning [1].
       | 
       | [1] https://altinity.com/blog/loading-100b-rows-in-minutes-in-
       | al...
        
         | saisrirampur wrote:
         | That sentence is more in the context of the blog of moving a
         | billion rows across data-stores. We will edit it to make that
         | more clear. Thanks for the feedback.
        
           | hipadev23 wrote:
           | I can't speak for Snowflake, but on Clickhouse it's quite
           | literally [1]:                   insert into new_table select
           | * from postgresql('postgres:5432', 'db', 'table', 'user',
           | 'pass');
           | 
           | I assume it's similarly easy on Snowflake, Databricks,
           | SingleStore, and the rest.
           | 
           | [1] https://clickhouse.com/docs/en/sql-reference/table-
           | functions...
        
       | tharakam wrote:
       | I'm confused. Everything sounds very expensive to me.
       | 
       | The last table which compares it with the other vendors is
       | surprising. Even Stich Data (cheapest) costs $1 to move 240K
       | records: (1B / 4,166.67 = 240K). Is this real?
       | 
       | So, their solution costs $1 to process 13.6M records. Sounds like
       | this is not very share-worthy.
       | 
       | What I'm missing here?
        
       | jmholla wrote:
       | This was a disappointing article. It was expecting it to explore
       | validating the integrity and consistency of the data, but that
       | just seems to be handwaved away by a short section saying PeerDB
       | handles it. This is especially disappointing since the article
       | calls that out as one of the cruxes that makes this so difficult.
        
       ___________________________________________________________________
       (page generated 2024-02-21 23:00 UTC)