[HN Gopher] Cleaning Up Your Postgres Database
       ___________________________________________________________________
        
       Cleaning Up Your Postgres Database
        
       Author : plaur782
       Score  : 110 points
       Date   : 2021-03-06 13:06 UTC (9 hours ago)
        
 (HTM) web link (blog.crunchydata.com)
 (TXT) w3m dump (blog.crunchydata.com)
        
       | ahachete wrote:
       | Postgres checkup (https://postgres.ai/products/postgres-checkup)
       | produces very detailed and complete reports easily. This is my
       | recommendation.
        
       | koeng wrote:
       | "if you have a table called messages, logs, events in your
       | Postgres database there is a good chance it's consuming the bulk
       | of your storage, is your largest table, and could benefit from
       | being moved outside of your main Postgres database. "
       | 
       | Any recommendations on a good logging database? I like keeping
       | full logs of what the application does (important for
       | traceability in my case) and love how much effort has gone into
       | postgres backups and querying and such.
        
         | scottshamus wrote:
         | There's lot of good solutions out there but if you really like
         | using Postgres, you could move your logs to a secondary
         | Postgres DB or Postgres server so it doesn't affect your
         | application DB's performance. I've scaled this to 10s of
         | millions of log messages without much issue.
        
           | OldHand2018 wrote:
           | I like this approach and have successfully used both table
           | partitioning [1] and (if necessary) foreign data wrappers [2]
           | to keep my main database small and fast. There is a pretty
           | old blog post that clued me into it [3]. It works really
           | nicely, allowing for such things as old data stored on
           | spinning disks and newer data on NVMe drives. You can query
           | the main table directly with query parameters that limit
           | which partitions you hit, or you can query a partition table
           | if you know that's where all the data resides. It's pretty
           | awesome!
           | 
           | [1] https://www.postgresql.org/docs/current/ddl-
           | partitioning.htm...
           | 
           | [2] https://www.postgresql.org/docs/current/ddl-foreign-
           | data.htm...
           | 
           | [3] https://pgdash.io/blog/postgres-11-sharding.html
        
         | dzikimarian wrote:
         | We're using ELK stack + grafana. There's some learning curve
         | and it is a bit more taxing on the hardware, however it can
         | easily aggregate logs from multiple applications, search them
         | and aggregate in various ways.
        
         | lr4444lr wrote:
         | Unless you need joins, type constraints, and referential
         | integrity, the ELK stack is built for superb query-ability.
        
         | DasIch wrote:
         | As you can probably tolerate reads being a bit slow for logs,
         | S3 is a good option. It's cheap, trivial to operate and you can
         | query it easily with Athena.
        
         | zlynx wrote:
         | Postgres has some features that are excellent for logging.
         | 
         | For example, instead of inserting everything into a log table
         | and deleting old entries and needing to vacuum things, use a
         | partitioned table and just DROP entire partitions when they age
         | out. You can summarize them into some kind of summary table
         | before that, of course.
         | 
         | The partitions are good for the summaries too. Instead of doing
         | index scans on the range of dates across a terabyte of logs, it
         | is a full table scan on a partition of a single day or hour.
         | 
         | You can use a separate database for logging but the one time we
         | did that it got weird. We ended up doing reporting scripts that
         | had to first copy tables from the active DB to the logging DB
         | so we could use them to JOIN on some log columns. They weren't
         | huge but if everything had been in the same DB it wouldn't have
         | been needed.
         | 
         | All of the scripts and procedures I know of were custom written
         | for this. I don't know of any good prepackaged Postgres logging
         | configs, but they may be out there.
        
         | stickfigure wrote:
         | This is a pretty good use case for BigQuery. It's pretty easy
         | to make fairly sophisticated queries across enormous datasets.
         | It fits BigQuery's model of write-once, time-indexed
         | information.
        
         | gunnarmorling wrote:
         | You could keep using Postgres, but in a special way for such
         | log tables: insert and delete the log records right away. Then
         | use a change data capture tool like Debezium to capture changes
         | from the transaction log and stream them to Kafka, or any other
         | destination.
         | 
         | Your log table will always be empty (so not taking up disk
         | space), but the CDC tool will be able to get all the
         | information from the TX logs (which themselves can be discarded
         | after all replication slots have advanced a given offset). The
         | advantage is that your log entries are transactionally written
         | with the actual application data itself.
         | 
         | We use this technique for instance in the Debezium Quarkus
         | extension for implementing the outbox pattern [1]. Postgres
         | even allows you to insert records directly to the TX log via
         | pg_logical_emit_message() - so no table involved at all -, but
         | we don't support this in Debezium yet.
         | 
         | Disclaimer: I work on Debezium.
         | 
         | [1]
         | https://debezium.io/documentation/reference/integrations/out...
        
           | jaytaylor wrote:
           | This looks interesting, looking forward to reading up on it.
           | Would be really cool if it worked with SQLite ;)
           | 
           | Any chance JDK11 support will come for the Debezium Java core
           | [1]?
           | 
           | [1] https://github.com/debezium/debezium
        
             | gunnarmorling wrote:
             | Funny you should ask, we were just talking about raising
             | the minimum baseline to JDK 11 the other day (currently it
             | still is compatible with Java 1.8). That said, you
             | definitely can use Debezium with Java 11 already; so
             | perhaps you an entail a bit what exactly you mean by JDK 11
             | support here?
             | 
             | Re SQLite, is there any CDC interface which could be used
             | to implement a connector?
        
               | jaytaylor wrote:
               | Got you, that sounds great!
               | 
               | When the README listed Java 1.8 as a requirement, I
               | assumed (perhaps incorrectly) an implication that it
               | works with _only_ Java 1.8.
               | 
               | At my work, some critical libraries only work with Java 8
               | or 11, so the transition from 8 was challenging on the
               | dependency versioning front. It's a gigantic project and
               | codebase, though.
               | 
               | Re: SQLite, not that I know of, but I don't have deep
               | knowledge on the matter. HN may go crazy (in a good way!)
               | if there were a way to pull this off though!!!
        
         | craigkerstiens wrote:
         | As another commenter mentioned, there are a lot of options out
         | there-probably too many to list off. Postgres can totally work
         | but the option of using a separate database is a good one. We
         | actually got a few questions on the original post about
         | auditing changes, so while not directly "logs" it falls are
         | least partially in that category. A colleague wrote a follow-on
         | post on how you can use Debezium with Postgres for change data
         | capture to send all your changes to Kafka and then to some
         | other down stream system.
        
         | azophy_2 wrote:
         | there is a HN thread a couple days ago discussing Clickhouse.
         | Haven't try it myself, but may be useful
         | 
         | https://news.ycombinator.com/item?id=26316401
        
         | mpitt wrote:
         | One of the newest kids in the block is Loki (by Grafana) [1]
         | 
         | We're starting to explore it in production and it's looking
         | very promising. Much, much simpler (especially to integrate)
         | than ELK.
         | 
         | [1] https://grafana.com/oss/loki/
        
         | cldellow wrote:
         | Not sure if you're only looking for self-hosted things, but
         | I've been trying out Apex Logs by TJ Holowaychuk:
         | https://apex.sh/logs/
         | 
         | So far, I like it. You get a minimalist UI for doing
         | interactive queries, and your data is ultimately persisted in
         | your own Google BigQuery datasets so you can write arbitrary
         | SQL against them.
         | 
         | I have yet to run significant volume through it, so perhaps my
         | view will change if it performs worse at scale.
        
         | St-Clock wrote:
         | One strategy we are currently implementing is using a second
         | Postgres database synchronized with logical replication.
         | 
         | 1. We insert the logs in the main database and periodically
         | delete old logs. We can thus create the log in the same
         | transaction as a record is created/updated/deleted. Yay
         | atomicity!
         | 
         | 2. The second database uses logical replication to receive new
         | logs, but it only replicates INSERT operations so old logs are
         | never deleted.
         | 
         | 3. The application can show recent logs quickly (who updated
         | the customer profile this week?) and if the user wants to dig
         | deeper, we can query the slower log database but the user kind
         | of expect that pulling the entire history of a record will take
         | a bit longer than other operations.
        
       ___________________________________________________________________
       (page generated 2021-03-06 23:02 UTC)