hngopher.com

       [HN Gopher] Lesser-known Postgres features
       ___________________________________________________________________
        
       Lesser-known Postgres features
        
       Author : CRConrad
       Score  : 436 points
       Date   : 2021-11-09 15:50 UTC (7 hours ago)
        
 (HTM) web link (hakibenita.com)
 (TXT) w3m dump (hakibenita.com)
        
       | tbrock wrote:
       | Im surprised there is no mention of foreign data wrappers, easily
       | one of the best but lesser known features.
       | 
       | Would i use them in production? no. Are they fun to play around
       | with? Yes!
        
         | laurent92 wrote:
         | Atlassian gives us sales figures as a large CSV. Nothing better
         | than loading the CSV as a Postgres view with a foreign data
         | wrapper, and performing pivots/joins/anything on it. Replace
         | the file with a more recent one, and you even have versioning!
        
         | iaabtpbtpnn wrote:
         | We use them in production, they work fine. You probably want to
         | enable use_remote_estimate for postgres_fdw though.
        
           | hyzyla wrote:
           | Could you describe for what use case are you using foreign
           | data wrapper in production?
        
       | jeffbee wrote:
       | My favorite relatively obscure pg feature is you can write stored
       | procedures in perl, python, and tcl.
        
         | pageandrew wrote:
         | Wow, I've been writing a lot of PL/pgSQL recently and did not
         | know this.
         | 
         | Would these be drop in replacements for PL/pgSQL? Are there any
         | performance tradeoffs to using one over another? Any changes in
         | functionality (aka functions you can call)?
        
           | jeffbee wrote:
           | I personally have never run into whatever the limitations
           | might be of pl/python, but I suppose you could make a
           | disaster by trying to dynamically load conflicting symbols or
           | something like that. But I have used numpy in stored
           | procedures, so at least that works.
           | 
           | pl/sh also works wonderfully if you want to run a complex
           | procedure in an isolated subprocess, but pl/sh is not part of
           | the main postgresql distribution.
        
         | mindcrime wrote:
         | Not to mention Java, R, Ruby, PHP, Scheme, and sh.
         | 
         | https://www.postgresql.org/docs/9.5/external-pl.html
         | 
         | Heck, there's even a pl/prolog out there, but it looks pretty
         | old and I'm skeptical how useful it would be.
         | 
         | https://github.com/salva/plswipl
        
         | brasetvik wrote:
         | They're cool, but keep in mind that this often means you cannot
         | run them on managed Postgres services.
        
       | thearrow wrote:
       | For the last point about "Find Overlapping Ranges", consider
       | PostgreSQL range types [0] and their corresponding range
       | operators [1].
       | 
       | [0]: https://www.postgresql.org/docs/current/rangetypes.html
       | 
       | [1]: https://www.postgresql.org/docs/current/functions-range.html
        
       | luhn wrote:
       | That COPY trick is really neat. I've always used SELECT INTO.
       | SELECT *         INTO TEMP copy         FROM foobar;
       | \copy "copy" to 'foobar.csv' with csv headers
        
         | haki wrote:
         | I'm always looking for new copy tricks. Is "copy" a temporary
         | table? Or a variable...?
        
           | luhn wrote:
           | Temporary table. Although looking at the docs it seems CREATE
           | TABLE AS is the recommended syntax.
           | https://www.postgresql.org/docs/13/sql-createtableas.html
           | CREATE TEMP TABLE "copy" AS         SELECT * FROM foobar;
        
       | sharadov wrote:
       | The comment one is pretty neat, am going to start using it
       | 
       | db=# COMMENT ON TABLE sale IS 'Sales made in the system'; COMMENT
        
         | ccakes wrote:
         | In-database comments combined with something like
         | https://github.com/k1LoW/tbls make for very cheap database
         | documentation.
         | 
         | No affiliation with tbls except that I'm a big fan
        
       | londons_explore wrote:
       | Avoid using lesser known features...
       | 
       | They're the ones that will be hardest to migrate to a different
       | database, most likely to be deprecated, and least likely to be
       | understood by the next engineer to fill your shoes.
       | 
       | While many of these are neat, good engineering practice is to
       | make the simplest thing to get the job done.
        
         | sodapopcan wrote:
         | How often does migrating databases ever actually happen? I'm
         | not saying it doesn't happen, but I've never experienced it or
         | know anyone who has (I've asked!). I certainly would shy away
         | from using features that make my app code cleaner and improves
         | the all-around performance of the app on the slim-to-none
         | chance that one day I might one day have to migrate off of
         | postgres.
        
           | handrous wrote:
           | 20-year career so far. Never seen a database swapped out.
           | I've seen "apps" replaced on top of databases, or more
           | programs added to access the same database. I've seen the app
           | _and_ database both get thrown out and replaced, because they
           | were tightly coupled[0], as the parent advocates (Rails +
           | ActiveRecord seems to be prime for this kind of  "gotta throw
           | it all out" situation). I've never seen the program stay the
           | same while the DB is swapped out from under it.
           | 
           | [0] yes, that's actually _tightly_ coupling them, because now
           | your DB is too unsafe to use without the  "app" you built on
           | top, and doesn't provide enough functionality to make it
           | worth trying to retrofit that safety onto it.
        
             | williamdclt wrote:
             | I've seen it firsthand, but we're talking about an early-
             | stage startup moving from MySQL to Postgres. It's
             | definitely not enough to convince me not to use database-
             | specific features.
        
         | gmfawcett wrote:
         | I'm not sure why you are being downvoted. Your argument is
         | reasonable and clearly presented, and "do the simplest thing
         | that can work" is a tried and true design philosophy.
        
           | nicoburns wrote:
           | Using a built in database feature if often a lot simpler
        
             | gmfawcett wrote:
             | Fair... and that's a good counter-argument, but not a good
             | reason to downvote someone who disagrees. (Not suggesting
             | that you downvoted him yourself.)
        
         | handrous wrote:
         | Counterpoint:
         | 
         | Databases often outlive applications. Database features usually
         | do what they do very quickly and reliably. Used well, a full-
         | featured database can make things like in-place refactors or
         | re-writes of your application layer _far_ easier, make everyday
         | operation much safer, as well as making it much safe  & useful
         | to allow multiple applications access to the same database,
         | which can be very handy for all kinds of reasons.
         | 
         | > While many of these are neat, good engineering practice is to
         | make the simplest thing to get the job done.
         | 
         | That, or, use the right tool for the job.
        
         | tengbretson wrote:
         | Even the most obscure Postgres features are documented better
         | than whatever code a dev would write in order to avoid using
         | them.
        
         | pyentropy wrote:
         | I'd rather have a non-portable SQL feature than an app-level
         | "portable" implementation that almost certainly is slower and
         | non-transactional.
         | 
         | Migrating to another DB will always take work, no matter how
         | much you try to ignore flavor-specific features. Query
         | planning, encoding and charsets, locking abilities tend to be
         | very different. A query can run fine in MySQL and cause a
         | deadlock in Postgres even though it's syntactically valid in
         | both.
        
       | gavinray wrote:
       | I want to stress the importance of not using                 id
       | int SERIAL
       | 
       | If you are on a somewhat recent version of postgres, please do
       | yourself a favor and use:                 id int GENERATED BY
       | DEFAULT AS IDENTITY PRIMARY KEY
       | 
       | An "identity column", the part here:
       | 
       | https://hakibenita.com/postgresql-unknown-features#prevent-s...
       | 
       | You might think this is trivial -- but SERIAL creates an "owned"
       | (by a certain user) sequence behind the scenes, and so you run
       | into massive headaches if you try to move things around.
       | 
       | Identity columns don't, and avoid the issue altogether.
       | 
       | https://www.2ndquadrant.com/en/blog/postgresql-10-identity-c...
        
         | cormacrelf wrote:
         | That syntax seems a bit verbose for a column that will most
         | likely appear on every table you ever create. How about:
         | id int SIMILAR TO SERIAL BUT WITHOUT PROBLEMS MOVING THINGS
         | AROUND
         | 
         | And the other variant for when you aren't sure that worked:
         | id int SIMILAR TO SERIAL BUT WITH EVEN FEWER PROBLEMS MOVING
         | THINGS AROUND
        
           | iaabtpbtpnn wrote:
           | Complain to the SQL standard authors, not the Postgres
           | developers :)
        
         | uhoh-itsmaciek wrote:
         | Another favor you should do yourself is use bigint keys (unless
         | using uuid as suggested elsewhere, or you're very, very sure
         | you'll never chew through 2 billion ids).
        
         | mixmastamyk wrote:
         | From TFA:                   Instead of using GENERATED BY
         | DEFAULT, use GENERATED ALWAYS.
        
         | cnity wrote:
         | I often move ID generation into the application layer (this
         | also helps avoid things like enumeration attacks), and actually
         | quite a lot of cool Postgres features blur that line a little
         | bit. It's interesting to think sequences and other
         | computational mechanisms in a DB, and whether they make
         | architecting applications easier or harder. I don't have a
         | strong opinion either way, but I'm interested in HN's opinion.
        
           | superice wrote:
           | One often hears the counterargument 'but using DB-specific
           | features makes your application less portable!' to which I
           | like to argue: When was the last time you moved an
           | application from SQL-db to SQL-db engine? Follow up question:
           | When was it ever 'easy' if you did?
           | 
           | If you start from the basic premise that the database engine
           | and the application are intertwined and are not loosely
           | coupled, using Postgres-specific features feels much less
           | icky from an architectural point of view. They are
           | essentially part of your application given that you use
           | something like Flyway for migrations and you don't manually
           | run SQL against your production to install functions and
           | triggers and such.
        
             | skrtskrt wrote:
             | Done it multiple times, most recently we got an MVP off the
             | ground using only ElasticSearch as a backend, since the
             | primary use case we wanted to validate and demonstrate was
             | discoverability.
             | 
             | As we added more features, the lack of transactions and
             | relational structures started to slow us down, so we
             | dropped in Postgres as a backend, and having application-
             | generated UUID4s as primary keys was a big part in making
             | that move fairly painless
        
             | gizdan wrote:
             | We're currently in the process of moving from MongoDB to
             | CockroachDB. It's not been easy. Certain persistence needs
             | to be duplicated for both of them. Certain queries need to
             | be compared between the two, and we need to make a
             | conscious decision when we pick from which DB.
             | 
             | Having said that, moving logic for id generation to the
             | application because it's less portable otherwise is an odd
             | reason.
        
             | toast0 wrote:
             | Adding on, even when you are trying to make a portable
             | application, you tend to want to make it work on either X
             | or Y, and you can still exclude people from switching their
             | existing data from one to the other.
             | 
             | I've used software that at least tried to be portable, so
             | you could install it with whatever database you had
             | available.
        
             | ziml77 wrote:
             | I've given up on believing that there is such a thing as
             | portable SQL. If you use the lowest common denominator,
             | you're likely to have queries that perform suboptimally.
             | Hell, even temp tables aren't portable because of the
             | different syntax between databases. I've worked with tons
             | of queries that would would take minutes to run without
             | temp tables while running in milliseconds with them. People
             | may as well take advantage of all the nice things their
             | database offers instead of fearing a situation that is
             | unlikely to happen and which will be a tough change anyway.
        
             | fiddlerwoaroof wrote:
             | So, I did an easy migration of an analytics app from
             | elasticsearch to ClickHouse: what made it easy was (a) we
             | could replay an event stream to regenerate the database and
             | (b) we had interfaces around the database defined in terms
             | of the questions we needed the database to answer, and not
             | in terms of ES-specific concepts.
             | 
             | But, what makes this sort of design so nice is that you can
             | use DB-specific stuff behind the interface because you're
             | not trying to write all your queries in the minimally
             | supported subset of SQL or something.
        
             | CRConrad wrote:
             | >>> [DB Sequences]
             | 
             | > One often hears the counterargument 'but using DB-
             | specific features makes your application less portable!'
             | 
             | OK, sorry for the probably stupid question: Isn't it just a
             | matter of, for each sequence, selecting its current value
             | and then creating the new one in the target database to
             | start from there? Should be, if perhaps not easily, still
             | reasonably scriptable... Or what am I missing?
        
         | netcraft wrote:
         | I'll stop short of giving a recommendation or using the word
         | "should", but ill give encouragement to consider using uuid's
         | for keys. I have used them in several systems and have never
         | had any issues with them, and they solve so many issues. The
         | ability to generate a key on the client or on the server or in
         | the database is great for one. And the fact that keys are
         | unique not only in a table but in the system (or many systems)
         | is a huge advantage.
        
           | hackbinary wrote:
           | >> I'll stop short of giving a recommendation or using the
           | word "should" ...
           | 
           | I beleive the word you are looking for is "ought". :)
        
           | evandwight wrote:
           | Do you write logic to handle collisions?
        
             | spullara wrote:
             | "A collision is possible but the total number of unique
             | keys generated is so large that the possibility of a
             | collision is almost zero. As per Wikipedia, the number of
             | UUIDs generated to have atleast 1 collision is 2.71
             | quintillion. This is equivalent to generating around 1
             | billion UUIDs per second for about 85 years."
        
               | fwip wrote:
               | For clarification, that's the number of UUIDs at which
               | you have a 50% chance of at least one collision.
               | 
               | Wikipedia also adds: "the probability to find a duplicate
               | within 103 trillion version-4 UUIDs is one in a billion."
        
             | Ginden wrote:
             | Unless you are writing nuclear warhead management system,
             | writing UUID collision handling is waste of time. Client
             | can retry on top level if request failed.
        
               | afiori wrote:
               | expecting collision to happen is probably a waste of
               | time, but it still sounds like a good idea to bound how
               | much damage a collision might cause.
        
             | williamdclt wrote:
             | part of the point of UUIDs is that you don't have to.
             | Collision is unlikely enough that your time would be more
             | wisely spent worrying about the sun exploding unless you
             | generate an absolutely absurd amount of data.
        
           | gavinray wrote:
           | Yeah I recognize the arguments for UUID keys:
           | - Avoids people being able to just iterate through records,
           | or to discover roughly how many records of a thing you have
           | - Allows you to generate the key before the row is saved
           | 
           | I think I default to auto-increment ID's due to:
           | - Familiarity bias       - They have a temporal aspect to
           | them (IE, I know row with ID 225 was created before row with
           | ID 392, and approximately when they might be created)       -
           | Easier to read (when you have less than +1,000,000 rows in a
           | table)
           | 
           | I agree and think you're right in that UUID's are probably a
           | better default.
           | 
           | Though you can never find a "definitive" guide/rule online or
           | in the docs unfortunately.
        
             | spiffytech wrote:
             | > They have a temporal aspect to them (IE, I know row with
             | ID 225 was created before row with ID 392, and
             | approximately when they might be created)
             | 
             | UUIDv7 (currently a draft spec[0]) are IDs that can be
             | sorted in the chronological order they were created
             | 
             | In the meantime, ulid[1] and ksuid[2] are popular time-
             | sortable ID schemes, both previously discussed on HN[3]
             | 
             | [0] https://datatracker.ietf.org/doc/html/draft-peabody-
             | dispatch...
             | 
             | [1] https://github.com/ulid/spec
             | 
             | [2] https://github.com/segmentio/ksuid
             | 
             | [3] ulid discussion:
             | https://news.ycombinator.com/item?id=18768909
             | 
             | UUIDv7 discusison:
             | https://news.ycombinator.com/item?id=28088213
        
               | gavinray wrote:
               | Wow, neat. TIL
        
               | codetrotter wrote:
               | Yup this. Ulid is great, and implemented in many
               | languages. For example there's a crate named Ulid that
               | you can use that gives you ulid's in Rust.
        
               | stjohnswarts wrote:
               | Hmmm that's really useful :) . thanks!
        
             | yardstick wrote:
             | I prefer UUIDs as well but one other benefit of an integer
             | type key is in the index. UUIDs, especially v4, make for
             | bloated indexes. If there's a very specific performance
             | reason I'll use int, otherwise uuid.
        
               | fdr wrote:
               | This is what I do as well. Most schema structure is on
               | tables that are not very large, and it's nice especially
               | when dealing with multiple environments to prevent
               | errors, or to produce schemas with some polymorphism or
               | joins between several tables without a chain of id-
               | resolving joins in-between.
               | 
               | There are UUID variants that can work well with indices,
               | which shrinks the case for big-integers yet further, to
               | micro-optimizing cases that are situational.
        
             | jrockway wrote:
             | Something I always liked but have never done -- start auto-
             | increment IDs at some number other than 1. I worked on a
             | system where account IDs started with 1000000, invoice IDs
             | started with 2000000, etc. That way, if you saw a random
             | number laying around like "1001234", you knew it was an
             | account ID. "2000123", an invoice! I don't remember how
             | often it helped things, but it was more than 0 times, and
             | it always "felt" good. I never loved the implicit upper
             | bound, but it was also never a problem.
             | 
             | (And remember, the values are in different tables, so the
             | only disadvantage is that your millionth user has ID
             | 2000000. To the trained eye they look like an invoice; but
             | there is no way for the system itself to treat that as an
             | invoice, so it's only confusing to humans. If you use auto-
             | increment keys that start at 1, you have the same problem.
             | Account 1 and Invoice 1 are obviously different things.)
        
               | 9dev wrote:
               | I use this for internal items, as in: real customer IDs
               | start at 10,000, so we have 9,999 possible
               | users/invoices/whatever for our own usage in the
               | database. That makes it easy to filter demo accounts of
               | your sales employees, or production test accounts, in
               | queries: just add an offset.
               | 
               | Re: the upper bound: if you reach a million customers,
               | you have lots of other nice problems, like how to spend
               | all the money they earn you :-)
        
               | Twisol wrote:
               | > but there is no way for the system itself to treat that
               | as an invoice, so it's only confusing to humans.
               | 
               | I agree in principle, but then you have Windows skipping
               | version 9 because of all of the `version.hasPrefix("9")`
               | out in the world that were trying to cleverly handle both
               | 95 and 98 at once. If a feature of data is exploitable,
               | there's a strong chance it will be exploited.
        
           | mjgp2 wrote:
           | Uuidv4 have worse performance when inserting into the btree
           | for the primary key.
        
             | orf wrote:
             | Why would you use a btree for them? Wouldn't a hash index
             | be ideal?
        
               | masklinn wrote:
               | > Why would you use a btree for them?
               | 
               | 1. because PRIMARY KEY is its own constraint, and the
               | underlying index is not under you control
               | 
               | 2. because PRIMARY KEY further restricts UNIQUE, and as
               | of postgres 14 "only B-tree indexes can be declared
               | unique"
        
               | tomnipotent wrote:
               | Some databases don't give you a choice, and your table
               | data is actually stored in a clustered index sorted by
               | primary key. This means random inserts when using a UUID
               | (without embedded time), and more page splitting.
        
             | williamdclt wrote:
             | That is true, but that has never been a bottleneck for me.
             | We have a table with multiple hundreds of millions of rows,
             | 20ish UUIDs indexes (yes it did get out of hand) and insert
             | time isn't an issue.
             | 
             | Ofc, your app performance requirements might vary, and it
             | is objectively true that UUIDs aren't ideal for btree
             | indexes.
        
               | akra wrote:
               | Then you are lucky. I once found an insert performance
               | drop of sometimes 5x once a table reaches 500,000
               | elements or so with BTree's and Uuids in particular.
               | Problem is: UUID's often scale well on the app side,
               | especially with multiple writers so the app wants to use
               | them.
               | 
               | Unless you are using a time sorted UUID, and you only do
               | inserts into the table (never updates) avoid any feature
               | that creates a BTree on those fields IMO. Given MVCC
               | architecture of Postgres time sorted UUID's are often not
               | enough if you do a lot of updates as these are really
               | just inserts which again create randomness in the index.
               | I've been in a project where to avoid a refactor (and
               | given Postgres usage was convenient) they just decided to
               | remove constraints and anything that creates a B-Tree
               | index implicitly or explicitly.
               | 
               | It makes me wish Hash indexes could be used to create
               | constraints. They often use less memory these days in my
               | previous testing under new versions of Postgres, and
               | scale a lot better despite less engineering effort in
               | them. In other databases where updates happen in-place so
               | as not to change order of rows (not Postgres MVCC) a BRIN
               | like index on a time ordered UUID would be often
               | fantastic for memory usage. ZHeap seems to have died.
               | 
               | Sadly this is something people should be aware of in
               | advance. Otherwise it will probably bite you later when
               | you have large write volumes, and therefore are most
               | unable to enact changes to the DB when performance
               | drastically goes down (e.g. large customer traffic). This
               | is amplified because writes don't scale in Postgres/most
               | SQL databases.
        
           | christophilus wrote:
           | A friend of mine recently recommended CUIDs, which I'd never
           | heard of before. Any thoughts on those? Neither usecuid.org
           | nor https://www.npmjs.com/package/cuid give me much
           | confidence, though (the former is not https, and the latter
           | has not been updated in 2 years... which might mean it's
           | done, or might mean it's abandoned).
        
           | polskibus wrote:
           | Be wary when using guids for PK in clustered indices, in
           | databases that support them, see for example
           | https://stackoverflow.com/questions/11938044/what-are-the-
           | be... . Sadly, Postgresql doesn't have them maintained on-
           | the-fly like Ms SQL does so it is less used feature.
        
           | barrkel wrote:
           | Random keys have the positive aspect of not creating any
           | hotspots, and they have the negative aspect of not creating
           | any hotspots.
           | 
           | So if you have concurrent updates which are correlated with
           | row insertion time, random keys can be a win. On the other
           | hand, if your lookups are correlated with row insertion time,
           | then the relevant key index pages are less likely to be hot
           | in memory, and depending on how large the table is, you may
           | have thrashing of index pages (this problem would be worse
           | with MySQL, where rows are stored in the PK index).
           | 
           | (UUID doesn't need to mean random any more though as pointed
           | out below. My commentary is specific to random keys and is
           | actually scar tissue from using MD5 as a lookup into multi-
           | billion row tables.)
        
           | tomnipotent wrote:
           | I generally use both - a serial for internal RDBMS storage to
           | coerce writes to be in order, and a secondary GUID that is
           | app-specific and generated. The app will never care about the
           | serial, and the RDBMS doesn't care about the GUID other than
           | secondary index look-ups (since they're rarely used for range
           | scans).
        
           | lr4444lr wrote:
           | Technically there are many advantages, but operationally, I
           | find it extremely useful in systems where an int PK is auto-
           | generated to use it as a pretty good proxy for a relative
           | created time with relation to other records, with built-in
           | indexing that avoids having to index on an actual created
           | datetime column for a lot of less precise lookup purposes.
           | The backend frameworks I use around data access make the
           | security benefits of UUIDs pretty moot.
        
             | Robin_Message wrote:
             | As someone below suggests, ULID is a good alternative which
             | has a temporal aspect in the high bits so it is still a
             | proxy for relative creation time.
        
       | netcraft wrote:
       | Great article, always learn a lot from this author.
       | 
       | Here is another way I have used to do pivot tables / crosstab in
       | postgres where you have a variable number of columns in the
       | output:
       | 
       | https://gist.github.com/ryanguill/101a19fb6ae6dfb26a01396c53...
       | 
       | You can try it out here:
       | https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=5dbbf7eadf0ed...
        
         | leephillips wrote:
         | Yes, just the per-DB history was worth the price of admission.
         | Too bad there is no RSS feed.
        
           | gobblegobble2 wrote:
           | The feed is here: https://hakibenita.com/feeds/all.atom.xml.
        
             | leephillips wrote:
             | Thanks!
        
       | mikkelenzo wrote:
       | Great article. Read through some of the other articles as well -
       | some high quality content there!
        
         | smackay wrote:
         | The author's posts on Django are always worth reading. They're
         | way more detailed than the run-of-the-mill rehash of the Django
         | documentation that most blogs contain and there's always a
         | nugget of "I never knew that" in them.
        
       | mixmastamyk wrote:
       | This example didn't bring anything new to the table, only adds
       | redundant extra chars:                   SELECT *         FROM
       | users         WHERE email ~ ANY('{@gmail\.com$|@yahoo\.com$}')
       | 
       | Perhaps they were intending something similar to the following
       | example instead. This one works but has a several potential
       | lurking issues:                   with connection.cursor() as
       | cursor:             cursor.execute('''                 SELECT *
       | FROM users                 WHERE email ~ ANY(ARRAY%(patterns)s)
       | ''' % {                 'patterns': [
       | '@gmail\.com$',                     '@yahoo\.com$',
       | ],             })
       | 
       | The dictionary-style interpolation is unnecessary, the pattern
       | strings should be raw strings (the escape is ignored only due to
       | being a period), and this could be a SQL injection site if any of
       | this is ever changed. I don't recommend this form as given, but
       | it could be improved.
        
         | jrmiii wrote:
         | > This example didn't bring anything new to the table, only
         | adds redundant extra chars:
         | 
         | OP indicated as much saying:
         | 
         | > This approach is easier to work with from a host language
         | such as Python
         | 
         | I'm with you on the injection - have to be sure your host
         | language driver properly escapes things.
        
       | brasetvik wrote:
       | Regarding "Match Against Multiple Patterns", the examples are
       | about finding the _suffixes_ of something, email domains in the
       | example.
       | 
       | An attempt to find a suffix like that will not be able to use an
       | index, whereas creating a functional index on the reverse and
       | looking for the reversed suffix as a prefix will be:
       | # create table users (id int primary key, email text);
       | CREATE TABLE       # create unique index on users(lower(email));
       | CREATE INDEX       # set enable_seqscan to false;       SET
       | # insert into users values (1, 'foo@gmail.com'), (2,
       | 'bar@gmail.com'), (3, 'foo@yahoo.com');       INSERT 0 3       #
       | explain select * from users where email ~*
       | '@(gmail.com|yahoo.com)$';
       | QUERY PLAN       ------------------------------------------------
       | --------------------------        Seq Scan on users
       | (cost=10000000000.00..10000000025.88 rows=1 width=36)
       | Filter: (email ~* '@(gmail.com|yahoo.com)$'::text)            #
       | create index on users(reverse(lower(email)) collate "C"); --
       | collate C explicitly to enable prefix lookups       CREATE INDEX
       | # explain select * from users where reverse(lower(email)) ~
       | '^(moc.liamg|moc.oohay)';
       | QUERY PLAN       ------------------------------------------------
       | ----------------------------------------------------
       | Bitmap Heap Scan on users  (cost=4.21..13.71 rows=1 width=36)
       | Filter: (reverse(lower(email)) ~ '^(moc.liamg|moc.oohay)'::text)
       | ->  Bitmap Index Scan on users_reverse_idx  (cost=0.00..4.21
       | rows=6 width=0)                Index Cond:
       | ((reverse(lower(email)) >= 'm'::text) AND (reverse(lower(email))
       | < 'n'::text))
       | 
       | (Another approach could of course be to tokenize the email, but
       | since it's about pattern matching in particular)
        
       | brasetvik wrote:
       | Two things that aren't exactly lesser-known, but that I wish more
       | used continuously as part of development:
       | 
       | - generate_series(): While not the best to make _realistic_ test
       | data for proper load testing, at least it's easy to make a lot of
       | data. If you don't have a few million rows in your tables when
       | you're developing, you probably don't know how things behave,
       | because a full table/seq scan will be fast anyway - and you'll
       | not spot the missing indexes (on e.g. reverse foreign keys, I see
       | missing often enough)
       | 
       | - `EXPLAIN` and `EXPLAIN ANALYZE`. Don't save minutes of looking
       | at your query plans during development by spending hours fixing
       | performance problems in production. EXPLAIN all the things.
       | 
       | A significant percentage of production issues I've seen (and
       | caused) are easily mitigated by those two.
       | 
       | By learning how to read and understand execution plans and how
       | and why they change over time, you'll learn a lot more about
       | databases too.
       | 
       | (CTEs/WITH-expressions are life changing too)
        
         | dylanz wrote:
         | It's worth to note that earlier versions of PostgreSQL didn't
         | include the "AS NOT MATERIALIZED" option when specifying CTE's.
         | In our setup, this had _huge_ hits to performance. If we were
         | on a more recent version of PostgreSQL (I think 11 in this
         | case), or if the query writer just used a sub-query instead of
         | a CTE, we would have been fine.
        
           | brasetvik wrote:
           | Yep! A lot of older posts about CTEs largely advice against
           | them for this reason.
           | 
           | Postgres 12 introduced controllable materialization
           | behaviour: https://paquier.xyz/postgresql-2/postgres-12-with-
           | materializ...
           | 
           | By default, it'll _not_ materialise unless it's recursive, or
           | if there are >1 other CTEs consuming it.
           | 
           | When not materializing, filters may push through the CTEs.
        
         | skunkworker wrote:
         | Also explain(analyze,buffers) is by far my favorite. It shows
         | you the number of pages loaded from disk or cache.
         | 
         | Also to note: EXPLAIN just plans the query, EXPLAIN (ANALYZE)
         | plans and runs the query. Which can take awhile in production.
        
       | bajsejohannes wrote:
       | Another great hidden feature is the LISTEN/NOTIFY system. You can
       | use it to watch for changes on a table, and then use that to e.g.
       | update a view in the UI.
       | 
       | For example create a table with a trigger on insert "NOTIFY
       | new_data". Then on query do                   LISTEN new_data;
       | SELECT ...;
       | 
       | Now you'll get the results and any future updates.
        
       | boarnoah wrote:
       | The upsert strategy mentioned (using ON CONFLICT to update
       | instead of insert) has a catch, if your PK is a sequence it will
       | increment even though the insert has failed.
       | 
       | It's not a huge deal in most cases, is a bigger issue if you are
       | running any sort of scheduled ETL to update (or insert new)
       | things.
        
         | orangepanda wrote:
         | Does it matter? In case of rollbacks, the sequence would
         | increment too. You have to handle gaps anyway.
        
       | crorella wrote:
       | Great article, a lot of useful commands for those that work
       | programming in the DB.
        
       | freeest-will wrote:
       | Careful with DISTINCT ON, it can be extremely slow, like 100x
       | slower than a verbose and less readable alternative. But it might
       | not matter in your use case.
        
         | CabSauce wrote:
         | I don't belive this is correct for postgres. It's true for
         | other RDBMS, I believe.
         | <https://medium.com/@smitagudale712/distinct-vs-group-by-
         | in-s...>
        
           | freeest-will wrote:
           | I encountered this problem in Postgres 13.
           | 
           | https://blog.timescale.com/blog/how-we-made-distinct-
           | queries...
        
       | ForHackernews wrote:
       | Surprised to not see any mention of Postgres native pub/sub
       | messaging: https://vijayg.com/postgresql-pubsub/
       | 
       | Is that feature considered well-known? Or is it so obscure this
       | author didn't know about it?
        
       | cldellow wrote:
       | Expanding on the regex trick: I often use [.] instead of \\. to
       | match a period.
       | 
       | They achieve the same thing, but [.] avoids the problem of your
       | host language "helpfully" interpolating \\. into . before sending
       | the query to Postgres.
        
       | micw wrote:
       | I may be wrong, but isn't "overlapping" just
       | "meeting_a.starts_at<meeting_b.ends_at &&
       | meeting_a.ends_at>meeting_b.starts_at"?
       | 
       | Or in words: There is an overlap if meeting A starts before
       | meeting B ends and meeting A ends after meeting B starts.
       | 
       | So the scenario described looks way more complex as it actual
       | seems to be.
        
         | jpitz wrote:
         | Well, for this feature, it means if meeting a and meeting b
         | share any point in time ( except their boundaries )
        
           | micw wrote:
           | I have tried this with the condition I described and does not
           | see any where it fails.
        
         | jayd16 wrote:
         | You'd fail on the exactly the same time case.
        
         | KarlKemp wrote:
         | I believe it's about the common case of two events sharing a
         | single (infinitely small) point of time: meeting A from 3 to 4,
         | followed by meeting B from 4 to 5.
         | 
         | Those two time periods do overlap mathematically, but not
         | physically: they share the common point of "4", even if that
         | point has duration 0.
         | 
         | I know the iCal standard goes into this for a bit.
        
         | micw wrote:
         | I get the exactly same result with that query as the "overlap"
         | query:
         | 
         | SELECT * FROM meetings, new_meetings WHERE
         | new_meetings.starts_at<meetings.ends_at AND
         | new_meetings.ends_at>meetings.starts_at;
        
       | jrumbut wrote:
       | I've seen similarly titled articles and they are usually full of
       | the features that aren't used for a reason (MONEY column type,
       | etc).
       | 
       | It's nice to see one that's full of good advice.
        
       | gmac wrote:
       | On this first point -- 'Get the Number of Updated and Inserted
       | Rows in an Upsert' -- unfortunately this `xmax` method is not
       | always reliable and there does not seem to be any reliable
       | alternative.
       | 
       | I posted to pgsql-hackers about this problem here:
       | https://www.postgresql.org/message-id/DE57F14C-DB96-4F17-925...
        
       | jakearmitage wrote:
       | I love those kinds of posts. Always something that you keep for
       | life.
        
       | picardo wrote:
       | > Grant Permissions on Specific Columns
       | 
       | This feature might be great for OLAP systems, but it's awful for
       | OLTP systems. The last thing I want is debug permission errors
       | bubbling from my database into the application layer. It may not
       | be as efficient, but you should always manage your ACL rules in
       | the application layer if you're building an OLTP app.
        
         | blowski wrote:
         | The word "always" should always be avoided in architecture
         | discussions.
        
           | picardo wrote:
           | I qualified it by saying "if you're building an OLTP app"
           | though.
           | 
           | I speak from personal experience. If you know a good usecase
           | for ACL in database for OLTP workflows, I'm all ears.
        
             | blowski wrote:
             | Even then, it would be better phrased as a tradeoff. What's
             | the downside of putting in the database? I'd guess it makes
             | horizontally scaling harder, and it's less portable to
             | other database vendors. Is there an upside, like you've got
             | stronger data security guarantees?
        
               | picardo wrote:
               | Good point. Phrasing it as a trade-off is the better
               | approach. Agree with the downsides you listed, as well.
               | 
               | Better security guarantees, for sure, but if an
               | unauthorized user gains direct access to your database,
               | you might have bigger problems from a security
               | perspective.
        
               | pfarrell wrote:
               | Agree with all points here.
               | 
               | An upside to using db to handle data permissions is the
               | data has the same protections if you are accessing with
               | direct SQL or using an app. Also, those permissions would
               | persist with the data in restored backups.
               | 
               | I'm not advocating this and think for OLTP, building RBAC
               | in the app layer is almost always a better idea, but
               | these would be some benefits.
        
             | mcherm wrote:
             | > If you know a good usecase for ACL in database for OLTP
             | workflows, I'm all ears.
             | 
             | Here's one that's meaningful to me. We have a single
             | database with five different applications that access it,
             | each of them managed by a separate team. By enforcing
             | access constraints in the database we guarantee the access
             | constraints will be applied in all cases. It is difficult
             | to ensure that in application code managed by separate
             | teams.
             | 
             | (Just to be clear: I don't think your advice is poor, I
             | just wanted to give an example of a case where it isn't
             | universally applicable.)
        
               | picardo wrote:
               | I see, thanks for your input. Most of the workflows I
               | worked on involved provisioning one database per app, so
               | I hadn't entertained this angle.
               | 
               | Are there any risks to changing ACL rules on a production
               | database server from a stability perspective?
        
       | NightMKoder wrote:
       | One of the hardest types of queries in a lot of DBs is the simple
       | `min-by` or `max-by` queries - e.g. "find the most recent post
       | for each user." Seems like Postgres has a solution - `DISTINCT
       | ON` - though personally I've always been a fan of how BigQuery
       | does it: `ARRAY_AGG`. e.g.                 SELECT user_id,
       | ARRAY_AGG(STRUCT(post_id, text, timestamp) ORDER BY timestamp
       | DESC LIMIT 1)[SAFE_OFFSET(0)].*       FROM posts       GROUP BY
       | user_id
       | 
       | `DISTINCT ON` feels like a hack and doesn't cleanly fit into the
       | SQL execution framework (e.g. you can't run a window function on
       | the result without using subqueries). This feels cleaner, but I'm
       | not actually aware of any other DBMS that supports `ARRAY_AGG`
       | with `LIMIT`.
        
         | bbirk wrote:
         | I would use lateral join. Alternatively you can use window
         | function instead of lateral join, buy from my experience
         | lateral joins are usually faster for this kind of top-N queries
         | in postgres.
         | 
         | select user_table.user_id, tmp.post_id, tmp.text, tmp.timestamp
         | 
         | from user_table
         | 
         | left outer join lateral (
         | 
         | select post_id, text, timestamp
         | 
         | from post
         | 
         | where post.user_id = user_table.user_id
         | 
         | order by timestamp
         | 
         | limit 1
         | 
         | ) tmp on true
         | 
         | order by user_id
         | 
         | limit 30;
        
         | mnahkies wrote:
         | I've generally used `PARTITION BY` to achieve the same effect
         | in big query (similar to what was suggested in the article for
         | postgres).
         | 
         | Does the `ARRAY_AGG` approach offer any advantages?
        
           | NightMKoder wrote:
           | I assume you mean `ROW_NUMBER() OVER (PARTITION BY ... ORDER
           | BY ...) as row_number` with a `WHERE row_number = 1`. We ran
           | into issues with memory usage (i.e. queries erroring because
           | of too high memory usage).
           | 
           | The `ARRAY_AGG` approach uses a simple heap, so it's as
           | efficient as MIN/MAX because it can be computed in parallel.
           | ROW_NUMBER however needs all the rows on one machine to
           | number them properly. ARRAY_AGG combination is associative
           | whereas ROW_NUMBER combination isn't.
        
             | mnahkies wrote:
             | That makes sense, I might have a play with the array agg
             | approach. I'm especially curious if it has any impact on
             | slot time consumed
        
       ___________________________________________________________________
       (page generated 2021-11-09 23:00 UTC)