[HN Gopher] Lesser-known Postgres features
___________________________________________________________________
Lesser-known Postgres features
Author : CRConrad
Score : 436 points
Date : 2021-11-09 15:50 UTC (7 hours ago)
(HTM) web link (hakibenita.com)
(TXT) w3m dump (hakibenita.com)
| tbrock wrote:
| Im surprised there is no mention of foreign data wrappers, easily
| one of the best but lesser known features.
|
| Would i use them in production? no. Are they fun to play around
| with? Yes!
| laurent92 wrote:
| Atlassian gives us sales figures as a large CSV. Nothing better
| than loading the CSV as a Postgres view with a foreign data
| wrapper, and performing pivots/joins/anything on it. Replace
| the file with a more recent one, and you even have versioning!
| iaabtpbtpnn wrote:
| We use them in production, they work fine. You probably want to
| enable use_remote_estimate for postgres_fdw though.
| hyzyla wrote:
| Could you describe for what use case are you using foreign
| data wrapper in production?
| jeffbee wrote:
| My favorite relatively obscure pg feature is you can write stored
| procedures in perl, python, and tcl.
| pageandrew wrote:
| Wow, I've been writing a lot of PL/pgSQL recently and did not
| know this.
|
| Would these be drop in replacements for PL/pgSQL? Are there any
| performance tradeoffs to using one over another? Any changes in
| functionality (aka functions you can call)?
| jeffbee wrote:
| I personally have never run into whatever the limitations
| might be of pl/python, but I suppose you could make a
| disaster by trying to dynamically load conflicting symbols or
| something like that. But I have used numpy in stored
| procedures, so at least that works.
|
| pl/sh also works wonderfully if you want to run a complex
| procedure in an isolated subprocess, but pl/sh is not part of
| the main postgresql distribution.
| mindcrime wrote:
| Not to mention Java, R, Ruby, PHP, Scheme, and sh.
|
| https://www.postgresql.org/docs/9.5/external-pl.html
|
| Heck, there's even a pl/prolog out there, but it looks pretty
| old and I'm skeptical how useful it would be.
|
| https://github.com/salva/plswipl
| brasetvik wrote:
| They're cool, but keep in mind that this often means you cannot
| run them on managed Postgres services.
| thearrow wrote:
| For the last point about "Find Overlapping Ranges", consider
| PostgreSQL range types [0] and their corresponding range
| operators [1].
|
| [0]: https://www.postgresql.org/docs/current/rangetypes.html
|
| [1]: https://www.postgresql.org/docs/current/functions-range.html
| luhn wrote:
| That COPY trick is really neat. I've always used SELECT INTO.
| SELECT * INTO TEMP copy FROM foobar;
| \copy "copy" to 'foobar.csv' with csv headers
| haki wrote:
| I'm always looking for new copy tricks. Is "copy" a temporary
| table? Or a variable...?
| luhn wrote:
| Temporary table. Although looking at the docs it seems CREATE
| TABLE AS is the recommended syntax.
| https://www.postgresql.org/docs/13/sql-createtableas.html
| CREATE TEMP TABLE "copy" AS SELECT * FROM foobar;
| sharadov wrote:
| The comment one is pretty neat, am going to start using it
|
| db=# COMMENT ON TABLE sale IS 'Sales made in the system'; COMMENT
| ccakes wrote:
| In-database comments combined with something like
| https://github.com/k1LoW/tbls make for very cheap database
| documentation.
|
| No affiliation with tbls except that I'm a big fan
| londons_explore wrote:
| Avoid using lesser known features...
|
| They're the ones that will be hardest to migrate to a different
| database, most likely to be deprecated, and least likely to be
| understood by the next engineer to fill your shoes.
|
| While many of these are neat, good engineering practice is to
| make the simplest thing to get the job done.
| sodapopcan wrote:
| How often does migrating databases ever actually happen? I'm
| not saying it doesn't happen, but I've never experienced it or
| know anyone who has (I've asked!). I certainly would shy away
| from using features that make my app code cleaner and improves
| the all-around performance of the app on the slim-to-none
| chance that one day I might one day have to migrate off of
| postgres.
| handrous wrote:
| 20-year career so far. Never seen a database swapped out.
| I've seen "apps" replaced on top of databases, or more
| programs added to access the same database. I've seen the app
| _and_ database both get thrown out and replaced, because they
| were tightly coupled[0], as the parent advocates (Rails +
| ActiveRecord seems to be prime for this kind of "gotta throw
| it all out" situation). I've never seen the program stay the
| same while the DB is swapped out from under it.
|
| [0] yes, that's actually _tightly_ coupling them, because now
| your DB is too unsafe to use without the "app" you built on
| top, and doesn't provide enough functionality to make it
| worth trying to retrofit that safety onto it.
| williamdclt wrote:
| I've seen it firsthand, but we're talking about an early-
| stage startup moving from MySQL to Postgres. It's
| definitely not enough to convince me not to use database-
| specific features.
| gmfawcett wrote:
| I'm not sure why you are being downvoted. Your argument is
| reasonable and clearly presented, and "do the simplest thing
| that can work" is a tried and true design philosophy.
| nicoburns wrote:
| Using a built in database feature if often a lot simpler
| gmfawcett wrote:
| Fair... and that's a good counter-argument, but not a good
| reason to downvote someone who disagrees. (Not suggesting
| that you downvoted him yourself.)
| handrous wrote:
| Counterpoint:
|
| Databases often outlive applications. Database features usually
| do what they do very quickly and reliably. Used well, a full-
| featured database can make things like in-place refactors or
| re-writes of your application layer _far_ easier, make everyday
| operation much safer, as well as making it much safe & useful
| to allow multiple applications access to the same database,
| which can be very handy for all kinds of reasons.
|
| > While many of these are neat, good engineering practice is to
| make the simplest thing to get the job done.
|
| That, or, use the right tool for the job.
| tengbretson wrote:
| Even the most obscure Postgres features are documented better
| than whatever code a dev would write in order to avoid using
| them.
| pyentropy wrote:
| I'd rather have a non-portable SQL feature than an app-level
| "portable" implementation that almost certainly is slower and
| non-transactional.
|
| Migrating to another DB will always take work, no matter how
| much you try to ignore flavor-specific features. Query
| planning, encoding and charsets, locking abilities tend to be
| very different. A query can run fine in MySQL and cause a
| deadlock in Postgres even though it's syntactically valid in
| both.
| gavinray wrote:
| I want to stress the importance of not using id
| int SERIAL
|
| If you are on a somewhat recent version of postgres, please do
| yourself a favor and use: id int GENERATED BY
| DEFAULT AS IDENTITY PRIMARY KEY
|
| An "identity column", the part here:
|
| https://hakibenita.com/postgresql-unknown-features#prevent-s...
|
| You might think this is trivial -- but SERIAL creates an "owned"
| (by a certain user) sequence behind the scenes, and so you run
| into massive headaches if you try to move things around.
|
| Identity columns don't, and avoid the issue altogether.
|
| https://www.2ndquadrant.com/en/blog/postgresql-10-identity-c...
| cormacrelf wrote:
| That syntax seems a bit verbose for a column that will most
| likely appear on every table you ever create. How about:
| id int SIMILAR TO SERIAL BUT WITHOUT PROBLEMS MOVING THINGS
| AROUND
|
| And the other variant for when you aren't sure that worked:
| id int SIMILAR TO SERIAL BUT WITH EVEN FEWER PROBLEMS MOVING
| THINGS AROUND
| iaabtpbtpnn wrote:
| Complain to the SQL standard authors, not the Postgres
| developers :)
| uhoh-itsmaciek wrote:
| Another favor you should do yourself is use bigint keys (unless
| using uuid as suggested elsewhere, or you're very, very sure
| you'll never chew through 2 billion ids).
| mixmastamyk wrote:
| From TFA: Instead of using GENERATED BY
| DEFAULT, use GENERATED ALWAYS.
| cnity wrote:
| I often move ID generation into the application layer (this
| also helps avoid things like enumeration attacks), and actually
| quite a lot of cool Postgres features blur that line a little
| bit. It's interesting to think sequences and other
| computational mechanisms in a DB, and whether they make
| architecting applications easier or harder. I don't have a
| strong opinion either way, but I'm interested in HN's opinion.
| superice wrote:
| One often hears the counterargument 'but using DB-specific
| features makes your application less portable!' to which I
| like to argue: When was the last time you moved an
| application from SQL-db to SQL-db engine? Follow up question:
| When was it ever 'easy' if you did?
|
| If you start from the basic premise that the database engine
| and the application are intertwined and are not loosely
| coupled, using Postgres-specific features feels much less
| icky from an architectural point of view. They are
| essentially part of your application given that you use
| something like Flyway for migrations and you don't manually
| run SQL against your production to install functions and
| triggers and such.
| skrtskrt wrote:
| Done it multiple times, most recently we got an MVP off the
| ground using only ElasticSearch as a backend, since the
| primary use case we wanted to validate and demonstrate was
| discoverability.
|
| As we added more features, the lack of transactions and
| relational structures started to slow us down, so we
| dropped in Postgres as a backend, and having application-
| generated UUID4s as primary keys was a big part in making
| that move fairly painless
| gizdan wrote:
| We're currently in the process of moving from MongoDB to
| CockroachDB. It's not been easy. Certain persistence needs
| to be duplicated for both of them. Certain queries need to
| be compared between the two, and we need to make a
| conscious decision when we pick from which DB.
|
| Having said that, moving logic for id generation to the
| application because it's less portable otherwise is an odd
| reason.
| toast0 wrote:
| Adding on, even when you are trying to make a portable
| application, you tend to want to make it work on either X
| or Y, and you can still exclude people from switching their
| existing data from one to the other.
|
| I've used software that at least tried to be portable, so
| you could install it with whatever database you had
| available.
| ziml77 wrote:
| I've given up on believing that there is such a thing as
| portable SQL. If you use the lowest common denominator,
| you're likely to have queries that perform suboptimally.
| Hell, even temp tables aren't portable because of the
| different syntax between databases. I've worked with tons
| of queries that would would take minutes to run without
| temp tables while running in milliseconds with them. People
| may as well take advantage of all the nice things their
| database offers instead of fearing a situation that is
| unlikely to happen and which will be a tough change anyway.
| fiddlerwoaroof wrote:
| So, I did an easy migration of an analytics app from
| elasticsearch to ClickHouse: what made it easy was (a) we
| could replay an event stream to regenerate the database and
| (b) we had interfaces around the database defined in terms
| of the questions we needed the database to answer, and not
| in terms of ES-specific concepts.
|
| But, what makes this sort of design so nice is that you can
| use DB-specific stuff behind the interface because you're
| not trying to write all your queries in the minimally
| supported subset of SQL or something.
| CRConrad wrote:
| >>> [DB Sequences]
|
| > One often hears the counterargument 'but using DB-
| specific features makes your application less portable!'
|
| OK, sorry for the probably stupid question: Isn't it just a
| matter of, for each sequence, selecting its current value
| and then creating the new one in the target database to
| start from there? Should be, if perhaps not easily, still
| reasonably scriptable... Or what am I missing?
| netcraft wrote:
| I'll stop short of giving a recommendation or using the word
| "should", but ill give encouragement to consider using uuid's
| for keys. I have used them in several systems and have never
| had any issues with them, and they solve so many issues. The
| ability to generate a key on the client or on the server or in
| the database is great for one. And the fact that keys are
| unique not only in a table but in the system (or many systems)
| is a huge advantage.
| hackbinary wrote:
| >> I'll stop short of giving a recommendation or using the
| word "should" ...
|
| I beleive the word you are looking for is "ought". :)
| evandwight wrote:
| Do you write logic to handle collisions?
| spullara wrote:
| "A collision is possible but the total number of unique
| keys generated is so large that the possibility of a
| collision is almost zero. As per Wikipedia, the number of
| UUIDs generated to have atleast 1 collision is 2.71
| quintillion. This is equivalent to generating around 1
| billion UUIDs per second for about 85 years."
| fwip wrote:
| For clarification, that's the number of UUIDs at which
| you have a 50% chance of at least one collision.
|
| Wikipedia also adds: "the probability to find a duplicate
| within 103 trillion version-4 UUIDs is one in a billion."
| Ginden wrote:
| Unless you are writing nuclear warhead management system,
| writing UUID collision handling is waste of time. Client
| can retry on top level if request failed.
| afiori wrote:
| expecting collision to happen is probably a waste of
| time, but it still sounds like a good idea to bound how
| much damage a collision might cause.
| williamdclt wrote:
| part of the point of UUIDs is that you don't have to.
| Collision is unlikely enough that your time would be more
| wisely spent worrying about the sun exploding unless you
| generate an absolutely absurd amount of data.
| gavinray wrote:
| Yeah I recognize the arguments for UUID keys:
| - Avoids people being able to just iterate through records,
| or to discover roughly how many records of a thing you have
| - Allows you to generate the key before the row is saved
|
| I think I default to auto-increment ID's due to:
| - Familiarity bias - They have a temporal aspect to
| them (IE, I know row with ID 225 was created before row with
| ID 392, and approximately when they might be created) -
| Easier to read (when you have less than +1,000,000 rows in a
| table)
|
| I agree and think you're right in that UUID's are probably a
| better default.
|
| Though you can never find a "definitive" guide/rule online or
| in the docs unfortunately.
| spiffytech wrote:
| > They have a temporal aspect to them (IE, I know row with
| ID 225 was created before row with ID 392, and
| approximately when they might be created)
|
| UUIDv7 (currently a draft spec[0]) are IDs that can be
| sorted in the chronological order they were created
|
| In the meantime, ulid[1] and ksuid[2] are popular time-
| sortable ID schemes, both previously discussed on HN[3]
|
| [0] https://datatracker.ietf.org/doc/html/draft-peabody-
| dispatch...
|
| [1] https://github.com/ulid/spec
|
| [2] https://github.com/segmentio/ksuid
|
| [3] ulid discussion:
| https://news.ycombinator.com/item?id=18768909
|
| UUIDv7 discusison:
| https://news.ycombinator.com/item?id=28088213
| gavinray wrote:
| Wow, neat. TIL
| codetrotter wrote:
| Yup this. Ulid is great, and implemented in many
| languages. For example there's a crate named Ulid that
| you can use that gives you ulid's in Rust.
| stjohnswarts wrote:
| Hmmm that's really useful :) . thanks!
| yardstick wrote:
| I prefer UUIDs as well but one other benefit of an integer
| type key is in the index. UUIDs, especially v4, make for
| bloated indexes. If there's a very specific performance
| reason I'll use int, otherwise uuid.
| fdr wrote:
| This is what I do as well. Most schema structure is on
| tables that are not very large, and it's nice especially
| when dealing with multiple environments to prevent
| errors, or to produce schemas with some polymorphism or
| joins between several tables without a chain of id-
| resolving joins in-between.
|
| There are UUID variants that can work well with indices,
| which shrinks the case for big-integers yet further, to
| micro-optimizing cases that are situational.
| jrockway wrote:
| Something I always liked but have never done -- start auto-
| increment IDs at some number other than 1. I worked on a
| system where account IDs started with 1000000, invoice IDs
| started with 2000000, etc. That way, if you saw a random
| number laying around like "1001234", you knew it was an
| account ID. "2000123", an invoice! I don't remember how
| often it helped things, but it was more than 0 times, and
| it always "felt" good. I never loved the implicit upper
| bound, but it was also never a problem.
|
| (And remember, the values are in different tables, so the
| only disadvantage is that your millionth user has ID
| 2000000. To the trained eye they look like an invoice; but
| there is no way for the system itself to treat that as an
| invoice, so it's only confusing to humans. If you use auto-
| increment keys that start at 1, you have the same problem.
| Account 1 and Invoice 1 are obviously different things.)
| 9dev wrote:
| I use this for internal items, as in: real customer IDs
| start at 10,000, so we have 9,999 possible
| users/invoices/whatever for our own usage in the
| database. That makes it easy to filter demo accounts of
| your sales employees, or production test accounts, in
| queries: just add an offset.
|
| Re: the upper bound: if you reach a million customers,
| you have lots of other nice problems, like how to spend
| all the money they earn you :-)
| Twisol wrote:
| > but there is no way for the system itself to treat that
| as an invoice, so it's only confusing to humans.
|
| I agree in principle, but then you have Windows skipping
| version 9 because of all of the `version.hasPrefix("9")`
| out in the world that were trying to cleverly handle both
| 95 and 98 at once. If a feature of data is exploitable,
| there's a strong chance it will be exploited.
| mjgp2 wrote:
| Uuidv4 have worse performance when inserting into the btree
| for the primary key.
| orf wrote:
| Why would you use a btree for them? Wouldn't a hash index
| be ideal?
| masklinn wrote:
| > Why would you use a btree for them?
|
| 1. because PRIMARY KEY is its own constraint, and the
| underlying index is not under you control
|
| 2. because PRIMARY KEY further restricts UNIQUE, and as
| of postgres 14 "only B-tree indexes can be declared
| unique"
| tomnipotent wrote:
| Some databases don't give you a choice, and your table
| data is actually stored in a clustered index sorted by
| primary key. This means random inserts when using a UUID
| (without embedded time), and more page splitting.
| williamdclt wrote:
| That is true, but that has never been a bottleneck for me.
| We have a table with multiple hundreds of millions of rows,
| 20ish UUIDs indexes (yes it did get out of hand) and insert
| time isn't an issue.
|
| Ofc, your app performance requirements might vary, and it
| is objectively true that UUIDs aren't ideal for btree
| indexes.
| akra wrote:
| Then you are lucky. I once found an insert performance
| drop of sometimes 5x once a table reaches 500,000
| elements or so with BTree's and Uuids in particular.
| Problem is: UUID's often scale well on the app side,
| especially with multiple writers so the app wants to use
| them.
|
| Unless you are using a time sorted UUID, and you only do
| inserts into the table (never updates) avoid any feature
| that creates a BTree on those fields IMO. Given MVCC
| architecture of Postgres time sorted UUID's are often not
| enough if you do a lot of updates as these are really
| just inserts which again create randomness in the index.
| I've been in a project where to avoid a refactor (and
| given Postgres usage was convenient) they just decided to
| remove constraints and anything that creates a B-Tree
| index implicitly or explicitly.
|
| It makes me wish Hash indexes could be used to create
| constraints. They often use less memory these days in my
| previous testing under new versions of Postgres, and
| scale a lot better despite less engineering effort in
| them. In other databases where updates happen in-place so
| as not to change order of rows (not Postgres MVCC) a BRIN
| like index on a time ordered UUID would be often
| fantastic for memory usage. ZHeap seems to have died.
|
| Sadly this is something people should be aware of in
| advance. Otherwise it will probably bite you later when
| you have large write volumes, and therefore are most
| unable to enact changes to the DB when performance
| drastically goes down (e.g. large customer traffic). This
| is amplified because writes don't scale in Postgres/most
| SQL databases.
| christophilus wrote:
| A friend of mine recently recommended CUIDs, which I'd never
| heard of before. Any thoughts on those? Neither usecuid.org
| nor https://www.npmjs.com/package/cuid give me much
| confidence, though (the former is not https, and the latter
| has not been updated in 2 years... which might mean it's
| done, or might mean it's abandoned).
| polskibus wrote:
| Be wary when using guids for PK in clustered indices, in
| databases that support them, see for example
| https://stackoverflow.com/questions/11938044/what-are-the-
| be... . Sadly, Postgresql doesn't have them maintained on-
| the-fly like Ms SQL does so it is less used feature.
| barrkel wrote:
| Random keys have the positive aspect of not creating any
| hotspots, and they have the negative aspect of not creating
| any hotspots.
|
| So if you have concurrent updates which are correlated with
| row insertion time, random keys can be a win. On the other
| hand, if your lookups are correlated with row insertion time,
| then the relevant key index pages are less likely to be hot
| in memory, and depending on how large the table is, you may
| have thrashing of index pages (this problem would be worse
| with MySQL, where rows are stored in the PK index).
|
| (UUID doesn't need to mean random any more though as pointed
| out below. My commentary is specific to random keys and is
| actually scar tissue from using MD5 as a lookup into multi-
| billion row tables.)
| tomnipotent wrote:
| I generally use both - a serial for internal RDBMS storage to
| coerce writes to be in order, and a secondary GUID that is
| app-specific and generated. The app will never care about the
| serial, and the RDBMS doesn't care about the GUID other than
| secondary index look-ups (since they're rarely used for range
| scans).
| lr4444lr wrote:
| Technically there are many advantages, but operationally, I
| find it extremely useful in systems where an int PK is auto-
| generated to use it as a pretty good proxy for a relative
| created time with relation to other records, with built-in
| indexing that avoids having to index on an actual created
| datetime column for a lot of less precise lookup purposes.
| The backend frameworks I use around data access make the
| security benefits of UUIDs pretty moot.
| Robin_Message wrote:
| As someone below suggests, ULID is a good alternative which
| has a temporal aspect in the high bits so it is still a
| proxy for relative creation time.
| netcraft wrote:
| Great article, always learn a lot from this author.
|
| Here is another way I have used to do pivot tables / crosstab in
| postgres where you have a variable number of columns in the
| output:
|
| https://gist.github.com/ryanguill/101a19fb6ae6dfb26a01396c53...
|
| You can try it out here:
| https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=5dbbf7eadf0ed...
| leephillips wrote:
| Yes, just the per-DB history was worth the price of admission.
| Too bad there is no RSS feed.
| gobblegobble2 wrote:
| The feed is here: https://hakibenita.com/feeds/all.atom.xml.
| leephillips wrote:
| Thanks!
| mikkelenzo wrote:
| Great article. Read through some of the other articles as well -
| some high quality content there!
| smackay wrote:
| The author's posts on Django are always worth reading. They're
| way more detailed than the run-of-the-mill rehash of the Django
| documentation that most blogs contain and there's always a
| nugget of "I never knew that" in them.
| mixmastamyk wrote:
| This example didn't bring anything new to the table, only adds
| redundant extra chars: SELECT * FROM
| users WHERE email ~ ANY('{@gmail\.com$|@yahoo\.com$}')
|
| Perhaps they were intending something similar to the following
| example instead. This one works but has a several potential
| lurking issues: with connection.cursor() as
| cursor: cursor.execute(''' SELECT *
| FROM users WHERE email ~ ANY(ARRAY%(patterns)s)
| ''' % { 'patterns': [
| '@gmail\.com$', '@yahoo\.com$',
| ], })
|
| The dictionary-style interpolation is unnecessary, the pattern
| strings should be raw strings (the escape is ignored only due to
| being a period), and this could be a SQL injection site if any of
| this is ever changed. I don't recommend this form as given, but
| it could be improved.
| jrmiii wrote:
| > This example didn't bring anything new to the table, only
| adds redundant extra chars:
|
| OP indicated as much saying:
|
| > This approach is easier to work with from a host language
| such as Python
|
| I'm with you on the injection - have to be sure your host
| language driver properly escapes things.
| brasetvik wrote:
| Regarding "Match Against Multiple Patterns", the examples are
| about finding the _suffixes_ of something, email domains in the
| example.
|
| An attempt to find a suffix like that will not be able to use an
| index, whereas creating a functional index on the reverse and
| looking for the reversed suffix as a prefix will be:
| # create table users (id int primary key, email text);
| CREATE TABLE # create unique index on users(lower(email));
| CREATE INDEX # set enable_seqscan to false; SET
| # insert into users values (1, 'foo@gmail.com'), (2,
| 'bar@gmail.com'), (3, 'foo@yahoo.com'); INSERT 0 3 #
| explain select * from users where email ~*
| '@(gmail.com|yahoo.com)$';
| QUERY PLAN ------------------------------------------------
| -------------------------- Seq Scan on users
| (cost=10000000000.00..10000000025.88 rows=1 width=36)
| Filter: (email ~* '@(gmail.com|yahoo.com)$'::text) #
| create index on users(reverse(lower(email)) collate "C"); --
| collate C explicitly to enable prefix lookups CREATE INDEX
| # explain select * from users where reverse(lower(email)) ~
| '^(moc.liamg|moc.oohay)';
| QUERY PLAN ------------------------------------------------
| ----------------------------------------------------
| Bitmap Heap Scan on users (cost=4.21..13.71 rows=1 width=36)
| Filter: (reverse(lower(email)) ~ '^(moc.liamg|moc.oohay)'::text)
| -> Bitmap Index Scan on users_reverse_idx (cost=0.00..4.21
| rows=6 width=0) Index Cond:
| ((reverse(lower(email)) >= 'm'::text) AND (reverse(lower(email))
| < 'n'::text))
|
| (Another approach could of course be to tokenize the email, but
| since it's about pattern matching in particular)
| brasetvik wrote:
| Two things that aren't exactly lesser-known, but that I wish more
| used continuously as part of development:
|
| - generate_series(): While not the best to make _realistic_ test
| data for proper load testing, at least it's easy to make a lot of
| data. If you don't have a few million rows in your tables when
| you're developing, you probably don't know how things behave,
| because a full table/seq scan will be fast anyway - and you'll
| not spot the missing indexes (on e.g. reverse foreign keys, I see
| missing often enough)
|
| - `EXPLAIN` and `EXPLAIN ANALYZE`. Don't save minutes of looking
| at your query plans during development by spending hours fixing
| performance problems in production. EXPLAIN all the things.
|
| A significant percentage of production issues I've seen (and
| caused) are easily mitigated by those two.
|
| By learning how to read and understand execution plans and how
| and why they change over time, you'll learn a lot more about
| databases too.
|
| (CTEs/WITH-expressions are life changing too)
| dylanz wrote:
| It's worth to note that earlier versions of PostgreSQL didn't
| include the "AS NOT MATERIALIZED" option when specifying CTE's.
| In our setup, this had _huge_ hits to performance. If we were
| on a more recent version of PostgreSQL (I think 11 in this
| case), or if the query writer just used a sub-query instead of
| a CTE, we would have been fine.
| brasetvik wrote:
| Yep! A lot of older posts about CTEs largely advice against
| them for this reason.
|
| Postgres 12 introduced controllable materialization
| behaviour: https://paquier.xyz/postgresql-2/postgres-12-with-
| materializ...
|
| By default, it'll _not_ materialise unless it's recursive, or
| if there are >1 other CTEs consuming it.
|
| When not materializing, filters may push through the CTEs.
| skunkworker wrote:
| Also explain(analyze,buffers) is by far my favorite. It shows
| you the number of pages loaded from disk or cache.
|
| Also to note: EXPLAIN just plans the query, EXPLAIN (ANALYZE)
| plans and runs the query. Which can take awhile in production.
| bajsejohannes wrote:
| Another great hidden feature is the LISTEN/NOTIFY system. You can
| use it to watch for changes on a table, and then use that to e.g.
| update a view in the UI.
|
| For example create a table with a trigger on insert "NOTIFY
| new_data". Then on query do LISTEN new_data;
| SELECT ...;
|
| Now you'll get the results and any future updates.
| boarnoah wrote:
| The upsert strategy mentioned (using ON CONFLICT to update
| instead of insert) has a catch, if your PK is a sequence it will
| increment even though the insert has failed.
|
| It's not a huge deal in most cases, is a bigger issue if you are
| running any sort of scheduled ETL to update (or insert new)
| things.
| orangepanda wrote:
| Does it matter? In case of rollbacks, the sequence would
| increment too. You have to handle gaps anyway.
| crorella wrote:
| Great article, a lot of useful commands for those that work
| programming in the DB.
| freeest-will wrote:
| Careful with DISTINCT ON, it can be extremely slow, like 100x
| slower than a verbose and less readable alternative. But it might
| not matter in your use case.
| CabSauce wrote:
| I don't belive this is correct for postgres. It's true for
| other RDBMS, I believe.
| <https://medium.com/@smitagudale712/distinct-vs-group-by-
| in-s...>
| freeest-will wrote:
| I encountered this problem in Postgres 13.
|
| https://blog.timescale.com/blog/how-we-made-distinct-
| queries...
| ForHackernews wrote:
| Surprised to not see any mention of Postgres native pub/sub
| messaging: https://vijayg.com/postgresql-pubsub/
|
| Is that feature considered well-known? Or is it so obscure this
| author didn't know about it?
| cldellow wrote:
| Expanding on the regex trick: I often use [.] instead of \\. to
| match a period.
|
| They achieve the same thing, but [.] avoids the problem of your
| host language "helpfully" interpolating \\. into . before sending
| the query to Postgres.
| micw wrote:
| I may be wrong, but isn't "overlapping" just
| "meeting_a.starts_at<meeting_b.ends_at &&
| meeting_a.ends_at>meeting_b.starts_at"?
|
| Or in words: There is an overlap if meeting A starts before
| meeting B ends and meeting A ends after meeting B starts.
|
| So the scenario described looks way more complex as it actual
| seems to be.
| jpitz wrote:
| Well, for this feature, it means if meeting a and meeting b
| share any point in time ( except their boundaries )
| micw wrote:
| I have tried this with the condition I described and does not
| see any where it fails.
| jayd16 wrote:
| You'd fail on the exactly the same time case.
| KarlKemp wrote:
| I believe it's about the common case of two events sharing a
| single (infinitely small) point of time: meeting A from 3 to 4,
| followed by meeting B from 4 to 5.
|
| Those two time periods do overlap mathematically, but not
| physically: they share the common point of "4", even if that
| point has duration 0.
|
| I know the iCal standard goes into this for a bit.
| micw wrote:
| I get the exactly same result with that query as the "overlap"
| query:
|
| SELECT * FROM meetings, new_meetings WHERE
| new_meetings.starts_at<meetings.ends_at AND
| new_meetings.ends_at>meetings.starts_at;
| jrumbut wrote:
| I've seen similarly titled articles and they are usually full of
| the features that aren't used for a reason (MONEY column type,
| etc).
|
| It's nice to see one that's full of good advice.
| gmac wrote:
| On this first point -- 'Get the Number of Updated and Inserted
| Rows in an Upsert' -- unfortunately this `xmax` method is not
| always reliable and there does not seem to be any reliable
| alternative.
|
| I posted to pgsql-hackers about this problem here:
| https://www.postgresql.org/message-id/DE57F14C-DB96-4F17-925...
| jakearmitage wrote:
| I love those kinds of posts. Always something that you keep for
| life.
| picardo wrote:
| > Grant Permissions on Specific Columns
|
| This feature might be great for OLAP systems, but it's awful for
| OLTP systems. The last thing I want is debug permission errors
| bubbling from my database into the application layer. It may not
| be as efficient, but you should always manage your ACL rules in
| the application layer if you're building an OLTP app.
| blowski wrote:
| The word "always" should always be avoided in architecture
| discussions.
| picardo wrote:
| I qualified it by saying "if you're building an OLTP app"
| though.
|
| I speak from personal experience. If you know a good usecase
| for ACL in database for OLTP workflows, I'm all ears.
| blowski wrote:
| Even then, it would be better phrased as a tradeoff. What's
| the downside of putting in the database? I'd guess it makes
| horizontally scaling harder, and it's less portable to
| other database vendors. Is there an upside, like you've got
| stronger data security guarantees?
| picardo wrote:
| Good point. Phrasing it as a trade-off is the better
| approach. Agree with the downsides you listed, as well.
|
| Better security guarantees, for sure, but if an
| unauthorized user gains direct access to your database,
| you might have bigger problems from a security
| perspective.
| pfarrell wrote:
| Agree with all points here.
|
| An upside to using db to handle data permissions is the
| data has the same protections if you are accessing with
| direct SQL or using an app. Also, those permissions would
| persist with the data in restored backups.
|
| I'm not advocating this and think for OLTP, building RBAC
| in the app layer is almost always a better idea, but
| these would be some benefits.
| mcherm wrote:
| > If you know a good usecase for ACL in database for OLTP
| workflows, I'm all ears.
|
| Here's one that's meaningful to me. We have a single
| database with five different applications that access it,
| each of them managed by a separate team. By enforcing
| access constraints in the database we guarantee the access
| constraints will be applied in all cases. It is difficult
| to ensure that in application code managed by separate
| teams.
|
| (Just to be clear: I don't think your advice is poor, I
| just wanted to give an example of a case where it isn't
| universally applicable.)
| picardo wrote:
| I see, thanks for your input. Most of the workflows I
| worked on involved provisioning one database per app, so
| I hadn't entertained this angle.
|
| Are there any risks to changing ACL rules on a production
| database server from a stability perspective?
| NightMKoder wrote:
| One of the hardest types of queries in a lot of DBs is the simple
| `min-by` or `max-by` queries - e.g. "find the most recent post
| for each user." Seems like Postgres has a solution - `DISTINCT
| ON` - though personally I've always been a fan of how BigQuery
| does it: `ARRAY_AGG`. e.g. SELECT user_id,
| ARRAY_AGG(STRUCT(post_id, text, timestamp) ORDER BY timestamp
| DESC LIMIT 1)[SAFE_OFFSET(0)].* FROM posts GROUP BY
| user_id
|
| `DISTINCT ON` feels like a hack and doesn't cleanly fit into the
| SQL execution framework (e.g. you can't run a window function on
| the result without using subqueries). This feels cleaner, but I'm
| not actually aware of any other DBMS that supports `ARRAY_AGG`
| with `LIMIT`.
| bbirk wrote:
| I would use lateral join. Alternatively you can use window
| function instead of lateral join, buy from my experience
| lateral joins are usually faster for this kind of top-N queries
| in postgres.
|
| select user_table.user_id, tmp.post_id, tmp.text, tmp.timestamp
|
| from user_table
|
| left outer join lateral (
|
| select post_id, text, timestamp
|
| from post
|
| where post.user_id = user_table.user_id
|
| order by timestamp
|
| limit 1
|
| ) tmp on true
|
| order by user_id
|
| limit 30;
| mnahkies wrote:
| I've generally used `PARTITION BY` to achieve the same effect
| in big query (similar to what was suggested in the article for
| postgres).
|
| Does the `ARRAY_AGG` approach offer any advantages?
| NightMKoder wrote:
| I assume you mean `ROW_NUMBER() OVER (PARTITION BY ... ORDER
| BY ...) as row_number` with a `WHERE row_number = 1`. We ran
| into issues with memory usage (i.e. queries erroring because
| of too high memory usage).
|
| The `ARRAY_AGG` approach uses a simple heap, so it's as
| efficient as MIN/MAX because it can be computed in parallel.
| ROW_NUMBER however needs all the rows on one machine to
| number them properly. ARRAY_AGG combination is associative
| whereas ROW_NUMBER combination isn't.
| mnahkies wrote:
| That makes sense, I might have a play with the array agg
| approach. I'm especially curious if it has any impact on
| slot time consumed
___________________________________________________________________
(page generated 2021-11-09 23:00 UTC)