[HN Gopher] The many faces of DISTINCT in Postgres (2017)
___________________________________________________________________
The many faces of DISTINCT in Postgres (2017)
Author : rsecora
Score : 153 points
Date : 2023-05-22 09:46 UTC (13 hours ago)
(HTM) web link (hakibenita.com)
(TXT) w3m dump (hakibenita.com)
| kubota wrote:
| Oracle listaggs do support distinct, since 19c.
| cpursley wrote:
| I only learned about the ranked query approach last week thanks
| to ChatGTP. Helped me solve a hairy query that rolled up activity
| events grouped by time periods. Before that I was struggling with
| distinct (and it was slow).
|
| I've avoided ChatGTP until recently and at least for SQL
| refactoring, it's great. The interesting part is the ranked
| example ChatCTP gave me was almost identical to the one in this
| post. I wonder if they're (ChatGTP) is training up on technical
| blog posts.
| swader999 wrote:
| I throw it lines of logs of sql from Paper trail and ask it to
| extract the SQL statements. Then I give it the analyse explain
| output. Sometimes ChatGPT will give me great advice on
| optimizing.
| cpursley wrote:
| Dang, that's a good idea (as long as not production data).
|
| It seems like eventually database automation is going be
| self-optimizing (more than the query planner already is).
| xupybd wrote:
| MSSQL does have something closer to array agg.
|
| https://www.mssqltips.com/sqlservertip/5542/using-for-xml-pa...
| veddan wrote:
| There's also JSON_ARRAYAGG.
|
| https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions....
| pophenat wrote:
| Microsoft SQL Server now also has IS [NOT] DISTINCT FROM.
| https://learn.microsoft.com/en-us/sql/t-sql/queries/is-disti...
| [deleted]
| codeflo wrote:
| > A classic job interview question is finding the employee with
| the highest salary in each department.
|
| Here's a cheat code, in case you need to write a database query
| and don't remember all the fancy join tricks: Problems like this
| often have a very straightforward solution with subselects. In
| this case, the main select gets the departments, and a subselect
| (with limit 1) fetches the top employee for each department.
|
| That's a very natural, compositional way of thinking. Granted,
| it's not the most optimized way to do it, but more often then
| not, the resulting query plan is perfectly fine.
| paol wrote:
| > It's not the most optimized way to do it, but more often then
| not, the query plan is perfectly fine.
|
| Depending on the details, the query plan might be exactly the
| same. I remember once (in SQL Server 2000, so long time ago)
| writing the same query in 3 completely different ways, only to
| see the DB generate the same plan for each one.
|
| It would surprise most people how sophisticated the query
| transformations databases can do.
| [deleted]
| dunham wrote:
| I had one case, with the same SQL server, where rewriting to
| a join made the query much faster. That was a data changing
| statement though, and I think the semantics demanded it rerun
| the subselect.
|
| More recently, I've found that mysql sometimes behaves poorly
| with subselects, so I tend to avoid them out of habit.
| emptyfile wrote:
| [dead]
| paulddraper wrote:
| > Granted, it's not the most optimized way to do it, but more
| often then not, the resulting query plan is perfectly fine.
|
| It usually is the exactly same. (MySQL has a potato for a query
| optimizer tho.)
|
| The biggest case it is suboptimal is when you need to produce
| multiple fields from the sub-select. Because then you need
| multiple sub-selects.
| foldr wrote:
| Also easy to do this with a lateral join (and a limit 1 in the
| subquery).
| giraffe_lady wrote:
| If you can do a lateral join off the top of your head this
| isn't the kind of question you need to have a strategy for
| anyway.
| foldr wrote:
| I think everyone tends to learn their own subset of SQL. I
| was unaware of some of the more 'obvious' solutions in the
| article (e.g. I had only a very vague idea of how window
| functions work), but I use lateral joins all the time.
|
| The nice thing about lateral joins is that they're very
| easy to understand conceptually once you get over the weird
| syntax.
| andrenotgiant wrote:
| How would you explain them to someone who knows the
| syntax but still finds them counterintuitive?
| foldr wrote:
| I just think of it as performing a subquery for every row
| of the main query.
| dagss wrote:
| To me a lateral join is the simplest construct you could
| have, like a for loop in a programming language.
|
| For each row in my query so far, do this other thing.
|
| Since I learned them, everything went from being a puzzle
| on its own, "I know what I want to do but how to express
| it in SQL" -- to simply being writing things out
| naturally..
| mwexler wrote:
| Fwir, BigQuery doesn't yet support lateral joins, though they
| do use the same concept in their UNNEST and array expansion
| features.
|
| It would be nice to have this feature; its a handy hammer for
| the right type of nail.
| neallindsay wrote:
| The author mentions having to get over the lack of upsert when
| moving from Oracle. But readers might like to know this isn't a
| problem anymore since Postgres got "INSERT ... ON CONFLICT UPDATE
| ...".
| paulryanrogers wrote:
| And MERGE
| inferiorhuman wrote:
| Oh I'll add to the pile: the Postgres ON CONFLICT/MERGE stuff
| doesn't capture the " _select or insert if row doesn 't exist_"
| behavior some folks might want to use an upsert type thing for.
|
| Why? You cannot access the row id or any values within the row
| as a return value unless the row changes. The top stackexchange
| answer goes into some detail about why you really don't want to
| simply make a dummy update in Postgres.
|
| You can access these old row values in some scope for the no-
| update case, and this allows you to leverage CTEs instead of
| dummy updates. However... you can't use CTEs with prepared
| statements (and it's a non-obvious fail for the uninitiated).
| So yeah there are still lots of little sharp edges.
| jadbox wrote:
| > "select or insert if row doesn't exist" behavior
|
| I've hit this many, many times. It's such a common need, and
| yet I keep having to write it out as multiple queries to
| "select row" and if no result, do then insert query.
| paulddraper wrote:
| It's true.
|
| Fortunately the records are locked, so a second query will
| pick them up. Still sad tho.
|
| Prepared statements work fine with CTEs.
| cdogl wrote:
| I am a huge Postgres fan, but "ON CONFLICT UPDATE" does not
| always cut the mustard. You can't target conflicts against
| multiple constraints, so if there are multiple constraints that
| could cause conflicts you're stuck either figuring out whether
| you can drop a constraint or doing something clever in your
| client.
|
| This is often a smell, but it's a little inflexible.
| paulddraper wrote:
| > You can't target conflicts against multiple constraints,
|
| I've never needed that, and it's not clear why I would.
|
| Like, my new record is a duplicate of record A according to
| one constraint and record B according to another constraint,
| so it updates both A and B?
|
| One "upsert" but two resulting records?
|
| Can someone fill me in on when this would crop up?
| richbell wrote:
| In my experience this tends to be an annoyance when you
| want to upsert against a table that has multiple unique
| constraints.
|
| For example, let's say you're tracking GitHub repositories
| and have a table `repository(id, gh_id, gh_node_id, owner,
| name, description)` where `gh_id` and `gh_node_id` are both
| unique.
|
| If you want to insert or update a repository you might want
| to do something like below, however, this is not a valid
| syntax and as you need to define a separate `DO UPDATE` for
| `gh_id` and `gh_node_id`: INSERT INTO
| repository (gh_id, gh_node_id, owner, name, description)
| VALUES (:id, :node_id, :owner, :name, :description)
| ON CONFLICT DO UPDATE SET name = excluded.name,
| description = excluded.description;
|
| ------
|
| To my knowledge there's no way to define a single
| constraint `UNIQUE(gh_id || gh_node_id)` instead of
| `UNIQUE(gh_id && gh_node_id)`.
| paulddraper wrote:
| So....what is gh_id and gh_node_id?
|
| Like, you have gh_id | gh_node_id
| ------------------ 1 | 2 2
| | 1
|
| And then you want to "upsert" (1, 1).
|
| You want both records to be updated?
|
| I'm still struggling to see why you would want to do a
| upsert relative to multiple unique constraints at once.
| richbell wrote:
| In addition to this, `ON CONFLICT` is still susceptible to
| race-conditions if you have a highly concurrent application.
| This may be obvious to some, however, a lot of people expect
| upsert to be atomic and get bit by this.
|
| I also dislike that `ON CONFLICT ... DO NOTHING` increments
| numerical primary keys. I understand _why_ it happens but it
| seems counter-intuitive given the name "DO NOTHING". (And,
| yes, relying on the values of primary keys to never change is
| an anti-pattern. However, if you have a table of 10 values
| and the primary keys have massive gaps like 1, 500_211,
| 2_521_241, 15_631_121, etc., it _feels weird_ nonetheless.)
| jack_squat wrote:
| From the Postgres docs,
|
| ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE
| outcome; provided there is no independent error, one of
| those two outcomes is guaranteed, even under high
| concurrency. This is also known as UPSERT -- "UPDATE or
| INSERT".
|
| https://www.postgresql.org/docs/current/sql-insert.html
|
| What are you referring to?
| neallindsay wrote:
| It increments the sequence associated with with a
| "serial" or "bigserial" field--usually used for primary
| keys. People are often surprised by this because they
| expect their primary keys to be sequential and sequences
| are designed to leave gaps in order to avoid a lot of
| locking.
| hn_throwaway_99 wrote:
| That just seems like a fundamental misunderstanding of
| sequences to me. They are guaranteed to be increasing but
| not necessarily sequential.
| richbell wrote:
| I think it's more confusion of how `ON CONFLICT DO
| NOTHING` works. Incrementing sequences is not _" doing
| nothing"_, even if it's valid and sensible behaviour,
| which leads to confusion.
| richbell wrote:
| If I recall correctly (and it has been a while, so I'm
| not saying I am), the issue was with concurrent
| transactions inserting a record into tableA and another
| into tableB which has a foreign key constraint to tableA.
| The issue was likely specific to `ON CONFLICT DO NOTHING`
| and not `ON CONFLICT DO UPDATE`.
|
| For example, let's saying you're building an index of
| open source packages and have two tables:
| package_type(id, name) and package(id, type_id,
| namespace, name).
|
| If you receive two concurrent requests for
| `maven://log4j:log4j` and `maven://io.quarkus:quarkus`, a
| naive implementation to insert both "maven" and the
| packages if they don't exist might look something like
| this: WITH type_id AS ( INSERT
| INTO package_type(name) VALUES (:type)
| RETURNING id ON CONFLICT DO NOTHING )
| INSERT INTO package (type_id, namespace, name)
| SELECT type_id, :namespace, :name FROM type
| ON CONFLICT DO NOTHING;
|
| However, one or both inserts can fail intermittently
| because the primary key for `package_type` will be auto-
| incremented and thus the foreign key won't be valid.
| Also, as mentioned in another comment[0] this won't work
| if `maven` already exists in the `package_type` table.
|
| [0]: https://news.ycombinator.com/item?id=36031790
| paulddraper wrote:
| I don't think this has to with concurrency; this query is
| fundamentally broken (besides typos), in that the CTE
| won't return anything if the package_type already exists.
|
| You have two options:
|
| (1) ON CONFLICT DO UPDATE, with dummy update:
| WITH type AS ( INSERT INTO
| package_type (name) VALUES ($1)
| ON CONFLICT (name) DO UPDATE SET name = excluded.name
| RETURNING id ) INSERT INTO package
| (type_id, namespace, name) SELECT id, $2, $3
| FROM type ON CONFLICT (type_id, namespace, name)
| DO UPDATE SET name = excluded.name RETURNING id;
|
| (2) Separate statements with ON CONFLICT DO NOTHING
| (could be in a UDF if desired): INSERT
| INTO package_type (name) VALUES ($1) ON
| CONFLICT DO NOTHING; INSERT INTO package
| (type_id, namespace, name) SELECT type_id, $2, $3
| FROM package_type WHERE name = $1 ON
| CONFLICT DO NOTHING; SELECT id FROM
| package WHERE (type_id, namespace, name) = ($1,
| $2, $3);
| jack_squat wrote:
| Atomicity doesn't mean "doesn't fail", it means "either
| fails or succeeds, but does not succeed halfway".
|
| There is nothing about what you are describing that is
| different from the behavior you'd get from a regular
| insert or update. If two transactions conflict, a
| rollback will occur. That isn't violating atomicity. In
| fact, it is the way by which atomicity is guaranteed.
|
| The behavior of sequence values getting incremented and
| not committed, resulting in gaps in the sequence, is a
| separate matter, not specific to Postgres or to upsert.
| hn_throwaway_99 wrote:
| I think that actually more to the root of the problem, as
| other folks have noted, is that `ON CONFLICT DO NOTHING`
| means the RETURNING clause returns no rows if there is a
| conflict, which in my experience is rarely what people
| want. So instead people do `ON CONFLICT DO UPDATE` with a
| no-op update which has performance/locking implications,
| otherwise they need to do a complicated query (search
| stack overflow).
|
| I wish that postgres would add some sort of backwards
| compatible option like `ON CONFLICT DO NOOP` or `ON
| CONFLICT DO RETURN` so that you got the semantics of `DO
| NOTHING` except that the conflicted rows are returned.
| paulddraper wrote:
| It's as resilient to race conditions as it could be. Does
| exactly what you'd expect, no?
|
| That is annoying about sequences, I agree.
| richbell wrote:
| > It's as resilient to race conditions *as it could be*.
| Does exactly what you'd expect, no?
|
| While that that may be true, based on the hours I've
| spent scouring Stack Overflow and GitHub issues it seems
| that many people don't realize that it's only 99.999%
| resilient.
| vore wrote:
| It is 100% resilient short of your hard drive going
| through a microwave. It is almost definitely an
| application logic error if you are observing non-
| atomicity. What would non-atomicity even look like here?
| Only some columns being written?
| paulddraper wrote:
| Could you describe what you mean? What phenomenon? For
| the default transaction isolation level, or a different
| one?
| mattashii wrote:
| Yes, that's been a thing since 9.5, which was released in early
| 2016.
| justinclift wrote:
| 2017
| cpursley wrote:
| And still helpful as ever.
| ayhanfuat wrote:
| I was thinking to myself "there is no way Haki Benita has just
| discovered `distinct on`".
| zzzeek wrote:
| I've never gotten into DISTINCT ON and it's always confused me,
| I'd rather see the query with MAX and GROUP BY if I'm looking for
| "the highest X in groups of Y". For the same reason I don't
| prefer RSA-in-three-lines.
| maweki wrote:
| "(easy CS students, I know it's not normalized...)"
|
| Sure it is. As long as you're not storing any other information
| on the department in the employee table.
| daigoba66 wrote:
| And as long as you never rename the departments.
| maweki wrote:
| That has no bearing on normalization.
| Izkata wrote:
| It's an almost verbatim example of getting to 1NF, the
| first and most basic normalization. The value (department
| name) is repeating and should be extracted and given its
| own ID.
| jack_squat wrote:
| 1NF bans relation-valued attributes, not repetition of
| attribute value across tuples in a relation. Mainstream
| SQL databases don't support relation-valued attributes,
| so any table you make in a relational database is 1NF.
|
| You can push back on this a little - for instance maybe
| you consider an array-valued attribute to be enough like
| a relation to argue array-valued attributes violate 1NF.
| But if you do that you must also explain what makes
| arrays different from strings, since strings are pretty
| similar to arrays of characters and can be treated the
| same way in most respects (for instance characters in a
| string can be addressed by index or split into
| substrings).
| dymk wrote:
| Only if there's a separate Departments table, which for
| this very simple example, there isn't
| maweki wrote:
| Even if there were a separate Departments table, who is
| to say that the Department Name is not its primary key?
| The Department name certainly is A key.
| [deleted]
| maweki wrote:
| > The value (department name) is repeating and should be
| extracted
|
| Then the id would be repeating. Furthermore, the
| department name would make a fine primary or alternate
| key for the new relation you're proposing.
|
| Also, that's not what 1NF is. 1NF means there should be
| no table-valued attributes. And neither is any column
| list-valued nor does any subset of columns form a
| subtable.
|
| The other normal forms talk about functional dependencies
| and there aren't any.
|
| The only possible violation of 1NF could be not splitting
| the name in given name and family name. Other than that,
| the table is normalized.
| Izkata wrote:
| > Then the id would be repeating.
|
| Yes, but it's an ID, not a value. No problems there.
|
| > Furthermore, the department name would make a fine
| primary or alternate key for the new relation you're
| proposing.
|
| They're called "natural keys" and there's a lot of
| problems with them not actually mapping to identity. Like
| for one example, if a department's name is changed, it
| isn't just a change to the database column, you have to
| update all associated code as well - which is why you
| should have an ID and use that in the code anyway.
|
| > Also, that's not what 1NF is. 1NF means there should be
| no table-valued attributes. And neither is any column
| list-valued nor does any subset of columns form a
| subtable.
|
| If there's only one such column and you switch
| perspective to the inner table, the transformation is the
| same. That's why I said "almost" - it's not exactly the
| same, there are additional conditions, but if you hit
| them it's the same thing and the result is extracting the
| duplicate values into their own table, linking them with
| an ID instead.
| maweki wrote:
| > Yes, but it's an ID, not a value.
|
| In normalization theory and relational algebra an ID is
| just a value. DBMSs make no difference between this
| column and any other primary key column(s) and they make
| no difference between PK indexes over strings or numbers
| or any other supported data type.
|
| You're just cargo culting here instead of applying
| database theory or actually looking at implementations.
___________________________________________________________________
(page generated 2023-05-22 23:00 UTC)