[HN Gopher] How to check if two SQL tables are the same
___________________________________________________________________
How to check if two SQL tables are the same
Author : remywang
Score : 238 points
Date : 2023-07-27 06:22 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| frhack wrote:
| Quickly Compare Data in Two Tables Using CHECKSUM and
| CHECKSUM_AGG
|
| https://sqlundercover.com/2018/12/18/quickly-compare-data-in...
| amtamt wrote:
| checksum and checksum_agg do not work for all data items, at
| least for SQL server "Noncomparable data types" for checksum
| and "null" for checksum_agg are show stopper.
|
| Also, checksum/ checksum_agg do not seem like SQL standard
| functions. referring
| https://www.postgresql.org/docs/current/features.html and
| https://en.wikipedia.org/wiki/SQL:2023#New_features.
| LeonB wrote:
| I remember casting some types to varbinary and then using
| fn_repl_hash_binary() to generate the hashes.
|
| And by "remember" I mean I wrote it down here so that I
| wouldn't have to - https://til.secretgeek.net/sql_server/bulk
| _comparison_with_h...
| poulpi wrote:
| If the issue happen a lot, there is also:
| https://github.com/datafold/data-diff
|
| That is a nice tool to do it cross database as well.
|
| I think it's based on checksum method.
| Pxtl wrote:
| Honestly if the resultsets are small-enough, I just dump them
| to JSON and diff the files. But it has to be fully
| deterministically sorted for that (in a sane world "order by *"
| would be valid ANSI SQL).
| hichkaker wrote:
| Thank you for mentioning Data Diff! Founder of Datafold here.
| We built Data Diff to solve a variety of problems that we
| encountered as data engineers: (A) Testing SQL code changes by
| diffing the output of production/dev versions of SQL query. (B)
| Validating that data is consistent when replicating data
| between databases.
|
| Data Diff has two algorithms implemented for diffing in the
| same database and across databases. The former is based on
| JOIN, and the latter utilizes checksumming with binary search,
| which has minimal network IO and database workload overhead.
| sinuhe69 wrote:
| In any normal/half-way good designed application, no row should
| be duplicated within a table, thus a comparison with EXCEPT is a
| valid solution. And it's often used in practice.
| remus wrote:
| > In any normal/half-way good designed application, no row
| should be duplicated within a table...
|
| Sure, but there are plenty of poorly designed databases out
| there!
| jhoechtl wrote:
| I think as null <> null you ca never be sure.
| tremon wrote:
| But (select null intersect select null) still gives you a 1-row
| result set, so there are ways to be sure.
| Pxtl wrote:
| But violating the reflexive property of equality is a good
| thing because math!
| t1234s wrote:
| Navicat is very useful for this
| ibejoeb wrote:
| If you have a total ordering and can get away with pulling the
| entire contents, just hash it. You don't have to contend with any
| of the sets stuff.
|
| I suppose there are some other edges, like if you're storing
| floats, but that's no so different no matter what technique you
| wind up with.
| photon_lines wrote:
| Yeah - I was going to say the same thing. Hashing would be the
| way to go if you're going for runtime and efficiency. You
| select the elements in an ordered manner though so that makes
| it n*log(n) and then produce a hash value for each table (or
| each row in the table) -- I'm trying to think if there is a
| better way of doing it in less time right now but I'm not
| seeing it yet. You could do this in O(n) or even constant time
| but it adds a bit of complexity to the solution I think.
| Someone wrote:
| If you hash each row in isolation and XOR the hashes of all
| rows, you don't need the ordering. I also think XOR-ing
| doesn't lose you 'resolving power' for this problem, compared
| to hashing the hashes or hashing everything in one go.
| sibrahim wrote:
| XOR is not a great choice here. Consider that 2 copies of a
| row give the same result as 0 (or 4, 6, etc). And even
| without multiple copies of rows, you can force any hash
| you'd like by observing what happens when you insert more
| random rows and finding a subcollection that flips exactly
| the bits you want.
|
| What you probably want to look at is homomorphic hashing.
| This is usually implemented by hashing each row to an
| element of an appropriate abelian group and then using the
| group operation to combine them.
|
| With suitable choice of group, this hash can have
| cryptographic strength. Some interesting choices here are
| lattices (LtHash), elliptic curves (ECMH), multiplicative
| groups (MuHash).
| Someone wrote:
| > XOR is not a great choice here. Consider that 2 copies
| of a row give the same result as 0 (or 4, 6, etc).
|
| That indeed is a major flaw. You have to use another
| commutative operation that doesn't destroy entropy.
| Addition seems a good choice to me.
|
| > And even without multiple copies of rows, you can force
| any hash you'd like
|
| I don't see how that matters for this problem.
| sibrahim wrote:
| It depends on whether you are doing something security
| critical with the result.
|
| Maybe you have a trusted table hash but only a user-
| supplied version of the table. Before you use that data
| for security sensitive queries, you should verify it
| hasn't been modified.
|
| Basically, if you ever have to contend with a malicious
| adversary, things are more interesting as usual. If not,
| addition is likely fine (though 2^k copies of a row now
| leave the k lowest bits unchanged).
| photon_lines wrote:
| Ahhh nice - yup that's one solution and really elegant as
| well. Great suggestion :)
| [deleted]
| umanwizard wrote:
| Not sure I understand why implementing EXCEPT ALL is hard. You
| just do +1 for every row in the left side and -1 for every row in
| the right, and then consolidate. Am I missing something?
| nothrowaways wrote:
| The most modern way is to tokenize the database, or the two
| databases in case the tables belong to different databases, then
| you can pretrain a large language model. Fine tune it with the
| two tables, then apply prompt engineering to ask the ai model in
| natural language. No need to learn SQL!
| BrentBrewington wrote:
| In case you haven't tried dbt (www.getdbt.com / "Data Build
| Tool") - there's a whole package ecosystem that solves for things
| like this. The one that came to mind is "dbt-audit-helper":
| https://github.com/dbt-labs/dbt-audit-helper#compare_relatio...
|
| It's kind of like PyPI/DockerHub for SQL. Lots of cool stuff in
| there...here's link to the package hub: https://hub.getdbt.com/
| SOLAR_FIELDS wrote:
| _What I expected:_
|
| A neat pithy SQL trick
|
| _What I got:_
|
| A lesson in the Dark Arts from a wizard
| vertica12498 wrote:
| remywang you're a genius
| fastaguy88 wrote:
| I guess this is an interesting question, but it seems incredibly
| contrary to the whole point of SQL. Two tables should never be
| the same, if they were, why would you have two tables. I can
| understand why one might want to check if two datasets were the
| same, which is easy to do with sort and diff, but two tables?
| They shouldn't even have the same fields.
| code_runner wrote:
| Migrating data or related processes and running a couple of
| things in parallel to make sure nothing was in translation...
| some sort of data replication hacking maybe.
|
| Sometimes you want them to be equal, sometimes you want to know
| they aren't... but it's not out of the ordinary at all to want
| to check
|
| I don't think this article is expressing any opinions on
| normalization
| fluoridation wrote:
| It could be the case that the two tables represent two
| independent instantiations of a single or very similar concept.
| One might be a list of chefs and their DOBs, and another a list
| of robots and their manufacture date, and for whatever reason
| you want to see if both of them are identical. There's no
| reason to unify both into a single table if your data model
| doesn't presuppose any correlation between chefs and robots,
| but someone might want to see if all their chefs are robots and
| if all their robots work as chefs. Obviously it's a contrived
| example, but my point is that there exist cases where having
| two tables with the same columns makes sense, if they store
| unrelated data. That the data is unrelated doesn't mean that
| the tables _cannot_ contain identical data, it just means that
| they don 't _necessarily_ contain identical data.
| RyanHamilton wrote:
| In q language, it really is t1~t2 match for full match or t1=t2
| for item wise. When you have a fully coherent programming
| language, it makes sense that this and many other things just
| work. e.g. in q there are no CTEs it's assigning and using
| variables as you would in most languages.
|
| https://www.timestored.com/jq/online/?qcode=t1%3A(%5B%5D%20s...
| orf wrote:
| Slightly related: I recently needed to compare several large
| databases for schema differences. The output from `pg_dump` isn't
| very diff-friendly at all, so I built this[1]. It outputs schemas
| in JSON or text, and ensures a stable sort ordering.
|
| It also optionally masks index and constraint names, which might
| be auto-generated and not relevant to the comparison.
|
| 1. https://github.com/orf/diffable-sql
| wfriesen wrote:
| I've always used this. Any rows returned means there are
| differences select * from ( (
| select * from table1 minus select *
| from table2 ) union all (
| select * from table2 minus select *
| from table1 ) )
| [deleted]
| globular-toast wrote:
| This is symmetric difference, but still has the problem that
| it's a set operation whereas in general a table is a bag
| (multiset).
| EddieJLSH wrote:
| Realistically which production DB tables don't have a unique
| id? Genuine question, never used one in my life.
| hobs wrote:
| PostTags in the published Stack Overflow schema - https://d
| ata.stackexchange.com/stackoverflow/query/edit/1772...
|
| It happens a lot when people are implementing something
| quick and often happens in linking tables.
| valenterry wrote:
| For example tables that store huge amount of logs or sensor
| data where IDs are not very useful and just increase space
| usage and decrease performance.
| dspillett wrote:
| Log analytics or warehouse tables often have no simple
| useful key for this sort of comparison.
|
| Also in a more general case you might be comparing tables
| that may contain the same data but have been constructed
| from different sources. Or perhaps a distributed dataset
| became disconnected and may have seen updates in both
| partitions, and you have brought them together to compare
| to try decide which to keep or if it is worth trying to
| merge. In those and other circumstances there may be a key
| but if it is a surrogate key it will be meaningless for
| comparing data from two sets of updates, so you would have
| to disregard it and compare on the other data (which might
| not include useful candidate keys).
| erinnh wrote:
| It happens. I'm currently working on a project where the
| CRM tool I need to access for data, actually does not have
| a unique id in its db. I have no idea if I will be able to
| successfully complete the project yet.
| justinclift wrote:
| Is there any chance that the rows actually do have a
| unique id, but it's not being displayed without some
| magic incantation?
|
| Asking because I've seen that before in some software,
| where it tries to "keep things simple" by default. But
| that behaviour can be toggled off so it shows the full
| schema (and data) for those with the need. :)
| erinnh wrote:
| Sadly, no.
|
| The manufacturer is just really incompetent.
|
| I was told their reason when asked was ,,it was easier
| (for us)".
| theodpHN wrote:
| Also, database tables where unique key constraints aren't
| enforced. Programming and operational mistakes happen. :-)
|
| https://stackoverflow.com/questions/62735776/what-is-the-
| poi...
| wodenokoto wrote:
| Don't things like BigQuery always allow duplicates?
| EddieJLSH wrote:
| Good point, have not used it before but looks like you
| have to add a unique ID if you want one
| DougBTX wrote:
| In theory, yes, however the vast majority of tables will have
| some form of unique ID in each record... so in practice,
| there's usually no difference. But if it must work for all
| tables...
| OskarS wrote:
| Does this work with the bag/multiset distinction that the
| author uses? Like, if table1 has two copies of some row and
| table2 has a single copy of that row, wont this query return
| that they're the same? But they're not: table1 has two copies
| of the same row, whereas table2 just has one?
| gregw2 wrote:
| I spent time researching this a while back.
|
| The duplicate row issue is part of why I don't use MINUS for
| table value comparisons, nor RECURSIVE like the original
| article suggests (which is not supported in all databases and
| scarier for junior developers)... You can accomplish the same
| thing and handle that dupes scenario too, with just GROUP
| BY/UNION ALL/HAVING, using the following technique:
|
| https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_.
| ..
|
| It will catch both if you have 1 row for a set of values in
| one table and 0 in another... or vice-versa... or 1 row for a
| set of values in one table and 2+ (dupes) in another.
|
| I have compared every row + every column value of billion-row
| tables in under a minute on a columnar database with this
| technique.
|
| Pseudocode summary explanation: Create (via group by) a
| rowcount for every single set of column values you give it
| from table A, create that same rowcount for every single set
| of column values from table B, then compare if those
| rowcounts match for all rows, and lets you know if they don't
| (sorted to make it easier to read when you do have
| differences). A nice fast set-based operation.
| LeonB wrote:
| I found that a weird edge case for the original author to
| fixate on. In mathematics or academia sure, but in "real" sql
| tables, that serve any kind of purpose, duplicate rows are
| not something you need to support, let alone go to _twice_
| the engineering effort to support. Duplicates are more likely
| to be something you deliberately eradicate (before the
| comparison) than preserve and respect.
| deely3 wrote:
| > duplicate rows are not something you need to support
|
| I can imagine that you want to have duplicates rows in a
| logging. If some events happens twice - you definitely want
| to log it twice.
| LeonB wrote:
| I have had it happen in real data from real systems - but
| it's not good -- "true duplicates" are a sign that
| something is missing from your schema or something has
| gone wrong earlier in the pipeline.
|
| Trying to delete duplicates (but leave 1 behind) is
| tricky in itself. I recorded notes on it one time here --
| using "row_number()" to act as the iniquitie, https://til
| .secretgeek.net/sql_server/delete_duplicate_rows....
| kmeisthax wrote:
| Your log tables don't have timestamps or line numbers on
| them?
|
| More generally (and formally) speaking, multisets violate
| normalization. Either you add information to the primary
| key to identify the copies or you roll the copies up into
| a quantity field. I can't think of any kind of data where
| neither of these would be good options.
| deely3 wrote:
| Primary key and quantity field could be not a perfect
| solution from performance point of view. Timestamps does
| not guarantee uniqueness.
|
| Look, Im not trying to win the argument. In most cases
| you definitely right, my point is that sometime you have
| to work with working legacy code/system, and sometime
| this system could have some unique features.
| SoftTalker wrote:
| Logs usually have a timestamp that would differentiate
| the two events.
| vikingerik wrote:
| Not necessarily - the clock source for logging is often
| at millisecond resolution, but at the speed of modern
| systems you could pile up quite a few log entries in a
| millisecond.
|
| I handle this by having a guid field for a primary key on
| such tables where there isn't a naturally unique index in
| the shape of the data. So _something_ is unique, and you
| can delete or ignore other rows relative to that. (Just
| don 't make your guid PK clustered; I use create-date or
| log-date for that.)
| SoftTalker wrote:
| Exactly. If you have two absolutely duplicate rows in a
| table, you're going to have problems with a lot of your
| queries. It's usually an indication that you are lacking a
| unique constraint or have a bug somewhere.
| mrweasel wrote:
| 15 years ago we just used Red Gates SQL Compare. I think it's
| still SQL Server only, but this really seems like this should
| have been solved for most mature databases by now.
| Pxtl wrote:
| The fact that Red Gate is still a necessary QOL add-on for a
| product that costs _that much_ should be shameful.
| Tostino wrote:
| I was sad to not see them move their tooling to multi-database.
| I had a couple calls with them ~5-6 years ago to talk about our
| needs and how they would be solved by a Postgres version of
| their tools (like SQL Compare), but apparently it never went
| anywhere.
| zerkten wrote:
| Having been a Red Gate customer in 2005, I decided to take a
| look at their current offerings. Based on https://www.red-
| gate.com/products/ it seems like they have started to look at
| other database. The newest products seem to have support for
| at least 3-4 databases. They are maybe caught between
| rewriting a legacy product and building out the features in
| their successor products.
| theodpHN wrote:
| Comparing for exactness is one challenge, but just a start -
| identifying what has changed is a bigger challenge. And that's
| where the world of file/database comparators come in. A somewhat
| niche but easy-to-use example of one with lots of options that
| others have sought to mimic in open source is SAS's PROC COMPARE,
| which can compare SAS datasets or other vendor's database tables
| (which SAS can effectively treat as if they're SAS datasets).
|
| SAS COMPARE Procedure Example 1: Producing a Complete Report of
| the Differences proc compare base=proclib.one
| compare=proclib.two printall;
|
| https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/...
|
| _____
|
| DataComPy (open-source python software developed by Capital One)
|
| DataComPy is a package to compare two Pandas DataFrames.
| Originally started to be something of a replacement for SAS's
| PROC COMPARE for Pandas DataFrames with some more functionality
| than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it
| prints out some stats, and lets you tweak how accurate matches
| have to be). from io import StringIO import
| pandas as pd import datacompy compare =
| datacompy.Compare( df1, df2,
| join_columns='acct_id', #You can also specify a list of columns
| abs_tol=0, #Optional, defaults to 0 rel_tol=0,
| #Optional, defaults to 0 df1_name='Original',
| #Optional, defaults to 'df1' df2_name='New' #Optional,
| defaults to 'df2' )
| compare.matches(ignore_extra_columns=False) # False
| # This method prints out a human-readable report summarizing and
| sampling differences print(compare.report())
|
| https://capitalone.github.io/datacompy/
| _ZeD_ wrote:
| https://stackoverflow.com/questions/5727882/check-if-two-sel...
| baal80spam wrote:
| I've worked with huge tables doing ETL and always used MINUS
| clause for the purpose of comparing tables. I was using Teradata
| though, not sure if it's available in all DB engines.
| Multrex wrote:
| Oracle has MINUS clause. Saved me many times.
| kohv wrote:
| 1. Write an SQL query to retrieve entities. Join the multi-row
| relational tables to the primary entity using a JSON aggregation
| function. This ensures all related data remains on a single row.
|
| 2. Export the query results to a .csv file(s).
|
| 3. Utilize Go along with the encoding/csv package to process each
| CSV row. Construct an in-memory index mapping each entity ID to
| its byte offset within the CSV file.
|
| 4. Traverse the CSV again, using the index to quickly locate and
| read corresponding lines into in-memory structures. Convert the
| aggregated JSON columns to standard arrays or objects.
|
| 5. After comparing individual CSV rows, save the outcomes to a
| map. This map associates each entityID with metadata about
| columns that don't match.
|
| 6. Convert the mismatch map into JSON format for further
| processing or segmentation.
| kbenson wrote:
| Hmm, CONCAT all fields in each row into a string, hash that
| string, use that as a derived table where you order the rows
| definitively, and concat all rows from that derived table into a
| string and hash that string. Repeat for second table. Compare
| hashes. You can do in a single query with two derived tables.
|
| You could do it without the inner hashing, but that seems more
| likely to exhaust memory if your rows are big enough, since
| you're basically hashing an entire table at that point and that's
| large string in memory.
| feoren wrote:
| Description | Notes
|
| --------------------------
|
| Fruit | Flies Fast
|
| Fruit Flies | Fast
| dragonwriter wrote:
| Note that if you have non-fixed-length fields, this can in
| principal produce false positives.
| kbenson wrote:
| Yeah, I was just thinking of that, and how you'd probably
| want to concat with some level of structure to eliminate
| this. e.g. by adding a delimiter between fields and doing a
| find and replace on that delimiter (carefully, e.g. CSV rules
| with quotes and escapes). Whatever it takes to definitively
| differentiate foo/barfoo from foobar/foo. Maybe just
| prefixing every field with its length is sufficient.
|
| I'm sure there's some relevant papers on the simplest way to
| achieve this I can and should look up. Hopefully they don't
| summarize as it being a much harder problem to do right. ;)
| dragonwriter wrote:
| Instead of CONCAT you could something like JSON
| serialization which preserves structure (which avoids
| reinventing the wheel on quoting, etc.) for the rows, which
| may not be maximally efficient but is probably the path of
| least resistance.
| kbenson wrote:
| Yeah, as long as every RDBMS has a JSON serialization
| method now (do they? I'm not up on SQL standards or where
| each RDBMS is implmenetation of them). Otherwise if we're
| shipping our own serialization as a procedure from core
| SQL functions, something simpler might be sufficient.
| Replace backslash with double backslash, replace pipe
| with backslash pipe, concatenate with pipes.
| LeonB wrote:
| I've worked on tools for this a bunch of times and one particular
| instance was my favorite. Checksums were used for equivalence -
| but the real fun was in displaying the differences in a
| meaningful way. We'd show:
|
| - which columns were only in the left, or only in the right
|
| - and then show which rows are only in the left, or only on the
| right
|
| - and then for rows that are in both but have some cell
| differences show a set that shows those differences. (Hard to
| explain how this was done... it was concise but rich with
| details.)
|
| This kind of "thorough" comparison was very useful for
| understanding the differences all at once.
| austin-cheney wrote:
| I have this old language aware diff tool I wrote that already
| supports CSV format, so what I would do is:
|
| 1) Check the record length of the two tables given your RDBMS
| API.
|
| 2) Check the column headers and data types of the tables are the
| same given your RDBMS API.
|
| 3) Only then would I export all triggers associated with the
| given tables using the RDBMS API and then compare that code using
| an external SQL language aware diff tool that excludes from
| comparison white space, comments, and other artifacts of your
| choosing.
|
| 4) Then finally if all else is the same I would export each table
| as CSV format and then compare that output in a CSV language
| aware diff tool.
|
| This can all be done with shell scripts.
| justinclift wrote:
| On the more practical side, SQLite has the "sqldiff" utility:
|
| https://www.sqlite.org/sqldiff.html
|
| By default it compares whole databases, but it can be told to
| only compare a specific table in each: sqldiff
| -t mytable database1.sqlite database2.sqlite
| RajT88 wrote:
| Indeed.
|
| > why isn't it a standard feature of SQL to just compare two
| tables?
|
| Not enough people have complained about needing it (it doesn't
| hit the desired ROI for a PM to prioritize the feature
| request).
|
| SQLite's SQL Diff I first came across years ago - super useful.
| It's the perfect example of the industries which pop up to fill
| in the gaps that huge software vendors like Microsoft leave
| open. I used to work at a company valued at hundreds of
| millions of dollars, which made nothing but such products
| filling gaps.
| TheRealPomax wrote:
| SQL isn't any specific product for a PM to greenlight work on
| or not, it's a programming language (the L in SQL =) that an
| structured (the S in SQL =) data application can support as
| either "the" or "one of the" programming interfaces to query
| (the Q in SQL =) that data.
|
| So it's really a question of why SQL, the language, doesn't
| come with a table comparison command so that you can write
| statements akin to "SELECT * FROM (COMPARE table1 table2) AS
| diff WHERE ..."
| tuwtuwtuwtuw wrote:
| Wouldn't the process be that some PM wants a feature to be
| added and sends it of to the representative in the
| appropriate committe who then push for it to be added?
|
| I mean it's not the SQL language itself which sits in the
| committe specifying new features, it would be
| representatives from Oracle, Microsoft, IBM and similar..
| Right?
| tremon wrote:
| If some PM wants to add a debugger to C, do they write a
| language extension proposal or do they focus on putting
| their requirements in a separate tool?
|
| This is roughly the same problem. The feature exists, but
| it's rare enough that the only people that need it are
| programmers/DBAs that need to deep-dive into a specific
| issue. Regular people/applications will never need this
| feature, so why should it be baked in the core language?
| The specialists already have specialist tooling, so it
| makes much more sense to implement this as a toolkit
| feature than a language feature.
| murki wrote:
| isn't that what MS SQL Server's tablediff does?
| https://learn.microsoft.com/en-us/sql/tools/tablediff-
| utilit...
| hatware wrote:
| [dead]
| ttfkam wrote:
| I'm surprised FULL OUTER JOINs weren't discussed. Then I realized
| SQLite doesn't support them. Perhaps it should have been titled:
|
| "How to Check 2 SQLite Tables Are the Same"
|
| I think SQLite's great, but "fully featured SQL engine" is not
| one of them. More like "perfectly adequate SQL engine" in an
| astoundingly compact operating envelope.
| cryptonector wrote:
| FULL OUTER JOINs were added in SQLite3 3.39.0. I've not checked
| if that includes NATURAL FULL OUTER JOIN.
|
| https://sqlite.org/releaselog/3_39_0.html
| go_prodev wrote:
| Normally EXCEPT or MINUS does a good enough job.
|
| For full comparisons of MSSQL tables, I often use Visual Studio
| which has a Data Comparison tool that shows identical rows and
| differences between source and target.
| wheelerof4te wrote:
| ""Fine," you say, "just generate the query and get back to work".
| Problem is, I don't feel like working today..."
|
| And after that, it naturally excalates quickly :)
|
| Never though I'd ever see SQL code golf, but here we are.
| cryptonector wrote:
| Using row values in PG: SELECT count(q.*)
| FROM (SELECT a, b FROM table_a a NATURAL FULL OUTER
| JOIN table_b b WHERE a IS NOT DISTINCT FROM NULL
| OR b IS NOT DISTINCT FROM NULL) q;
|
| This looks for rows in `a` that are not in `b` and vice-versa and
| produces a count of those.
|
| The key for this in SQL is `NATURAL FULL OUTER JOIN` (and row
| values).
| k2xl wrote:
| [flagged]
| jagged-chisel wrote:
| Did you test it?
| LargeDiggerNick wrote:
| I don't know why you're being downvoted. This is interesting.
| Did you test it yet?
| simonw wrote:
| Posting answers from GPT-4 etc on their own isn't interesting
| - they become interesting if you do the extra work to verify
| them first.
| Someone wrote:
| Because it's a (worse) variant of one of the first queries
| the article gives: SELECT *, COUNT(*)
| FROM t1 GROUP BY x, y, z, ... -- all attributes of t1
| EXCEPT SELECT *, COUNT(*) FROM t2
| GROUP BY x, y, z, ... -- all attributes of t2
|
| and we aren't discussing ChatGPT here.
| [deleted]
| abujazar wrote:
| How do you end up with two identical SQL tables in the first
| place?
| theodpHN wrote:
| Regression testing.
| remywang wrote:
| grading homework
| hypercube33 wrote:
| I immediately considered it for comparison of backups.
| amtamt wrote:
| or a host migration.
| abujazar wrote:
| Good point
| j16sdiz wrote:
| The sql in the article is more like brian gymnastics.
|
| It is interesting, but not something you should use. It
| scales horribly with number of columns
| nxx788 wrote:
| [dead]
| crabbone wrote:
| Are tables the same if the schema differs in names only but the
| content is the same? Are tables different if one table has
| triggers defined on it where other one doesn't? Same for various
| constraints? What if tables have infinite reference loops? Or if
| different foreign keys point to the "equal" tows? What if types
| in the schema don't match but values do (if interpreted as some
| third type)? Are currently active transactions taken into
| account?
|
| I find the idea of comparing two SQL tables weird / pointless in
| general. Maybe OK, if implemented with some very restrictive and
| well-defined semantics, but I wouldn't rely on a third-party tool
| to do something like this.
| remywang wrote:
| Author here! This was mostly intended for entertainment, and
| partly to demonstrate the idea of "relations as vectors" and
| "queries as polynomials". But I guess I'll indulge myself a bit
| more:
|
| > Just use sqldiff
|
| sqldiff is sensitive to ordering, e.g., it'll say the relation
| [1, 2] is different from [2, 1] (I consider them to be the same
| because they are the same multiset). You'd need to sort with
| ORDER BY first, but that also requires listing all attributes
| explicitly like the GROUP BY solution (ORDER BY * doesn't work).
|
| > What about CHECKSUM
|
| It's also sensitive to ordering, and I was told different tables
| can have the same CHECKSUM (hash collisions?).
|
| > Are the tables the same if they only differ by schema?
|
| I'd say no. Perhaps a better definition of "the same" is that all
| SQL queries (using "textbook" SQL features) return the same
| result over the tables, if you just replace t1 with t2 in the
| query. Wait, but how do you know if the results are the same...
| :)
|
| > There are better ways to compare tables
|
| Absolutely, my recursive query runs in time O(N^N) so I'm sure
| you can be a little better than that.
| tshaddox wrote:
| > Perhaps a better definition of "the same" is that all SQL
| queries (using "textbook" SQL features) return the same result
| over the tables, if you just replace t1 with t2 in the query.
| Wait, but how do you know if the results are the same... :)
|
| It's actually still a useful definition! (Assuming we're
| talking about all _deterministic_ SQL queries and can define
| precisely what we mean by that!)
|
| It's a useful definition because it includes all possible
| _aggregations_ as well, including very onerous ones like
| ridiculous STRING_AGG stuff. Those almost certainly be
| candidates for reasonable queries to solve your original
| problem, but they are useful in benchmarking whether a proposed
| solution is accurate.
| Pxtl wrote:
| tl;dr: SQL hates you and it's your fault it hates you.
|
| to elaborate: I'm rapidly appreciating that SQL is a "bondage
| and discipline" language where it is impossible for SQL to fail
| at any given task, it can only be failed by its
| users/developers.
|
| edit: further, it occurs to me also that SQL hates you because
| in a sane language you'd be able to write up some generic
| method that says "compare every element of these collections to
| each other" that was reusable for all possible collections. But
| try defining a scalar user-defined function that takes two
| arbitrary resultsets as parameters. But not only can I not do
| that, I'm a bad person for wanting that because it violates
| Relational Theory.
| tremon wrote:
| _because it violates Relational Theory_
|
| It doesn't violate relational theory at all -- it merely
| violates the query compiler's requirements about what must be
| known at the time of query compilation.
|
| You can absolute write a query that does what you want (you
| need a stored procedure that generates ad-hoc functions based
| on the involved tables' signatures), but stored sql object
| must have defined inputs and outputs. What you're asking for
| is a meta-sql function that can translate to different query
| plans based on its inputs, and that's not allowed because its
| performance cannot be predetermined.
| BrentBrewington wrote:
| You should definitely check out dbt :)
|
| Some links in my comment:
| https://news.ycombinator.com/item?id=36911937
| bigbonch wrote:
| When you say "bag semantics" are you just referring to Set
| Theory? Is there a formal distinction between bags and sets?
| pkkim wrote:
| With bag (multiset) semantics, {1, 2, 2} is different from
| {1, 2}. With set semantics, they're the same.
| justinclift wrote:
| Not sure if Go code is your kind of thing. But if it is my
| colleague wrote a SQLite "diff" capability for our online
| SQLite hosting service:
|
| https://github.com/sqlitebrowser/dbhub.io/blob/5c9e1ab1cfe0f...
|
| The code there can also output a "merge" object out of the
| differences too, in order to merge the differences from one
| database object into another.
| Pxtl wrote:
| I'm just sure there's some important relational theory reason why
| you're a bad person for wanting to do this and it's not a flaw in
| the language at all that "check the equivalence of two result-
| sets" is so difficult.
| SOLAR_FIELDS wrote:
| Ah yes, the Apple support approach: "Why would you want to do
| that?"
| remywang wrote:
| My guess is that t1 = t2 seems cheap but is actually expensive,
| so you might accidentally write very slow queries if it were
| built into the language.
| Pxtl wrote:
| Uh, that ship has sailed. Sooooo sailed. It is incredibly
| trivially easy to write unsargable queries in SQL.
| water9 wrote:
| That would be -1NF normalization
___________________________________________________________________
(page generated 2023-07-28 23:01 UTC)