[HN Gopher] How to check if two SQL tables are the same
       ___________________________________________________________________
        
       How to check if two SQL tables are the same
        
       Author : remywang
       Score  : 238 points
       Date   : 2023-07-27 06:22 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | frhack wrote:
       | Quickly Compare Data in Two Tables Using CHECKSUM and
       | CHECKSUM_AGG
       | 
       | https://sqlundercover.com/2018/12/18/quickly-compare-data-in...
        
         | amtamt wrote:
         | checksum and checksum_agg do not work for all data items, at
         | least for SQL server "Noncomparable data types" for checksum
         | and "null" for checksum_agg are show stopper.
         | 
         | Also, checksum/ checksum_agg do not seem like SQL standard
         | functions. referring
         | https://www.postgresql.org/docs/current/features.html and
         | https://en.wikipedia.org/wiki/SQL:2023#New_features.
        
           | LeonB wrote:
           | I remember casting some types to varbinary and then using
           | fn_repl_hash_binary() to generate the hashes.
           | 
           | And by "remember" I mean I wrote it down here so that I
           | wouldn't have to - https://til.secretgeek.net/sql_server/bulk
           | _comparison_with_h...
        
       | poulpi wrote:
       | If the issue happen a lot, there is also:
       | https://github.com/datafold/data-diff
       | 
       | That is a nice tool to do it cross database as well.
       | 
       | I think it's based on checksum method.
        
         | Pxtl wrote:
         | Honestly if the resultsets are small-enough, I just dump them
         | to JSON and diff the files. But it has to be fully
         | deterministically sorted for that (in a sane world "order by *"
         | would be valid ANSI SQL).
        
         | hichkaker wrote:
         | Thank you for mentioning Data Diff! Founder of Datafold here.
         | We built Data Diff to solve a variety of problems that we
         | encountered as data engineers: (A) Testing SQL code changes by
         | diffing the output of production/dev versions of SQL query. (B)
         | Validating that data is consistent when replicating data
         | between databases.
         | 
         | Data Diff has two algorithms implemented for diffing in the
         | same database and across databases. The former is based on
         | JOIN, and the latter utilizes checksumming with binary search,
         | which has minimal network IO and database workload overhead.
        
       | sinuhe69 wrote:
       | In any normal/half-way good designed application, no row should
       | be duplicated within a table, thus a comparison with EXCEPT is a
       | valid solution. And it's often used in practice.
        
         | remus wrote:
         | > In any normal/half-way good designed application, no row
         | should be duplicated within a table...
         | 
         | Sure, but there are plenty of poorly designed databases out
         | there!
        
       | jhoechtl wrote:
       | I think as null <> null you ca never be sure.
        
         | tremon wrote:
         | But (select null intersect select null) still gives you a 1-row
         | result set, so there are ways to be sure.
        
         | Pxtl wrote:
         | But violating the reflexive property of equality is a good
         | thing because math!
        
       | t1234s wrote:
       | Navicat is very useful for this
        
       | ibejoeb wrote:
       | If you have a total ordering and can get away with pulling the
       | entire contents, just hash it. You don't have to contend with any
       | of the sets stuff.
       | 
       | I suppose there are some other edges, like if you're storing
       | floats, but that's no so different no matter what technique you
       | wind up with.
        
         | photon_lines wrote:
         | Yeah - I was going to say the same thing. Hashing would be the
         | way to go if you're going for runtime and efficiency. You
         | select the elements in an ordered manner though so that makes
         | it n*log(n) and then produce a hash value for each table (or
         | each row in the table) -- I'm trying to think if there is a
         | better way of doing it in less time right now but I'm not
         | seeing it yet. You could do this in O(n) or even constant time
         | but it adds a bit of complexity to the solution I think.
        
           | Someone wrote:
           | If you hash each row in isolation and XOR the hashes of all
           | rows, you don't need the ordering. I also think XOR-ing
           | doesn't lose you 'resolving power' for this problem, compared
           | to hashing the hashes or hashing everything in one go.
        
             | sibrahim wrote:
             | XOR is not a great choice here. Consider that 2 copies of a
             | row give the same result as 0 (or 4, 6, etc). And even
             | without multiple copies of rows, you can force any hash
             | you'd like by observing what happens when you insert more
             | random rows and finding a subcollection that flips exactly
             | the bits you want.
             | 
             | What you probably want to look at is homomorphic hashing.
             | This is usually implemented by hashing each row to an
             | element of an appropriate abelian group and then using the
             | group operation to combine them.
             | 
             | With suitable choice of group, this hash can have
             | cryptographic strength. Some interesting choices here are
             | lattices (LtHash), elliptic curves (ECMH), multiplicative
             | groups (MuHash).
        
               | Someone wrote:
               | > XOR is not a great choice here. Consider that 2 copies
               | of a row give the same result as 0 (or 4, 6, etc).
               | 
               | That indeed is a major flaw. You have to use another
               | commutative operation that doesn't destroy entropy.
               | Addition seems a good choice to me.
               | 
               | > And even without multiple copies of rows, you can force
               | any hash you'd like
               | 
               | I don't see how that matters for this problem.
        
               | sibrahim wrote:
               | It depends on whether you are doing something security
               | critical with the result.
               | 
               | Maybe you have a trusted table hash but only a user-
               | supplied version of the table. Before you use that data
               | for security sensitive queries, you should verify it
               | hasn't been modified.
               | 
               | Basically, if you ever have to contend with a malicious
               | adversary, things are more interesting as usual. If not,
               | addition is likely fine (though 2^k copies of a row now
               | leave the k lowest bits unchanged).
        
             | photon_lines wrote:
             | Ahhh nice - yup that's one solution and really elegant as
             | well. Great suggestion :)
        
         | [deleted]
        
       | umanwizard wrote:
       | Not sure I understand why implementing EXCEPT ALL is hard. You
       | just do +1 for every row in the left side and -1 for every row in
       | the right, and then consolidate. Am I missing something?
        
       | nothrowaways wrote:
       | The most modern way is to tokenize the database, or the two
       | databases in case the tables belong to different databases, then
       | you can pretrain a large language model. Fine tune it with the
       | two tables, then apply prompt engineering to ask the ai model in
       | natural language. No need to learn SQL!
        
       | BrentBrewington wrote:
       | In case you haven't tried dbt (www.getdbt.com / "Data Build
       | Tool") - there's a whole package ecosystem that solves for things
       | like this. The one that came to mind is "dbt-audit-helper":
       | https://github.com/dbt-labs/dbt-audit-helper#compare_relatio...
       | 
       | It's kind of like PyPI/DockerHub for SQL. Lots of cool stuff in
       | there...here's link to the package hub: https://hub.getdbt.com/
        
       | SOLAR_FIELDS wrote:
       | _What I expected:_
       | 
       | A neat pithy SQL trick
       | 
       |  _What I got:_
       | 
       | A lesson in the Dark Arts from a wizard
        
       | vertica12498 wrote:
       | remywang you're a genius
        
       | fastaguy88 wrote:
       | I guess this is an interesting question, but it seems incredibly
       | contrary to the whole point of SQL. Two tables should never be
       | the same, if they were, why would you have two tables. I can
       | understand why one might want to check if two datasets were the
       | same, which is easy to do with sort and diff, but two tables?
       | They shouldn't even have the same fields.
        
         | code_runner wrote:
         | Migrating data or related processes and running a couple of
         | things in parallel to make sure nothing was in translation...
         | some sort of data replication hacking maybe.
         | 
         | Sometimes you want them to be equal, sometimes you want to know
         | they aren't... but it's not out of the ordinary at all to want
         | to check
         | 
         | I don't think this article is expressing any opinions on
         | normalization
        
         | fluoridation wrote:
         | It could be the case that the two tables represent two
         | independent instantiations of a single or very similar concept.
         | One might be a list of chefs and their DOBs, and another a list
         | of robots and their manufacture date, and for whatever reason
         | you want to see if both of them are identical. There's no
         | reason to unify both into a single table if your data model
         | doesn't presuppose any correlation between chefs and robots,
         | but someone might want to see if all their chefs are robots and
         | if all their robots work as chefs. Obviously it's a contrived
         | example, but my point is that there exist cases where having
         | two tables with the same columns makes sense, if they store
         | unrelated data. That the data is unrelated doesn't mean that
         | the tables _cannot_ contain identical data, it just means that
         | they don 't _necessarily_ contain identical data.
        
       | RyanHamilton wrote:
       | In q language, it really is t1~t2 match for full match or t1=t2
       | for item wise. When you have a fully coherent programming
       | language, it makes sense that this and many other things just
       | work. e.g. in q there are no CTEs it's assigning and using
       | variables as you would in most languages.
       | 
       | https://www.timestored.com/jq/online/?qcode=t1%3A(%5B%5D%20s...
        
       | orf wrote:
       | Slightly related: I recently needed to compare several large
       | databases for schema differences. The output from `pg_dump` isn't
       | very diff-friendly at all, so I built this[1]. It outputs schemas
       | in JSON or text, and ensures a stable sort ordering.
       | 
       | It also optionally masks index and constraint names, which might
       | be auto-generated and not relevant to the comparison.
       | 
       | 1. https://github.com/orf/diffable-sql
        
       | wfriesen wrote:
       | I've always used this. Any rows returned means there are
       | differences                 select * from (         (
       | select *           from table1           minus           select *
       | from table2         )         union all         (
       | select *           from table2           minus           select *
       | from table1         )       )
        
         | [deleted]
        
         | globular-toast wrote:
         | This is symmetric difference, but still has the problem that
         | it's a set operation whereas in general a table is a bag
         | (multiset).
        
           | EddieJLSH wrote:
           | Realistically which production DB tables don't have a unique
           | id? Genuine question, never used one in my life.
        
             | hobs wrote:
             | PostTags in the published Stack Overflow schema - https://d
             | ata.stackexchange.com/stackoverflow/query/edit/1772...
             | 
             | It happens a lot when people are implementing something
             | quick and often happens in linking tables.
        
             | valenterry wrote:
             | For example tables that store huge amount of logs or sensor
             | data where IDs are not very useful and just increase space
             | usage and decrease performance.
        
             | dspillett wrote:
             | Log analytics or warehouse tables often have no simple
             | useful key for this sort of comparison.
             | 
             | Also in a more general case you might be comparing tables
             | that may contain the same data but have been constructed
             | from different sources. Or perhaps a distributed dataset
             | became disconnected and may have seen updates in both
             | partitions, and you have brought them together to compare
             | to try decide which to keep or if it is worth trying to
             | merge. In those and other circumstances there may be a key
             | but if it is a surrogate key it will be meaningless for
             | comparing data from two sets of updates, so you would have
             | to disregard it and compare on the other data (which might
             | not include useful candidate keys).
        
             | erinnh wrote:
             | It happens. I'm currently working on a project where the
             | CRM tool I need to access for data, actually does not have
             | a unique id in its db. I have no idea if I will be able to
             | successfully complete the project yet.
        
               | justinclift wrote:
               | Is there any chance that the rows actually do have a
               | unique id, but it's not being displayed without some
               | magic incantation?
               | 
               | Asking because I've seen that before in some software,
               | where it tries to "keep things simple" by default. But
               | that behaviour can be toggled off so it shows the full
               | schema (and data) for those with the need. :)
        
               | erinnh wrote:
               | Sadly, no.
               | 
               | The manufacturer is just really incompetent.
               | 
               | I was told their reason when asked was ,,it was easier
               | (for us)".
        
             | theodpHN wrote:
             | Also, database tables where unique key constraints aren't
             | enforced. Programming and operational mistakes happen. :-)
             | 
             | https://stackoverflow.com/questions/62735776/what-is-the-
             | poi...
        
             | wodenokoto wrote:
             | Don't things like BigQuery always allow duplicates?
        
               | EddieJLSH wrote:
               | Good point, have not used it before but looks like you
               | have to add a unique ID if you want one
        
           | DougBTX wrote:
           | In theory, yes, however the vast majority of tables will have
           | some form of unique ID in each record... so in practice,
           | there's usually no difference. But if it must work for all
           | tables...
        
         | OskarS wrote:
         | Does this work with the bag/multiset distinction that the
         | author uses? Like, if table1 has two copies of some row and
         | table2 has a single copy of that row, wont this query return
         | that they're the same? But they're not: table1 has two copies
         | of the same row, whereas table2 just has one?
        
           | gregw2 wrote:
           | I spent time researching this a while back.
           | 
           | The duplicate row issue is part of why I don't use MINUS for
           | table value comparisons, nor RECURSIVE like the original
           | article suggests (which is not supported in all databases and
           | scarier for junior developers)... You can accomplish the same
           | thing and handle that dupes scenario too, with just GROUP
           | BY/UNION ALL/HAVING, using the following technique:
           | 
           | https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_.
           | ..
           | 
           | It will catch both if you have 1 row for a set of values in
           | one table and 0 in another... or vice-versa... or 1 row for a
           | set of values in one table and 2+ (dupes) in another.
           | 
           | I have compared every row + every column value of billion-row
           | tables in under a minute on a columnar database with this
           | technique.
           | 
           | Pseudocode summary explanation: Create (via group by) a
           | rowcount for every single set of column values you give it
           | from table A, create that same rowcount for every single set
           | of column values from table B, then compare if those
           | rowcounts match for all rows, and lets you know if they don't
           | (sorted to make it easier to read when you do have
           | differences). A nice fast set-based operation.
        
           | LeonB wrote:
           | I found that a weird edge case for the original author to
           | fixate on. In mathematics or academia sure, but in "real" sql
           | tables, that serve any kind of purpose, duplicate rows are
           | not something you need to support, let alone go to _twice_
           | the engineering effort to support. Duplicates are more likely
           | to be something you deliberately eradicate (before the
           | comparison) than preserve and respect.
        
             | deely3 wrote:
             | > duplicate rows are not something you need to support
             | 
             | I can imagine that you want to have duplicates rows in a
             | logging. If some events happens twice - you definitely want
             | to log it twice.
        
               | LeonB wrote:
               | I have had it happen in real data from real systems - but
               | it's not good -- "true duplicates" are a sign that
               | something is missing from your schema or something has
               | gone wrong earlier in the pipeline.
               | 
               | Trying to delete duplicates (but leave 1 behind) is
               | tricky in itself. I recorded notes on it one time here --
               | using "row_number()" to act as the iniquitie, https://til
               | .secretgeek.net/sql_server/delete_duplicate_rows....
        
               | kmeisthax wrote:
               | Your log tables don't have timestamps or line numbers on
               | them?
               | 
               | More generally (and formally) speaking, multisets violate
               | normalization. Either you add information to the primary
               | key to identify the copies or you roll the copies up into
               | a quantity field. I can't think of any kind of data where
               | neither of these would be good options.
        
               | deely3 wrote:
               | Primary key and quantity field could be not a perfect
               | solution from performance point of view. Timestamps does
               | not guarantee uniqueness.
               | 
               | Look, Im not trying to win the argument. In most cases
               | you definitely right, my point is that sometime you have
               | to work with working legacy code/system, and sometime
               | this system could have some unique features.
        
               | SoftTalker wrote:
               | Logs usually have a timestamp that would differentiate
               | the two events.
        
               | vikingerik wrote:
               | Not necessarily - the clock source for logging is often
               | at millisecond resolution, but at the speed of modern
               | systems you could pile up quite a few log entries in a
               | millisecond.
               | 
               | I handle this by having a guid field for a primary key on
               | such tables where there isn't a naturally unique index in
               | the shape of the data. So _something_ is unique, and you
               | can delete or ignore other rows relative to that. (Just
               | don 't make your guid PK clustered; I use create-date or
               | log-date for that.)
        
             | SoftTalker wrote:
             | Exactly. If you have two absolutely duplicate rows in a
             | table, you're going to have problems with a lot of your
             | queries. It's usually an indication that you are lacking a
             | unique constraint or have a bug somewhere.
        
       | mrweasel wrote:
       | 15 years ago we just used Red Gates SQL Compare. I think it's
       | still SQL Server only, but this really seems like this should
       | have been solved for most mature databases by now.
        
         | Pxtl wrote:
         | The fact that Red Gate is still a necessary QOL add-on for a
         | product that costs _that much_ should be shameful.
        
         | Tostino wrote:
         | I was sad to not see them move their tooling to multi-database.
         | I had a couple calls with them ~5-6 years ago to talk about our
         | needs and how they would be solved by a Postgres version of
         | their tools (like SQL Compare), but apparently it never went
         | anywhere.
        
           | zerkten wrote:
           | Having been a Red Gate customer in 2005, I decided to take a
           | look at their current offerings. Based on https://www.red-
           | gate.com/products/ it seems like they have started to look at
           | other database. The newest products seem to have support for
           | at least 3-4 databases. They are maybe caught between
           | rewriting a legacy product and building out the features in
           | their successor products.
        
       | theodpHN wrote:
       | Comparing for exactness is one challenge, but just a start -
       | identifying what has changed is a bigger challenge. And that's
       | where the world of file/database comparators come in. A somewhat
       | niche but easy-to-use example of one with lots of options that
       | others have sought to mimic in open source is SAS's PROC COMPARE,
       | which can compare SAS datasets or other vendor's database tables
       | (which SAS can effectively treat as if they're SAS datasets).
       | 
       | SAS COMPARE Procedure Example 1: Producing a Complete Report of
       | the Differences                 proc compare base=proclib.one
       | compare=proclib.two printall;
       | 
       | https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/...
       | 
       | _____
       | 
       | DataComPy (open-source python software developed by Capital One)
       | 
       | DataComPy is a package to compare two Pandas DataFrames.
       | Originally started to be something of a replacement for SAS's
       | PROC COMPARE for Pandas DataFrames with some more functionality
       | than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it
       | prints out some stats, and lets you tweak how accurate matches
       | have to be).                 from io import StringIO       import
       | pandas as pd       import datacompy            compare =
       | datacompy.Compare(           df1,           df2,
       | join_columns='acct_id',  #You can also specify a list of columns
       | abs_tol=0, #Optional, defaults to 0           rel_tol=0,
       | #Optional, defaults to 0           df1_name='Original',
       | #Optional, defaults to 'df1'           df2_name='New' #Optional,
       | defaults to 'df2'           )
       | compare.matches(ignore_extra_columns=False)       # False
       | # This method prints out a human-readable report summarizing and
       | sampling differences       print(compare.report())
       | 
       | https://capitalone.github.io/datacompy/
        
       | _ZeD_ wrote:
       | https://stackoverflow.com/questions/5727882/check-if-two-sel...
        
       | baal80spam wrote:
       | I've worked with huge tables doing ETL and always used MINUS
       | clause for the purpose of comparing tables. I was using Teradata
       | though, not sure if it's available in all DB engines.
        
         | Multrex wrote:
         | Oracle has MINUS clause. Saved me many times.
        
       | kohv wrote:
       | 1. Write an SQL query to retrieve entities. Join the multi-row
       | relational tables to the primary entity using a JSON aggregation
       | function. This ensures all related data remains on a single row.
       | 
       | 2. Export the query results to a .csv file(s).
       | 
       | 3. Utilize Go along with the encoding/csv package to process each
       | CSV row. Construct an in-memory index mapping each entity ID to
       | its byte offset within the CSV file.
       | 
       | 4. Traverse the CSV again, using the index to quickly locate and
       | read corresponding lines into in-memory structures. Convert the
       | aggregated JSON columns to standard arrays or objects.
       | 
       | 5. After comparing individual CSV rows, save the outcomes to a
       | map. This map associates each entityID with metadata about
       | columns that don't match.
       | 
       | 6. Convert the mismatch map into JSON format for further
       | processing or segmentation.
        
       | kbenson wrote:
       | Hmm, CONCAT all fields in each row into a string, hash that
       | string, use that as a derived table where you order the rows
       | definitively, and concat all rows from that derived table into a
       | string and hash that string. Repeat for second table. Compare
       | hashes. You can do in a single query with two derived tables.
       | 
       | You could do it without the inner hashing, but that seems more
       | likely to exhaust memory if your rows are big enough, since
       | you're basically hashing an entire table at that point and that's
       | large string in memory.
        
         | feoren wrote:
         | Description | Notes
         | 
         | --------------------------
         | 
         | Fruit | Flies Fast
         | 
         | Fruit Flies | Fast
        
         | dragonwriter wrote:
         | Note that if you have non-fixed-length fields, this can in
         | principal produce false positives.
        
           | kbenson wrote:
           | Yeah, I was just thinking of that, and how you'd probably
           | want to concat with some level of structure to eliminate
           | this. e.g. by adding a delimiter between fields and doing a
           | find and replace on that delimiter (carefully, e.g. CSV rules
           | with quotes and escapes). Whatever it takes to definitively
           | differentiate foo/barfoo from foobar/foo. Maybe just
           | prefixing every field with its length is sufficient.
           | 
           | I'm sure there's some relevant papers on the simplest way to
           | achieve this I can and should look up. Hopefully they don't
           | summarize as it being a much harder problem to do right. ;)
        
             | dragonwriter wrote:
             | Instead of CONCAT you could something like JSON
             | serialization which preserves structure (which avoids
             | reinventing the wheel on quoting, etc.) for the rows, which
             | may not be maximally efficient but is probably the path of
             | least resistance.
        
               | kbenson wrote:
               | Yeah, as long as every RDBMS has a JSON serialization
               | method now (do they? I'm not up on SQL standards or where
               | each RDBMS is implmenetation of them). Otherwise if we're
               | shipping our own serialization as a procedure from core
               | SQL functions, something simpler might be sufficient.
               | Replace backslash with double backslash, replace pipe
               | with backslash pipe, concatenate with pipes.
        
       | LeonB wrote:
       | I've worked on tools for this a bunch of times and one particular
       | instance was my favorite. Checksums were used for equivalence -
       | but the real fun was in displaying the differences in a
       | meaningful way. We'd show:
       | 
       | - which columns were only in the left, or only in the right
       | 
       | - and then show which rows are only in the left, or only on the
       | right
       | 
       | - and then for rows that are in both but have some cell
       | differences show a set that shows those differences. (Hard to
       | explain how this was done... it was concise but rich with
       | details.)
       | 
       | This kind of "thorough" comparison was very useful for
       | understanding the differences all at once.
        
       | austin-cheney wrote:
       | I have this old language aware diff tool I wrote that already
       | supports CSV format, so what I would do is:
       | 
       | 1) Check the record length of the two tables given your RDBMS
       | API.
       | 
       | 2) Check the column headers and data types of the tables are the
       | same given your RDBMS API.
       | 
       | 3) Only then would I export all triggers associated with the
       | given tables using the RDBMS API and then compare that code using
       | an external SQL language aware diff tool that excludes from
       | comparison white space, comments, and other artifacts of your
       | choosing.
       | 
       | 4) Then finally if all else is the same I would export each table
       | as CSV format and then compare that output in a CSV language
       | aware diff tool.
       | 
       | This can all be done with shell scripts.
        
       | justinclift wrote:
       | On the more practical side, SQLite has the "sqldiff" utility:
       | 
       | https://www.sqlite.org/sqldiff.html
       | 
       | By default it compares whole databases, but it can be told to
       | only compare a specific table in each:                   sqldiff
       | -t mytable database1.sqlite database2.sqlite
        
         | RajT88 wrote:
         | Indeed.
         | 
         | > why isn't it a standard feature of SQL to just compare two
         | tables?
         | 
         | Not enough people have complained about needing it (it doesn't
         | hit the desired ROI for a PM to prioritize the feature
         | request).
         | 
         | SQLite's SQL Diff I first came across years ago - super useful.
         | It's the perfect example of the industries which pop up to fill
         | in the gaps that huge software vendors like Microsoft leave
         | open. I used to work at a company valued at hundreds of
         | millions of dollars, which made nothing but such products
         | filling gaps.
        
           | TheRealPomax wrote:
           | SQL isn't any specific product for a PM to greenlight work on
           | or not, it's a programming language (the L in SQL =) that an
           | structured (the S in SQL =) data application can support as
           | either "the" or "one of the" programming interfaces to query
           | (the Q in SQL =) that data.
           | 
           | So it's really a question of why SQL, the language, doesn't
           | come with a table comparison command so that you can write
           | statements akin to "SELECT * FROM (COMPARE table1 table2) AS
           | diff WHERE ..."
        
             | tuwtuwtuwtuw wrote:
             | Wouldn't the process be that some PM wants a feature to be
             | added and sends it of to the representative in the
             | appropriate committe who then push for it to be added?
             | 
             | I mean it's not the SQL language itself which sits in the
             | committe specifying new features, it would be
             | representatives from Oracle, Microsoft, IBM and similar..
             | Right?
        
               | tremon wrote:
               | If some PM wants to add a debugger to C, do they write a
               | language extension proposal or do they focus on putting
               | their requirements in a separate tool?
               | 
               | This is roughly the same problem. The feature exists, but
               | it's rare enough that the only people that need it are
               | programmers/DBAs that need to deep-dive into a specific
               | issue. Regular people/applications will never need this
               | feature, so why should it be baked in the core language?
               | The specialists already have specialist tooling, so it
               | makes much more sense to implement this as a toolkit
               | feature than a language feature.
        
           | murki wrote:
           | isn't that what MS SQL Server's tablediff does?
           | https://learn.microsoft.com/en-us/sql/tools/tablediff-
           | utilit...
        
             | hatware wrote:
             | [dead]
        
       | ttfkam wrote:
       | I'm surprised FULL OUTER JOINs weren't discussed. Then I realized
       | SQLite doesn't support them. Perhaps it should have been titled:
       | 
       | "How to Check 2 SQLite Tables Are the Same"
       | 
       | I think SQLite's great, but "fully featured SQL engine" is not
       | one of them. More like "perfectly adequate SQL engine" in an
       | astoundingly compact operating envelope.
        
         | cryptonector wrote:
         | FULL OUTER JOINs were added in SQLite3 3.39.0. I've not checked
         | if that includes NATURAL FULL OUTER JOIN.
         | 
         | https://sqlite.org/releaselog/3_39_0.html
        
       | go_prodev wrote:
       | Normally EXCEPT or MINUS does a good enough job.
       | 
       | For full comparisons of MSSQL tables, I often use Visual Studio
       | which has a Data Comparison tool that shows identical rows and
       | differences between source and target.
        
       | wheelerof4te wrote:
       | ""Fine," you say, "just generate the query and get back to work".
       | Problem is, I don't feel like working today..."
       | 
       | And after that, it naturally excalates quickly :)
       | 
       | Never though I'd ever see SQL code golf, but here we are.
        
       | cryptonector wrote:
       | Using row values in PG:                 SELECT count(q.*)
       | FROM (SELECT a, b FROM table_a a             NATURAL FULL OUTER
       | JOIN table_b b             WHERE a IS NOT DISTINCT FROM NULL
       | OR b IS NOT DISTINCT FROM NULL) q;
       | 
       | This looks for rows in `a` that are not in `b` and vice-versa and
       | produces a count of those.
       | 
       | The key for this in SQL is `NATURAL FULL OUTER JOIN` (and row
       | values).
        
       | k2xl wrote:
       | [flagged]
        
         | jagged-chisel wrote:
         | Did you test it?
        
         | LargeDiggerNick wrote:
         | I don't know why you're being downvoted. This is interesting.
         | Did you test it yet?
        
           | simonw wrote:
           | Posting answers from GPT-4 etc on their own isn't interesting
           | - they become interesting if you do the extra work to verify
           | them first.
        
           | Someone wrote:
           | Because it's a (worse) variant of one of the first queries
           | the article gives:                 SELECT *, COUNT(*)
           | FROM t1        GROUP BY x, y, z, ... -- all attributes of t1
           | EXCEPT            SELECT *, COUNT(*)          FROM t2
           | GROUP BY x, y, z, ... -- all attributes of t2
           | 
           | and we aren't discussing ChatGPT here.
        
         | [deleted]
        
       | abujazar wrote:
       | How do you end up with two identical SQL tables in the first
       | place?
        
         | theodpHN wrote:
         | Regression testing.
        
         | remywang wrote:
         | grading homework
        
         | hypercube33 wrote:
         | I immediately considered it for comparison of backups.
        
           | amtamt wrote:
           | or a host migration.
        
           | abujazar wrote:
           | Good point
        
           | j16sdiz wrote:
           | The sql in the article is more like brian gymnastics.
           | 
           | It is interesting, but not something you should use. It
           | scales horribly with number of columns
        
         | nxx788 wrote:
         | [dead]
        
       | crabbone wrote:
       | Are tables the same if the schema differs in names only but the
       | content is the same? Are tables different if one table has
       | triggers defined on it where other one doesn't? Same for various
       | constraints? What if tables have infinite reference loops? Or if
       | different foreign keys point to the "equal" tows? What if types
       | in the schema don't match but values do (if interpreted as some
       | third type)? Are currently active transactions taken into
       | account?
       | 
       | I find the idea of comparing two SQL tables weird / pointless in
       | general. Maybe OK, if implemented with some very restrictive and
       | well-defined semantics, but I wouldn't rely on a third-party tool
       | to do something like this.
        
       | remywang wrote:
       | Author here! This was mostly intended for entertainment, and
       | partly to demonstrate the idea of "relations as vectors" and
       | "queries as polynomials". But I guess I'll indulge myself a bit
       | more:
       | 
       | > Just use sqldiff
       | 
       | sqldiff is sensitive to ordering, e.g., it'll say the relation
       | [1, 2] is different from [2, 1] (I consider them to be the same
       | because they are the same multiset). You'd need to sort with
       | ORDER BY first, but that also requires listing all attributes
       | explicitly like the GROUP BY solution (ORDER BY * doesn't work).
       | 
       | > What about CHECKSUM
       | 
       | It's also sensitive to ordering, and I was told different tables
       | can have the same CHECKSUM (hash collisions?).
       | 
       | > Are the tables the same if they only differ by schema?
       | 
       | I'd say no. Perhaps a better definition of "the same" is that all
       | SQL queries (using "textbook" SQL features) return the same
       | result over the tables, if you just replace t1 with t2 in the
       | query. Wait, but how do you know if the results are the same...
       | :)
       | 
       | > There are better ways to compare tables
       | 
       | Absolutely, my recursive query runs in time O(N^N) so I'm sure
       | you can be a little better than that.
        
         | tshaddox wrote:
         | > Perhaps a better definition of "the same" is that all SQL
         | queries (using "textbook" SQL features) return the same result
         | over the tables, if you just replace t1 with t2 in the query.
         | Wait, but how do you know if the results are the same... :)
         | 
         | It's actually still a useful definition! (Assuming we're
         | talking about all _deterministic_ SQL queries and can define
         | precisely what we mean by that!)
         | 
         | It's a useful definition because it includes all possible
         | _aggregations_ as well, including very onerous ones like
         | ridiculous STRING_AGG stuff. Those almost certainly be
         | candidates for reasonable queries to solve your original
         | problem, but they are useful in benchmarking whether a proposed
         | solution is accurate.
        
         | Pxtl wrote:
         | tl;dr: SQL hates you and it's your fault it hates you.
         | 
         | to elaborate: I'm rapidly appreciating that SQL is a "bondage
         | and discipline" language where it is impossible for SQL to fail
         | at any given task, it can only be failed by its
         | users/developers.
         | 
         | edit: further, it occurs to me also that SQL hates you because
         | in a sane language you'd be able to write up some generic
         | method that says "compare every element of these collections to
         | each other" that was reusable for all possible collections. But
         | try defining a scalar user-defined function that takes two
         | arbitrary resultsets as parameters. But not only can I not do
         | that, I'm a bad person for wanting that because it violates
         | Relational Theory.
        
           | tremon wrote:
           | _because it violates Relational Theory_
           | 
           | It doesn't violate relational theory at all -- it merely
           | violates the query compiler's requirements about what must be
           | known at the time of query compilation.
           | 
           | You can absolute write a query that does what you want (you
           | need a stored procedure that generates ad-hoc functions based
           | on the involved tables' signatures), but stored sql object
           | must have defined inputs and outputs. What you're asking for
           | is a meta-sql function that can translate to different query
           | plans based on its inputs, and that's not allowed because its
           | performance cannot be predetermined.
        
           | BrentBrewington wrote:
           | You should definitely check out dbt :)
           | 
           | Some links in my comment:
           | https://news.ycombinator.com/item?id=36911937
        
         | bigbonch wrote:
         | When you say "bag semantics" are you just referring to Set
         | Theory? Is there a formal distinction between bags and sets?
        
           | pkkim wrote:
           | With bag (multiset) semantics, {1, 2, 2} is different from
           | {1, 2}. With set semantics, they're the same.
        
         | justinclift wrote:
         | Not sure if Go code is your kind of thing. But if it is my
         | colleague wrote a SQLite "diff" capability for our online
         | SQLite hosting service:
         | 
         | https://github.com/sqlitebrowser/dbhub.io/blob/5c9e1ab1cfe0f...
         | 
         | The code there can also output a "merge" object out of the
         | differences too, in order to merge the differences from one
         | database object into another.
        
       | Pxtl wrote:
       | I'm just sure there's some important relational theory reason why
       | you're a bad person for wanting to do this and it's not a flaw in
       | the language at all that "check the equivalence of two result-
       | sets" is so difficult.
        
         | SOLAR_FIELDS wrote:
         | Ah yes, the Apple support approach: "Why would you want to do
         | that?"
        
         | remywang wrote:
         | My guess is that t1 = t2 seems cheap but is actually expensive,
         | so you might accidentally write very slow queries if it were
         | built into the language.
        
           | Pxtl wrote:
           | Uh, that ship has sailed. Sooooo sailed. It is incredibly
           | trivially easy to write unsargable queries in SQL.
        
       | water9 wrote:
       | That would be -1NF normalization
        
       ___________________________________________________________________
       (page generated 2023-07-28 23:01 UTC)