[HN Gopher] SQL, Homomorphisms and Constraint Satisfaction Problems
       ___________________________________________________________________
        
       SQL, Homomorphisms and Constraint Satisfaction Problems
        
       Author : xlinux
       Score  : 99 points
       Date   : 2024-11-20 17:07 UTC (5 hours ago)
        
 (HTM) web link (www.philipzucker.com)
 (TXT) w3m dump (www.philipzucker.com)
        
       | babel_ wrote:
       | For anyone curious: the performance difference between Clang and
       | GCC on the example C solution for verbal arithmetic comes down to
       | Clang's auto-vectorisation (deducing SIMD) whilst GCC here sticks
       | with scalar, which is why the counter brings Clang closer in line
       | to GCC (https://godbolt.org/z/xfdxGvMYP), and it's actually a
       | pretty nice example of auto-vectorisation (and its limitations)
       | in action, which is a fun tangent from this article (given its
       | relevance to high-performance SMT/SAT solving for CSP)
        
       | mbid wrote:
       | The post mentions the idea that querying a database D can be
       | understood algebraically as enumerating all morphisms Q -> D,
       | where Q is the "classifying" database of the query, i.e. a
       | minimal database instance that admits a single "generic" result
       | of the query. You can use this to give a neat formulation of
       | Datalog evaluation. A Datalog rule then corresponds a morphism P
       | -> H, where P is the classifying database instance of the rule
       | body and H is the classifying database instance for matches of
       | both body and head. For example, for the the transitivity rule
       | edge(x, z) :- edge(x, y), edge(y, z).
       | 
       | you'd take for P the database instance containing two rows (a_1,
       | a_2) and (a_2, a_3), and the database instance H contains
       | additionally (a_1, a_3). Now saying that a Database D satisfies
       | this rule means that every morphism P -> D (i.e., every match of
       | the premise of the rule) can be completed to a commuting diagram
       | P --> D       |    ^       |   /       [?]  /       Q
       | 
       | where the additional map is the arrow Q -> D, which corresponds
       | to a match of both body and head.
       | 
       | This kind of phenomenon is known in category theory as a "lifting
       | property", and there's rich theory around it. For example, you
       | can show in great generality that there's always a "free" way to
       | add data to a database D so that it satisfies the lifting
       | property (the orthogonal reflection construction/the small object
       | argument). Those are the theoretical underpinnings of the Datalog
       | engine I'm sometimes working on [1], and there they allow you to
       | prove that Datalog evaluation is also well-defined if you allow
       | adjoining new elements during evaluation in a controlled way. I
       | believe the author of this post is involved in the egglog project
       | [2], which might have similar features as well.
       | 
       | [1] https://github.com/eqlog/eqlog [2]
       | https://github.com/egraphs-good/egglog
        
       | lovasoa wrote:
       | The topic of huge queries on tiny databases makes me think of
       | this recent discussion on the SQLite forum:
       | https://sqlite.org/forum/forumpost/0d18320369
       | 
       | Someone had an issue because SQLite failed to optimize the
       | following query                   select * from t where x = 'x'
       | or '' = 'x'
       | 
       | Someone said that SQLite could not optimize out the "or '' = 'x'"
       | because it would be too expensive to compute. Which is obviously
       | true only for huge queries on tiny datasets.
        
         | hinkley wrote:
         | Why would it be too expensive to optimize out static
         | subexpressions?
        
           | jjice wrote:
           | My guess is that the expense can be tricky to calculate since
           | the additional optimization prior to executing the query may
           | take longer than if the query was just able to run (depending
           | on the dataset, of course). I wonder if it's too expensive to
           | calculate a heuristic as well, so it just allows it to
           | execute.
           | 
           | Just a guess.
        
         | recursive wrote:
         | It's not obviously true at all. Optimizing out `'' = 'x'` can
         | be done for a fixed cost regardless of record count.
        
           | lovasoa wrote:
           | Optimizing out static expressions can be done in linear time
           | at best. So if the number of clauses in WHERE is huge and the
           | size of the underlying table is tiny (such as in the examples
           | shown in the article we are commenting on), it will be better
           | not to run the optimization.
           | 
           | But of course, in normal life, outside of the world of people
           | having fun with Homomorphisms, queries are much smaller than
           | databases.
        
             | recursive wrote:
             | Parsing the expression in the first place is already linear
             | time.
        
         | jiggawatts wrote:
         | > SQLite
         | 
         | Well... there's your problem. SQLite is _not_ a general-purpose
         | RDBMS, it is marketed as a replacement for  "fopen()", a
         | purpose for which it excels.
         | 
         | A similar product is the Microsoft Jet database engine, used in
         | products such as Microsoft Exchange and Active Directory.
         | Queries have to be more-or-less manually optimised by the
         | developer, but they run faster and more consistently than they
         | would with a general-purpose query engine designed for ad-hoc
         | queries.
        
       ___________________________________________________________________
       (page generated 2024-11-20 23:00 UTC)