hngopher.com

       [HN Gopher] Joins 13 Ways
       ___________________________________________________________________
        
       Joins 13 Ways
        
       Author : foldU
       Score  : 296 points
       Date   : 2023-07-03 17:01 UTC (5 hours ago)
        
 (HTM) web link (justinjaffray.com)
 (TXT) w3m dump (justinjaffray.com)
        
       | ttfkam wrote:
       | Excellent! We need more articles like this that demonstrate the
       | subtleties of the relational model to primarily app-level
       | developers. The explanations and explorations in terms of
       | functional programming are both concise and compelling.
        
       | charles_f wrote:
       | Very nice explanation!
       | 
       | > The correct way to do this is to normalize the table
       | 
       | This is true for transactional dbs, but in data warehouses it's
       | widely accepted that some degree of denormalization is the way to
       | go
        
       | kadenwolff wrote:
       | The title of this post sounds like an advertisement for a cult
        
       | bot12345 wrote:
       | This is a very insightful discussion. Thanks for sharing!
        
       | [deleted]
        
       | BoppreH wrote:
       | Fantastic post, I always enjoy reading about different
       | computation mental models.
       | 
       | And under "A join is a...join", there's a typo in the partially
       | ordered set properties. It currently reads:                   1.
       | Reflexivity: a<=b,
       | 
       | And I'm pretty sure it should be                   1.
       | Reflexivity: a<=a,
       | 
       | instead (i.e., every element is <= to itself).
        
         | foldU wrote:
         | You are correct! I will fix it when I get home, thank you for
         | the correction!
        
       | nickpeterson wrote:
       | I imagine the title 'thirteen ways of looking at a join' was
       | taken?
        
         | lbrindze wrote:
         | Dunno why this was downvoted, I came here to make a similar
         | comment about the possible Wallace Steven's reference.
        
           | lcnPylGDnU4H9OF wrote:
           | Obviously I'd only be able to say for sure if I was the
           | downvoter but I sometimes observe lighter text on comments
           | which make a reference without calling out the reference.
           | It's similar to using an acronym without defining it, though
           | possibly more confounding if it's a niche enough reference.
           | In this case, I did not get the reference and would have
           | wondered why the parent commenter expected that title to have
           | already been used.
           | 
           | At best, such a comment is referencing something topical
           | which most readers will get (and ostensibly be entertained
           | by); at worst, it's a distracting non sequitur. It generally
           | ties back to a community preference that comments are
           | curiosity-satisfying before entertaining.
        
             | lbrindze wrote:
             | Wouldn't an allusion to 20th century American poetry fall
             | into the category of "curiosity satisfying"? Given they
             | were not the only person who got the reference it feels
             | kind of arbitrary to say this is frivolous entertainment
             | when another person in the community (in this case, me)
             | found it curious and also wondered if there was an allusion
             | there.
             | 
             | If it was an intentional allusion, then it may actually add
             | depth/meaning to the conversation but we may not know since
             | it was already downvoted...
        
               | lcnPylGDnU4H9OF wrote:
               | It could be if it was called out as such. Without the
               | explicit callout, one runs the risk of it "going over the
               | readers' heads" so to speak. Anyway, I'm not intending to
               | justify any behavior; just offering my interpretation
               | based on past observations.
        
       | captaintobs wrote:
       | Very cool work!
        
       | smif wrote:
       | An inner join is a Cartesian product with conditionals added.
        
         | RobinL wrote:
         | There's a big performance difference between creating the
         | Cartesian product and then filtering for the conditional, and
         | creating the conditionals directly. An inner join with equi
         | join conditions creates the conditions directly; any non equi
         | join conditions actually have to be evaluated
        
           | smif wrote:
           | That's true, but that is an implementation detail. In
           | abstract terms, you can think of an inner join like that.
           | 
           | Also, while what you say is true in general for modern DB's,
           | there are some implementations like old Oracle versions where
           | the only way to create the effect of an inner join was in
           | terms of a Cartesian product.
        
       | lopatin wrote:
       | Very useful. Something like this but for Flink's fancy temporal,
       | lateral, interval, and window joins would be great too.
        
       | bob1029 wrote:
       | Once I started thinking about joins in terms of spatial
       | dimensions, things got a lot easier to reason with. I like to
       | think of the inner join like the scene from Stargate where they
       | were describing how gate addressing works. Assume you have a
       | contrived example with 3 tables describing the position of an off
       | world probe in each dimension - X, Y and Z. Each table looks
       | like:                 CREATE TABLE Dim_X (         int EntityId
       | float Value       )
       | 
       | Establishing a point in space for a given entity (regardless of
       | how you derive that ID) is then a matter of:
       | SELECT Dim_X.Value AS X, Dim_Y.Value as Y, Dim_Z.Value as Z
       | FROM Dim_X, Dim_Y, Dim_Z       WHERE Dim_X.EntityId =
       | Dim_Y.EntityId --Join 1       AND Dim_Y.EntityId = Dim_Z.EntityId
       | --Join 2       AND Dim_X.EntityId = @MyEntityId --The entity we
       | want to find the 3D location of
       | 
       | You will note that there are 2 inner joins used in this example.
       | That is the bare minimum needed to construct a 3 dimensional
       | space. Think about taking 3 cards and taping them together with 2
       | pieces of rigid metal tape. You can make a good corner of a cube,
       | even if it's a bit floppy on one edge. Gravity doesn't apply
       | inside the RDBMS, so this works out.
       | 
       | This same reasoning can be extrapolated to higher, non-spatial
       | dimensions. Think about adding time into that mix. In order to
       | find an entity, you also now need to perform an inner join on
       | that dimension and constrain on a specific time value. If you
       | join but fail to constrain on a specific time, then you get a
       | happy, comprehensive report of all the places an entity was over
       | time.
       | 
       | The other join types are really minor variations on these themes
       | once you have a deep conceptual grasp (i.e. can "rotate" the
       | schema in your mind).
       | 
       | Playing around with some toy examples can do wonders for
       | understanding. I sometimes find myself going back to the
       | cartesian coordinate example when stuck trying to weld together
       | 10+ dimensions in a real-world business situation.
        
         | michaelmior wrote:
         | This reminds me a lot of HyperDex[0], which hashes values into
         | a multidimensional hyperspace based on their attributes for
         | indexing purposes.
         | 
         | [0] https://dbdb.io/db/hyperdex
        
           | 1970-01-01 wrote:
           | OLAP and MOLAP
           | 
           | https://www.ibm.com/topics/olap
           | 
           | https://en.wikipedia.org/wiki/Online_analytical_processing#M.
           | ..
        
         | rjbwork wrote:
         | This is like, a hypernormalization of the data though, isn't
         | it? I think in standard BCNF you'd just leave your table as
         | CREATE TABLE EntityPosition (         int EntityId,
         | float X,         float y,         float z       )
         | 
         | It does remind me of data warehouse stuff though, given we're
         | working with aggregates and piecing together bits of various
         | dimensions.
        
       | boredemployee wrote:
       | I always thought venn diagram was a good representation but I
       | think I was wrong.
        
       | ss892714028 wrote:
       | [dead]
        
       | waynecochran wrote:
       | Can join also be thought of as unification? Similar to type
       | checking.
       | 
       | https://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/le...
        
       | danbruc wrote:
       | 1. cross join        2. natural join        3. equi join
       | 4. theta join        5. inner join        6. left outer join
       | 7. right outer join        8. full outer join        9. left semi
       | join       10. right semi join       11. left anti semi join
       | 12. right anti semi join       13. ???
        
         | iaabtpbtpnn wrote:
         | 13. lateral join :)
        
       | maxdemarzi wrote:
       | The 14th way is "multi way joins" also called "worst case optimal
       | joins" which is a terrible name.
       | 
       | It means instead of joining tables two at a time and dealing with
       | the temporary results along the way (eating memory), you join 3
       | or more tables together without the temporary results.
       | 
       | There is a blog post and short video of this on
       | https://relational.ai/blog/dovetail-join and the original paper
       | is on https://dl.acm.org/doi/pdf/10.1145/3180143
       | 
       | I work for RelationalAI, we and about 4 other new database
       | companies are bringing these new join algorithms to market after
       | ten years in academia.
        
         | namibj wrote:
         | Negating inputs (set complement) turns the join's `AND` into a
         | `NOR`, as _Tetris_ exploits.
         | 
         | The worst case bounds don't tighten over (stateless/streaming)
         | WCOJ's, but much real world data has far smaller box
         | certificates.
         | 
         | One thing I didn't see is whether Dovetail join allows
         | recursive queries (i.e., arbitrary datalog with a designated
         | output relation, and the user having no concern about what the
         | engine does with all the intermediate relations mentioned in
         | the bundle of horn clauses that make up this datalog query).
         | 
         | Do you happen to know if it supports such queries?
        
       | srcreigh wrote:
       | Another missed chance to educate on the N+1 area. Join on
       | unclustered index is still N+1, it's just N+1 on disk instead of
       | N+1 over the network and disk.
        
         | tourist2d wrote:
         | "Another missed chance to talk about X problem I find interest
         | in and would bloat the article"
        
       | Anon4Now wrote:
       | A couple things:
       | 
       | This is admittedly a bit pedantic, but in E.F. Codd's original
       | paper, he defined "relation" as the relation between a tuple
       | (table row) and an attribute (table column) in a single table -
       | https://en.wikipedia.org/wiki/Relation_(database). I'm not sure
       | of the author's intent, but the user table example (user,
       | country_id ) might imply the relationship between the user table
       | and the country table. It's a common misconception about what
       | "relational" means, but tbh I'm fine with that since it makes
       | more sense to the average developer.
       | 
       | If you ever need to join sets of data in code, don't use nested
       | loops - O(n^2). Use a map / dictionary. It's one of the few
       | things I picked up doing Leetcode problems that I've actually
       | needed to apply irl.
        
       | roywiggins wrote:
       | The 0th would be "it's an operator in relational algebra."
       | 
       | https://en.m.wikipedia.org/wiki/Relational_algebra
       | 
       | ("The result of the natural join [R [?] S] is the set of all
       | combinations of tuples in R and S that are equal on their common
       | attribute names... it is the relational counterpart of the
       | logical AND operator.")
       | 
       | [?] amounts to a Cartesian product with a predicate that throws
       | out the rows that shouldn't be in the result. Lots of SQL makes
       | sense if you think of joins this way.
        
         | gpderetta wrote:
         | The cross-product plus predicate is referenced in "A join is a
         | nested loop over rows".
        
           | roywiggins wrote:
           | That's true, but I think the benefit of treating a Cartesian
           | product as fundamental is that it lets you stop thinking
           | about loops at all, or which one is the inner or outer loop,
           | or of it as an _iterative_ process at all.
           | 
           | Boxing all that up into the Cartesian product is a really
           | useful concept, and the whole idea of relational algebra is
           | to find convenient formalisms for relational operations, so
           | it seems like it deserves a separate mention.
        
         | jimwhite42 wrote:
         | In the functional interpretation of relational theory, a join
         | is function composition, surprised that one was left out.
        
       | benjiweber wrote:
       | Joins as a relational AND.
       | 
       | https://benjiweber.co.uk/blog/2021/03/21/thinking-in-questio....
        
       | mirekrusin wrote:
       | For several days I'm having trouble finding good resources on
       | _implementation_ for query execution/planning (predicates,
       | existing indices <<especially composite ones - how to locate them
       | during planning etc>>, joins etc).
       | 
       | Google is spammed with _usage_.
       | 
       | Anybody has some recommendations at hand?
       | 
       | ps. the only one I found was CMU's Database Group resources,
       | which are great
        
         | gavinray wrote:
         | There's an entire 700-page book about this you can access for
         | free here:
         | 
         | "Building Query Compilers"
         | 
         | https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.p...
         | 
         | EDIT: You also might get use out of TUM's course "Database
         | Systems on Modern CPU Architectures"
         | 
         | The year that contains all the video lectures is 2020:
         | 
         | https://db.in.tum.de/teaching/ss20/moderndbs/?lang=en
        
         | malfist wrote:
         | It seems content farms, farming keywords with just fluff has
         | taken over google. Can't find how to do anything anymore, just
         | pages and pages of people talking about doing something.
         | 
         | You could try your query on a different search engine. I've had
         | good luck with kagi.
        
         | skywhopper wrote:
         | It's out of date and probably has less detail than I remember
         | but I got a lot out of "Inside Microsoft SQL Server 7.0" which
         | does deep dives into storage architecture, indices, query
         | planning etc from an MSSQL specific POV. The book was updated
         | for SQL Server 2003 and 2008. Obviously the book also has a ton
         | of stuff about MS specific features and admin functionality
         | that's not really relevant, and I'm sure there are better
         | resources out there, but I've found the background from that
         | book has helped me understand the innards of Oracle, MySQL, and
         | Postgres in the years since.
        
         | dattl wrote:
         | I find the Advanced Databases Course from CMU an excellent
         | resource.
         | https://15721.courses.cs.cmu.edu/spring2023/schedule.html
         | 
         | You might want to look into academic papers, e.g., T. Neumann,
         | Efficiently Compiling Efficient Query Plans for Modern
         | Hardware, in VLDB, 2011
         | https://www.vldb.org/pvldb/vol4/p539-neumann.pdf
        
         | mrkeen wrote:
         | I just tried adding 'relational algebra' in front of my 'query
         | planning' query, and at a glance it skews more towards
         | implementation.
        
         | sobellian wrote:
         | I don't know if this has the level of detail you're looking
         | for, but you might want to check
         | https://www.sqlite.org/optoverview.html and
         | https://www.sqlite.org/queryplanner.html.
        
         | aidos wrote:
         | My go to pointer for this is to read the Postgres docs (and /
         | or source - which is also super readable).
         | 
         | https://www.postgresql.org/docs/current/planner-optimizer.ht...
        
       ___________________________________________________________________
       (page generated 2023-07-03 23:00 UTC)