[HN Gopher] Joins 13 Ways
___________________________________________________________________
Joins 13 Ways
Author : foldU
Score : 296 points
Date : 2023-07-03 17:01 UTC (5 hours ago)
(HTM) web link (justinjaffray.com)
(TXT) w3m dump (justinjaffray.com)
| ttfkam wrote:
| Excellent! We need more articles like this that demonstrate the
| subtleties of the relational model to primarily app-level
| developers. The explanations and explorations in terms of
| functional programming are both concise and compelling.
| charles_f wrote:
| Very nice explanation!
|
| > The correct way to do this is to normalize the table
|
| This is true for transactional dbs, but in data warehouses it's
| widely accepted that some degree of denormalization is the way to
| go
| kadenwolff wrote:
| The title of this post sounds like an advertisement for a cult
| bot12345 wrote:
| This is a very insightful discussion. Thanks for sharing!
| [deleted]
| BoppreH wrote:
| Fantastic post, I always enjoy reading about different
| computation mental models.
|
| And under "A join is a...join", there's a typo in the partially
| ordered set properties. It currently reads: 1.
| Reflexivity: a<=b,
|
| And I'm pretty sure it should be 1.
| Reflexivity: a<=a,
|
| instead (i.e., every element is <= to itself).
| foldU wrote:
| You are correct! I will fix it when I get home, thank you for
| the correction!
| nickpeterson wrote:
| I imagine the title 'thirteen ways of looking at a join' was
| taken?
| lbrindze wrote:
| Dunno why this was downvoted, I came here to make a similar
| comment about the possible Wallace Steven's reference.
| lcnPylGDnU4H9OF wrote:
| Obviously I'd only be able to say for sure if I was the
| downvoter but I sometimes observe lighter text on comments
| which make a reference without calling out the reference.
| It's similar to using an acronym without defining it, though
| possibly more confounding if it's a niche enough reference.
| In this case, I did not get the reference and would have
| wondered why the parent commenter expected that title to have
| already been used.
|
| At best, such a comment is referencing something topical
| which most readers will get (and ostensibly be entertained
| by); at worst, it's a distracting non sequitur. It generally
| ties back to a community preference that comments are
| curiosity-satisfying before entertaining.
| lbrindze wrote:
| Wouldn't an allusion to 20th century American poetry fall
| into the category of "curiosity satisfying"? Given they
| were not the only person who got the reference it feels
| kind of arbitrary to say this is frivolous entertainment
| when another person in the community (in this case, me)
| found it curious and also wondered if there was an allusion
| there.
|
| If it was an intentional allusion, then it may actually add
| depth/meaning to the conversation but we may not know since
| it was already downvoted...
| lcnPylGDnU4H9OF wrote:
| It could be if it was called out as such. Without the
| explicit callout, one runs the risk of it "going over the
| readers' heads" so to speak. Anyway, I'm not intending to
| justify any behavior; just offering my interpretation
| based on past observations.
| captaintobs wrote:
| Very cool work!
| smif wrote:
| An inner join is a Cartesian product with conditionals added.
| RobinL wrote:
| There's a big performance difference between creating the
| Cartesian product and then filtering for the conditional, and
| creating the conditionals directly. An inner join with equi
| join conditions creates the conditions directly; any non equi
| join conditions actually have to be evaluated
| smif wrote:
| That's true, but that is an implementation detail. In
| abstract terms, you can think of an inner join like that.
|
| Also, while what you say is true in general for modern DB's,
| there are some implementations like old Oracle versions where
| the only way to create the effect of an inner join was in
| terms of a Cartesian product.
| lopatin wrote:
| Very useful. Something like this but for Flink's fancy temporal,
| lateral, interval, and window joins would be great too.
| bob1029 wrote:
| Once I started thinking about joins in terms of spatial
| dimensions, things got a lot easier to reason with. I like to
| think of the inner join like the scene from Stargate where they
| were describing how gate addressing works. Assume you have a
| contrived example with 3 tables describing the position of an off
| world probe in each dimension - X, Y and Z. Each table looks
| like: CREATE TABLE Dim_X ( int EntityId
| float Value )
|
| Establishing a point in space for a given entity (regardless of
| how you derive that ID) is then a matter of:
| SELECT Dim_X.Value AS X, Dim_Y.Value as Y, Dim_Z.Value as Z
| FROM Dim_X, Dim_Y, Dim_Z WHERE Dim_X.EntityId =
| Dim_Y.EntityId --Join 1 AND Dim_Y.EntityId = Dim_Z.EntityId
| --Join 2 AND Dim_X.EntityId = @MyEntityId --The entity we
| want to find the 3D location of
|
| You will note that there are 2 inner joins used in this example.
| That is the bare minimum needed to construct a 3 dimensional
| space. Think about taking 3 cards and taping them together with 2
| pieces of rigid metal tape. You can make a good corner of a cube,
| even if it's a bit floppy on one edge. Gravity doesn't apply
| inside the RDBMS, so this works out.
|
| This same reasoning can be extrapolated to higher, non-spatial
| dimensions. Think about adding time into that mix. In order to
| find an entity, you also now need to perform an inner join on
| that dimension and constrain on a specific time value. If you
| join but fail to constrain on a specific time, then you get a
| happy, comprehensive report of all the places an entity was over
| time.
|
| The other join types are really minor variations on these themes
| once you have a deep conceptual grasp (i.e. can "rotate" the
| schema in your mind).
|
| Playing around with some toy examples can do wonders for
| understanding. I sometimes find myself going back to the
| cartesian coordinate example when stuck trying to weld together
| 10+ dimensions in a real-world business situation.
| michaelmior wrote:
| This reminds me a lot of HyperDex[0], which hashes values into
| a multidimensional hyperspace based on their attributes for
| indexing purposes.
|
| [0] https://dbdb.io/db/hyperdex
| 1970-01-01 wrote:
| OLAP and MOLAP
|
| https://www.ibm.com/topics/olap
|
| https://en.wikipedia.org/wiki/Online_analytical_processing#M.
| ..
| rjbwork wrote:
| This is like, a hypernormalization of the data though, isn't
| it? I think in standard BCNF you'd just leave your table as
| CREATE TABLE EntityPosition ( int EntityId,
| float X, float y, float z )
|
| It does remind me of data warehouse stuff though, given we're
| working with aggregates and piecing together bits of various
| dimensions.
| boredemployee wrote:
| I always thought venn diagram was a good representation but I
| think I was wrong.
| ss892714028 wrote:
| [dead]
| waynecochran wrote:
| Can join also be thought of as unification? Similar to type
| checking.
|
| https://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/le...
| danbruc wrote:
| 1. cross join 2. natural join 3. equi join
| 4. theta join 5. inner join 6. left outer join
| 7. right outer join 8. full outer join 9. left semi
| join 10. right semi join 11. left anti semi join
| 12. right anti semi join 13. ???
| iaabtpbtpnn wrote:
| 13. lateral join :)
| maxdemarzi wrote:
| The 14th way is "multi way joins" also called "worst case optimal
| joins" which is a terrible name.
|
| It means instead of joining tables two at a time and dealing with
| the temporary results along the way (eating memory), you join 3
| or more tables together without the temporary results.
|
| There is a blog post and short video of this on
| https://relational.ai/blog/dovetail-join and the original paper
| is on https://dl.acm.org/doi/pdf/10.1145/3180143
|
| I work for RelationalAI, we and about 4 other new database
| companies are bringing these new join algorithms to market after
| ten years in academia.
| namibj wrote:
| Negating inputs (set complement) turns the join's `AND` into a
| `NOR`, as _Tetris_ exploits.
|
| The worst case bounds don't tighten over (stateless/streaming)
| WCOJ's, but much real world data has far smaller box
| certificates.
|
| One thing I didn't see is whether Dovetail join allows
| recursive queries (i.e., arbitrary datalog with a designated
| output relation, and the user having no concern about what the
| engine does with all the intermediate relations mentioned in
| the bundle of horn clauses that make up this datalog query).
|
| Do you happen to know if it supports such queries?
| srcreigh wrote:
| Another missed chance to educate on the N+1 area. Join on
| unclustered index is still N+1, it's just N+1 on disk instead of
| N+1 over the network and disk.
| tourist2d wrote:
| "Another missed chance to talk about X problem I find interest
| in and would bloat the article"
| Anon4Now wrote:
| A couple things:
|
| This is admittedly a bit pedantic, but in E.F. Codd's original
| paper, he defined "relation" as the relation between a tuple
| (table row) and an attribute (table column) in a single table -
| https://en.wikipedia.org/wiki/Relation_(database). I'm not sure
| of the author's intent, but the user table example (user,
| country_id ) might imply the relationship between the user table
| and the country table. It's a common misconception about what
| "relational" means, but tbh I'm fine with that since it makes
| more sense to the average developer.
|
| If you ever need to join sets of data in code, don't use nested
| loops - O(n^2). Use a map / dictionary. It's one of the few
| things I picked up doing Leetcode problems that I've actually
| needed to apply irl.
| roywiggins wrote:
| The 0th would be "it's an operator in relational algebra."
|
| https://en.m.wikipedia.org/wiki/Relational_algebra
|
| ("The result of the natural join [R [?] S] is the set of all
| combinations of tuples in R and S that are equal on their common
| attribute names... it is the relational counterpart of the
| logical AND operator.")
|
| [?] amounts to a Cartesian product with a predicate that throws
| out the rows that shouldn't be in the result. Lots of SQL makes
| sense if you think of joins this way.
| gpderetta wrote:
| The cross-product plus predicate is referenced in "A join is a
| nested loop over rows".
| roywiggins wrote:
| That's true, but I think the benefit of treating a Cartesian
| product as fundamental is that it lets you stop thinking
| about loops at all, or which one is the inner or outer loop,
| or of it as an _iterative_ process at all.
|
| Boxing all that up into the Cartesian product is a really
| useful concept, and the whole idea of relational algebra is
| to find convenient formalisms for relational operations, so
| it seems like it deserves a separate mention.
| jimwhite42 wrote:
| In the functional interpretation of relational theory, a join
| is function composition, surprised that one was left out.
| benjiweber wrote:
| Joins as a relational AND.
|
| https://benjiweber.co.uk/blog/2021/03/21/thinking-in-questio....
| mirekrusin wrote:
| For several days I'm having trouble finding good resources on
| _implementation_ for query execution/planning (predicates,
| existing indices <<especially composite ones - how to locate them
| during planning etc>>, joins etc).
|
| Google is spammed with _usage_.
|
| Anybody has some recommendations at hand?
|
| ps. the only one I found was CMU's Database Group resources,
| which are great
| gavinray wrote:
| There's an entire 700-page book about this you can access for
| free here:
|
| "Building Query Compilers"
|
| https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.p...
|
| EDIT: You also might get use out of TUM's course "Database
| Systems on Modern CPU Architectures"
|
| The year that contains all the video lectures is 2020:
|
| https://db.in.tum.de/teaching/ss20/moderndbs/?lang=en
| malfist wrote:
| It seems content farms, farming keywords with just fluff has
| taken over google. Can't find how to do anything anymore, just
| pages and pages of people talking about doing something.
|
| You could try your query on a different search engine. I've had
| good luck with kagi.
| skywhopper wrote:
| It's out of date and probably has less detail than I remember
| but I got a lot out of "Inside Microsoft SQL Server 7.0" which
| does deep dives into storage architecture, indices, query
| planning etc from an MSSQL specific POV. The book was updated
| for SQL Server 2003 and 2008. Obviously the book also has a ton
| of stuff about MS specific features and admin functionality
| that's not really relevant, and I'm sure there are better
| resources out there, but I've found the background from that
| book has helped me understand the innards of Oracle, MySQL, and
| Postgres in the years since.
| dattl wrote:
| I find the Advanced Databases Course from CMU an excellent
| resource.
| https://15721.courses.cs.cmu.edu/spring2023/schedule.html
|
| You might want to look into academic papers, e.g., T. Neumann,
| Efficiently Compiling Efficient Query Plans for Modern
| Hardware, in VLDB, 2011
| https://www.vldb.org/pvldb/vol4/p539-neumann.pdf
| mrkeen wrote:
| I just tried adding 'relational algebra' in front of my 'query
| planning' query, and at a glance it skews more towards
| implementation.
| sobellian wrote:
| I don't know if this has the level of detail you're looking
| for, but you might want to check
| https://www.sqlite.org/optoverview.html and
| https://www.sqlite.org/queryplanner.html.
| aidos wrote:
| My go to pointer for this is to read the Postgres docs (and /
| or source - which is also super readable).
|
| https://www.postgresql.org/docs/current/planner-optimizer.ht...
___________________________________________________________________
(page generated 2023-07-03 23:00 UTC)