[HN Gopher] Let's talk about joins
___________________________________________________________________
Let's talk about joins
Author : r4um
Score : 105 points
Date : 2024-01-20 06:52 UTC (16 hours ago)
(HTM) web link (cghlewis.com)
(TXT) w3m dump (cghlewis.com)
| adrianmonk wrote:
| From the opening of the "Vertical Joins" section:
|
| > _Similar to horizontal joins, there are many use cases for
| joining data horizontally, also called appending data._
|
| Shouldn't this read "joining data vertically"? This seems like a
| typo.
| cghlewis wrote:
| This is definitely a typo! Thanks for catching this!
| jethkl wrote:
| The article omits the "natural join". A natural join
| automatically joins tables on columns having the same name, and
| the columns in the resulting output appear exactly once. Natural
| joins often simplify queries, and they automatically adapt to
| schema changes in source tables. Natural joins also encourage
| standardized naming conventions, which has longer-term benefits.
| youerbt wrote:
| I feel like natural joins simplify writing queries, but not
| exactly reading them (especially if you are not familiar with
| the database). IMO a good compromise is the USING clause, which
| acts like a natural join, but columns have to be named
| explicitly.
| amluto wrote:
| Maybe some SQL server should require a token like
| "USING_WILD_GUESS" when doing a natural join :)
| chasil wrote:
| The problem with USING is that it isn't supported everywhere.
| hipadev23 wrote:
| That's not a join type, that's syntactic sugar.
| bradleybuda wrote:
| A right join is also syntactic sugar for a left join
| bakuninsbart wrote:
| Exactly, that's why they shouldn't be used. If you feel the
| need to use a right join, swap the direction around. And
| only use inner joins if it is the only type of join you
| use, otherwise specify conditions in the where clause. Both
| serve to significantly lower mental overhead when dealing
| with queries.
|
| Natural joins are naturally more implicit, and while SQL
| tends to be a little bit verbose, given the significance of
| data integrity and the dififculty of testing SQL, the
| trade-off goes very clearly towards being explicit.
| cldellow wrote:
| Natural joins also automatically break your queries when two
| columns happen to share a name but not the same meaning.
|
| Step 1: use natural join. Life is great. Step 2: someone adds a
| `comment` field on table A. Life is great. Step 3: someone adds
| a `comment` field on table B. Ruh roh.
|
| I'll use them in short-lived personal projects, but not on
| something where I'm collaborating with other people on software
| that evolves over several years.
| jethkl wrote:
| a defense against collisions like this is through CTEs that
| select a minimal set of columns, with column names suitably
| selected and standardized: CTE_A AS (SELECT
| ... comment as comment_a from A...), CTE_B AS (SELECT
| ... comment as comment_b from B...)
| Cyberdog wrote:
| Isn't this a lot more work both for the user and the RDBMS
| than just using a relatively simple left join?
| closeparen wrote:
| It seems like the database should be able to figure this out
| when a foreign key constraint is explicitly declared in the
| DDL.
| tqi wrote:
| Should, but in practice I've rarely seen fk contraints used
| in analytics data warehouses (mostly for etl performance
| reasons)
| civilized wrote:
| > Natural joins also encourage standardized naming conventions,
| which has longer-term benefits.
|
| This is a very positive spin on "you have to manage column
| names very rigorously for this strategy to be sustainable".
| mgaunard wrote:
| There are lots of other joins not mentioned there.
|
| A popular one with time series is the asof join.
|
| There are also literal joins, which are generalizations of
| adjacent difference.
| mgaunard wrote:
| meant lAteral sorry
| HermitX wrote:
| Excellent learning material, thanks for sharing. I've noticed an
| interesting trend: JOINS are crucial for data analytics, yet many
| new open-source data analytics products, or open-source OLAP
| products, offer limited support for JOINS. Examples include
| ClickHouse, Apache Druid, and Apache Pinot. It seems that
| currently, only Trino and StarRocks provide robust support for
| JOINS. Commercial products tend to have better support in this
| area. I hope the various open-source projects will also enhance
| their JOIN capabilities.
| minitoar wrote:
| The reason those tools have more limited support for joins is
| mainly because they are making intentional trade offs in favor
| of other features, eg performance in a particular domain.
| totalhack wrote:
| I've also been frustrated when testing out tools that kinda
| keep you locked into one predetermined view, table, or set of
| tables at a time. I made a semantic data modeling library that
| puts together queries (and of course joins) for you as it uses
| a drill-across querying technique, and can also join data
| across different data sources in a secondary execution layer.
|
| https://github.com/totalhack/zillion
|
| Disclaimer: this project is currently a one man show, though I
| use it in production at my own company.
| closeparen wrote:
| If you want arbitrarily powerful adhoc query support, you need
| to wait for data to land in an offline warehouse or lake
| environment where you have access to e.g. Presto/Trino and
| Spark. If you want a near-real-time view then you're going to
| need to design the data layout around your query pattern - do a
| streaming join or other enrichment prior to OLAP ingestion.
| necubi wrote:
| Yep. There are always going to be constraints about how well
| a system like clickhouse can support arbitrary joins. Queries
| in clickhouse are fast because the data is laid out in such a
| way that it can minimize how much it needs to read.
|
| Part of this is the columnar layout that means it can avoid
| reading columns that are not involved in the query. However
| it's also able to push query predicates into the table scan,
| using metadata (like bloom filters) that tell it what values
| are in each chunk of data.
|
| But for joins, you typically end up needing to read all of
| the data and materialize it in memory.
|
| For realtime joins the best option is to do it in a steaming
| fashion on ingestion, for example in a system like Flink or
| Arroyo [0], which I work on.
|
| [0] https://github.com/ArroyoSystems/arroyo
| closeparen wrote:
| Something I have found pretty annoying is that Flink works
| great for joining a stream against another stream where
| messages to be joined are expected to arrive within a few
| minutes of each other, but there is actually ~no platform
| solution for joining a small, unchanging or slowly changing
| table against a stream. We end up needing a service to
| consume the messages, make RPC calls, and re-emit them with
| new fields.
| teunispeters wrote:
| A related note, PostgreSQL is very good at joins, but MySQL -
| with at the time much larger share - was never very good at
| them (at the time). (I last explored this 2016 and before). But
| a lot of web interfaces to data exploration (then) were based
| on MySQL and its quirks, and that colours perspectives a lot.
| tosh wrote:
| asof join:
|
| https://code.kx.com/q/ref/asof/
|
| https://code.kx.com/q/learn/brief-introduction/#time-joins
|
| https://clickhouse.com/docs/en/sql-reference/statements/sele...
|
| https://duckdb.org/docs/guides/sql_features/asof_join.html
|
| > Do you have time series data that you want to join, but the
| timestamps don't quite match? Or do you want to look up a value
| that changes over time using the times in another table?
|
| DuckDB blog post on temporal joins:
|
| https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-look...
|
| note: this idea is very useful for timestamps but can also be
| used for other types of values
| jamesblonde wrote:
| The ASOF LEFT JOIN is the main join used to create training
| data, where the labels are in the left table, and features are
| in the tables on the RHS. Note, we use DuckDB for its left asof
| join capabilities, but most vendors use Spark which is
| hideously slow for left asof joins.
| andy81 wrote:
| If I was going to add anything -
|
| "Vertical join" is normally called union.
|
| "Right join" is just (bad) syntax sugar and should be avoided.
| Left join with the tables reversed is the usual convention.
|
| The join condition for inner is optional- if it's always true
| then you get a "cross join". Can be useful to show all the
| possible combinations of two fields.
| wood_spirit wrote:
| All the other non-left joins are just syntactic sugar and can
| be expressed using only left join...?
| tzot wrote:
| > All the other non-left joins are just syntactic sugar and
| can be expressed using only left join...?
|
| A "vertical join" (SQL UNION) is one of the "non-left joins".
| How can you transform a "vertical join" to a "left join"?
| cbreezyyall wrote:
| This feels like an interesting interview question. I think
| you could simulate this with a full outer join on the
| entire select list and coalesce? For a UNION ALL you could
| put some literal column in the selects from both tables
| that you set to different values and include that in the
| join so you'd get a result set that will have all nulls in
| the right table columns for the rows in the left table and
| vice versa. Something like WITH top_t AS (
| SELECT a ,b ,c ,'top' as
| nonexistent_col FROM table_1 ), bottom_t AS (
| SELECT a ,b ,c ,'bottom'
| as nonexistent_col FROM table_2 ) SELECT
| COALESCE(top_t.a, bottom_t.a) AS a
| ,COALESCE(top_t.b, bottom_t.b) AS b
| ,COALESCE(top_t.c, bottom_t.c) AS c FROM top_t FULL
| OUTER JOIN bottom_t ON top_t.a = bottom_t.a
| AND top_t.b = bottom_t.b AND top_t.c = bottom_t.c
| AND top_t.nonexistent_col = bottom_t.nonexistent_col --
| remove this for a normal UNION
| wood_spirit wrote:
| Bravo!!
| Little_Kitty wrote:
| Of the tens of thousands of queries I've written I've needed
| right join the exactly once. It's a feature which is neat in
| that it exists, but the prevalence in teaching materials is
| entirely unjustified. Cross joins are massively more practical
| and enable some efficient transformations, but are usually
| taught only as all to all without a clear position on why they
| are useful.
| nikhilsimha wrote:
| Never heard a union be called a "vertical join" before.
| petalmind wrote:
| It seems that lots of people independently coin this
| expression. I did it too a couple of years ago. Just checked
| Google, and there are lots of hits for this.
| airstrike wrote:
| _> Let's try this again using R._
|
| This should actually be "Let's try this again using dplyr, one of
| the most elegant pieces of software ever written".
|
| Hadley Wickham is a treasure to humanity.
| jarym wrote:
| > "Here we typically expect that the combined dataset will have
| the same number of rows as our original left side dataset."
|
| For left join this isn't entirely true. If there are more
| matching cases on the right joined table then you'll get
| additional rows for each match. That is unless you take steps to
| ensure only at most one row is matched per row on the left (eg
| using something like DISTINCT ON in Postgres)
| wood_spirit wrote:
| Yes I picked that up and was tempted to comment. But then half
| way down it addresses this with the section "Many
| relationships":
|
| > Until now we have discussed scenarios that are considered
| one-to-one merges. In these cases, we only expect one
| participant in a dataset to be joined to one instance of that
| same participant in the other dataset.
|
| > However, there are scenarios where this will not be the
| case...
| jarym wrote:
| Got it, in my opinion the structure is likely to mislead
| people because it doesn't make clear that it refers to 'one-
| to-one' relationships at the outset (in fact not even, it
| deals with the cases of one-to-one and one-to-none) - it only
| refers to 'left join'.
| wood_spirit wrote:
| Yeap, the whole article is confusing to us who are already
| very familiar with eg SQL joins and things. It's not using
| mainstream terminology. But I guess we are not the
| audience.
| petalmind wrote:
| This is my pet peeve. Top google search results and ChatGPT
| suggest this "same number of rows" mental model which is
| incomplete and breaks.
|
| I wrote about this:
| https://minimalmodeling.substack.com/p/many-explanations-of-...
| zzzeek wrote:
| when I saw the term "horizontal joins", I immediately went to,
| what? what's a "vertical join?" must be a UNION, scrolled down
| and sure enough.
|
| Is there some reason to use non-standard terminology for posts
| that are trying to be in-depth, authoritative tutorials ?
| genman wrote:
| Good material.
|
| Joins can be also categorized by the used join algorithm.
|
| The simplest join is a nested loop join. In this case a larger
| dataset is iterated and another small dataset is combined using a
| simple key lookup.
|
| Then there is a merge join. In this case two larger datasets are
| first sorted using merge sort and then aligned linearly.
|
| Then there is a hash join. For the hash join a hash table is
| generated first based on the smaller dataset and the larger
| dataset is iterated and the join is made by making a hash lookup
| to the generated hash table.
|
| The difference between nested loop join and hash join might be
| confusing.
|
| In case of a nested loop join the second table is not loaded from
| the storage medium first, instead an index is used to lookup the
| location of the records. This has O(log n) complexity for each
| lookup. For hash join the table is loaded and hash table is
| generated. In this case each lookup has O(1) complexity but
| creation of hash table is expensive (it has O(n) complexity) and
| is only worth the cost when the dataset is relatively large.
| tmoertel wrote:
| In distributed systems, there's even a _broadcast_ hash join,
| in which the hash table is not distributed across shards and
| assigned to workers but copied in full to every worker. That
| way, the other side of the join need not be sharded and
| shuffled to align join keys across workers. This strategy can
| save a lot of time and network bandwidth when one side of the
| join is small enough to fit into a worker 's RAM and the other
| side is staggeringly massive (e.g., logs). It lets the massive
| side be processed in place, as it's streamed from storage.
| lakomen wrote:
| Beginners talking about SQL joins. What is this newbienews or
| hackernews? "But but ChatGPT says"...
|
| Disgusting
___________________________________________________________________
(page generated 2024-01-20 23:00 UTC)