[HN Gopher] Accidentally exponential behavior in Spark
___________________________________________________________________
Accidentally exponential behavior in Spark
Author : drob
Score : 33 points
Date : 2021-06-21 18:54 UTC (4 hours ago)
(HTM) web link (heap.io)
(TXT) w3m dump (heap.io)
| gopalv wrote:
| I've hit almost the exact same issue with Hive, with a somewhat
| temporary workaround (like this post) to build a balanced tree
| out of this by reading it into a list and rebuilding a binary
| balanced tree out of it.
|
| But we ended up implementing a single level Multi-AND [1] so that
| this no longer a tree for just AND expressions & can be
| vectorized neater than the nested structure with a function call
| for each (this looks more like a tail-call rather than a
| recursive function).
|
| The ORC CNF conversion has a similar massively exponential item
| inside which is protected by a check for 256 items or less[2].
|
| [1] - https://issues.apache.org/jira/browse/HIVE-13084
|
| [2] - https://github.com/apache/hive/blob/master/storage-
| api/src/j...
| dreyfan wrote:
| Spark is this weird ecosystem of people who take absolutely
| trivial concepts in SQL, bury their heads in the sand and ignore
| the past 50 years of RDBMS evolution, and then write extremely
| complicated (or broken) and expensive to run code. But whatever
| it takes to get Databricks to IPO! Afterwards the hype will die
| down and everyone will collectively abandon it just like MongoDB
| except for the unfortunate companies with so much technical debt
| they can't extricate themselves from it.
| scottmcdot wrote:
| Is it best to just use spark.sql?
| hrkfmud50k wrote:
| spark is far more testable and composable than sql! and you
| even get static typing checking. plus i can read data from
| anywhere - local fs, s3, rdbms, json, parquet, csv... rdbms
| could not compete
| hobs wrote:
| Good read - fwiw if this is your blog some of your links are
| broken and think they are local -
| https://heap.io/blog/%E2%80%9Dhttps://github.com/apache/spar...
| drob wrote:
| Fixed, thank you for flagging!
| IvanVergiliev wrote:
| Post author here. Let me know if you have any questions!
| [deleted]
| [deleted]
| mlyle wrote:
| Why not just... val transformedLeftTemp =
| transform(tree.left) val transformedLeft = if
| (transformedLeftTemp.isDefined) { transformedLeftTemp
| } else None
| eska wrote:
| It boggles my mind that the author wrote an entire long article
| based on this.
|
| The rhetorical question saying that surely that weird refactor
| of two different functions into one, followed by calling that
| new, non-trivial function twice for no reason surely shouldn't
| affect performance.. He already lost me during the premise of
| the article.
| renewiltord wrote:
| What is so hard to understand here? There is some library
| code you can't immediately change because it belongs to
| upstream Spark. To illustrate the problem, ne simplifies the
| code to represent what the problem is.
|
| Then, ne writes some code that works around the library bug
| by modifying the input losslessly into something that's more
| easily processed by the library.
|
| Finally, ne patches the library bug and shares the patch.
|
| All of this is also kinda fucking obvious to not just me, but
| a lot of people, so I'm having a really hard time grasping if
| you've mixed up the illustrative simplification with the
| actual code, or if you think that the best engineering
| approach is to always patch your environment bugs instead of
| modifying your input, or if you just don't have a Github
| account or for some other reason can't read the patch.
|
| Between that patch and
| https://github.com/apache/spark/pull/24910 you can see why
| the code is what it is.
| lanna wrote:
| It looks like transform(tree.left) returns an Option[Tree]
| already (otherwise the code would not type check) so the entire
| if-else in the original code seems redundant and could be
| replaced with: val transformedLeft =
| transform(tree.left)
| IvanVergiliev wrote:
| Just responded to the parent comment as well - there's an
| additional mutable argument to the real `transform` method so
| it's unsafe to invoke it directly without first checking if
| the tree is convertible.
| IvanVergiliev wrote:
| Good question, the simplified example doesn't make this clear.
|
| The real implementation has a mutable `builder` argument used
| to gradually build the converted filter. If we perform the
| `transform().isDefined` call directly on the "main" builder,
| but the subtree turns out to not be convertible, we can mess up
| the state of the builder.
|
| The second example from the post would look roughly like this:
| val transformedLeft = if (transform(tree.left, new
| Builder()).isDefined) { transform(tree.left,
| mainBuilder) } else None
|
| Since the two `transform` invocations are different, we can't
| cache the result this way.
|
| There's a more detailed explanation in the old comment to the
| method:
| https://github.com/apache/spark/pull/24068/files#diff-5de773...
| .
___________________________________________________________________
(page generated 2021-06-21 23:00 UTC)