hngopher.com

       [HN Gopher] Accidentally exponential behavior in Spark
       ___________________________________________________________________
        
       Accidentally exponential behavior in Spark
        
       Author : drob
       Score  : 33 points
       Date   : 2021-06-21 18:54 UTC (4 hours ago)
        
 (HTM) web link (heap.io)
 (TXT) w3m dump (heap.io)
        
       | gopalv wrote:
       | I've hit almost the exact same issue with Hive, with a somewhat
       | temporary workaround (like this post) to build a balanced tree
       | out of this by reading it into a list and rebuilding a binary
       | balanced tree out of it.
       | 
       | But we ended up implementing a single level Multi-AND [1] so that
       | this no longer a tree for just AND expressions & can be
       | vectorized neater than the nested structure with a function call
       | for each (this looks more like a tail-call rather than a
       | recursive function).
       | 
       | The ORC CNF conversion has a similar massively exponential item
       | inside which is protected by a check for 256 items or less[2].
       | 
       | [1] - https://issues.apache.org/jira/browse/HIVE-13084
       | 
       | [2] - https://github.com/apache/hive/blob/master/storage-
       | api/src/j...
        
       | dreyfan wrote:
       | Spark is this weird ecosystem of people who take absolutely
       | trivial concepts in SQL, bury their heads in the sand and ignore
       | the past 50 years of RDBMS evolution, and then write extremely
       | complicated (or broken) and expensive to run code. But whatever
       | it takes to get Databricks to IPO! Afterwards the hype will die
       | down and everyone will collectively abandon it just like MongoDB
       | except for the unfortunate companies with so much technical debt
       | they can't extricate themselves from it.
        
         | scottmcdot wrote:
         | Is it best to just use spark.sql?
        
         | hrkfmud50k wrote:
         | spark is far more testable and composable than sql! and you
         | even get static typing checking. plus i can read data from
         | anywhere - local fs, s3, rdbms, json, parquet, csv... rdbms
         | could not compete
        
       | hobs wrote:
       | Good read - fwiw if this is your blog some of your links are
       | broken and think they are local -
       | https://heap.io/blog/%E2%80%9Dhttps://github.com/apache/spar...
        
         | drob wrote:
         | Fixed, thank you for flagging!
        
       | IvanVergiliev wrote:
       | Post author here. Let me know if you have any questions!
        
       | [deleted]
        
       | [deleted]
        
       | mlyle wrote:
       | Why not just...                 val transformedLeftTemp =
       | transform(tree.left)            val transformedLeft = if
       | (transformedLeftTemp.isDefined) {         transformedLeftTemp
       | } else None
        
         | eska wrote:
         | It boggles my mind that the author wrote an entire long article
         | based on this.
         | 
         | The rhetorical question saying that surely that weird refactor
         | of two different functions into one, followed by calling that
         | new, non-trivial function twice for no reason surely shouldn't
         | affect performance.. He already lost me during the premise of
         | the article.
        
           | renewiltord wrote:
           | What is so hard to understand here? There is some library
           | code you can't immediately change because it belongs to
           | upstream Spark. To illustrate the problem, ne simplifies the
           | code to represent what the problem is.
           | 
           | Then, ne writes some code that works around the library bug
           | by modifying the input losslessly into something that's more
           | easily processed by the library.
           | 
           | Finally, ne patches the library bug and shares the patch.
           | 
           | All of this is also kinda fucking obvious to not just me, but
           | a lot of people, so I'm having a really hard time grasping if
           | you've mixed up the illustrative simplification with the
           | actual code, or if you think that the best engineering
           | approach is to always patch your environment bugs instead of
           | modifying your input, or if you just don't have a Github
           | account or for some other reason can't read the patch.
           | 
           | Between that patch and
           | https://github.com/apache/spark/pull/24910 you can see why
           | the code is what it is.
        
         | lanna wrote:
         | It looks like transform(tree.left) returns an Option[Tree]
         | already (otherwise the code would not type check) so the entire
         | if-else in the original code seems redundant and could be
         | replaced with:                   val transformedLeft =
         | transform(tree.left)
        
           | IvanVergiliev wrote:
           | Just responded to the parent comment as well - there's an
           | additional mutable argument to the real `transform` method so
           | it's unsafe to invoke it directly without first checking if
           | the tree is convertible.
        
         | IvanVergiliev wrote:
         | Good question, the simplified example doesn't make this clear.
         | 
         | The real implementation has a mutable `builder` argument used
         | to gradually build the converted filter. If we perform the
         | `transform().isDefined` call directly on the "main" builder,
         | but the subtree turns out to not be convertible, we can mess up
         | the state of the builder.
         | 
         | The second example from the post would look roughly like this:
         | val transformedLeft = if (transform(tree.left, new
         | Builder()).isDefined) {         transform(tree.left,
         | mainBuilder)       } else None
         | 
         | Since the two `transform` invocations are different, we can't
         | cache the result this way.
         | 
         | There's a more detailed explanation in the old comment to the
         | method:
         | https://github.com/apache/spark/pull/24068/files#diff-5de773...
         | .
        
       ___________________________________________________________________
       (page generated 2021-06-21 23:00 UTC)