https://hbfs.wordpress.com/2021/07/20/binary-trees-are-optimal-except-when-theyre-not/

Harder, Better, Faster, Stronger

Explorations in better, faster, stronger code.

  * Home
  * About the author
  * About this blog

 

Binary Trees are optimal... except when they're not.

The best case depth for a search tree is O(\log_b n), if b is the
arity (or branching) of the tree. Intuitively, we know that if we
increase b, the depth goes down, but is that all there is to it? And
if not, how do we chose the optimal branching b?

measuring-tree

While it's true that an optimal search tree will have depth O(\log_b
n), we can't just increase b indefinitely, because we also increase
the number of comparisons we must do in each node. A b-ary tree will
have (in the worst case) b-1 keys, and for each, we must do
comparisons for lower-than and equal (the right-most child will be
searched without comparison, it will be the "else" of all the
comparisons). We must first understand how these comparisons affect
average search time.

If we use a naive comparison scheme, each keys ask for two
(potentially expensive) comparisons. One to test if the sought value,
v is smaller than a key k; one to test if they're equal. If neither
tests is true, we move on to the next key. If none of the tests are
true, it is an "else" case and we go down the rightmost branch. That
scheme does way more work than necessary and would ask for 2(b-1)
comparison per node (because there are b-1 keys per node).

The main problem with this straightforward approach is that
comparisons can be very expensive. While comparing two integral types
(int and the like) is often thought of as "free", comparing string is
expensive. So you clearly do not want to test two strings twice, once
for < and once for =. The much better approach is to use a three-way
comparison, also known as the "spaceship operator".

The spaceship operator, <=> is C++20's version of C's old timey qsort
comparison function (it is in fact much better because it also
automagically provides all comparison operators). Basically, a<=>b
can return <0 if a<b, 0 if they're equal, and >0 if a>b. We can
therefore store the result of the expensive comparison, then do
inexpensive comparisons for the result. That reduces the number of
expensive comparisons to b-1 per node.

The search complexity, counting the number of comparisons in the
worst case for the best-case tree is

\displaystyle (b-1)\log_b n=(b-1)\frac{\ln n}{\ln b}=\frac{b-1}{ln b}
                               \ln n,

a strictly increasing function in b>1 (as we can treat \ln n as a
constant, since we're not optimizing for n):

search-cost

We therefore conclude that b=2 is the optimal solution.

Except when it isn't

We have neglected, or rather abstracted, something important: the
cost of accessing the node. In main memory, it's very convenient to
suppose that reading a node is free, that is, incurs no cost. That's
of course false, because even reading from cache incurs a delay;
fortunately very small. It is even more false if we read the node
from disk (yes, even NVMe). A classical spinning rust disk reading a
block will cause a few millisecond delay, that may be really, really
expensive relative to the comparison of two keys (once they're) in
memory.

The new cost function to optimize will be

\displaystyle ((b-1)c_1+c_2)\log_b n=\frac{(b-1)c_1+c2}{\ln b}\ln n,

with c_1 the comparison cost, and c_2, the cost of accessing the
node. We can set c_1 to 1 and make c_2 relative to c_1. Let's say
something like c_2=100. Now, luckily for us, this function is convex
in b, so we can solve for the optimal b given c_1 and c_2. To do so,
we take the derivative and find where it is zero, that is, solve

\displaystyle \frac{\partial}{\partial b}\frac{(b-1)c_1+c_2}{\ln b}=\
              frac{b(\ln b-1)c_1+c_1-c_2}{b(\ln b)^2}=0

for b. Since the denominator is strictly increasing in b>1, we must
solve for a zero numerator. However, there's a slight complication:

             \displaystyle b(\ln b-1)=\frac{c_2}{c_1}-1

isn't algebraic. We must use a numerical solution to the Lambert W
Function. It gives us, with u=\frac{c_2}{c_1}-1,

         \displaystyle b^*=\frac{u}{LambertW(\frac{u}{e})}.

The following graph shows the function's surface with c_1=5 and c_2=
200, one axes being b, block size and the other being n, the number
of items in the tree. The green plane shows the optimal b found with
the Lambert W function.

search-cost-with-blocks

                                  *
                                 * *

To conclude, binary trees are optimal when we ignore the cost of
accessing a node, but they aren't when it becomes costly to access a
node. When we access the nodes on disk, with a high cost, it becomes
interesting to bundles many keys in a node, and we gave the optimal
solution. However, the problem is often solved quite more directly.
We just fit as many keys as possible in an I/O block. Disks operates
on small blocks of (typically) 512 bytes, but file systems operate in
somewhat larger, but fixed-sized blocks, something like 4 or 8k. A
good strategy is therefore to fit as many keys as possible in that
block, since even if the number of comparisons is large, it will
still be faster than bringing that block into main memory.

Share this:

  * Reddit
  * Twitter
  * More
  * 

  * Facebook
  * Email
  * 
  * 

Like this:

Like Loading...

Related

This entry was posted on Tuesday, July 20th, 2021 at 2:52 am and is
filed under Uncategorized. You can follow any responses to this entry
through the RSS 2.0 feed. You can leave a response, or trackback from
your own site.

Post navigation

<< Previous Post

Leave a Reply Cancel reply

Enter your comment here...
[                    ]

Fill in your details below or click an icon to log in:

  *  
  *  
  * 
  *  
  *  

Gravatar
Email (required) (Address never made public)
[                    ]
Name (required)
[                    ]
Website
[                    ]
WordPress.com Logo

You are commenting using your WordPress.com account. ( Log Out / 
Change )

Google photo

You are commenting using your Google account. ( Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. ( Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. ( Log Out /  Change )

Cancel

Connecting to %s

[ ] Notify me of new comments via email.

[ ] Notify me of new posts via email.

[Post Comment] 

[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
  * Follow Harder, Better, Faster, Stronger on WordPress.com
  * Le Tweet

      + @LMoutouss @maths_en_tete @paljasn IREM ? 3 hours ago
      + @LMoutouss @maths_en_tete @paljasn Comme Gaudi et les
        catenaires? 3 hours ago
      + @LMoutouss @maths_en_tete @paljasn Et c'est vraiment le fruit
        de tatonnements de construction: La pyramide de Khaf...
        twitter.com/i/web/status/1... 3 hours ago
      + @LMoutouss @maths_en_tete @paljasn Parce que la coudee royale
        est divise en 7, et une pente de 10/7 c'est pas assez...
        twitter.com/i/web/status/1... 3 hours ago
      + @LMoutouss @maths_en_tete @paljasn En classe je releve aussi
        ces faits... par exemple, p supposement dans la grande...
        twitter.com/i/web/status/1... 3 hours ago
    Follow @steven_pigeon
  * Recent Posts

      + Binary Trees are optimal... except when they're not.
      + Fixed-Points
      + Mirror, Mirror on the Tripod, Who's the Fairest of them All?
      + More Litbord
      + just_enough
  * Categories

    Categories[Select Category             ]
  * Archives

      + July 2021
      + August 2020
      + July 2020
      + June 2020
      + May 2020
      + March 2020
      + January 2020
      + November 2019
      + August 2019
      + July 2019
      + February 2019
      + August 2018
      + July 2018
      + June 2018
      + May 2018
      + April 2018
      + March 2018
      + February 2018
      + January 2018
      + December 2017
      + November 2017
      + October 2017
      + September 2017
      + May 2017
      + April 2017
      + March 2017
      + February 2017
      + January 2017
      + December 2016
      + November 2016
      + October 2016
      + September 2016
      + April 2016
      + March 2016
      + February 2016
      + January 2016
      + December 2015
      + November 2015
      + October 2015
      + September 2015
      + June 2015
      + May 2015
      + April 2015
      + March 2015
      + February 2015
      + January 2015
      + December 2014
      + November 2014
      + October 2014
      + September 2014
      + June 2014
      + May 2014
      + April 2014
      + March 2014
      + February 2014
      + January 2014
      + December 2013
      + November 2013
      + October 2013
      + September 2013
      + August 2013
      + July 2013
      + June 2013
      + May 2013
      + April 2013
      + March 2013
      + February 2013
      + January 2013
      + August 2012
      + July 2012
      + June 2012
      + May 2012
      + April 2012
      + March 2012
      + February 2012
      + January 2012
      + December 2011
      + November 2011
      + October 2011
      + September 2011
      + August 2011
      + July 2011
      + June 2011
      + May 2011
      + April 2011
      + March 2011
      + February 2011
      + January 2011
      + December 2010
      + November 2010
      + October 2010
      + September 2010
      + August 2010
      + July 2010
      + June 2010
      + May 2010
      + April 2010
      + March 2010
      + February 2010
      + January 2010
      + December 2009
      + November 2009
      + October 2009
      + September 2009
      + August 2009
      + July 2009
      + June 2009
      + May 2009
      + April 2009
      + March 2009
      + February 2009
      + January 2009
      + December 2008
      + November 2008
      + October 2008
      + September 2008
      + August 2008
  * Blogroll

      + DebugFailure
      + Dr Frankenstein's Lab
      + LostWebSite
      + Occasionnally Sane
      + Ofek's Visual C++ stuff
  * RSS

      + RSS Feed


Blog at WordPress.com.

 

  

Loading Comments...
 
Write a Comment... [                    ]
Email (Required) [                    ] Name (Required)
[                    ] Website [                    ]
[Post Comment]

Send to Email Address [                    ] Your Name
[                    ] Your Email Address [                    ]
[                         ]
loading [Send Email] Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.
%d bloggers like this:

[b]