From peter@dataloss.nl  Sat Apr 19 00:47:24 2003
Return-Path: <peter@dataloss.nl>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id B3E0C37B401
	for <FreeBSD-gnats-submit@freebsd.org>; Sat, 19 Apr 2003 00:47:24 -0700 (PDT)
Received: from useful.dataloss.nl (useful.dataloss.nl [81.17.40.64])
	by mx1.FreeBSD.org (Postfix) with SMTP id 9F2F843FA3
	for <FreeBSD-gnats-submit@freebsd.org>; Sat, 19 Apr 2003 00:47:23 -0700 (PDT)
	(envelope-from peter@dataloss.nl)
Received: (qmail 29699 invoked by uid 1001); 19 Apr 2003 07:47:21 -0000
Message-Id: <20030419074721.29698.qmail@useful.dataloss.nl>
Date: 19 Apr 2003 07:47:21 -0000
From: Peter van Dijk <peter@dataloss.nl>
Reply-To: Peter van Dijk <peter@dataloss.nl>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: du hardlinkmatching is slow - fix included
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         51151
>Category:       bin
>Synopsis:       du hardlinkmatching is slow - fix included
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    kientzle
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Sat Apr 19 00:50:10 PDT 2003
>Closed-Date:    Mon May 17 11:47:54 PDT 2004
>Last-Modified:  Mon May 17 11:47:54 PDT 2004
>Originator:     Peter van Dijk
>Release:        FreeBSD 4.7-RELEASE-p6 i386
>Organization:
-
>Environment:
System: FreeBSD useful.dataloss.nl 4.7-RELEASE-p6 FreeBSD 4.7-RELEASE-p6 #3: Fri Feb 28 21:41:55 CET 2003 root@useful.home.dataloss.nl:/home2/usr2/obj/home2/usr2/src/sys/USEFUL i386


>Description:

Running du -hs on a set of directories with lots of hardlinks between them
is horribly slow. The reason is that the inode-numbers are stored in a flat
array which is searched in a linear way.

>How-To-Repeat:

Create lots of hardlinks and run du on your directory.

>Fix:

This patch modifies du to use the balanced tree methods provided by libisc.
On my testset it reduced du runtime from around 4 hours to just under 2
minutes.

I'm sorry it depends on libisc, but I could not find any suitable
datastructures in the libc. I am open to suggestions.

--- /usr/src/usr.bin/du/du.c	Wed Sep 19 21:19:48 2001
+++ du.c	Thu Apr 17 15:16:17 2003
@@ -63,6 +63,7 @@
 #include <string.h>
 #include <sysexits.h>
 #include <unistd.h>
+#include <isc/tree.h>
 
 #define	KILO_SZ(n) (n)
 #define	MEGA_SZ(n) ((n) * (n))
@@ -104,6 +105,8 @@
 void		ignoreclean __P((void));
 int		ignorep __P((FTSENT *));
 
+tree *linktree;
+
 int
 main(argc, argv)
 	int argc;
@@ -232,6 +235,8 @@
 	blocksize /= 512;
 
 	rval = 0;
+
+	tree_init(&linktree);
 	
 	if ((fts = fts_open(argv, ftsoptions, NULL)) == NULL)
 		err(1, "fts_open");
@@ -308,36 +313,33 @@
 	exit(rval);
 }
 
-
-typedef struct _ID {
-	dev_t	dev;
-	ino_t	inode;
-} ID;
-
-
 int
 linkchk(p)
 	FTSENT *p;
 {
-	static ID *files;
-	static int maxfiles, nfiles;
-	ID *fp, *start;
 	ino_t ino;
 	dev_t dev;
+	char *s;
+
+	if((s=malloc(32))==0)
+		err(1, "malloc");
 
 	ino = p->fts_statp->st_ino;
 	dev = p->fts_statp->st_dev;
-	if ((start = files) != NULL)
-		for (fp = start + nfiles - 1; fp >= start; --fp)
-			if (ino == fp->inode && dev == fp->dev)
-				return (1);
-
-	if (nfiles == maxfiles && (files = realloc((char *)files,
-	    (u_int)(sizeof(ID) * (maxfiles += 128)))) == NULL)
-		errx(1, "can't allocate memory");
-	files[nfiles].inode = ino;
-	files[nfiles].dev = dev;
-	++nfiles;
+
+	snprintf(s, 32, "%u:%u", dev, ino);
+
+	if(tree_srch(&linktree, strcmp, s)==0)
+	{
+		tree_add(&linktree, strcmp, s, 0);
+		return(0);
+	}
+	else
+	{
+		free(s);
+		return(1);
+	}
+
 	return (0);
 }
 

>Release-Note:
>Audit-Trail:

From: David Schultz <das@FreeBSD.ORG>
To: Peter van Dijk <peter@dataloss.nl>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: bin/51151: du hardlinkmatching is slow - fix included
Date: Sun, 20 Apr 2003 01:27:24 -0700

 On Sat, Apr 19, 2003, Peter van Dijk wrote:
 > +	snprintf(s, 32, "%u:%u", dev, ino);
 > +
 > +	if(tree_srch(&linktree, strcmp, s)==0)
 > +	{
 > +		tree_add(&linktree, strcmp, s, 0);
 > +		return(0);
 > +	}
 > +	else
 > +	{
 > +		free(s);
 > +		return(1);
 > +	}
 
 A hash table would be more appropriate here.  You could even
 preserve the original behavior of making infrequent calls to
 malloc() and most of the original code by using open addressing.
 In any case, pulling in another library is probably not desirable.
 
 If you would like to revise this patch, I would be happy to help
 you get it committed.  You might also want to take a look at
 style(9).

From: Peter van Dijk <peter@dataloss.nl>
To: David Schultz <das@FreeBSD.ORG>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: bin/51151: du hardlinkmatching is slow - fix included
Date: Sun, 20 Apr 2003 11:18:35 +0200

 On Sun, Apr 20, 2003 at 01:27:24AM -0700, David Schultz wrote:
 > A hash table would be more appropriate here.  You could even
 > preserve the original behavior of making infrequent calls to
 > malloc() and most of the original code by using open addressing.
 
 Yup, I'd prefer a hash too, but libc doesn't provide a suitable one,
 so I figured giving this a shot was a viable option :)
 
 > In any case, pulling in another library is probably not desirable.
 
 I feared as much.
 
 > If you would like to revise this patch, I would be happy to help
 > you get it committed.  You might also want to take a look at
 > style(9).
 
 I indeed care about this patch (or at least, a patch that speeds up
 du) getting committed.
 
 I glanced over style(9) before I submitted the patch - any specific
 nits with the current patch, regarding style?
 
 [resent with gnats in Cc:]
 
 Greetz, Peter
 -- 
 peter@dataloss.nl | ~ we care a lot: about the war we're fighting
 www.dataloss.nl   |  - gee that looks like fun! (Faith no more)
 UnderNet/#clue    | 
                   |       iraqbodycount.net: min 1878, max 2325

From: David Schultz <das@FreeBSD.org>
To: Peter van Dijk <peter@dataloss.nl>
Cc: FreeBSD-gnats-submit@FreeBSD.org
Subject: Re: bin/51151: du hardlinkmatching is slow - fix included
Date: Sun, 20 Apr 2003 02:51:49 -0700

 On Sun, Apr 20, 2003, Peter van Dijk wrote:
 > On Sun, Apr 20, 2003 at 01:27:24AM -0700, David Schultz wrote:
 > > A hash table would be more appropriate here.  You could even
 > > preserve the original behavior of making infrequent calls to
 > > malloc() and most of the original code by using open addressing.
 > 
 > Yup, I'd prefer a hash too, but libc doesn't provide a suitable one,
 > so I figured giving this a shot was a viable option :)
 
 In libc, there are hcreate(3) and friends, which work nicely except
 for their limitation of one hash table per module.  That shouldn't
 be an issue here.  Alternatively, you could roll your own easily
 enough.  Here's some pseudocode using chaining:
 
 	hval = hash(ino, dev);
 	for (p = table[hval]; p != NULL; p = p->next)
 		if (p->ino == ino && p->dev == dev)
 			return (1);
 	p = malloc(sizeof(hashent));
 	p->next = table[hval];
 	p->ino = ino;
 	p->dev = dev;
 	talbe[hval] = p;
 
 > > If you would like to revise this patch, I would be happy to help
 > > you get it committed.  You might also want to take a look at
 > > style(9).
 
 At a glance, just the whitespace around = and ==.

From: Peter van Dijk <peter@dataloss.nl>
To: David Schultz <das@FreeBSD.org>
Cc: FreeBSD-gnats-submit@FreeBSD.org
Subject: Re: bin/51151: du hardlinkmatching is slow - fix included
Date: Sun, 20 Apr 2003 12:35:12 +0200

 On Sun, Apr 20, 2003 at 02:51:49AM -0700, David Schultz wrote:
 [snip]
 > In libc, there are hcreate(3) and friends, which work nicely except
 > for their limitation of one hash table per module.  That shouldn't
 > be an issue here.  Alternatively, you could roll your own easily
 > enough.  Here's some pseudocode using chaining:
 
 I tried hcreate but it failed miserably. I may investigate that some
 more (it seemed to misbehave on my input) and send-pr about it.
 
 I intend to roll my own with open addressing indeed.
 
 > 	hval = hash(ino, dev);
 > 	for (p = table[hval]; p != NULL; p = p->next)
 > 		if (p->ino == ino && p->dev == dev)
 > 			return (1);
 > 	p = malloc(sizeof(hashent));
 > 	p->next = table[hval];
 > 	p->ino = ino;
 > 	p->dev = dev;
 > 	talbe[hval] = p;
 
 Indeed, it shouldn't be hard :)
 
 Greetz, Peter
 -- 
 peter@dataloss.nl | ~ we care a lot: about the war we're fighting
 www.dataloss.nl   |  - gee that looks like fun! (Faith no more)
 UnderNet/#clue    | 
                   |       iraqbodycount.net: min 1878, max 2325

From: David Schultz <das@FreeBSD.org>
To: Peter van Dijk <peter@dataloss.nl>
Cc: FreeBSD-gnats-submit@FreeBSD.org
Subject: Re: bin/51151: du hardlinkmatching is slow - fix included
Date: Sun, 20 Apr 2003 04:01:09 -0700

 On Sun, Apr 20, 2003, Peter van Dijk wrote:
 > On Sun, Apr 20, 2003 at 02:51:49AM -0700, David Schultz wrote:
 > [snip]
 > > In libc, there are hcreate(3) and friends, which work nicely except
 > > for their limitation of one hash table per module.  That shouldn't
 > > be an issue here.  Alternatively, you could roll your own easily
 > > enough.  Here's some pseudocode using chaining:
 > 
 > I tried hcreate but it failed miserably. I may investigate that some
 > more (it seemed to misbehave on my input) and send-pr about it.
 > 
 > I intend to roll my own with open addressing indeed.
 
 Cool.  One caveat I forgot about: you have to deal with expansion
 in that case.  Might not be worth it.

From: Peter van Dijk <peter@dataloss.nl>
To: David Schultz <das@FreeBSD.org>
Cc: FreeBSD-gnats-submit@FreeBSD.org
Subject: Re: bin/51151: du hardlinkmatching is slow - fix included
Date: Sun, 20 Apr 2003 17:55:04 +0200

 On Sun, Apr 20, 2003 at 04:01:09AM -0700, David Schultz wrote:
 [snip]
 > > I intend to roll my own with open addressing indeed.
 > 
 > Cool.  One caveat I forgot about: you have to deal with expansion
 > in that case.  Might not be worth it.
 
 Just a matter of building a new empty table and re-filling it.. should
 be faster than the current state of affairs anyway.
 
 I'll dig into it later this week :)
 
 Thanks for the support so far!
 
 Greetz, Peter
 -- 
 peter@dataloss.nl | ~ we care a lot: about the war we're fighting
 www.dataloss.nl   |  - gee that looks like fun! (Faith no more)
 UnderNet/#clue    | 
                   |       iraqbodycount.net: min 1878, max 2325

From: Tim Kientzle <tim@kientzle.com>
To: freebsd-gnats-submit@FreeBSD.org, peter@dataloss.nl
Cc:  
Subject: Re: bin/51151: du hardlinkmatching is slow - fix included
Date: Thu, 29 Apr 2004 22:30:16 -0700

 What's the status of this?
 
 I was thinking of copying the self-tuning hash
 code from bsdtar (which handles exactly this problem,
 and very efficiently), but if you have something better...
 
 Tim Kientzle
 

From: Peter van Dijk <peter@dataloss.nl>
To: Tim Kientzle <tim@kientzle.com>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: bin/51151: du hardlinkmatching is slow - fix included
Date: Fri, 30 Apr 2004 11:27:42 +0200

 On Thu, Apr 29, 2004 at 10:30:16PM -0700, Tim Kientzle wrote:
 > What's the status of this?
 
 Status is that I'm using the patch myself on a few machines, and that
 I haven't looked into it any further..
 
 > I was thinking of copying the self-tuning hash
 > code from bsdtar (which handles exactly this problem,
 > and very efficiently), but if you have something better...
 
 I have nothing better right now. Please go ahead, your idea sounds
 good :)
 
 Greetz, Peter
 -- 
 peter@dataloss.nl        | ~ tonight tonight, what is this potion
 http://blog.dataloss.nl/ | ~ that makes a fool of me
 UnderNet/#clue           |     Wayfinder, fr-025 soundtrack
State-Changed-From-To: open->patched 
State-Changed-By: kientzle 
State-Changed-When: Fri Apr 30 11:34:11 PDT 2004 
State-Changed-Why:  
Fixed in revision 1.30 of du.c 


Responsible-Changed-From-To: freebsd-bugs->kientzle 
Responsible-Changed-By: kientzle 
Responsible-Changed-When: Fri Apr 30 11:34:11 PDT 2004 
Responsible-Changed-Why:  
Fixed in revision 1.30 of du.c 

http://www.freebsd.org/cgi/query-pr.cgi?pr=51151 
State-Changed-From-To: patched->closed 
State-Changed-By: kientzle 
State-Changed-When: Mon May 17 11:47:38 PDT 2004 
State-Changed-Why:  
Fixed 

http://www.freebsd.org/cgi/query-pr.cgi?pr=51151 
>Unformatted:
