From nobody@FreeBSD.org  Thu Aug 29 09:19:44 2002
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id BD71E37B400
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 29 Aug 2002 09:19:44 -0700 (PDT)
Received: from www.freebsd.org (www.FreeBSD.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8054743E42
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 29 Aug 2002 09:19:44 -0700 (PDT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.12.4/8.12.4) with ESMTP id g7TGJiOT008161
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 29 Aug 2002 09:19:44 -0700 (PDT)
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.12.4/8.12.4/Submit) id g7TGJikm008160;
	Thu, 29 Aug 2002 09:19:44 -0700 (PDT)
Message-Id: <200208291619.g7TGJikm008160@www.freebsd.org>
Date: Thu, 29 Aug 2002 09:19:44 -0700 (PDT)
From: Mike Harding <mvh@ix.netcom.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: du uses linear search for duplicate inodes - very slow!
X-Send-Pr-Version: www-1.0

>Number:         42167
>Category:       misc
>Synopsis:       du uses linear search for duplicate inodes - very slow!
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    kientzle
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Thu Aug 29 09:20:01 PDT 2002
>Closed-Date:    Mon May 17 11:47:28 PDT 2004
>Last-Modified:  Mon May 17 11:47:28 PDT 2004
>Originator:     Mike Harding
>Release:        4.6-STABLE
>Organization:
Welkyn Technologies, LLC
>Environment:
FreeBSD netcom1.netcom.com 4.6-STABLE FreeBSD 4.6-STABLE #0: Wed Aug 28 09:57:56 PDT 2002     mvh@netcom1.netcom.com:/usr/obj/usr/src/sys/MIKEIPF  i386

>Description:
'du' uses an unordered linear search for duplicate inodes.  This causes du to run _extremely_ slowly if you have a lot of links.  I have a local news spool created by 'leafnode+' and it takes hours at 100% cpu to run du on it.  This sort of approach is order N^2 or N^3 and is very time inefficient.
>How-To-Repeat:
run du on a directory with a _lot_ of files (my leafnode directory has about 1 million files).
>Fix:
see the function linkchk() in du.c - this function does a linear search and insert.  It could be enhanced to use either a hash, a btree, an ordered list, or an ordered list of a fixed size and a sorted overflow list.  Another possibility is to use a set of hash buckets and the same underlying algorithm - I will be trying to code up this approach in the next few days.
>Release-Note:
>Audit-Trail:

From: Mike Harding <mvh@netcom.com>
To: freebsd-gnats-submit@FreeBSD.org, mvh@ix.netcom.com
Cc:  
Subject: Re: misc/42167: du uses linear search for duplicate inodes - very
 slow!
Date: Fri, 30 Aug 2002 17:29:18 -0700

 Here's a patch file which uses 256 lists rather than one - it
 doesn't slow down nearly as fast...
 
 begin 644 du.c
 M26YD97@Z(&1U+F,*/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]
 M/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/0I20U,@9FEL93H@
 M+W5S<B]L;V-A;"]C=G-R;V]T+V1U+V1U+F,L=@IR971R:65V:6YG(')E=FES
 M:6]N(#$N,0IR971R:65V:6YG(')E=FES:6]N(#$N,PID:69F("UC("UR,2XQ
 M("UR,2XS"BHJ*B!D=2YC"3,P($%U9R`R,#`R(#$T.C0R.C0Q("TP,#`P"3$N
 M,0HM+2T@9'4N8PDS,2!!=6<@,C`P,B`P,#HR,#HS.2`M,#`P,`DQ+C,**BHJ
 M*BHJ*BHJ*BHJ*BHJ"BHJ*B`S,30L,S0S("HJ*BH*("`):6YO7W0):6YO9&4[
 M"B`@?2!)1#L*("`*("`*("!I;G0*("!L:6YK8VAK*'`I"B`@"4944T5.5"`J
 M<#L*("!["B$@"7-T871I8R!)1"`J9FEL97,["B$@"7-T871I8R!I;G0@;6%X
 M9FEL97,L(&YF:6QE<SL*("`)240@*F9P+"`J<W1A<G0["B`@"6EN;U]T(&EN
 M;SL*("`)9&5V7W0@9&5V.PH@(`H@(`EI;F\@/2!P+3YF='-?<W1A='`M/G-T
 M7VEN;SL*("`)9&5V(#T@<"T^9G1S7W-T871P+3YS=%]D978["B$@"6EF("@H
 M<W1A<G0@/2!F:6QE<RD@(3T@3E5,3"D*(2`)"69O<B`H9G`@/2!S=&%R="`K
 M(&YF:6QE<R`M(#$[(&9P(#X]('-T87)T.R`M+69P*0H@(`D)"6EF("AI;F\@
 M/3T@9G`M/FEN;V1E("8F(&1E=B`]/2!F<"T^9&5V*0H@(`D)"0ER971U<FX@
 M*#$I.PH@(`HA(`EI9B`H;F9I;&5S(#T](&UA>&9I;&5S("8F("AF:6QE<R`]
 M(')E86QL;V,H*&-H87(@*BEF:6QE<RP*(2`)("`@("AU7VEN="DH<VEZ96]F
 M*$E$*2`J("AM87AF:6QE<R`K/2`Q,C@I*2DI(#T]($Y53$PI"B`@"0EE<G)X
 M*#$L(")C86XG="!A;&QO8V%T92!M96UO<GDB*3L*(2`)9FEL97-;;F9I;&5S
 M72YI;F]D92`](&EN;SL*(2`)9FEL97-;;F9I;&5S72YD978@/2!D978["B$@
 M"2LK;F9I;&5S.PH@(`ER971U<FX@*#`I.PH@('T*("`*+2TM(#,Q-"PS-3(@
 M+2TM+0H@(`EI;F]?=`EI;F]D93L*("!]($E$.PH@(`HK("-D969I;F4@2$%3
 M2%-)6D4@,C4V"B`@"B`@:6YT"B`@;&EN:V-H:RAP*0H@(`E&5%-%3E0@*G`[
 M"B`@>PHA(`ES=&%T:6,@240@*BIF:6QE<SL*(2`)<W1A=&EC(&EN="`J;6%X
 M9FEL97,L("IN9FEL97,["B$@"7-T871I8R!)1"`J9FEL97-P:%M(05-(4TE:
 M15T["B$@"7-T871I8R!I;G0@;6%X9FEL97-H6TA!4TA325I%72P@;F9I;&5S
 M:%M(05-(4TE:15T["B`@"4E$("IF<"P@*G-T87)T.PH@(`EI;F]?="!I;F\[
 M"B`@"61E=E]T(&1E=CL**R`):6YT(&EN9&5X.PH@(`H@(`EI;F\@/2!P+3YF
 M='-?<W1A='`M/G-T7VEN;SL**R`):6YD97@@/2!A8G,H(&EN;R`E($A!4TA3
 M25I%*3L**R`)9FEL97,@/2`F9FEL97-P:%MI;F1E>%T["BL@"6UA>&9I;&5S
 M(#T@)FUA>&9I;&5S:%MI;F1E>%T["BL@"6YF:6QE<R`]("9N9FEL97-H6VEN
 M9&5X73L**R`*("`)9&5V(#T@<"T^9G1S7W-T871P+3YS=%]D978["B$@"6EF
 M("@H<W1A<G0@/2`J9FEL97,I("$]($Y53$PI"B$@"0EF;W(@*&9P(#T@<W1A
 M<G0@*R`J;F9I;&5S("T@,3L@9G`@/CT@<W1A<G0[("TM9G`I"B`@"0D):68@
 M*&EN;R`]/2!F<"T^:6YO9&4@)B8@9&5V(#T](&9P+3YD978I"B`@"0D)"7)E
 M='5R;B`H,2D["B`@"B$@"6EF("@J;F9I;&5S(#T]("IM87AF:6QE<R`F)B`H
 M*F9I;&5S(#T@<F5A;&QO8R@H8VAA<B`J*2IF:6QE<RP*(2`)("`@("AU7VEN
 M="DH<VEZ96]F*$E$*2`J("@J;6%X9FEL97,@*ST@,3(X*2DI*2`]/2!.54Q,
 M*0H@(`D)97)R>"@Q+"`B8V%N)W0@86QL;V-A=&4@;65M;W)Y(BD["B$@"2@J
 M9FEL97,I6RIN9FEL97-=+FEN;V1E(#T@:6YO.PHA(`DH*F9I;&5S*5LJ;F9I
 M;&5S72YD978@/2!D978["B$@"2LK*"IN9FEL97,I.PH@(`ER971U<FX@*#`I
 ).PH@('T*("`*
 `
 end
 
 

From: Mike Harding <mvh@netcom.com>
To: freebsd-gnats-submit@FreeBSD.org, mvh@ix.netcom.com
Cc:  
Subject: Re: misc/42167: du uses linear search for duplicate inodes - very
 slow!
Date: Fri, 30 Aug 2002 17:34:10 -0700

 sorry, I am new with patches - here's a proper patch file, I think...
 
 begin 644 du.c.patch
 M26YD97@Z(&1U+F,*/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]
 M/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/3T]/0I20U,@9FEL93H@
 M+W5S<B]L;V-A;"]C=G-R;V]T+V1U+V1U+F,L=@IR971R:65V:6YG(')E=FES
 M:6]N(#$N,0IR971R:65V:6YG(')E=FES:6]N(#$N,PID:69F("UC("UR,2XQ
 M("UR,2XS"BHJ*B!D=2YC"3,P($%U9R`R,#`R(#$T.C0R.C0Q("TP,#`P"3$N
 M,0HM+2T@9'4N8PDS,2!!=6<@,C`P,B`P,#HR,#HS.2`M,#`P,`DQ+C,**BHJ
 M*BHJ*BHJ*BHJ*BHJ"BHJ*B`S,30L,S0S("HJ*BH*("`):6YO7W0):6YO9&4[
 M"B`@?2!)1#L*("`*("`*("!I;G0*("!L:6YK8VAK*'`I"B`@"4944T5.5"`J
 M<#L*("!["B$@"7-T871I8R!)1"`J9FEL97,["B$@"7-T871I8R!I;G0@;6%X
 M9FEL97,L(&YF:6QE<SL*("`)240@*F9P+"`J<W1A<G0["B`@"6EN;U]T(&EN
 M;SL*("`)9&5V7W0@9&5V.PH@(`H@(`EI;F\@/2!P+3YF='-?<W1A='`M/G-T
 M7VEN;SL*("`)9&5V(#T@<"T^9G1S7W-T871P+3YS=%]D978["B$@"6EF("@H
 M<W1A<G0@/2!F:6QE<RD@(3T@3E5,3"D*(2`)"69O<B`H9G`@/2!S=&%R="`K
 M(&YF:6QE<R`M(#$[(&9P(#X]('-T87)T.R`M+69P*0H@(`D)"6EF("AI;F\@
 M/3T@9G`M/FEN;V1E("8F(&1E=B`]/2!F<"T^9&5V*0H@(`D)"0ER971U<FX@
 M*#$I.PH@(`HA(`EI9B`H;F9I;&5S(#T](&UA>&9I;&5S("8F("AF:6QE<R`]
 M(')E86QL;V,H*&-H87(@*BEF:6QE<RP*(2`)("`@("AU7VEN="DH<VEZ96]F
 M*$E$*2`J("AM87AF:6QE<R`K/2`Q,C@I*2DI(#T]($Y53$PI"B`@"0EE<G)X
 M*#$L(")C86XG="!A;&QO8V%T92!M96UO<GDB*3L*(2`)9FEL97-;;F9I;&5S
 M72YI;F]D92`](&EN;SL*(2`)9FEL97-;;F9I;&5S72YD978@/2!D978["B$@
 M"2LK;F9I;&5S.PH@(`ER971U<FX@*#`I.PH@('T*("`*+2TM(#,Q-"PS-3(@
 M+2TM+0H@(`EI;F]?=`EI;F]D93L*("!]($E$.PH@(`HK("-D969I;F4@2$%3
 M2%-)6D4@,C4V"B`@"B`@:6YT"B`@;&EN:V-H:RAP*0H@(`E&5%-%3E0@*G`[
 M"B`@>PHA(`ES=&%T:6,@240@*BIF:6QE<SL*(2`)<W1A=&EC(&EN="`J;6%X
 M9FEL97,L("IN9FEL97,["B$@"7-T871I8R!)1"`J9FEL97-P:%M(05-(4TE:
 M15T["B$@"7-T871I8R!I;G0@;6%X9FEL97-H6TA!4TA325I%72P@;F9I;&5S
 M:%M(05-(4TE:15T["B`@"4E$("IF<"P@*G-T87)T.PH@(`EI;F]?="!I;F\[
 M"B`@"61E=E]T(&1E=CL**R`):6YT(&EN9&5X.PH@(`H@(`EI;F\@/2!P+3YF
 M='-?<W1A='`M/G-T7VEN;SL**R`):6YD97@@/2!A8G,H(&EN;R`E($A!4TA3
 M25I%*3L**R`)9FEL97,@/2`F9FEL97-P:%MI;F1E>%T["BL@"6UA>&9I;&5S
 M(#T@)FUA>&9I;&5S:%MI;F1E>%T["BL@"6YF:6QE<R`]("9N9FEL97-H6VEN
 M9&5X73L**R`*("`)9&5V(#T@<"T^9G1S7W-T871P+3YS=%]D978["B$@"6EF
 M("@H<W1A<G0@/2`J9FEL97,I("$]($Y53$PI"B$@"0EF;W(@*&9P(#T@<W1A
 M<G0@*R`J;F9I;&5S("T@,3L@9G`@/CT@<W1A<G0[("TM9G`I"B`@"0D):68@
 M*&EN;R`]/2!F<"T^:6YO9&4@)B8@9&5V(#T](&9P+3YD978I"B`@"0D)"7)E
 M='5R;B`H,2D["B`@"B$@"6EF("@J;F9I;&5S(#T]("IM87AF:6QE<R`F)B`H
 M*F9I;&5S(#T@<F5A;&QO8R@H8VAA<B`J*2IF:6QE<RP*(2`)("`@("AU7VEN
 M="DH<VEZ96]F*$E$*2`J("@J;6%X9FEL97,@*ST@,3(X*2DI*2`]/2!.54Q,
 M*0H@(`D)97)R>"@Q+"`B8V%N)W0@86QL;V-A=&4@;65M;W)Y(BD["B$@"2@J
 M9FEL97,I6RIN9FEL97-=+FEN;V1E(#T@:6YO.PHA(`DH*F9I;&5S*5LJ;F9I
 M;&5S72YD978@/2!D978["B$@"2LK*"IN9FEL97,I.PH@(`ER971U<FX@*#`I
 ).PH@('T*("`*
 `
 end
 
 

From: Mike Harding <mvh@netcom.com>
To: freebsd-gnats-submit@FreeBSD.org, mvh@ix.netcom.com
Cc:  
Subject: Re: misc/42167: du uses linear search for duplicate inodes - very
 slow!
Date: Fri, 30 Aug 2002 19:52:57 -0700

 Here's a unified diff, for perusal by the merely curious... the general 
 approach is to replace the single unordered list with HASHSIZE unordered 
 lists.  The hashing function is a simple divide, which should be 
 sufficient as inodes should be very evenly distributed (sequential, 
 no?).  The use of 'abs' below may be unnecessary if ino_t is guaranteed 
 to be unsigned, I am unaware of convention here...
 
 - Mike H.
 
 bash-2.05b$ cvs diff -u -r 1.1 -r 1.3 du.c
 Index: du.c
 ===================================================================
 RCS file: /usr/local/cvsroot/du/du.c,v
 retrieving revision 1.1
 retrieving revision 1.3
 diff -u -r1.1 -r1.3
 --- du.c    30 Aug 2002 14:42:41 -0000    1.1
 +++ du.c    31 Aug 2002 00:20:39 -0000    1.3
 @@ -314,30 +314,39 @@
      ino_t    inode;
  } ID;
  
 +#define HASHSIZE 256
  
  int
  linkchk(p)
      FTSENT *p;
  {
 -    static ID *files;
 -    static int maxfiles, nfiles;
 +    static ID **files;
 +    static int *maxfiles, *nfiles;
 +    static ID *filesph[HASHSIZE];
 +    static int maxfilesh[HASHSIZE], nfilesh[HASHSIZE];
      ID *fp, *start;
      ino_t ino;
      dev_t dev;
 +    int index;
  
      ino = p->fts_statp->st_ino;
 +    index = abs( ino % HASHSIZE);
 +    files = &filesph[index];
 +    maxfiles = &maxfilesh[index];
 +    nfiles = &nfilesh[index];
 +
      dev = p->fts_statp->st_dev;
 -    if ((start = files) != NULL)
 -        for (fp = start + nfiles - 1; fp >= start; --fp)
 +    if ((start = *files) != NULL)
 +        for (fp = start + *nfiles - 1; fp >= start; --fp)
              if (ino == fp->inode && dev == fp->dev)
                  return (1);
  
 -    if (nfiles == maxfiles && (files = realloc((char *)files,
 -        (u_int)(sizeof(ID) * (maxfiles += 128)))) == NULL)
 +    if (*nfiles == *maxfiles && (*files = realloc((char *)*files,
 +        (u_int)(sizeof(ID) * (*maxfiles += 128)))) == NULL)
          errx(1, "can't allocate memory");
 -    files[nfiles].inode = ino;
 -    files[nfiles].dev = dev;
 -    ++nfiles;
 +    (*files)[*nfiles].inode = ino;
 +    (*files)[*nfiles].dev = dev;
 +    ++(*nfiles);
      return (0);
  }
  
 

From: Tim Kientzle <tim@kientzle.com>
To: freebsd-gnats-submit@FreeBSD.org, mvh@ix.netcom.com
Cc:  
Subject: Re: misc/42167: du uses linear search for duplicate inodes - very
 slow!
Date: Thu, 29 Apr 2004 22:27:02 -0700

 I just ran across this problem myself.
 
 As it happens, I recently had to address this
 exact problem in bsdtar, and therefore have
 some very efficient code (a self-tuning
 hash table) for handling this problem.
 
 I'll look into adapting it for du.
 
 
State-Changed-From-To: open->patched 
State-Changed-By: kientzle 
State-Changed-When: Fri Apr 30 11:31:12 PDT 2004 
State-Changed-Why:  
Fixed in revision 1.30 of du.c 


Responsible-Changed-From-To: freebsd-bugs->kientzle 
Responsible-Changed-By: kientzle 
Responsible-Changed-When: Fri Apr 30 11:31:12 PDT 2004 
Responsible-Changed-Why:  
Fixed in revision 1.30 of du.c 

http://www.freebsd.org/cgi/query-pr.cgi?pr=42167 

From: Mike Harding <mvh@ix.netcom.com>
To: Tim Kientzle <tim@kientzle.com>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: misc/42167: du uses linear search for duplicate inodes - very
	slow!
Date: Fri, 30 Apr 2004 14:08:52 -0700

 The patch I provided was certainly not optimal - I made minimal changes
 to get du to speed up somewhat in a manner where it was clear that the
 change was correct.  Thanks for getting your optimization into the
 system, it will clearly save people a lot of time.  Too bad it missed
 4.10, where I will be for a while...
 
 On Thu, 2004-04-29 at 22:27, Tim Kientzle wrote:
 > I just ran across this problem myself.
 > 
 > As it happens, I recently had to address this
 > exact problem in bsdtar, and therefore have
 > some very efficient code (a self-tuning
 > hash table) for handling this problem.
 > 
 > I'll look into adapting it for du.
 > 
 > 
 
State-Changed-From-To: patched->closed 
State-Changed-By: kientzle 
State-Changed-When: Mon May 17 11:46:58 PDT 2004 
State-Changed-Why:  
Fixed 

http://www.freebsd.org/cgi/query-pr.cgi?pr=42167 
>Unformatted:
