From kees@BTC-Twente.nl  Fri Apr  7 03:43:07 2000
Return-Path: <kees@BTC-Twente.nl>
Received: from Proxy.BTC-Twente.nl (proxy.btc-twente.nl [195.193.46.131])
	by hub.freebsd.org (Postfix) with ESMTP id 9D47537B7D9
	for <FreeBSD-gnats-submit@freebsd.org>; Fri,  7 Apr 2000 03:43:01 -0700 (PDT)
	(envelope-from kees@BTC-Twente.nl)
Received: from kees by Proxy.BTC-Twente.nl with local (Exim 1.92 #1)
	for FreeBSD-gnats-submit@freebsd.org
	id 12dWE7-0000Bg-00; Fri, 7 Apr 2000 12:42:59 +0200
Message-Id: <E12dWE7-0000Bg-00@Proxy.BTC-Twente.nl>
Date: Fri, 7 Apr 2000 12:42:59 +0200
From: kees@echelon.nl (Kees Hendrikse)
Sender: Operator <kees@BTC-Twente.nl>
Reply-To: kees@echelon.nl
To: FreeBSD-gnats-submit@freebsd.org
Subject: Erratic user time reports for long running processes
X-Send-Pr-Version: 3.2

>Number:         17842
>Category:       kern
>Synopsis:       Erratic user time reports for long running processes
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Apr  7 03:50:01 PDT 2000
>Closed-Date:    Wed Feb 02 07:59:46 GMT 2005
>Last-Modified:  Wed Feb  2 10:00:40 GMT 2005
>Originator:     Kees Hendrikse
>Release:        FreeBSD 3.4-RELEASE i386
>Organization:
Echelon Consultancy bv, Enschede, Netherlands
>Environment:

	Tested on a stock 3.4-RELEASE, also tested on a stock 2.2.6-RELEASE.
	Probably a bug present in all releases of FreeBSD.

>Description:

	After a process has consumed more than (about) 379400 seconds of user
	time, ru.ru_utime.tv_sec rolls over to 0. After that, the counter is
	updated with twice the amount of time actually spent in user mode.
	The small program below shows this behaviour, after running for a
	couple of days:

	ce=379950 cu=378871
	ce=380079 cu=378999
	ce=380207 cu=379127
	ce=380336 cu=379255
	ce=380464 cu=379384
	ANOMALY: st=954617438 pe=380464 ce=380592 pu=379384 cu=237
	ANOMALY: st=954617438 pe=380592 ce=380721 pu=237 cu=493
	ANOMALY: st=954617438 pe=380721 ce=380849 pu=493 cu=749
	ANOMALY: st=954617438 pe=380849 ce=380978 pu=749 cu=1005
	ANOMALY: st=954617438 pe=380978 ce=381106 pu=1005 cu=1260
	ANOMALY: st=954617438 pe=381106 ce=381234 pu=1260 cu=1516
	ANOMALY: st=954617438 pe=381234 ce=381363 pu=1516 cu=1772
	ANOMALY: st=954617438 pe=381363 ce=381491 pu=1772 cu=2027

 	The same erratic numbers are reported via /proc/PID/status.

	This bug effects long running programs that consume lots of CPU time
	and use ru.ru_utime for internal statistics. For example, it causes
	the setiathome client to return results with negative amounts of
	CPU time spent, causing the results to be rejected.

>How-To-Repeat:

	Run this program for more than 105 hours (with nice -20):

/* pointless.c */
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>

main()
{
  struct rusage ru;
  struct timeval tv;
  struct timezone tz;
  clock_t start, pelap, celap, puser, cuser;
  long i;

  srandomdev();
  gettimeofday(&tv, &tz);
  getrusage(0, &ru);
  start = tv.tv_sec;
  celap = 0;
  cuser = ru.ru_utime.tv_sec;

  while (1) {
    for (i=0; i < 100000000; i++) random();
    gettimeofday(&tv, &tz);
    getrusage(0, &ru);
    pelap = celap;
    celap = tv.tv_sec - start;
    puser = cuser;
    cuser = ru.ru_utime.tv_sec;

    if (celap < pelap || cuser < puser || cuser - puser > celap - pelap + 1)
      fprintf(stderr, "ANOMALY: st=%ld pe=%ld ce=%ld pu=%ld cu=%ld\n",
              start, pelap, celap, puser, cuser);
    else fprintf(stderr, "ce=%ld cu=%ld\n", celap, cuser);
  }
}


>Fix:
	


>Release-Note:
>Audit-Trail:

From: Sheldon Hearn <sheldonh@uunet.co.za>
To: kees@echelon.nl
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: kern/17842: Erratic user time reports for long running processes 
Date: Wed, 12 Apr 2000 13:29:11 +0200

 On Fri, 07 Apr 2000 12:42:59 +0200, Kees Hendrikse wrote:
 
 > 	Tested on a stock 3.4-RELEASE, also tested on a stock 2.2.6-RELEASE.
 > 	Probably a bug present in all releases of FreeBSD.
 
 I've been running the program for more than 48 hours without an ANOMALY
 lines being printed.  I'm using 5.0-CURRENT.
 
 Does this happen when you _don't_ run it with a high nice scheduling
 parameter?  If so, then I suspect this may be fixed in 4.0-RELEASE as
 well.
 
 Ciao,
 Sheldon.
 

From: Kees Hendrikse <kees@echelon.nl>
To: sheldonh@uunet.co.za (Sheldon Hearn)
Cc: ech@echelon.nl (mailarchief Echelon)
Subject: Re: kern/17842: Erratic user time reports for long running processes
Date: Mon, 24 Apr 2000 02:12:02 +0200 (METDST)

 Hi Sheldon,
 
 > Excellent!  Let me know what you find.
 
 FreeBSD 4.0 doesn't trigger the "ANOMALY" output lines, but it exhibits
 an interesting variation of the same problem: it just stops increasing
 ru.ru_utime.tv_sec:
 
 (...)
 ce=459822 cu=378572
 ce=459938 cu=378688
 ce=460054 cu=378803
 ce=460170 cu=378919
 ce=460285 cu=379034
 ce=460401 cu=379150
 ce=460517 cu=379265 <--- that's all folks....
 ce=460633 cu=379265
 ce=460748 cu=379265
 ce=460864 cu=379265
 (...)
 ce=545783 cu=379265
 ce=545899 cu=379265
 ce=546015 cu=379265
 ce=546130 cu=379265
 ce=546246 cu=379265
 
 The program is still running...
 
 
 top shows:
 
 last pid:  9456;  load averages:  1.00,  1.00,  1.00    up 6+07:57:54  01:42:24
 15 processes:  2 running, 13 sleeping
 CPU states:  0.4% user, 98.1% nice,  0.4% system,  1.2% interrupt,  0.0% idle
 Mem: 4156K Active, 4072K Inact, 3500K Wired, 1576K Cache, 1023K Buf, 1052K Free
 Swap: 64M Total, 60K Used, 64M Free
 
   PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
   162 root     103  20   804K   236K RUN    129.5H 97.22% 97.22% pointless
 (...)
 
 
 /proc/162/status confirms the output of "pointless":
 
 # cd /proc/162
 # for i in 1 2 3 4 5
 > do
 > cat status
 > sleep 2
 > done
 pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16922,998963 nochan 0 0 0,0,0,2,3,4,5,20,31 -
 pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16923,228881 nochan 0 0 0,0,0,2,3,4,5,20,31 -
 pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16923,619831 nochan 0 0 0,0,0,2,3,4,5,20,31 -
 pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16924,10718 nochan 0 0 0,0,0,2,3,4,5,20,31 -
 pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16924,401572 nochan 0 0 0,0,0,2,3,4,5,20,31 -
                                                    ^^^^^^^^^^^^^
 
 
 What baffles me most: why stop counting at 379265,687860 seconds? 
 
 -- 
 Kees Hendrikse                               | email:     kees@echelon.nl
                                              | web:        www.echelon.nl
 ECHELON consultancy and software development | phone: +31 (0)53 48 36 585
 PO Box 545, 7500AM Enschede, The Netherlands | fax:   +31 (0)53 43 36 222
 
 
State-Changed-From-To: open->feedback 
State-Changed-By: mike 
State-Changed-When: Sat Jul 21 18:08:17 PDT 2001 
State-Changed-Why:  

Does this problem still occur in newer versions of FreeBSD, 
such as 4.3-RELEASE? 

http://www.FreeBSD.org/cgi/query-pr.cgi?pr=17842 
State-Changed-From-To: feedback->closed 
State-Changed-By: sheldonh 
State-Changed-When: Fri Jan 18 08:14:39 PST 2002 
State-Changed-Why:  
Automatic feedback timeout.  If additional feedback that warrants 
the re-opening of this PR is available but not included in the 
audit trail, please include the feedback in a reply to this message 
(preserving the Subject line) and ask that the PR be re-opened. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=17842 
State-Changed-From-To: closed->open 
State-Changed-By: bde 
State-Changed-When: Wed Feb 2 07:38:29 GMT 2005 
State-Changed-Why:  
Closed in error. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=17842 
State-Changed-From-To: open->closed 
State-Changed-By: bde 
State-Changed-When: Wed Feb 2 07:48:07 GMT 2005 
State-Changed-Why:  
Superseded by kern/76972, which has a detailed analysis and a partial 
fix (a multiplication overflows after about 105 hours of runtime if 
stathz has its normal value of 128...). 

http://www.freebsd.org/cgi/query-pr.cgi?pr=17842 

From: Bruce Evans <bde@zeta.org.au>
To: Kees Hendrikse <kees@echelon.nl>
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/17842: Erratic user time reports for long running processes
Date: Wed, 2 Feb 2005 20:56:50 +1100 (EST)

 On Wed, 26 Apr 2000, Bruce Evans meant to Cc this to freebsd-gnats-submit
 but only Cced it to freebsd-bugs:
 
 [The submitter of the PR wrote:]
 > >  What baffles me most: why stop counting at 379265,687860 seconds?
 >
 > The multiplication in "uu = (tu * ut) / tt;" in kern_resource.c
 > overflows near there.  Here tu is the total time in usec, ut is the
 > user tick count and tt is the total tick count.  If all ticks are user
 > ticks, then overflow occurs for tu = sqrt(2^64 / 10^6 / hz) = 429496
 > seconds when hz = 100.  Normally there are some interrupt and system
 > ticks, so overflow occurs for a somewhat larger tu and a somewhat
 > smaller ut -- apparently about 379 / 429 times smaller in your case.
 >
 > 4.0 enforces monotonicity of the resource times, so truncated values
 > resulting from the overflow don't cause the resource times to stick
                               ^^^^^ oops, delete "don't" here
 > at the first value where overflow occurs.
 >
 > Bruce
 
 The factor of 379 / 429 was because I got hz and stathz mixed up.
 sqrt(2^64 / 10^6 / 128) is 379265.06...
 
 See PR 76972 for more details, except about the stickyness of the value.
 The value might not stick at the first overflowed value -- overflow just
 gives garbage values, and monotonicity limits the jitter caused by the
 garbage.
 
 Bruce
>Unformatted:
