From enache@rdslink.ro  Fri Dec  6 12:18:40 2002
Return-Path: <enache@rdslink.ro>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 2DD4537B401
	for <FreeBSD-gnats-submit@freebsd.org>; Fri,  6 Dec 2002 12:18:40 -0800 (PST)
Received: from ratsnest.org (213.157.186.184 [213.157.186.184])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 01BEA43EBE
	for <FreeBSD-gnats-submit@freebsd.org>; Fri,  6 Dec 2002 12:18:32 -0800 (PST)
	(envelope-from enache@rdslink.ro)
Received: (from adi@localhost)
	by ratsnest.org (8.12.1/8.12.1) id gB6KIm8W001891;
	Fri, 6 Dec 2002 22:18:48 +0200
Message-Id: <200212062018.gB6KIm8W001891@ratsnest.org>
Date: Fri, 6 Dec 2002 22:18:48 +0200
From: Enache Adrian <enache@rdslink.ro>
Reply-To: Enache Adrian <enache@rdslink.ro>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: inaccurate timeouts in select(),nanosleep() + fix
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         46036
>Category:       kern
>Synopsis:       [RFE] select is unsuitable for implementing a periodic timeout + fix
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          suspended
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Dec 06 12:20:01 PST 2002
>Closed-Date:    
>Last-Modified:  Fri Nov 16 20:33:04 UTC 2007
>Originator:     Enache Adrian
>Release:        FreeBSD 5.0-CURRENT i386
>Organization:
>Environment:
System: FreeBSD ratsnest.hole 5.0-CURRENT FreeBSD 5.0-CURRENT #1: Fri Dec 6 01:46:22 EET 2002 root@ratsnest.hole:/opt/hack/sys/i386/compile/CUBATAO i386


	
>Description:
	select(),nanosleep(),poll() sleep one tick longer, _except_
	when previous similar system call was interrupted by a signal
	or by i/o conditions on the polled file descriptors.

	I find this rather buggy and inconsequent.
	It badly breaks any code which tries to hook both timers
	and input polling on a select() loop and defeats benchmark
	programs which test for the accuracy of timeouts.

	This is due to tvtohz (sys/kern/kern_clock.c) which always
	adds one to its result to allow for the current tick to
	expire. Instead of adding one it should be better testing
	for a non-zero result.

	I modified like this my 4.4 kernel long time ago and have
	seen so far no trouble at all.Same thing more recently with 5.0.

>How-To-Repeat:
	This little program prints the timeout accuracy of the 
	select() system call as "[+-] sec usec".It catches SIGQUIT
	and polls for console input.
	It use to give:

   unpatched (regular) kernel:
	+   0 s       9938 us

		--input
	+   0 s       9959 us
	+   0 s       9972 us
	+   0 s       9930 us
	^\      --interrupt
	+   0 s       9954 us
	+   0 s       9954 us

   patched kernel:
	-   0 s         44 us
	^\      --interrupt
	-   0 s          3 us
	-   0 s        100 us

		--input
	-   0 s         44 us
	-   0 s         47 us

-------------------------------8x-----------------------------------
#include <sys/types.h>
#include <sys/time.h>
#include <sys/sysctl.h>
#include <unistd.h>
#include <errno.h>
#include <signal.h>
#include <time.h>
#include <err.h>
#include <stdio.h>

fd_set rdfs;
int ms;

void sig(int unused) {}

int main(int argc,char **argv)
{
	int k,mib[2];
	struct timeval tv,savtv,otv;
	struct clockinfo clock;
	char buf[256];
	signal(SIGQUIT,sig);
	if (argc > 1) ms = atoi(argv[1]);
	if (!ms) ms = 500;
	savtv.tv_usec = ms % 1000 * 1000; savtv.tv_sec = ms / 1000;
	mib[0] = CTL_KERN; mib[1] = KERN_CLOCKRATE;
	k = sizeof clock;
	if (sysctl(mib,2,&clock,&k,NULL,0)) err(1,"sysctl");
	savtv.tv_usec -= savtv.tv_usec % clock.tick;
redo:
	tv = savtv; gettimeofday(&otv,NULL);
	/* a quick fix is to substract one tick from tv here
	   but this will break things when compiled on other
	   OS ( linux for instance ) */
	for(;;) {
		FD_SET(0,&rdfs);
		k = select(1,&rdfs,NULL,NULL,&tv);
		gettimeofday(&tv,NULL);
		timersub(&tv,&otv,&tv);
		switch(k) {
		case 0:
			if (timercmp(&tv,&savtv,<))
				{ timersub(&savtv,&tv,&tv);k = '-'; }
			else
				{ timersub(&tv,&savtv,&tv);k = '+'; }
			printf("%c%4d s %10d us\n",k,tv.tv_sec,tv.tv_usec);
			goto redo;
		case -1:
			if (errno != EINTR) err(1,"select");
			printf("\t--interrupt\n");
			break;
		default:
			read(0,buf,256);
			printf("\t--input\n");
			break;
		}
		timersub(&savtv,&tv,&tv);
		if (tv.tv_sec < 0) timerclear(&tv);
	}
}
-------------------------------8x-----------------------------------
>Fix:
	patch to kern_clock.c and other places which assume that 
	tvtohz adds one follows.

	notes:
	tvtohz has been renamed to hzto in NetBSD but is the same.

	of course one can write
	ticks = (sec * hz) + (((unsigned long)usec + (tick - 1)) / tick) ?: 1;
	but this is a GCC extension.

-------------------------------8x-----------------------------------

diff -Naur osrc/sys/kern/kern_clock.c src/sys/kern/kern_clock.c
--- osrc/sys/kern/kern_clock.c	Tue Nov 19 23:58:52 2002
+++ src/sys/kern/kern_clock.c	Mon Dec  2 23:47:36 2002
@@ -257,8 +257,7 @@
 	 * If the number of usecs in the whole seconds part of the time
 	 * difference fits in a long, then the total number of usecs will
 	 * fit in an unsigned long.  Compute the total and convert it to
-	 * ticks, rounding up and adding 1 to allow for the current tick
-	 * to expire.  Rounding also depends on unsigned long arithmetic
+	 * ticks.  Rounding also depends on unsigned long arithmetic
 	 * to avoid overflow.
 	 *
 	 * Otherwise, if the number of ticks in the whole seconds part of
@@ -291,14 +290,15 @@
 		ticks = 1;
 	} else if (sec <= LONG_MAX / 1000000)
 		ticks = (sec * 1000000 + (unsigned long)usec + (tick - 1))
-			/ tick + 1;
+			/ tick;
 	else if (sec <= LONG_MAX / hz)
-		ticks = sec * hz
-			+ ((unsigned long)usec + (tick - 1)) / tick + 1;
+		ticks = sec * hz + ((unsigned long)usec + (tick - 1)) / tick;
 	else
 		ticks = LONG_MAX;
 	if (ticks > INT_MAX)
 		ticks = INT_MAX;
+	else if (ticks == 0)
+		ticks = 1;
 	return ((int)ticks);
 }
 
diff -Naur osrc/sys/kern/kern_time.c src/sys/kern/kern_time.c
--- osrc/sys/kern/kern_time.c	Tue Dec  3 00:14:05 2002
+++ src/sys/kern/kern_time.c	Mon Dec  2 23:33:04 2002
@@ -527,10 +527,6 @@
  * Else compute next time timer should go off which is > current time.
  * This is where delay in processing this timeout causes multiple
  * SIGALRM calls to be compressed into one.
- * tvtohz() always adds 1 to allow for the time until the next clock
- * interrupt being strictly less than 1 clock tick, but we don't want
- * that here since we want to appear to be in sync with the clock
- * interrupt even when we're delayed.
  */
 void
 realitexpire(void *arg)
@@ -555,7 +551,7 @@
 		if (timevalcmp(&p->p_realtimer.it_value, &ctv, >)) {
 			ntv = p->p_realtimer.it_value;
 			timevalsub(&ntv, &ctv);
-			callout_reset(&p->p_itcallout, tvtohz(&ntv) - 1,
+			callout_reset(&p->p_itcallout, tvtohz(&ntv),
 			    realitexpire, p);
 			splx(s);
 			PROC_UNLOCK(p);
diff -Naur osrc/sys/kern/kern_timeout.c src/sys/kern/kern_timeout.c
--- osrc/sys/kern/kern_timeout.c	Thu Sep  5 14:42:03 2002
+++ src/sys/kern/kern_timeout.c	Mon Dec  2 23:56:45 2002
@@ -399,15 +399,17 @@
 		return;
 	else if (time_change->tv_sec <= LONG_MAX / 1000000)
 		delta_ticks = (time_change->tv_sec * 1000000 +
-			       time_change->tv_usec + (tick - 1)) / tick + 1;
+			       time_change->tv_usec + (tick - 1)) / tick;
 	else if (time_change->tv_sec <= LONG_MAX / hz)
 		delta_ticks = time_change->tv_sec * hz +
-			      (time_change->tv_usec + (tick - 1)) / tick + 1;
+			      (time_change->tv_usec + (tick - 1)) / tick;
 	else
 		delta_ticks = LONG_MAX;
 
 	if (delta_ticks > INT_MAX)
 		delta_ticks = INT_MAX;
+	else if (delta_ticks == 0)
+		delta_ticks = 1;
 
 	/* 
 	 * Now rip through the timer calltodo list looking for timers
diff -Naur osrc/sys/net/bpf.c src/sys/net/bpf.c
--- osrc/sys/net/bpf.c	Tue Nov 19 23:58:58 2002
+++ src/sys/net/bpf.c	Mon Dec  2 23:58:24 2002
@@ -774,12 +774,8 @@
 		{
 			struct timeval *tv = (struct timeval *)addr;
 
-			/*
-			 * Subtract 1 tick from tvtohz() since this isn't
-			 * a one-shot timer.
-			 */
 			if ((error = itimerfix(tv)) == 0)
-				d->bd_rtout = tvtohz(tv) - 1;
+				d->bd_rtout = tvtohz(tv);
 			break;
 		}
 
-------------------------------8x-----------------------------------
>Release-Note:
>Audit-Trail:

From: Bruce Evans <bde@zeta.org.au>
To: Enache Adrian <enache@rdslink.ro>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: kern/46036: inaccurate timeouts in select(),nanosleep() + fix
Date: Sat, 7 Dec 2002 23:57:40 +1100 (EST)

 On Fri, 6 Dec 2002, Enache Adrian wrote:
 
 > >Description:
 > 	select(),nanosleep(),poll() sleep one tick longer, _except_
 > 	when previous similar system call was interrupted by a signal
 > 	or by i/o conditions on the polled file descriptors.
 >
 > 	I find this rather buggy and inconsequent.
 > 	It badly breaks any code which tries to hook both timers
 > 	and input polling on a select() loop and defeats benchmark
 > 	programs which test for the accuracy of timeouts.
 >
 > 	This is due to tvtohz (sys/kern/kern_clock.c) which always
 > 	adds one to its result to allow for the current tick to
 > 	expire. Instead of adding one it should be better testing
 > 	for a non-zero result.
 
 tvtohz() actually adds 1 to ensure that the sleep time is at least the
 specified time.  This gives a minimum sleep time of 0.0 ticks longer
 than the specified sleep time (as required by standards) and a maximum
 sleep time of about 1.0 tick longer than the specified sleep time,
 with an average sleep time of about 0.5 ticks longer.  Not adding 1
 would give a {min,max,avg} sleep time of about {-1.0,0.0,-0.5} ticks
 longer than specified; i.e., it would be shorter than specified (broken)
 in most cases.
 
 Individual syscalls could handle this by sleeping for the shorter time
 and checking whether the timeout has actually expired when they wake up,
 but most don't, and this would be a pessimization in most cases.  Of
 the above syscalls, only nanosleep() checks that the timeout has actually
 expired.  It does this mainly for the technical reason that very large
 timeouts (ones greater than INT_MAX ticks) are required to work and may
 even be useful (especially if HZ is large).  Checking is a pessimization
 in at least the following cases, because we will just have to wait again
 after wasting time waking up to check:
 - most short timeouts that are a multiple of `tick'.
 - half or the cases for random timeouts of 0.5 ticks (since the current
   tick will expire before the timeout half the time), and proportionally
   for all random timeouts of 1 tick or shorter.
 
 > >How-To-Repeat:
 > 	This little program prints the timeout accuracy of the
 > 	select() system call as "[+-] sec usec".It catches SIGQUIT
 > 	and polls for console input.
 > 	It use to give:
 >
 >    unpatched (regular) kernel:
 > 	+   0 s       9938 us
 >
 > 		--input
 > 	+   0 s       9959 us
 > 	+   0 s       9972 us
 > 	+   0 s       9930 us
 > 	^\      --interrupt
 > 	+   0 s       9954 us
 > 	+   0 s       9954 us
 
 This is the correct behaviour.  We are mostly in sync with the clock
 interrupt, and the default timeout is a multiple of clock.tick, so we
 must sleep for an extra tick (less epsilon) for the timeout to expire.
 select() would have to go back to sleep for an extra tick if tvtohz()
 didn't add 1.
 
 >    patched kernel:
 > 	-   0 s         44 us
 > 	^\      --interrupt
 > 	-   0 s          3 us
 > 	-   0 s        100 us
 >
 > 		--input
 > 	-   0 s         44 us
 > 	-   0 s         47 us
 
 This is incorrect.  The times are the values
 ((actual timeout) - (specified timeout)).  These should be >= 0.
 The errors are small here because the timeout is a multiple of clock.tick
 and we are mostly in sync with clock interrupts.  In general the error
 from not adding 1 is up to 1 tick (see above).
 
 > -------------------------------8x-----------------------------------
 > >Fix:
 > 	patch to kern_clock.c and other places which assume that
 > 	tvtohz adds one follows.
 
 This mainly undoes some FreeBSD fixes.
 
 > 	notes:
 > 	tvtohz has been renamed to hzto in NetBSD but is the same.
 
 Actually, FreeBSD renamed hzto to tvtohz.
 
 > redo:
 > 	tv = savtv; gettimeofday(&otv,NULL);
 > 	/* a quick fix is to substract one tick from tv here
 > 	   but this will break things when compiled on other
 > 	   OS ( linux for instance ) */
 > 	for(;;) {
 > 		FD_SET(0,&rdfs);
 > 		k = select(1,&rdfs,NULL,NULL,&tv);
 > 		gettimeofday(&tv,NULL);
 
 I thought that Linux gets timeouts right -- at least sleep() used to
 work except in very early versions of Linux-0.x.  Perhaps it uses
 clock interrupts to get to get close to the timeout and then
 udelay() to get closer if the residual timeout is small.  In general,
 the residual could be up to `tick' usec, but in cases that are in sync
 with the clock like your test program, the residual would be small
 (about 44 usec) and udelay() could reasonably handle it.
 
 Bruce
 

From: Bruce Evans <bde@zeta.org.au>
To: Enache Adrian <enache@rdslink.ro>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: kern/46036: inaccurate timeouts in select(),nanosleep() + fix
Date: Sun, 8 Dec 2002 19:06:45 +1100 (EST)

 On Sun, 8 Dec 2002, Enache Adrian wrote:
 
 > On Sat, Dec 07, 2002 at 11:57:40PM +1100, Bruce Evans wrote:
 > > tvtohz() actually adds 1 to ensure that the sleep time is at least the
 > > specified time.  This gives a minimum sleep time of 0.0 ticks longer
 > > than the specified sleep time (as required by standards) and a maximum
 > > sleep time of about 1.0 tick longer than the specified sleep time,
 > > with an average sleep time of about 0.5 ticks longer.  Not adding 1
 > > would give a {min,max,avg} sleep time of about {-1.0,0.0,-0.5} ticks
 > > longer than specified; i.e., it would be shorter than specified (broken)
 > > in most cases.
 > ...
 > > and we are mostly in sync with clock interrupts.  In general the error
 > > from not adding 1 is up to 1 tick (see above).
 > The average sleep time seems to be with ~ 0.995 longer on an idle
 
 I think you mean 0.009950.
 
 > machine - much longer on a loaded one where ticks are lost.
 
 Hopefully ticks won't be really lost (in the kernel), although they can
 be and are lost due to bugs.  Ticks man be lost as far as applications
 can see of course.
 
 > For the error to be down to -1 tick with an average of -0.5 ticks
 > (when not adding 1) the tick value must be very small - this doesn't
 > seem to happen with the standard 100hz. Under high load the error
 > tends to be positive.
 
 It does (would) happen for random timeouts.  If you look at the clock
 in a loop then you normally see timeouts of ((1 tick) - epsilon)
 (when not adding 1) or ((2 ticks) - epsilon) (when adding 1) because
 the first one syncs with the clock.  E.g., nanotime({1 nsec}) normally
 returns at a time epsilon after a clock tick occurs, so a subsequent
 nanotime({1 nsec}) will wait a full clock tick or two (less epsilon)
 (if nanotome() is implemented using clock ticks even for short intervals).
 
 > When the user has asked for something not multiple of tick the
 > value will be correctly rounded up by tvtohz() and the error will
 > be positive too.
 
 No.  E.g., if the interval is 9999 usec then it will be rounded up
 to 1 tick (10000 usec).  The next clock tick may occur after 1 usec,
 so if 1 tick is not added then the timeout will return after 1 usec
 (plus epsilon) instead of after the requested 9999 usec.
 
 > > I thought that Linux gets timeouts right -- at least sleep() used to
 > > work except in very early versions of Linux-0.x.  Perhaps it uses
 > ...
 > Only nanosleep() works like in FreeBSD ( one tick longer by just
 > adding one - this is why many recommend to use select() instead
 > of usleep() ...)
 
 This is a bug in Linux :-).
 
 > Does it exist a standard specification for this case ?
 > A quick browsing through susv3 shows that only nanosleep()
 > is required to do so:
 > functions/select.html:
 > 	"If the timeout parameter is not a  null pointer,
 > 	 it specifies a maximum interval to wait for the
 >           selection to complete."
 > functions/nanosleep.html
 > 	"But, except for the case of being interrupted by
 > 	a signal, the suspension time shall not be less than
 > 	the time specified by rqtp, as measured by the
 > 	system clock CLOCK_REALTIME"
 
 select() and its timeouts are now specified in POSIX, not quite
 as clearly as this.  From draft7 of POSIX.1-200x:
 
 %%%
 31187              limitations on the granularity of timeout intervals. If the requested timeout interval requires a
 31188              finer granularity than the implementation supports, the actual timeout interval shall be rounded
 31189              up to the next supported value.
 %%%
 
 Bruce
 

From: Enache Adrian <enache@rdslink.ro>
To: FreeBSD-gnats-submit@FreeBSD.ORG
Cc:  
Subject: Re: kern/46036: inaccurate timeouts in select(),nanosleep() + fix
Date: Sun, 8 Dec 2002 22:33:41 +0200

 On Sat, Dec 07, 2002 at 11:57:40PM +1100, Bruce Evans wrote:
 > tvtohz() actually adds 1 to ensure that the sleep time is at least the
 > specified time.  This gives a minimum sleep time of 0.0 ticks longer
 > than the specified sleep time (as required by standards) and a maximum
 > sleep time of about 1.0 tick longer than the specified sleep time,
 > with an average sleep time of about 0.5 ticks longer.  Not adding 1
 > would give a {min,max,avg} sleep time of about {-1.0,0.0,-0.5} ticks
 > longer than specified; i.e., it would be shorter than specified (broken)
 > in most cases.
 ...
 > and we are mostly in sync with clock interrupts.  In general the error
 > from not adding 1 is up to 1 tick (see above).
 The average sleep time seems to be with ~ 0.995 longer on an idle
 machine - much longer on a loaded one where ticks are lost.
 
 For the error to be down to -1 tick with an average of -0.5 ticks
 (when not adding 1) the tick value must be very small - this doesn't
 seem to happen with the standard 100hz. Under high load the error
 tends to be positive.
 
 When the user has asked for something not multiple of tick the
 value will be correctly rounded up by tvtohz() and the error will
 be positive too.
 
 > I thought that Linux gets timeouts right -- at least sleep() used to
 > work except in very early versions of Linux-0.x.  Perhaps it uses
 > clock interrupts to get to get close to the timeout and then
 > udelay() to get closer if the residual timeout is small.  In general,
 > the residual could be up to `tick' usec, but in cases that are in sync
 > with the clock like your test program, the residual would be small
 > (about 44 usec) and udelay() could reasonably handle it.
 Oh no.
 Linux doesn't seem to call udelay() to handle the residual timeout.
 
 Only nanosleep() works like in FreeBSD ( one tick longer by just
 adding one - this is why many recommend to use select() instead
 of usleep() ...)
 
 from kernel/timer.c
 	expire = timespec_to_jiffies(&t) + (t.tv_sec || t.tv_nsec);
 
 The silly test program, when run - with some minor modifications
 -----------------------
 +#include <sys/param.h>
 -       struct clockinfo clock;
 -       mib[0] = CTL_KERN; mib[1] = KERN_CLOCKRATE;
 -       k = sizeof clock;
 -       if (sysctl(mib,2,&clock,&k,NULL,0)) err(1,"sysctl");
 -       savtv.tv_usec -= savtv.tv_usec % clock.tick;
 +       savtv.tv_usec -= savtv.tv_usec % (1000000 / HZ);
 _______________________
 on linux 2.4.18 says:
 -   0 s        171 us
 -   0 s        181 us
 -   0 s        184 us
 
         --input
 -   0 s        169 us
 -   0 s        180 us
 
 According to you it is broken to sleep less than specified.
 
 Does it exist a standard specification for this case ?
 A quick browsing through susv3 shows that only nanosleep()
 is required to do so:
 functions/select.html:
 	"If the timeout parameter is not a  null pointer,
 	 it specifies a maximum interval to wait for the
           selection to complete."
 functions/nanosleep.html
 	"But, except for the case of being interrupted by
 	a signal, the suspension time shall not be less than
 	the time specified by rqtp, as measured by the
 	system clock CLOCK_REALTIME"
 
 but I'll keep reading on ...
 
 thank you & best regards
 Adi
 
State-Changed-From-To: open->feedback 
State-Changed-By: kmacy 
State-Changed-When: Fri Nov 16 07:03:16 UTC 2007 
State-Changed-Why:  

Please comment on whether or not this is in fact a bug and then throw back in to the pool. 
Thanks. 


Responsible-Changed-From-To: freebsd-bugs->bde 
Responsible-Changed-By: kmacy 
Responsible-Changed-When: Fri Nov 16 07:03:16 UTC 2007 
Responsible-Changed-Why:  

Please comment on whether or not this is in fact a bug and then throw back in to the pool.  
Thanks.  


http://www.freebsd.org/cgi/query-pr.cgi?pr=46036 
State-Changed-From-To: feedback->suspended 
State-Changed-By: kmacy 
State-Changed-When: Fri Nov 16 20:31:16 UTC 2007 
State-Changed-Why:  

This constitutes a RFE and no one is available to work on it right now. 


Responsible-Changed-From-To: bde->freebsd-bugs 
Responsible-Changed-By: kmacy 
Responsible-Changed-When: Fri Nov 16 20:31:16 UTC 2007 
Responsible-Changed-Why:  

This constitutes a RFE and no one is available to work on it right now.  

http://www.freebsd.org/cgi/query-pr.cgi?pr=46036 
>Unformatted:
