From anarcat@anarcat.dyndns.org  Thu Oct  4 01:23:43 2001
Return-Path: <anarcat@anarcat.dyndns.org>
Received: from tomts11-srv.bellnexxia.net (tomts11.bellnexxia.net [209.226.175.55])
	by hub.freebsd.org (Postfix) with ESMTP id EB8E437B405
	for <FreeBSD-gnats-submit@freebsd.org>; Thu,  4 Oct 2001 01:23:42 -0700 (PDT)
Received: from khan.anarcat.dyndns.org ([65.92.161.107])
          by tomts11-srv.bellnexxia.net
          (InterMail vM.4.01.03.16 201-229-121-116-20010115) with ESMTP
          id <20011004082341.PBFG19737.tomts11-srv.bellnexxia.net@khan.anarcat.dyndns.org>
          for <FreeBSD-gnats-submit@freebsd.org>;
          Thu, 4 Oct 2001 04:23:41 -0400
Received: from shall.anarcat.dyndns.org (shall.anarcat.dyndns.org [192.168.0.1])
	by khan.anarcat.dyndns.org (Postfix) with ESMTP id 6768A1AF8
	for <FreeBSD-gnats-submit@freebsd.org>; Thu,  4 Oct 2001 04:23:37 -0400 (EDT)
Received: by shall.anarcat.dyndns.org (Postfix, from userid 1000)
	id A6EBC20BE1; Thu,  4 Oct 2001 04:23:42 -0400 (EDT)
Message-Id: <20011004082342.A6EBC20BE1@shall.anarcat.dyndns.org>
Date: Thu,  4 Oct 2001 04:23:42 -0400 (EDT)
From: The Anarcat <anarcat@anarcat.dyndns.org>
Reply-To: The Anarcat <anarcat@anarcat.dyndns.org>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: syslogd remote logging back down
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         31029
>Category:       bin
>Synopsis:       syslogd remote logging back down
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    cjc
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Thu Oct 04 01:30:01 PDT 2001
>Closed-Date:    Thu Sep 05 10:55:31 PDT 2002
>Last-Modified:  Thu Sep 05 10:55:31 PDT 2002
>Originator:     The Anarcat
>Release:        FreeBSD 4.4-STABLE i386
>Organization:
Nada, inc.
>Environment:
System: FreeBSD shall.anarcat.dyndns.org 4.4-STABLE FreeBSD 4.4-STABLE #7: Sat Sep 15 00:41:38 EDT 2001 anarcat@shall.anarcat.dyndns.org:/usr/obj/usr/src/sys/SHALL i386

>Description:

From -questions:

On Tue, Oct 02, 2001 at 11:57:08AM -0400, The Anarcat wrote:
> Hi.
>  
> I think I noticed what seems to me undesirable (and undocumented?)
> behavior in syslogd. When a remote logging host (@host) is
> unreachable:
> 
> syslogd: sendto: Host is down
> 
> syslogd *never* tries to reach it again, unless it receives a HUP.
> Shouldn't it try to reach it again, from time to time?
> 
> The @host was indeed down, but when it was brought back up, remote
> logging wasn't resumed.

>How-To-Repeat:

*.*			@host

where host is down or unreachable.

>Fix:

This is a draft of what I would call "approximate exponential backoff
algorithm". :)

There's a lot of debugging code that can be removed, but they help
seeing what's going on.

There's probably a better way to do this too. :)

--- syslogd.c.orig	Wed Oct  3 15:56:32 2001
+++ syslogd.c	Thu Oct  4 00:06:49 2001
@@ -142,6 +142,9 @@
 #define MARK		0x008	/* this message is a mark */
 #define ISKERNEL	0x010	/* kernel generated message */
 
+#define DELAY_MUL	2       /* delay multiplier */
+#define DELAY_INIT	30	/* initial delay in seconds */
+
 /*
  * This structure represents the files that will have log
  * copies printed.
@@ -159,6 +162,9 @@
 #define PRI_EQ	0x2
 #define PRI_GT	0x4
 	char	*f_program;		/* program this applies to */
+	/* should this be part of the union? */
+	time_t  f_unreach;	      /* time since last unreach */
+	time_t  f_delay;		/* backoff time */
 	union {
 		char	f_uname[MAXUNAMES][UT_NAMESIZE+1];
 		struct {
@@ -999,6 +1005,15 @@
 			l = MAXLINE;
 
 		if (finet) {
+			dprintf("FORW: now: %d f_unreach: %d f_delay: %d\n", (int) now, (int) f->f_unreach, (int) f->f_delay);
+			/* XXX: must make sure this is initialized to 0 */
+			if (f->f_unreach) { /* there was a failure last time */
+				dprintf("another try at host\n");
+				if ( (now - f->f_unreach) < f->f_delay) {
+					dprintf("skipping: now: %d, f_unreach: %d f_delay: %d\n", (int) now, (int) f->f_unreach, (int) f->f_delay);
+					break; /* do not send */
+				}
+			}
 			for (r = f->f_un.f_forw.f_addr; r; r = r->ai_next) {
 				for (i = 0; i < *finet; i++) {
 #if 0 
@@ -1017,12 +1032,38 @@
 				if (lsent == l && !send_to_all) 
 					break;
 			}
+			dprintf("lsent: %d\n", lsent);
 			if (lsent != l) {
 				int e = errno;
-				(void)close(f->f_file);
-				errno = e;
-				f->f_type = F_UNUSED;
+				dprintf("sendto: f_unreach: %d f_delay: %d\n", (int) f->f_unreach, (int) f->f_delay);
 				logerror("sendto");
+				errno = e;
+				switch (errno) {
+				case EHOSTUNREACH:
+				case EHOSTDOWN:
+					if (f->f_unreach)
+						f->f_delay *= DELAY_MUL;
+					else {
+						f->f_unreach = now;
+						f->f_delay = DELAY_INIT;
+					}
+					dprintf("setting: f_unreach: %d f_delay: %d\n", (int) f->f_unreach, (int) f->f_delay);
+					break;
+				/* case EBADF: */
+				/* case EACCES: */
+				/* case ENOTSOCK: */
+				/* case EFAULT: */
+				/* case EMSGSIZE: */
+				/* case EAGAIN: */
+				/* case ENOBUFS: */
+				/* case ECONNREFUSED: */
+				default:
+					dprintf("removing entry\n", e);
+					(void)close(f->f_file);
+					errno = e;
+					f->f_type = F_UNUSED;
+					break;
+				}
 			}
 		}
 		break;
@@ -2301,3 +2342,7 @@
 
 	return(socks);
 }
+
+/* Local Variables: *** */
+/* c-basic-offset:8 *** */
+/* End: *** */
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->cjc 
Responsible-Changed-By: cjc 
Responsible-Changed-When: Thu Nov 29 02:04:07 PST 2001 
Responsible-Changed-Why:  
I have been testing these patches. 

http://www.FreeBSD.org/cgi/query-pr.cgi?pr=31029 

From: The Anarcat <anarcat@anarcat.dyndns.org>
To: gnats-admin@FreeBSD.org, freebsd-bugs@FreeBSD.org
Cc: "Crist J. Clark" <cristjc@earthlink.net>,
	freebsd-hackers@FreeBSD.org
Subject: Re: bin/31029: syslogd remote logging back down
Date: Fri, 5 Oct 2001 15:41:27 -0400

 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 [cc'ed to -hackers since it is what I think is the relevant list, please
 correct me if I'm wrong]
 
 [[note that I welcome style and function comments]]
 
 Note that this fix "works" in the sense that syslogd will eventually be
 able to reach to remote server again, but the delays of the checks are
 dependent on the frequency of the outgoing log lines. 
 
 For example, say the initial delay is 30 seconds. If the algorithm was
 "clean", there would be some kind of check made on the remote server at
 30, 30+60=90, 90+120=210, etc, seconds after the failure. But the check
 is only made when a log has to be sent out. So if no log is sent out for
 (say) 5 minutes, the delay (f_delay) won't be changed to take into
 account the delay taken to have a new log to send. 
 
 This might not be an undesirable behavior, BTW. :) That way, if a remote
 host doesn't usually receive much logs, it has more chances of being
 checked each time a new log is sent. On the other hand, if a host
 receives much logs, most will be discarded, and the host will be checked
 at regular, exponentially expanding intervals, until it becomes
 available again.
 
 Also, note that the patch inserts 2 more time_t fields in the filed
 struct.
 
 I don't know exactly how the struct filed* pointers are initialized in
 the fprintlog function. Actually, one must assume that f_unreach is
 initialized to 0. If the field is filled with arbitrary value, the
 algorithm leads to undefined behavior. The worst that can happen is that
 syslogd will not *try* to send the log to this host, even if it is
 reachable, as the f_unreach field is also a "flag".
 
 Also, all this won't work very well if something like this isn't applied
 after:
 
 (warning: cut'n'paste)
 
 - --- /usr/src/usr.sbin/syslogd/syslogd.c Thu Oct  4 00:06:49 2001
 +++ syslogd.c   Fri Oct  5 15:31:28 2001
 @@ -1064,6 +1064,8 @@
                                         f->f_type = F_UNUSED;
                                         break;
                                 }
 +                       } else { /* no error from sendto */
 +                               f->f_unreach = 0; /* clear unreach error
 flag */
                         }
                 }
                 break;
 
 I could swear I saw my router's patched syslogd recover from "host down"
 even without that extra patch (using only the orinigal pr's patch), but
 anyways..
 
 For convenience, I also put here a full patch that has the dprintf's
 removed:
 
 - --- /usr/src/usr.sbin/syslogd/syslogd.c.orig	Wed Oct  3 15:56:32 2001
 +++ syslogd.c	Fri Oct  5 15:37:57 2001
 @@ -142,6 +142,9 @@
  #define MARK		0x008	/* this message is a mark */
  #define ISKERNEL	0x010	/* kernel generated message */
  
 +#define DELAY_MUL	2       /* delay multiplier */
 +#define DELAY_INIT	30	/* initial delay in seconds */
 +
  /*
   * This structure represents the files that will have log
   * copies printed.
 @@ -159,6 +162,9 @@
  #define PRI_EQ	0x2
  #define PRI_GT	0x4
  	char	*f_program;		/* program this applies to */
 +	/* should this be part of the union? */
 +	time_t  f_unreach;	      /* time since last unreach */
 +	time_t  f_delay;		/* backoff time */
  	union {
  		char	f_uname[MAXUNAMES][UT_NAMESIZE+1];
  		struct {
 @@ -999,6 +1005,11 @@
  			l = MAXLINE;
  
  		if (finet) {
 +			/* XXX: must make sure this is initialized to 0 */
 +			if ((f->f_unreach) && 
 +			   ((now - f->f_unreach) < f->f_delay)) {
 +				break; /* do not send */
 +			}
  			for (r = f->f_un.f_forw.f_addr; r; r = r->ai_next) {
  				for (i = 0; i < *finet; i++) {
  #if 0 
 @@ -1019,10 +1030,34 @@
  			}
  			if (lsent != l) {
  				int e = errno;
 - -				(void)close(f->f_file);
 - -				errno = e;
 - -				f->f_type = F_UNUSED;
  				logerror("sendto");
 +				errno = e;
 +				switch (errno) {
 +				case EHOSTUNREACH:
 +				case EHOSTDOWN:
 +					if (f->f_unreach)
 +						f->f_delay *= DELAY_MUL;
 +					else {
 +						f->f_unreach = now;
 +						f->f_delay = DELAY_INIT;
 +					}
 +					break;
 +				/* case EBADF: */
 +				/* case EACCES: */
 +				/* case ENOTSOCK: */
 +				/* case EFAULT: */
 +				/* case EMSGSIZE: */
 +				/* case EAGAIN: */
 +				/* case ENOBUFS: */
 +				/* case ECONNREFUSED: */
 +				default:
 +					(void)close(f->f_file);
 +					errno = e;
 +					f->f_type = F_UNUSED;
 +					break;
 +				}
 +			} else { /* no error from sendto */
 +				f->f_unreach = 0; /* clear unreach error flag */
  			}
  		}
  		break;
 
 A.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.0.6 (FreeBSD)
 Comment: Pour information voir http://www.gnupg.org
 
 iEYEARECAAYFAju+DOUACgkQttcWHAnWiGcsVgCeI7L2H3xD5GRN65mDdW4ZLvwe
 sfkAnAjESC9zhuC7wlobnXpm14MZ00Ik
 =h9I5
 -----END PGP SIGNATURE-----
 
 To Unsubscribe: send mail to majordomo@FreeBSD.org
 with "unsubscribe freebsd-bugs" in the body of the message
State-Changed-From-To: open->patched 
State-Changed-By: cjc 
State-Changed-When: Sat Aug 24 23:05:38 PDT 2002 
State-Changed-Why:  
I hesitated doing this because the back-off method seemed too 
complicated. So I simplifed. Don't back-off at all. Just keep trying 
and ignore (but log) EHOSTUNREACH and EHOSTDOWN. The failures really 
don't consume more resources than a success would, so why not just 
keep trying? The fact that we can keep getting "sendto" errors in the 
logs helps to point out the failure and are throttled by the usual 
methods. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=31029 
State-Changed-From-To: patched->closed 
State-Changed-By: cjc 
State-Changed-When: Thu Sep 5 10:54:45 PDT 2002 
State-Changed-Why:  
MFC to -STABLE. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=31029 
>Unformatted:
