From wollman@xyz.csail.mit.edu  Fri Dec 23 19:30:37 2005
Return-Path: <wollman@xyz.csail.mit.edu>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 465F916A41F
	for <FreeBSD-gnats-submit@freebsd.org>; Fri, 23 Dec 2005 19:30:37 +0000 (GMT)
	(envelope-from wollman@xyz.csail.mit.edu)
Received: from khavrinen.csail.mit.edu (khavrinen.csail.mit.edu [128.30.28.20])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3917343D7C
	for <FreeBSD-gnats-submit@freebsd.org>; Fri, 23 Dec 2005 19:30:16 +0000 (GMT)
	(envelope-from wollman@xyz.csail.mit.edu)
Received: from xyz.csail.mit.edu (xyz.csail.mit.edu [128.31.0.28])
	by khavrinen.csail.mit.edu (8.13.1/8.13.4) with ESMTP id jBNJUDbD086323
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256
	verify=NO CN= issuer=)
	for <FreeBSD-gnats-submit@freebsd.org>; Fri, 23 Dec 2005 14:30:15 -0500 (EST)
	(envelope-from wollman@xyz.csail.mit.edu)
Received: (from wollman@localhost)
	by xyz.csail.mit.edu (8.13.4/8.13.4/Submit) id jBNJUCjn000912;
	Fri, 23 Dec 2005 14:30:12 -0500 (EST)
	(envelope-from wollman)
Message-Id: <200512231930.jBNJUCjn000912@xyz.csail.mit.edu>
Date: Fri, 23 Dec 2005 14:30:12 -0500 (EST)
From: Garrett Wollman <wollman@xyz.csail.mit.edu>
Reply-To: Garrett Wollman <wollman@xyz.csail.mit.edu>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: 6.0 boot: name resolution broken for daemon startup
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         90863
>Category:       conf
>Synopsis:       [patch] 6.0 boot: name resolution broken for daemon startup
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    dougb
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Dec 23 19:40:03 GMT 2005
>Closed-Date:    Mon Aug 03 20:14:53 UTC 2009
>Last-Modified:  Mon Aug 03 20:14:53 UTC 2009
>Originator:     Garrett Wollman
>Release:        FreeBSD 6.0-RELEASE-p1 i386
>Organization:
>Environment:
System: FreeBSD xyz.csail.mit.edu 6.0-RELEASE-p1 FreeBSD 6.0-RELEASE-p1 #1: Fri Dec 23 13:27:54 EST 2005 wollman@xyz.csail.mit.edu:/usr/obj/usr/src/sys/XYZ i386

This is a fairly generic 6.0 install from source.

>Description:

This machine uses a local caching nameserver.  When booting, named starts
successfully but does not answer queries for about half a minute.  Meanwhile,
many other daemons (such as ntpd, sendmail, and apache) are started by the
rc scripts and fail because they depend on name resolution working immediately
on startup.

This is very difficult to debug as named is working by the time a console
login is possible.

There is a second name server referenced in /etc/resolv.conf, but libc will
not fall back to it because named is running and seems to be giving some
sort of answer (SERVFAIL?).

>How-To-Repeat:
	
>Fix:

I am using the following workaround:

--- /usr/src/etc/rc.d/named	Mon May 23 08:25:33 2005
+++ /etc/rc.d/named	Fri Dec 23 14:16:26 2005
@@ -115,3 +115,8 @@
 pidfile="${named_pidfile:-/var/run/${name}/pid}"
 
 run_rc_command "$1"
+case "$1" in
+*start*) while ! host localhost 2>/dev/null >/dev/null; do
+		sleep 1
+	done;;
+esac
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-rc 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Tue Dec 27 20:52:56 UTC 2005 
Responsible-Changed-Why:  
Patch addresses a possible problem in rc. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=90863 

From: JoaoBR <joao@matik.com.br>
To: bug-followup@freebsd.org, wollman@xyz.csail.mit.edu
Cc:  
Subject: Re: conf/90863: [patch] 6.0 boot: name resolution broken for daemon startup
Date: Tue, 27 Dec 2005 20:29:05 -0200

 I think that named is not starting first and so I guess the rc start order =
 is=20
 wrong and not that named do not answer queries,=20
 
 I reported similar here:
 http://www.freebsd.org/cgi/query-pr.cgi?pr=3D86668
 
 also I believe that this touches the same problem that the timeout is too=20
 long, see this PR:
 
 http://www.freebsd.org/cgi/query-pr.cgi?pr=3Dbin/62139
 (even if this is apearently to ssh related the problem is the sshd dns reso=
 lve=20
 timeout which is too long)
 
 In my opinion the rc order needs to be corrected and the resolv timeout nee=
 ds=20
 to be shorter and a proper error messages on startup would help to understa=
 nd=20
 the problem because hanging on
 
  starting sendmail .
 
 makes believe the problem is in sendmail configuration
 
 Jo=E3o
 
 
 
 
 
 
 
 A mensagem foi scaneada pelo sistema de e-mail e pode ser considerada segura.
 Service fornecido pelo Datacenter Matik  https://datacenter.matik.com.br

From: Garrett Wollman <wollman@csail.mit.edu>
To: JoaoBR <joao@matik.com.br>
Cc: bug-followup@freebsd.org
Subject: Re: conf/90863: [patch] 6.0 boot: name resolution broken for daemon startup
Date: Tue, 27 Dec 2005 17:51:07 -0500

 <<On Tue, 27 Dec 2005 20:29:05 -0200, JoaoBR <joao@matik.com.br> said:
 
 > I think that named is not starting first and so I guess the rc start order is 
 > wrong and not that named do not answer queries, 
 
 No, on my system named definitely is started in the correct order:
 
 wollman@xyz(4)$ echo `rcorder *` | fold -s
 rcconf.sh dumpon initrandom geli gbde encswap ccd swap1 ramdisk early.sh fsck 
 root mountcritlocal var cleanvar random adjkerntz atm1 hostname ipfilter ipnat 
 ipfs kldxref sppp addswap sysctl serial pccard netif isdnd ppp-user ipfw 
 nsswitch ip6addrctl atm2 pfsync pflog pf routing ip6fw network_ipv6 mroute6d 
 route6d mrouted routed dhclient NETWORKING devd mountcritremote devfs ipmon 
 ramdisk-own newsyslog syslogd savecore SERVERS named ntpdate rpcbind nisdomain 
 [...]
 
 The problem seems to be related to the fact that the bge(4) network
 interface in this machine takes a long time bring the link up.  When
 named starts, it attempts to validate the root zone cache before the
 network link comes up, forks, and returns SERVFAIL (?) to all requests
 until it is finally able to validate.  Older versions of named did not
 daemonize until the root zone cache was validated.
 
 This would not be a problem (that's why a server should always have
 another server after itself in /etc/resolv.conf) except that the stub
 resolver considers any reply (even "no I can't do that now") to be
 authoritative.  If named simply failed to respond to these queries,
 the resolver would fail over to the other server.
 
 -GAWollman
 
State-Changed-From-To: open->feedback 
State-Changed-By: dougb 
State-Changed-When: Sat Dec 31 01:50:46 UTC 2005 
State-Changed-Why:  

This is an interesting problem, and I have several responses. :) 

First, if you're sure that the problem is with the bge interface, 
I would prefer to see the problem fixed generically there, rather 
than in rc.d/named. However, I can see some value for having some 
sort of watchdog timer, similar to how it's done in /etc/rc.shutdown, 
that insures named is working, or barks loudly if it's not. I will 
give some thought as to how to make that a more generic interface 
so that not just rc.shutdown and named can use it. Also, if we do 
this I think it should be behind a knob that is off by default. 

As for the boot order of named, Garrett is right, it starts as 
soon as it's possible for it to start. If Greg wants to change 
rc.d/sendmail to REQUIRE: named, that's up to him.  


Responsible-Changed-From-To: freebsd-rc->dougb 
Responsible-Changed-By: dougb 
Responsible-Changed-When: Sat Dec 31 01:50:46 UTC 2005 
Responsible-Changed-Why:  

I will take responsibility for looking at the issue of a more 
generic watchdog timer that boot scripts can use. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=90863 

From: Garrett Wollman <wollman@csail.mit.edu>
To: Doug Barton <dougb@FreeBSD.org>
Cc: freebsd-rc@FreeBSD.org, bug-followup@FreeBSD.org
Subject: Re: conf/90863: [patch] 6.0 boot: name resolution broken for daemon startup
Date: Fri, 30 Dec 2005 21:22:44 -0500

 <<On Sat, 31 Dec 2005 01:57:56 GMT, Doug Barton <dougb@FreeBSD.org> said:
 
 > First, if you're sure that the problem is with the bge interface,
 > I would prefer to see the problem fixed generically there, rather
 > than in rc.d/named.
 
 It's not a problem with bge(4), it's a general problem with network
 interfaces that take a long time to bring the link up after it is
 initialized.  (I expect to have the same problem with ti(4) on a
 machine I'm upgrading right now.)  In this particular case I'm willing
 to wait forever, since the machine can't do anything useful until it
 has network, but that would be unacceptable for the general case.
 Ordinary workstations using DHCP don't see this, because you obviously
 can't get a lease until you can communicate with the DHCP server.
 
 What I'd like would be to have a "don't fork until you're really
 ready" option for named (or even better, for that to be restored as
 the default behavior); servers without a local resolver don't have
 this problem, because the stub resolver will retry requests that don't
 elicit a response.  I think that's a superior solution to anything
 that requires explicit configuration on the part of the sysadmin.
 
 > As for the boot order of named, Garrett is right, it starts as
 > soon as it's possible for it to start. If Greg wants to change
 > rc.d/sendmail to REQUIRE: named, that's up to him. 
 
 sendmail already REQUIRE:s LOGIN, so there's no issue there.
 
 -GAWollman
 

From: Brooks Davis <brooks@one-eyed-alien.net>
To: Garrett Wollman <wollman@csail.mit.edu>
Cc: Doug Barton <dougb@freebsd.org>, freebsd-rc@freebsd.org,
        bug-followup@freebsd.org
Subject: Re: conf/90863: [patch] 6.0 boot: name resolution broken for daemon startup
Date: Fri, 6 Jan 2006 10:54:44 -0800

 --uQr8t48UFsdbeI+V
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 Content-Transfer-Encoding: quoted-printable
 
 On Fri, Dec 30, 2005 at 09:22:44PM -0500, Garrett Wollman wrote:
 > <<On Sat, 31 Dec 2005 01:57:56 GMT, Doug Barton <dougb@FreeBSD.org> said:
 >=20
 > > First, if you're sure that the problem is with the bge interface,
 > > I would prefer to see the problem fixed generically there, rather
 > > than in rc.d/named.
 >=20
 > It's not a problem with bge(4), it's a general problem with network
 > interfaces that take a long time to bring the link up after it is
 > initialized.  (I expect to have the same problem with ti(4) on a
 > machine I'm upgrading right now.)  In this particular case I'm willing
 > to wait forever, since the machine can't do anything useful until it
 > has network, but that would be unacceptable for the general case.
 > Ordinary workstations using DHCP don't see this, because you obviously
 > can't get a lease until you can communicate with the DHCP server.
 >=20
 > What I'd like would be to have a "don't fork until you're really
 > ready" option for named (or even better, for that to be restored as
 > the default behavior); servers without a local resolver don't have
 > this problem, because the stub resolver will retry requests that don't
 > elicit a response.  I think that's a superior solution to anything
 > that requires explicit configuration on the part of the sysadmin.
 
 On the whole, daemons should operate on the assumption that the network
 will take an arbitrrarily long time to come up and that it may come
 and go at any time.  A user should be able to boot their laptop while
 on an airplane, suspend to disk for landing, boot up again and aquire
 a network connection, and have all their daemons work correctly.
 Likewise, a copy of FreeBSD running on a virtual server should support
 being suspended, copied to a different datacenter, and coming back up
 with a new addresses.  Obviously we're not there yet in a number of
 areas, but this is where we should be heading and we can work on
 server/libc behavior in advance of the kernel actually working.
 
 -- Brooks
 
 --=20
 Any statement of the form "X is the one, true Y" is FALSE.
 PGP fingerprint 655D 519C 26A7 82E7 2529  9BF0 5D8E 8BE9 F238 1AD4
 
 --uQr8t48UFsdbeI+V
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.2.1 (GNU/Linux)
 
 iD8DBQFDvrzzXY6L6fI4GtQRAgWgAJ9gv1DcNjuYz5/9lW5uDznW65Hn/gCfdtAF
 D/F1uuPk32YKSsQc9mgdWSI=
 =hN38
 -----END PGP SIGNATURE-----
 
 --uQr8t48UFsdbeI+V--
State-Changed-From-To: feedback->suspended 
State-Changed-By: linimon 
State-Changed-When: Sun Mar 2 02:36:05 UTC 2008 
State-Changed-Why:  
Feedback was received quite some time ago.  It does not look like this 
issue is being actively worked on. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=90863 
State-Changed-From-To: suspended->feedback 
State-Changed-By: dougb 
State-Changed-When: Mon Jun 1 04:57:49 UTC 2009 
State-Changed-Why:  

I've added the named_wait feature to HEAD, have you had a chance 
to use it yet? 

http://www.freebsd.org/cgi/query-pr.cgi?pr=90863 
State-Changed-From-To: feedback->closed 
State-Changed-By: dougb 
State-Changed-When: Mon Aug 3 20:14:23 UTC 2009 
State-Changed-Why:  

The named_wait option has now been MFC'ed 

http://www.freebsd.org/cgi/query-pr.cgi?pr=90863 
>Unformatted:
