From gad@freefour.acs.rpi.edu  Sat Sep  2 21:51:42 2000
Return-Path: <gad@freefour.acs.rpi.edu>
Received: from freefour.acs.rpi.edu (freefour.acs.rpi.edu [128.113.24.91])
	by hub.freebsd.org (Postfix) with ESMTP id D4CD337B422
	for <FreeBSD-gnats-submit@freebsd.org>; Sat,  2 Sep 2000 21:51:41 -0700 (PDT)
Received: (from gad@localhost)
	by freefour.acs.rpi.edu (8.9.3/8.9.3) id AAA73984;
	Sun, 3 Sep 2000 00:51:39 -0400 (EDT)
	(envelope-from gad)
Message-Id: <200009030451.AAA73984@freefour.acs.rpi.edu>
Date: Sun, 3 Sep 2000 00:51:39 -0400 (EDT)
From: Garance A Drosehn <gad@freefour.acs.rpi.edu>
Reply-To: gad@eclipse.acs.rpi.edu
To: FreeBSD-gnats-submit@freebsd.org
Cc: gad@eclipse.acs.rpi.edu
Subject: Fix for lpr's handling of lots of jobs in a queue
X-Send-Pr-Version: 3.2

>Number:         21008
>Category:       bin
>Synopsis:       lpr(1) Fix for lpr's handling of lots of jobs in a queue
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    gad
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Sep 02 22:00:00 PDT 2000
>Closed-Date:    
>Last-Modified:  Wed May 21 20:47:46 UTC 2008
>Originator:     Garance A Drosehn
>Release:        FreeBSD 4.-stable and 5.-current i386
>Organization:
RPI ; Troy, NY
>Environment:

	Using freebsd's lpr/lpd on print servers in a busy setting.

>Description:

	Lpr uses a counter from 0 to 999 when spooling jobs in a queue
	which cycles back to 0.  The assumption is that by the time the
	counter cycles around, the earlier jobs are long gone.  If that
	assumption is wrong, then the error is handled in a way which
	is pretty painful.  (note that I have only checked on how lpd's
	recvjob routine handles it when accepting jobs from some other
	host, I haven't looked at what lpr does if a client is "full up").

	What happens is that recvjob (readfile) notices that the datafile
	already exists, so it calls frecverr.  That goes to cleanup, and
	the cleanup routine assumes that the INCOMING file was a problem,
	and thus removes it.  This means you have removed the datafile for
	an EARLIER job, and told the sending host an error occurred.  The
	sending host may respond to this error by waiting a bit, and then
	resending the same file.  Now the datafile will not already exist,
	because it was destroyed, so the datafile will transfer successfully.
	However, the control file for the earlier job still exists, and when
	the control file for the incoming job arrives, IT errors.  The
	receiving host again sends an error to the sending host, and it seems
	that the sending host decides this is a good reason to just remove
	the job on it's side.
	So, you've gone from two (or more) datafiles and two control files
	to one control file and no datafiles.

	We should (optionally) allow a larger range for the counter, but
	I'm not ready to write that right now.  In any case, there needs
	to be a fix for how recvjob behaves when an overflow of the counter's
	range does occur.

>How-To-Repeat:

	You could send over a thousand jobs to a queue, but that's a bit
	unwieldly.  Instead, I suggest:
	    On "server":
		lpc stop <printer>
	    On "client":
		lpc stop <printer>
	        lpr -P<printer> somefile
	    go to spool directory, and save a copy of the cf and df files.
	        lpc start <printer>
	    (the files for the job go from client to server, and sit there)
	    recreate cf and df files, with the exact same name, from copies.
	        lpc start <printer>
	    then watch what happens...

>Fix:
	
	The real fix, in my opinion, would be a pretty significant rewrite
	of recvjob.c.

	The interim fix is to change recvjob such that the receiving host
	will tell the sending host that it is "out of space" if a file is
	being sent which already exists on the server.  Assuming datafiles
	are being correctly removed as each job finishes printing, this
	works well.  NOTE: there does seem to be some situations where a
	datafile is left behind (not-removed) even though the job has in
	fact printed.  I do not know if those are due to other changes I
	have made in my lpr, or if everyone sees them.  In any case, those
	leftover data files could now cause queues to "stall" with this
	update.  Still, no data is lost, and both the server and the client
	will have some information as to why the stall has happened.  So,
	I still think this is a reasonable fix, even if it isn't foolproof.

	Here is the update:

--- recvjob.c.orig	Sat Sep  2 23:39:25 2000
+++ recvjob.c	Sat Sep  2 23:35:45 2000
@@ -58,6 +58,7 @@
 #include <signal.h>
 #include <fcntl.h>
 #include <dirent.h>
+#include <errno.h>
 #include <syslog.h>
 #include <stdio.h>
 #include <stdlib.h>
@@ -239,8 +240,18 @@
 	int fd, err;
 
 	fd = open(file, O_CREAT|O_EXCL|O_WRONLY, FILMOD);
-	if (fd < 0)
-		frecverr("readfile: %s: illegal path name: %m", file);
+	if (fd < 0) {
+		if (errno != EEXIST)
+			frecverr("readfile: %s: illegal path name: %m", file);
+		/* the open() failed because the file already exists.  This
+		 * may just mean that there already are 1000 jobs in the queue
+		 * from the sending host.  Treat this as if we are temporarily
+		 * out-of-space for new jobs */
+		syslog(LOG_INFO, "returning 'no-space' to %s because %s already exists", fromb, file);
+		sleep(2);
+		(void) write(1, "\2", 1);
+		return(0);
+	}
 	ack();
 	err = 0;
 	for (i = 0; i < size; i += BUFSIZ) {


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->sheldonh 
Responsible-Changed-By: sheldonh 
Responsible-Changed-When: Tue Sep 5 03:29:31 PDT 2000 
Responsible-Changed-Why:  
I'll take this one. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=21008 

From: Sheldon Hearn <sheldonh@uunet.co.za>
To: gad@eclipse.acs.rpi.edu
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue 
Date: Tue, 05 Sep 2000 12:48:55 +0200

 On Sun, 03 Sep 2000 00:51:39 -0400, Garance A Drosehn wrote:
 
 >  	fd = open(file, O_CREAT|O_EXCL|O_WRONLY, FILMOD);
 > -	if (fd < 0)
 > -		frecverr("readfile: %s: illegal path name: %m", file);
 > +	if (fd < 0) {
 > +		if (errno != EEXIST)
 > +			frecverr("readfile: %s: illegal path name: %m", file);
 		[...]
 > +		(void) write(1, "\2", 1);
 > +		return(0);
 > +	}
 
 Hi Garance,
 
 Doesn't this introduce a file descriptor leak?
 
 Ciao,
 Sheldon.
 

From: Garance A Drosehn <gad@eclipse.acs.rpi.edu>
To: Sheldon Hearn <sheldonh@uunet.co.za>
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue 
Date: Tue,  5 Sep 2000 17:21:11 -0400

 > Sheldon Hearn <sheldonh@uunet.co.za> wrote:
 > On Sun, 03 Sep 2000, Garance A Drosehn wrote:
 >
 > >  	fd = open(file, O_CREAT|O_EXCL|O_WRONLY, FILMOD);
 > > -	if (fd < 0)
 > > -		frecverr("readfile: %s: illegal path name: %m", file);
 > > +	if (fd < 0) {
 > > +		if (errno != EEXIST)
 > > +			frecverr("readfile: %s: illegal path name: %m", file);
 > 		[...]
 > > +		(void) write(1, "\2", 1);
 > > +		return(0);
 > > +	}
 >
 > Hi Garance,
 >
 > Doesn't this introduce a file descriptor leak?
 
 I would not expect it to.  This whole 'if' clause will only happen
 if that call to open() has failed.  That call to open() is the
 first subroutine call in the readfile() routine, so if the open
 has failed then the routine has not done anything by the time of
 that 'return(0)' call.
 
 Hrm.  I guess style-conventions would suggest that 'return(0);'
 should be 'return (0);' (with the blank before the parenthesis),
 but other than that I don't see any problem with the suggested
 change.  And even that style faux-pas I picked up from the other
 two return statements in the same readfile routine.
 
 What file-descriptor leak would you expect this to cause?
 
 ---
 Garance Alistair Drosehn     =     gad@eclipse.acs.rpi.edu
 Senior Systems Programmer        (MIME & NeXTmail capable)
 Rensselaer Polytechnic Institute;           Troy NY    USA
 

From: Sheldon Hearn <sheldonh@uunet.co.za>
To: Garance A Drosehn <gad@eclipse.acs.rpi.edu>
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue 
Date: Wed, 06 Sep 2000 10:02:43 +0200

 On Tue, 05 Sep 2000 17:21:11 -0400, Garance A Drosehn wrote:
 
 > I would not expect it to.  This whole 'if' clause will only happen
 > if that call to open() has failed.
 [...]
 > What file-descriptor leak would you expect this to cause?
 
 Never mind.  I was reading too fast. :-)
 
 Ciao,
 Sheldon.
 

From: Garance A Drosihn <drosih@rpi.edu>
To: freebsd-gnats-submit@FreeBSD.org
Cc:  
Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue
Date: Thu, 12 Oct 2000 10:05:25 -0400

 Minor note:
 
 The update I included in this original PR still has a problem
 in some obscure cases.  Namely, if we hit the 1,000 job mark
 AND we have a print server which accepts jobs from multiple
 hosts, AND:
 
 1) if the printing client has multiple network interfaces,
     and the job is sent from an interface whose IP address
     does not match the result from 'hostname' on the client.
     Hmm.
     I guess this isn't limited to clients with multiple
     interfaces.  It would also happen for clients with one
     interface (say, a ppp connection) that has a different
     hostname than the client itself uses for 'hostname'.
 
 or
 
 2) if a print server receives a job from a printing client,
     and turns around and sends that job on to a second print
     server.  The problem happens on the second print server.
 
 In both cases, the incoming print job might be destroyed.
 (this problem doesn't come up at RPI, because of other
 changes I have made...)
 
 Note that even in this problem situation, we're better off
 than we are without this update.  Without the original
 update, BOTH the incoming job AND the matching job (the
 one already in the print-server's queue) will be destroyed.
 
 So the update is still worth applying, although it needs
 to be improved upon in the future.
 
 ---
 Garance Alistair Drosehn           =   gad@eclipse.acs.rpi.edu
 Senior Systems Programmer          or  drosih@rpi.edu
 Rensselaer Polytechnic Institute
 

From: Garance A Drosihn <drosih@rpi.edu>
To: sheldonh@freebsd.org
Cc:  
Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue
Date: Thu, 12 Oct 2000 07:10:02 -0700 (PDT)

 The following reply was made to PR bin/21008; it has been noted by GNATS.
 
 From: Garance A Drosihn <drosih@rpi.edu>
 To: freebsd-gnats-submit@FreeBSD.org
 Cc:  
 Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue
 Date: Thu, 12 Oct 2000 10:05:25 -0400
 
  Minor note:
  
  The update I included in this original PR still has a problem
  in some obscure cases.  Namely, if we hit the 1,000 job mark
  AND we have a print server which accepts jobs from multiple
  hosts, AND:
  
  1) if the printing client has multiple network interfaces,
      and the job is sent from an interface whose IP address
      does not match the result from 'hostname' on the client.
      Hmm.
      I guess this isn't limited to clients with multiple
      interfaces.  It would also happen for clients with one
      interface (say, a ppp connection) that has a different
      hostname than the client itself uses for 'hostname'.
  
  or
  
  2) if a print server receives a job from a printing client,
      and turns around and sends that job on to a second print
      server.  The problem happens on the second print server.
  
  In both cases, the incoming print job might be destroyed.
  (this problem doesn't come up at RPI, because of other
  changes I have made...)
  
  Note that even in this problem situation, we're better off
  than we are without this update.  Without the original
  update, BOTH the incoming job AND the matching job (the
  one already in the print-server's queue) will be destroyed.
  
  So the update is still worth applying, although it needs
  to be improved upon in the future.
  
  ---
  Garance Alistair Drosehn           =   gad@eclipse.acs.rpi.edu
  Senior Systems Programmer          or  drosih@rpi.edu
  Rensselaer Polytechnic Institute
  
 
 

From: Garance A Drosihn <drosih@rpi.edu>
To: freebsd-gnats-submit@FreeBSD.org
Cc:  
Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue
Date: Thu, 12 Oct 2000 21:00:19 -0400

 Earlier today, I wrote:
 >   So the update is still worth applying, although it
 >   needs to be improved upon in the future.
 
 I have an improved version of the patch now available
 in
 
 ftp://freefour.acs.rpi.edu/pub/bsdlpr/lpr-recv.diff
 
 but this version assumes the trstat2.diff patch (in the
 same directory) has already been applied, so this PR
 can wait until that patch is applied.
 
 the newer patch still isn't perfect, but it relegates
 the problem to even more obscure situations, and I think
 it is about the best that can be done short of rewriting
 assorted areas of lpr.
 
 ---
 Garance Alistair Drosehn           =   gad@eclipse.acs.rpi.edu
 Senior Systems Programmer          or  drosih@rpi.edu
 Rensselaer Polytechnic Institute
 

From: Garance A Drosihn <drosih@rpi.edu>
To: sheldonh@freebsd.org
Cc:  
Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue
Date: Thu, 12 Oct 2000 18:10:04 -0700 (PDT)

 The following reply was made to PR bin/21008; it has been noted by GNATS.
 
 From: Garance A Drosihn <drosih@rpi.edu>
 To: freebsd-gnats-submit@FreeBSD.org
 Cc:  
 Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue
 Date: Thu, 12 Oct 2000 21:00:19 -0400
 
  Earlier today, I wrote:
  >   So the update is still worth applying, although it
  >   needs to be improved upon in the future.
  
  I have an improved version of the patch now available
  in
  
  ftp://freefour.acs.rpi.edu/pub/bsdlpr/lpr-recv.diff
  
  but this version assumes the trstat2.diff patch (in the
  same directory) has already been applied, so this PR
  can wait until that patch is applied.
  
  the newer patch still isn't perfect, but it relegates
  the problem to even more obscure situations, and I think
  it is about the best that can be done short of rewriting
  assorted areas of lpr.
  
  ---
  Garance Alistair Drosehn           =   gad@eclipse.acs.rpi.edu
  Senior Systems Programmer          or  drosih@rpi.edu
  Rensselaer Polytechnic Institute
  
 
 
Responsible-Changed-From-To: sheldonh->gad 
Responsible-Changed-By: gad 
Responsible-Changed-When: Wed Nov 1 18:25:15 PST 2000 
Responsible-Changed-Why:  
Now that I have committer access, this is another of my PR's that I 
should take over.  I've already asked Sheldon about this, and he 
seemed more than happy to let me have it.  Er, or something like 
that. 
I will be committing a patch to fix this soon. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=21008 

From: Garance A Drosehn <gad@FreeBSD.org>
To: freebsd-gnats-submit@FreeBSD.org, gad@eclipse.acs.rpi.edu
Cc:  
Subject: Re: bin/21008: Fix for lpr's handling of lots of jobs in a queue
Date: Mon, 21 Apr 2003 16:07:26 -0400

 About time that I wrote a followup to this, just so people don't think 
 I completely forgot about it.
 
 I do have a patch for this problem installed at RPI, and what I found 
 out is that it had some minor side-effects.  The patch saves you from 
 the problem of jobs disappearing (*iff* you print 1000 jobs before the 
 first one prints), but it introduced a new problem where a print queue 
 can 'mysteriously hang'.  Actually, the problem is not all that 
 mysterious to me, but every time it has happened here at RPI, it has 
 totally baffled everyone else.
 
 The other thing about this is that the 'hung queue' problem can happen 
 much more frequently on a busy print server than the deleting-jobs 
 problem.  So, I need to write a different solution for the problem.  I 
 have an idea of how I want that to work, but I haven't written it yet.
 
 -- 
 Garance Alistair Drosehn     =      gad@gilead.netel.rpi.edu
 Senior Systems Programmer               or   gad@FreeBSD.org
 Rensselaer Polytechnic Institute;             Troy, NY;  USA
 
>Unformatted:
