From nobody@FreeBSD.org  Fri Feb 22 09:51:00 2002
Return-Path: <nobody@FreeBSD.org>
Received: from freefall.freebsd.org (freefall.FreeBSD.org [216.136.204.21])
	by hub.freebsd.org (Postfix) with ESMTP id CC2DF37B405
	for <freebsd-gnats-submit@FreeBSD.org>; Fri, 22 Feb 2002 09:50:59 -0800 (PST)
Received: (from nobody@localhost)
	by freefall.freebsd.org (8.11.6/8.11.6) id g1MHoxh45595;
	Fri, 22 Feb 2002 09:50:59 -0800 (PST)
	(envelope-from nobody)
Message-Id: <200202221750.g1MHoxh45595@freefall.freebsd.org>
Date: Fri, 22 Feb 2002 09:50:59 -0800 (PST)
From: Sandeep Kumar <skumar@juniper.net>
To: freebsd-gnats-submit@FreeBSD.org
Subject: dump program hangs while exiting
X-Send-Pr-Version: www-1.0

>Number:         35214
>Category:       bin
>Synopsis:       dump(8) program hangs while exiting
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    obrien
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Feb 22 10:00:03 PST 2002
>Closed-Date:    
>Last-Modified:  Sat Dec 22 20:50:00 UTC 2012
>Originator:     Sandeep Kumar
>Release:        4.2
>Organization:
Juniper
>Environment:
>Description:
Backtrace of the hung dump process:
0x88055efc in nanosleep () from /usr/libexec/ld-elf.so.1
(gdb) bt
#0  0x88055efc in nanosleep () from /usr/libexec/ld-elf.so.1
#1  0x88054bd9 in wlock_acquire () from /usr/libexec/ld-elf.so.1
#2  0x880539fa in rtld_exit () from /usr/libexec/ld-elf.so.1
#3  0x880de0c8 in exit () from /usr/lib/libc.so.4
#4  0x804c9d3 in Exit ()
#5  0x804cb20 in enslave ()
#6  0x804c8f3 in startnewtape ()
#7  0x804a622 in main ()
#8  0x8049601 in _start ()

wlock_acquire waits for ever for the lock to be released by a reader. Since
dump is a non-threaded application, this has to be this process itself. Looking
at the lock and unlock invokations of this lock, they seemed paired. The
lock structure didn't look corrupted either and the fields were consistent.
So the only possibility was that process was interrupted by a signal while,
it had acquired the read lock. This looked like a possibility by looking at the
SIGUSR2 handler of dump. This handler calls longjmp, which can leave the read
lock locked, if done during the _rtld_bind operation. So, it was a matter of
confirming that, this is what had happened.

Taking the symbolic dump of the stack page, was able to locate the sigframe
structure in the stack. This structure is copied by the kernel on the user
stack and also contains info about the registers at the time of the trap,
when signal was delivered. Some of the signature items are, signal no., saved
return pointer to the signal trampoline code at the base of user stack. The
structure looked good, and the saved eip was symlook_list+27. This
function ends up geting called after a call to _rtld_bind, which does acquire
the read lock. So we did do a longjmp while holding the read lock.

The dump application makes extensive use of signalling to communicate between
different children of the dump program. SIGUSR2 is delivered 3-4 times a
second. So, its possible that once in a while, it gets delivered while we
are doing a _rtld_bind. Now, an obvious solution will be to mask the signals
during the duration the read lock is held. In fact, same signal blocking fix
was made, while acquiring the writer version of this lock, by jdp@polstra.com,
in BSD. When, I suggested to make the same change for the reader lock, I
received the following reply:
"It would hurt performance too much.  The rtld would have to do two
system calls for every symbol it resolved lazily."

 
>How-To-Repeat:
Run "dump -f - FS1 | restore -f - FS2"  in a infinite loop
>Fix:
May involve redesigning to not to use longjmp, as its not safe to call it from the signal handler.
>Release-Note:
>Audit-Trail:

From: Sandeep Kumar <skumar@juniper.net>
To: freebsd-gnats-submit@FreeBSD.org, skumar@juniper.net
Cc:  
Subject: Re: bin/35214: dump program hangs while exiting
Date: Mon, 04 Mar 2002 09:30:39 -0800

 Replacing setjmp/pause/longjmp with sigprocmask/sigsuspend fixes the 
 problem.
 Here is the diff for tape.c
 
 Index: tape.c
 ===================================================================
 RCS file: /cvs/junos-2001/src/sbin/dump/tape.c,v
 retrieving revision 1.1.1.4
 retrieving revision 1.2
 diff -r1.1.1.4 -r1.2
 60d59
 < #include <setjmp.h>
 124,127d122
 < static int ready;     /* have we reached the lock point without having */
 <                       /* received the SIGUSR2 signal from the prev 
 slave? */
 < static jmp_buf jmpbuf;        /* where to jump to if we are ready when 
 the */
 <                       /* SIGUSR2 arrives from the previous slave */
 682,684d676
 <
 <       if (ready)
 <               longjmp(jmpbuf, 1);
 757a750,755
  >         sigset_t sigusr2_mask, omask;
  >
  >       /* Create a mask with SIGUSR2 */
  >         sigemptyset(&sigusr2_mask);
  >         sigaddset(&sigusr2_mask, SIGUSR2);
  >
 792,797c790,792
 <               if (setjmp(jmpbuf) == 0) {
 <                       ready = 1;
 <                       if (!caught)
 <                               (void) pause();
 <               }
 <               ready = 0;
 ---
  >               sigprocmask(SIG_BLOCK, &sigusr2_mask, &omask); /*Mask 
 SIGUSR2*/;
  >               if (!caught)
  >                       (void) sigsuspend(&omask); /* wait for SIGUSR2 */
 798a794
  >               sigprocmask(SIG_SETMASK, &omask, NULL); /* Set the old 
 mask */
 
 

From: Sandeep Kumar <skumar@juniper.net>
To: freebsd-gnats-submit@FreeBSD.org, skumar@juniper.net
Cc:  
Subject: Re: bin/35214: dump program hangs while exiting
Date: Mon, 04 Mar 2002 11:32:41 -0800

 Submitting unified diff for the fix.
 
 Index: tape.c
 ===================================================================
 RCS file: /cvs/junos-2001/src/sbin/dump/tape.c,v
 retrieving revision 1.1.1.4
 retrieving revision 1.2
 diff -u -r1.1.1.4 -r1.2
 --- tape.c      2001/03/31 04:38:41     1.1.1.4
 +++ tape.c      2002/03/01 17:53:24     1.2
 @@ -57,7 +57,6 @@
  
  #include <errno.h>
  #include <fcntl.h>
 -#include <setjmp.h>
  #include <signal.h>
  #include <stdio.h>
  #ifdef __STDC__
 @@ -121,10 +120,6 @@
  int master;            /* pid of master, for sending error signals */
  int tenths;            /* length of tape used per block written */
  static int caught;     /* have we caught the signal to proceed? */
 -static int ready;      /* have we reached the lock point without having */
 -                       /* received the SIGUSR2 signal from the prev 
 slave? */
 -static jmp_buf jmpbuf; /* where to jump to if we are ready when the */
 -                       /* SIGUSR2 arrives from the previous slave */
  
  int
  alloctape()
 @@ -679,9 +674,6 @@
  proceed(signo)
         int signo;
  {
 -
 -       if (ready)
 -               longjmp(jmpbuf, 1);
         caught++;
  }
  
 @@ -755,7 +747,13 @@
  {
         register int nread;
         int nextslave, size, wrote, eot_count;
 +        sigset_t sigusr2_mask, omask;
 +
 +       /* Create a mask with SIGUSR2 */
 +        sigemptyset(&sigusr2_mask);
 +        sigaddset(&sigusr2_mask, SIGUSR2);
  
 +
         /*
          * Need our own seek pointer.
          */
 @@ -788,14 +786,12 @@
                                     TP_BSIZE) != TP_BSIZE)
                                        quit("master/slave protocol 
 botched.\n");
                         }
 -               }
 -               if (setjmp(jmpbuf) == 0) {
 -                       ready = 1;
 -                       if (!caught)
 -                               (void) pause();
                 }
 -               ready = 0;
 +               sigprocmask(SIG_BLOCK, &sigusr2_mask, &omask); /*Mask 
 SIGUSR2*/;
 +               if (!caught)
 +                       (void) sigsuspend(&omask); /* wait for SIGUSR2 */
                 caught = 0;
 +               sigprocmask(SIG_SETMASK, &omask, NULL); /* Set the old 
 mask */
  
                 /* Try to write the data... */
                 eot_count = 0;
 
Responsible-Changed-From-To: freebsd-bugs->obrien 
Responsible-Changed-By: obrien 
Responsible-Changed-When: Thu Jun 6 09:38:44 PDT 2002 
Responsible-Changed-Why:  
I'll commit this as soon as dumps work again in -current. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=35214 

From: Niclas Zeising <niclas.zeising@gmail.com>
To: bug-followup@FreeBSD.org, skumar@juniper.net
Cc: obrien@freebsd.org
Subject: Re: bin/35214: dump(8) program hangs while exiting
Date: Tue, 17 May 2011 15:43:07 +0200

 As far as I can tell, this is not commited, and possibly still an issue,
 as the setjmp/longjmp calls are still there. Obrien, what's the status
 on this? Is it simply forgotten?
 Regards!
 -- 
 Niclas

From: Chris Rees <utisoft@gmail.com>
To: "bug-followup@freebsd.org" <bug-followup@freebsd.org>
Cc:  
Subject: Re: ports/35214
Date: Sat, 22 Dec 2012 20:43:29 +0000

 Any update?
 
 Chris
>Unformatted:
