From james@jrv.org  Thu Sep 10 18:18:02 2009
Return-Path: <james@jrv.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 23CE31065672
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 10 Sep 2009 18:18:02 +0000 (UTC)
	(envelope-from james@jrv.org)
Received: from mail.jrv.org (adsl-70-243-84-13.dsl.austtx.swbell.net [70.243.84.13])
	by mx1.freebsd.org (Postfix) with ESMTP id B6A778FC17
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 10 Sep 2009 18:18:01 +0000 (UTC)
Received: from bigtex.housenet.jrv (localhost [127.0.0.1])
	by mail.jrv.org (8.14.3/8.14.3) with ESMTP id n8AHkOGt005602
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 10 Sep 2009 12:46:24 -0500 (CDT)
	(envelope-from james@bigtex.housenet.jrv)
Received: (from root@localhost)
	by bigtex.housenet.jrv (8.14.3/8.14.3/Submit) id n8AHkOxV005601;
	Thu, 10 Sep 2009 12:46:24 -0500 (CDT)
	(envelope-from james)
Message-Id: <200909101746.n8AHkOxV005601@bigtex.housenet.jrv>
Date: Thu, 10 Sep 2009 12:46:24 -0500 (CDT)
From: "James R. Van Artsdalen" <james-freebsd-current@jrv.org>
Reply-To: "James R. Van Artsdalen" <james-freebsd-current@jrv.org>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: [patch] FreeBSD/amd64 can't see all system memory
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         138709
>Category:       kern
>Synopsis:       [zfs] zfs recv hangs, pool accesses hang in rrl->rr_cv state
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    mm
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Sep 10 18:20:01 UTC 2009
>Closed-Date:    Mon Jun 06 11:45:22 UTC 2011
>Last-Modified:  Mon Jun 06 11:45:22 UTC 2011
>Originator:     James R. Van Artsdalen
>Release:        FreeBSD 9.0-CURRENT amd64
>Organization:
>Environment:
System: FreeBSD pygmy.housenet.jrv 9.0-CURRENT FreeBSD 9.0-CURRENT #4 r197044M: Wed Sep  9 18:58:08 CDT 2009     james@pygmy.housenet.jrv:/usr/obj/usr/src/sys/GENERIC  amd64


>Description:

ZFS recv hung last night during the daily periodic script.  Most of the
pool can be read but if one area is touched the process hangs with ^T
reporting:

$ find /bigtex
...
/bigtex/usr/home/james
^T
load: 0.00  cmd: find 2794 [rrl->rr_cv)] 5861.45r 0.28u 2.02s 0% 1704k

That's about where the ZFS recv hung:

receiving incremental stream of
bigtex/usr/home/james/News@syssnap-1246856401 into bigtex/usr/home/james/News@syssnap-1246856401
received 15.8MB stream in 59 seconds (275KB/sec)
receiving incremental stream of
bigtex/usr/home/james/News@syssnap-1246942803 into bigtex/usr/home/james/News@syssnap-1246942803
received 5.91MB stream in 50 seconds (121KB/sec)
receiving incremental stream of
bigtex/usr/home/james/News@syssnap-1247029203 into bigtex/usr/home/james/News@syssnap-1247029203

There should have 25 or so more snapshots in that filesystem.

/var/log/messages has no messages.

The replication stream has many filesystem deletes and renames.



>How-To-Repeat:
repeatabioity not yet clear.

>Fix:

>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-amd64->freebsd-fs 
Responsible-Changed-By: gavin 
Responsible-Changed-When: Thu Sep 10 19:14:04 UTC 2009 
Responsible-Changed-Why:  
Over to maintainer(s) 

http://www.freebsd.org/cgi/query-pr.cgi?pr=138709 
State-Changed-From-To: open->closed 
State-Changed-By: linimon 
State-Changed-When: Sat Sep 12 03:13:27 UTC 2009 
State-Changed-Why:  
See kern/138220. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=138709 
State-Changed-From-To: closed->feedback 
State-Changed-By: pjd 
State-Changed-When: ndz 13 wrz 2009 16:26:24 UTC 
State-Changed-Why:  
I don't think it hasanything to do with kern/138220. 

This would be great if you could provide a way to reproduce this. 
It might be related to one of my recent commits. 


Responsible-Changed-From-To: freebsd-fs->pjd 
Responsible-Changed-By: pjd 
Responsible-Changed-When: ndz 13 wrz 2009 16:26:24 UTC 
Responsible-Changed-Why:  
I'll take this one. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=138709 

From: Borja Marcos <borjam@sarenet.es>
To: bug-followup@FreeBSD.org,
 james-freebsd-current@jrv.org
Cc:  
Subject: Re: kern/138709: [zfs] zfs recv hangs, pool accesses hang in rrl-&gt;rr_cv state
Date: Tue, 29 Sep 2009 15:52:29 +0200

 I think I've observed this on 8.0RC1. Message sent to FreeBSD-stable,  
 subject "8.0RC1 ZFS deadlock" follows.
 
 I have observed a deadlock condition when using ZFS. We are making a  
 heavy usage of zfs send/zfs receive to keep a replica of a dataset on  
 a remote machine. It can be done at one minute intervals. Maybe we're  
 doing a somehow atypical usage of ZFS, but, well, seems to be a great  
 solution to keep filesystem replicas once this is sorted out.
 
 
 How to reproduce:
 
 Set up two systems. A dataset with heavy I/O activity is replicated  
 from the first to the second one. I've used a dataset containing /usr/ 
 obj while I did a make buildworld.
 
 Replicate the dataset from the first machine to the second one using  
 an incremental send
 
 zfs send -i pool/dataset@Nminus1 pool/dataset@N | ssh destination zfs  
 receive -d pool
 
 When there is read activity on the second system, reading the  
 replicated system, I mean, having read access while zfs receive is  
 updating it, there can be a deadlock. We have discovered this doing a  
 test on a hopefully soon in production server, with 8 GB RAM. A Bacula  
 backup agent was running and ZFS deadlocked.
 
 I have set up a couple of VMWare Fussion virtual machines in order to  
 test this, and it has deadlocked as well. The virtual machines have  
 little memory, 512 MB, but I don't believe this is the actual problem.  
 There is no complaint about lack of memory.
 
 A running top shows processes stuck on "zfsvfs"
 
 last pid:  2051;  load averages:  0.00,  0.07,  0.55    up 0+01:18:25   
 12:05:48
 37 processes:  1 running, 36 sleeping
 CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
 Mem: 18M Active, 20M Inact, 114M Wired, 40K Cache, 59M Buf, 327M Free
 Swap: 1024M Total, 1024M Free
 
   PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU  
 COMMAND
 1914 root        1  62    0 11932K  2564K zfsvfs  0   0:51  0.00% bsdtar
 1093 borjam      1  44    0  8304K  2464K CPU1    1   0:32  0.00% top
 1913 root        1  54    0 11932K  2600K rrl->r  0   0:19  0.00% bsdtar
 1019 root        1  44    0 25108K  4812K select  0   0:05  0.00% sshd
 2008 root        1  76    0 13600K  1904K tx->tx  0   0:04  0.00% zfs
 1089 borjam      1  44    0 37040K  5216K select  1   0:04  0.00% sshd
   995 root        1  76    0  8252K  2652K pause   0   0:02  0.00% csh
   840 root        1  44    0 11044K  3828K select  1   0:02  0.00%  
 sendmail
 1086 root        1  76    0 37040K  5156K sbwait  1   0:01  0.00% sshd
   850 root        1  44    0  6920K  1612K nanslp  0   0:01  0.00% cron
   607 root        1  44    0  5992K  1540K select  1   0:01  0.00%  
 syslogd
 1090 borjam      1  76    0  8252K  2636K pause   1   0:01  0.00% csh
   990 borjam      1  44    0 37040K  5220K select  0   0:00  0.00% sshd
   985 root        1  48    0 37040K  5160K sbwait  1   0:00  0.00% sshd
   911 root        1  44    0  8252K  2608K ttyin   0   0:00  0.00% csh
   991 borjam      1  56    0  8252K  2636K pause   0   0:00  0.00% csh
   844 smmsp       1  46    0 11044K  3852K pause   0   0:00  0.00%  
 sendmail
 
 Interestingly, this has blocked access to all the filesystems. I  
 cannot, for instance, ssh into the machine anymore, even though all  
 the system-important filesystems are on  ufs, I was just using ZFS for  
 a test.
 
 Any ideas on what information might be useful to collect? I have the  
 vmware machine right now. I've made a couple of VMWare snapshots of  
 it, first before breaking into DDB with the deadlock just started, the  
 second being into DDB (I've broken into DDB with sysctl).
 
 Also, a copy of the VMWare virtual machine with snapshots is avaiable  
 on request. Your choice ;)
 
 
 
 
 
 
 Borja.
 
 
 
 
 Sorry, forgot to explain what was happening on the second system (the  
 one receiving the incremental snapshots) for the deadlock to happen.
 
 It was just running an endless loop, copying the contents of /usr/obj  
 to another dataset, in order to keep the reading activity going on.
 
 That's how it has deadlocked. On the original test system an rsync did  
 the same trick.
 
 
 
 
 
 Borja
 
 
 
 
 
 

From: Martin Matuska <mm@FreeBSD.org>
To: bug-followup@FreeBSD.org, james-freebsd-current@jrv.org
Cc:  
Subject: Re: kern/138709: [zfs] zfs recv hangs, pool accesses hang in rrl-&gt;rr_cv
 state
Date: Wed, 05 May 2010 09:27:02 +0200

 This PR might be related to recently fixed kern/146296.
 Please try the patch from there or use a up-to-date 9-CURRENT.
 Patch applies against 8-STABLE as well.
 
 http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/146296
Responsible-Changed-From-To: pjd->mm 
Responsible-Changed-By: mm 
Responsible-Changed-When: Wed May 5 07:30:16 UTC 2010 
Responsible-Changed-Why:  
Taking over with pjd's approval. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=138709 
State-Changed-From-To: feedback->closed 
State-Changed-By: mm 
State-Changed-When: Mon Jun 6 11:45:21 UTC 2011 
State-Changed-Why:  
Feedback timeout and probably fixed in kern/146296 

http://www.freebsd.org/cgi/query-pr.cgi?pr=138709 
>Unformatted:
