From tobez@plab.ku.dk  Mon Jul  3 07:16:35 2000
Return-Path: <tobez@plab.ku.dk>
Received: from plab.ku.dk (plab.ku.dk [130.225.105.65])
	by hub.freebsd.org (Postfix) with ESMTP id 30DB837B822
	for <FreeBSD-gnats-submit@freebsd.org>; Mon,  3 Jul 2000 07:16:33 -0700 (PDT)
	(envelope-from tobez@plab.ku.dk)
Received: from lion.plab.ku.dk (lion.plab.ku.dk [130.225.105.49])
	by plab.ku.dk (8.9.3/8.9.3) with ESMTP id QAA44028
	for <FreeBSD-gnats-submit@freebsd.org>; Mon, 3 Jul 2000 16:17:58 +0200 (CEST)
	(envelope-from tobez@plab.ku.dk)
Received: (from tobez@localhost)
	by lion.plab.ku.dk (8.9.3/8.9.3) id QAA93991;
	Mon, 3 Jul 2000 16:16:34 +0200 (CEST)
	(envelope-from tobez)
Message-Id: <200007031416.QAA93991@lion.plab.ku.dk>
Date: Mon, 3 Jul 2000 16:16:34 +0200 (CEST)
From: tobez@tobez.org
Sender: tobez@plab.ku.dk
Reply-To: tobez@tobez.org
To: FreeBSD-gnats-submit@freebsd.org
Subject: contigmalloc1() oddity for large alignments (race condition)
X-Send-Pr-Version: 3.2

>Number:         19672
>Category:       kern
>Synopsis:       contigmalloc1() oddity for large alignments (race condition)
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    andre
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Jul 03 07:20:01 PDT 2000
>Closed-Date:    Sat Dec 27 07:45:55 PST 2003
>Last-Modified:  Sat Dec 27 07:50:15 PST 2003
>Originator:     Anton Berezin
>Release:        FreeBSD 5.0-CURRENT i386
>Organization:
tobez.org
>Environment:

Most versions of FreeBSD, as far as I can tell.
File:           src/sys/vm/vm_page.c
Function:       contigmalloc1()

>Description:

If an object is requested with a large alignment, say, 1<<24, so that
contigmalloc1() is not even able to find a single PQ_FREE or PQ_CACHE
page with said alignment, it then proceeds freeing inactive pages, one
by one, and then immediately active pages as well, also one by one.

The problem is, that after freeing a page (in most cases  the routine
pages them out --- I inserted some sysctl counters to debug this), it
starts again by rescanning the same queue (either PQ_INACTIVE or
PQ_ACTIVE), from its head.

To me, it looks bad enough even for inactive pages, but for an active
queue it's a disaster, unless the box is idle.  The point is that, in a
nutshell, the following sequence gets executed when contigmalloc1()
tries to free the page:

   vm_pageout_flush(page)  which calls
        vm_pager_put_pages(page)  which calls
                swap_pager_putpages(page)
                        which sleeps (swwrt).

When the box is not idle, while the process is blocked in swwrt state,
some other process execution will lead to more inactive (some chances)
or active (all the chances) pages added, and then contigmalloc1() starts
scanning a queue again!

>How-To-Repeat:

A program that issues the METEORSETGEO ioctl to bktr driver, with
relatively large number of frames (in my tests I used 14 frames ==
14*768*576*4/4096 == 6049 pages).  The bktr driver did not have
sufficient space preallocated.

For some reason, bktr driver in its get_bktr_mem() function
(dev/bktr/bktr_os.c) first tries to do vm_page_alloc_contig() with the
alignment of 1<<24, and then, if this fails, proceeds with PAGE_SIZE.

[As a side note, I have no idea what is the reason for using such a large
alignment in bktr driver.  Apparently, this piece of code was copied
as is from meteor driver.]

On a practically idle box the allocation fails after 4 to 8 seconds.
The number of jumps from vm_pageout_flush() callpoint in inactive scan
code to PQ_INACTIVE rescan is about 110.  The number of jumps from
vm_pageout_flush() callpoint in active scan code to PQ_INACTIVE rescan
is about 4400.

On a busy box (nice -20 perl -e 'for(;;){}') this takes forever - or at
least I was not patient enough to wait for completion.  The number of
jumps increases at a steady rate, most of them are from the `active'
piece.  In top(1), I observed things like this (please pay attention to
Ks and Ms here):

Mem: 348K Active, 180K Inact, 21M Wired, 38M Cache, 9899K Buf, 64M Free
Swap: 525M Total, 21M Used, 504M Free, 3% Inuse, 1552K Out

>Fix:

A first obvious thing to do is to remove the 1<<24 alignment allocation
from the bktr (and meteor) code.

This helps in my particular case.

However, I think that the internal workings of contigmalloc1() are
seriously broken for large alignments.  My understanding is that the
page freeing code is somewhat of a last resort for the routine, and it
probably should not do that in this case --- the assumption
contigmalloc1() takes is that if the very first loop was not able to
find even the starting page, then there is a severe memory shortage or
something.  Not necessarily so.

To me, the code simply `does not look right'.

And I have no idea what the proper fix might look like.

Cheers,
Anton.


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->dillon 
Responsible-Changed-By: sheldonh 
Responsible-Changed-When: Mon Jul 3 07:38:55 PDT 2000 
Responsible-Changed-Why:  
The VM system is Matt's area. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=19672 
Responsible-Changed-From-To: dillon->freebsd-bugs 
Responsible-Changed-By: keramida 
Responsible-Changed-When: Sat Feb 22 18:15:14 PST 2003 
Responsible-Changed-Why:  
Back to the free pool. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=19672 
State-Changed-From-To: open->feedback 
State-Changed-By: andre 
State-Changed-When: Sat Dec 27 06:55:05 PST 2003 
State-Changed-Why:  
Check with Originator if problem persists. 


Responsible-Changed-From-To: freebsd-bugs->andre 
Responsible-Changed-By: andre 
Responsible-Changed-When: Sat Dec 27 06:55:05 PST 2003 
Responsible-Changed-Why:  
Check with Originator if problem persists. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=19672 

From: Andre Oppermann <andre@freebsd.org>
To: freebsd-gnats-submit@FreeBSD.org, tobez@tobez.org
Cc:  
Subject: Re: kern/19672: contigmalloc1() oddity for large alignments (race
 condition)
Date: Sat, 27 Dec 2003 15:53:11 +0100

 Anton,
 
 do you still have the problem with contigmalloc() described in
 the problem report?
 
 -- 
 Andre
 
 

From: Anton Berezin <tobez@tobez.org>
To: Andre Oppermann <andre@freebsd.org>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/19672: contigmalloc1() oddity for large alignments (race condition)
Date: Sat, 27 Dec 2003 16:35:14 +0100

 Andre,
 
 On Sat, Dec 27, 2003 at 03:53:11PM +0100, Andre Oppermann wrote:
 
 > do you still have the problem with contigmalloc() described in
 > the problem report?
 
 No idea, I have not been using bktr/meteor drivers for ages now.  If the
 code did not change substantially, I would expect the problem to still
 be there, though.
 
 \Anton.
 -- 
 Civilization is a fractal patchwork of old and new and dangerously new.
 -- Vernor Vinge
State-Changed-From-To: feedback->closed 
State-Changed-By: andre 
State-Changed-When: Sat Dec 27 07:45:12 PST 2003 
State-Changed-Why:  
See description in last message. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=19672 

From: Andre Oppermann <andre@freebsd.org>
To: Anton Berezin <tobez@tobez.org>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/19672: contigmalloc1() oddity for large alignments (race 
 condition)
Date: Sat, 27 Dec 2003 16:41:30 +0100

 Anton Berezin wrote:
 > 
 > Andre,
 > 
 > On Sat, Dec 27, 2003 at 03:53:11PM +0100, Andre Oppermann wrote:
 > 
 > > do you still have the problem with contigmalloc() described in
 > > the problem report?
 > 
 > No idea, I have not been using bktr/meteor drivers for ages now.  If the
 > code did not change substantially, I would expect the problem to still
 > be there, though.
 
 Ok, the code has been redone and reorganized.  The redo was by phk
 in sys/vm/vm_page.c rev 1.154 and the reorg by dillon in rev 1.167.
 With that all contigmalloc() stuff has been moved to sys/vm/vm_contig.c
 which has some more redones in it.
 
 I'd say it's save to close this PR as it no longer relevant for todays
 codebase.
 
 Thanks for your feedback.
 -- 
 Andre
>Unformatted:
