From eugen@delikates-nk.ru  Thu Mar  6 16:55:26 2008
Return-Path: <eugen@delikates-nk.ru>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5F81F1065672
	for <FreeBSD-gnats-submit@freebsd.org>; Thu,  6 Mar 2008 16:55:26 +0000 (UTC)
	(envelope-from eugen@delikates-nk.ru)
Received: from delikates-nk.ru (delikates-nk.ru [81.26.177.74])
	by mx1.freebsd.org (Postfix) with ESMTP id CA0438FC12
	for <FreeBSD-gnats-submit@freebsd.org>; Thu,  6 Mar 2008 16:55:25 +0000 (UTC)
	(envelope-from eugen@delikates-nk.ru)
Received: from delikates-nk.ru (localhost [127.0.0.1])
	by delikates-nk.ru (8.14.2/8.14.2) with ESMTP id m26GhW18005479
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 6 Mar 2008 23:43:32 +0700 (KRAT)
	(envelope-from eugen@delikates-nk.ru)
Received: (from eugen@localhost)
	by delikates-nk.ru (8.14.2/8.14.2/Submit) id m26GhVBU005478;
	Thu, 6 Mar 2008 23:43:31 +0700 (KRAT)
	(envelope-from eugen)
Message-Id: <200803061643.m26GhVBU005478@delikates-nk.ru>
Date: Thu, 6 Mar 2008 23:43:31 +0700 (KRAT)
From: Eugene Grosbein <eugen@kuzbass.ru>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: [cpufreq] kern_cpu.c's logic error leads to spontaneous disabling of passive cooling
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         121433
>Category:       kern
>Synopsis:       [cpufreq] kern_cpu.c's logic error leads to spontaneous disabling of passive cooling
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    jhb
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Mar 06 17:00:03 UTC 2008
>Closed-Date:    Thu Jul 17 19:25:26 UTC 2008
>Last-Modified:  Thu Jul 17 19:25:26 UTC 2008
>Originator:     Eugene Grosbein
>Release:        FreeBSD 6.3-PRERELEASE i386
>Organization:
Svyaz-Service JSC
>Environment:
System: FreeBSD 6.3-PRERELEASE, Pentium-4 2.0Ghz

>Description:
	I've 1U/unipocessor FreeBSD 6.3-PRERELEASE server having inadequate
	active cooling that leads to CPU overheating. The server is remote and while
	good cooling is being prepared, I decided to use passive cooling feature
	of acpi_thermal(4). It uses p4tcc here and really helps
	to keep CPU temperature in bounds but there is annoying bug:
	very often (many times per hour) the acpi_thermal(4)
	disables passive cooling with a message:

failed to set new freq, disabling passive cooling

	So I need to use cron to (re)enable passive cooling ones a minute
	to keep it running.

	I've tracked this down to src/sys/kern/kern_cpu.c,
	function cf_get_method():

	1) src/sys/dev/acpica/acpi_thermal.c, function acpi_tz_cooling_thread()
	calls acpi_tz_cpufreq_update() from same file;

	2) acpi_tz_cpufreq_update() calls CPUFREQ_GET() that takes us to
	src/sys/kern/kern_cpu.c, cf_get_method();

	3) cf_get_method() has the following code:

        /*
         * Reacquire the lock and search for the given level.
         *
         * XXX Note: this is not quite right since we really need to go
         * through each level and compare both absolute and relative
         * settings for each driver in the system before making a match.
         * The estimation code below catches this case though.
         */
        CF_MTX_LOCK(&sc->lock);
        for (n = 0; n < numdevs && curr_set->freq == CPUFREQ_VAL_UNKNOWN; n++) {
                if (!device_is_attached(devs[n]))
                        continue;
                error = CPUFREQ_DRV_GET(devs[n], &set);
                if (error)
                        continue;
                for (i = 0; i < count; i++) {
                        if (CPUFREQ_CMP(set.freq, levels[i].total_set.freq)) {
                                sc->curr_level = levels[i];
                                break;
                        }
                }
        }

	Note that error value is not cleaned after this cycle.
	It happens to be ENXIO after the cycle in my case.
	Later code successfully reports:

CF_DEBUG("get estimated freq %d\n", curr_set->freq);

	(curr_set->freq always happens to be max value of CPU frequency here)

	Then it does 'return (error);' with value ENXIO propagated
	from the cycle shown above.

	4) acpi_tz_cpufreq_update() propagates ENXIO
	to acpi_tz_cooling_thread() that disables passive cooling.

>How-To-Repeat:

	Just use uniprocessor Pentium-4 system with heavy constant CPU load,
	acpi_thermal/cpufreq/p4tcc and tune acpi_thermal so passive cooling
	gets used. Here is my /etc/sysctl.conf:

debug.cpufreq.lowest=1246                                                                      
#debug.cpufreq.verbose=1                                                                       
hw.acpi.thermal.user_override=1                                                                
hw.acpi.thermal.tz0.passive_cooling=1                                                          
hw.acpi.thermal.tz0._PSV=70C                                                                   
hw.acpi.thermal.tz0._CRT=75C


>Fix:

	Unknown. Perhaps, just clear errno after the code cited above?
	As workaround, I've patched acpi_thermal(4) to not disable
	passive cooling when acpi_tz_cpufreq_update() returns ENXIO,
	that works for me.

Eugene Grosbein
>Release-Note:
>Audit-Trail:

From: John Baldwin <jhb@FreeBSD.org>
To: bug-followup@FreeBSD.org, eugen@kuzbass.ru
Cc:  
Subject: Re: kern/121433: [cpufreq] kern_cpu.c's logic error leads to spontaneous disabling of passive cooling
Date: Fri, 2 May 2008 16:16:20 -0400

 Try this patch:
 
 Index: kern_cpu.c
 ===================================================================
 RCS file: /usr/cvs/src/sys/kern/kern_cpu.c,v
 retrieving revision 1.29
 diff -u -r1.29 kern_cpu.c
 --- kern_cpu.c	16 Jan 2008 01:05:21 -0000	1.29
 +++ kern_cpu.c	2 May 2008 20:13:54 -0000
 @@ -452,8 +452,7 @@
  	for (n = 0; n < numdevs && curr_set->freq == CPUFREQ_VAL_UNKNOWN; n++) {
  		if (!device_is_attached(devs[n]))
  			continue;
 -		error = CPUFREQ_DRV_GET(devs[n], &set);
 -		if (error)
 +		if (CPUFREQ_DRV_GET(devs[n], &set) != 0)
  			continue;
  		for (i = 0; i < count; i++) {
  			if (CPUFREQ_CMP(set.freq, levels[i].total_set.freq)) {
 @@ -483,9 +482,10 @@
  		if (CPUFREQ_CMP(rate, levels[i].total_set.freq)) {
  			sc->curr_level = levels[i];
  			CF_DEBUG("get estimated freq %d\n", curr_set->freq);
 -			break;
 +			goto out;
  		}
  	}
 +	error = ENXIO;
  
  out:
  	if (error == 0)
 
 -- 
 John Baldwin

From: Eugene Grosbein <eugen@kuzbass.ru>
To: John Baldwin <jhb@FreeBSD.org>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/121433: [cpufreq] kern_cpu.c's logic error leads to spontaneous disabling of passive cooling
Date: Sat, 3 May 2008 13:34:28 +0800

 On Fri, May 02, 2008 at 04:16:20PM -0400, John Baldwin wrote:
 
 > Try this patch:
 
 Thanks!
 
 I'm testing it now with CPU fully loaded. I'll keep it running for several hours.
 
 Eugene Grosbein

From: Eugene Grosbein <eugen@kuzbass.ru>
To: John Baldwin <jhb@FreeBSD.org>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/121433: [cpufreq] kern_cpu.c's logic error leads to spontaneous disabling of passive cooling
Date: Sat, 3 May 2008 16:36:04 +0800

 On Fri, May 02, 2008 at 04:16:20PM -0400, John Baldwin wrote:
 
 > Try this patch:
 
 With this patch, passive cooling works reliably. Thanks!
 Please commit.
 
 Eugene Grosbein
State-Changed-From-To: open->analyzed 
State-Changed-By: linimon 
State-Changed-When: Sat May 3 08:55:11 UTC 2008 
State-Changed-Why:  
Submitter confirms that the patch fixes the problem. 


Responsible-Changed-From-To: freebsd-bugs->jhb 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Sat May 3 08:55:11 UTC 2008 
Responsible-Changed-Why:  

http://www.freebsd.org/cgi/query-pr.cgi?pr=121433 
State-Changed-From-To: analyzed->patched 
State-Changed-By: jhb 
State-Changed-When: Mon May 5 19:14:01 UTC 2008 
State-Changed-Why:  
Fixed in HEAD, MFC pending. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=121433 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/121433: commit references a PR
Date: Mon,  5 May 2008 19:14:01 +0000 (UTC)

 jhb         2008-05-05 19:13:52 UTC
 
   FreeBSD src repository
 
   Modified files:
     sys/kern             kern_cpu.c 
   Log:
   Fix a few edge cases with error handling in cpufreq(4)'s CPUFREQ_GET()
   method:
   - If the last of the child cpufreq drivers returns an error while trying to
     fetch its list of supported frequencies but an earlier driver found the
     requested frequency, don't return an error to the caller.
   - If all of the child cpufreq drivers fail and the attempt to match the
     frequency based on 'cpu_est_clockrate()' fails, return ENXIO rather than
     returning success and returning a frequency of CPUFREQ_VAL_UNKNOWN.
   
   MFC after:      3 days
   PR:             kern/121433
   Reported by:    Eugene Grosbein  eugen ! kuzbass dot ru
   
   Revision  Changes    Path
   1.30      +3 -3      src/sys/kern/kern_cpu.c
 _______________________________________________
 cvs-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/cvs-all
 To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
 

From: Eugene Grosbein <eugen@kuzbass.ru>
To: bug-followup@freebsd.org
Cc: jhb@freebsd.org
Subject: Re: kern/121433: [cpufreq] kern_cpu.c's logic error leads to spontaneous disabling of passive cooling
Date: Mon, 14 Jul 2008 23:45:52 +0800

 On Mon, Jul 14, 2008 at 11:28:13PM +0800, Eugene Grosbein wrote:
 
 > I run the patch commited to HEAD since May 3, it works perfectly.
 > Please perform MFC to RELENG_6/7, it applies just right.
 
 Oops, I have not noticed MFC have been performed in May.
 This PR should be closed, thanks!
 
 Eugene Grosbein

From: Eugene Grosbein <eugen@kuzbass.ru>
To: bug-followup@freebsd.org
Cc: jhb@freebsd.org
Subject: Re: kern/121433: [cpufreq] kern_cpu.c's logic error leads to spontaneous disabling of passive cooling
Date: Mon, 14 Jul 2008 23:28:13 +0800

 Hi!
 
 I run the patch commited to HEAD since May 3, it works perfectly.
 Please perform MFC to RELENG_6/7, it applies just right.
 
 Eugene Grosbein
State-Changed-From-To: patched->closed 
State-Changed-By: jhb 
State-Changed-When: Thu Jul 17 19:24:41 UTC 2008 
State-Changed-Why:  
Fix MFC'd to RELENG_[67]. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=121433 
>Unformatted:
