Discussion:
2.6.21-rc5: Thinkpad X60 gets critical thermal shutdowns
(too old to reply)
Andi Kleen
2007-03-31 09:40:07 UTC
Permalink
When I run 2.6.21-rc5 + Andi's x86 patches + paravirt_ops patches, I've
Hmm, don't think there's anything either in x86 that would touch this code.
But can you double check with plain rc5?
Mar 30 23:19:03 localhost kernel: ACPI: Critical trip point
Mar 30 23:19:03 localhost kernel: Critical temperature reached (128 C), shutting down.
Mar 30 23:19:03 localhost kernel: Critical temperature reached (128 C), shutting down.
Mar 30 23:19:03 localhost shutdown[19417]: shutting down for system halt
and the machine does feel pretty hot.
Pavel has been complaining about higher power consumption on his laptop versus
.20 too.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jeremy Fitzhardinge
2007-04-01 06:30:15 UTC
Permalink
Could you try to unload or disable hardware sensors and check if it
helps?
CONFIG_I2C=m
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCA=m
CONFIG_I2C_I810=m
CONFIG_I2C_PIIX4=m
CONFIG_SENSORS_DS1337=m
CONFIG_SENSORS_DS1374=m
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
CONFIG_SENSORS_PCA9539=m
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_MAX6875=m
That seems to have helped. If I watch
/proc/acpi/thermal_zone/THM?/temperature, it seems stable even under
load. I didn't try watching the thermal_zones when these options were
enabled, but I presume the temperature was not controlled for it to hit
128 degC.

What's going on here? Does reading an i2c sensor from the kernel
prevent something else from doing it?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Matthew Garrett
2007-04-01 15:00:22 UTC
Permalink
Post by Jeremy Fitzhardinge
That seems to have helped. If I watch
/proc/acpi/thermal_zone/THM?/temperature, it seems stable even under
load. I didn't try watching the thermal_zones when these options were
enabled, but I presume the temperature was not controlled for it to hit
128 degC.
What's going on here? Does reading an i2c sensor from the kernel
prevent something else from doing it?
The i2c drivers access the same hardware as the ACPI methods, and
there's no locking.
--
Matthew Garrett | ***@srcf.ucam.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pavel Machek
2007-04-01 16:40:11 UTC
Permalink
Hi!
Post by Jeremy Fitzhardinge
CONFIG_I2C=m
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCA=m
CONFIG_I2C_I810=m
CONFIG_I2C_PIIX4=m
CONFIG_SENSORS_DS1337=m
CONFIG_SENSORS_DS1374=m
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
CONFIG_SENSORS_PCA9539=m
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_MAX6875=m
That seems to have helped. If I watch
/proc/acpi/thermal_zone/THM?/temperature, it seems stable even under
load. I didn't try watching the thermal_zones when these options were
enabled, but I presume the temperature was not controlled for it to hit
128 degC.
What's going on here? Does reading an i2c sensor from the kernel
prevent something else from doing it?
ACPI is misdesigned, and lm_sensors can't cope with that.

One idea was to add 'big acpi lock' and make lm_sensors take it, too.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Henrique de Moraes Holschuh
2007-04-01 22:50:08 UTC
Permalink
Post by Pavel Machek
ACPI is misdesigned, and lm_sensors can't cope with that.
Err, HOW exactly are you accessing the ThinkPad i2c buses directly? Or did
Lenovo change completely the hardware project of thinkpads in the X60?

Or did anyone add an lm-sensors that attach to the ACPI EC ports now?
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jeremy Fitzhardinge
2007-04-01 23:20:05 UTC
Permalink
Post by Jeremy Fitzhardinge
Could you try to unload or disable hardware sensors and check if it
helps?
CONFIG_I2C=m
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCA=m
CONFIG_I2C_I810=m
CONFIG_I2C_PIIX4=m
CONFIG_SENSORS_DS1337=m
CONFIG_SENSORS_DS1374=m
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
CONFIG_SENSORS_PCA9539=m
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_MAX6875=m
That seems to have helped. If I watch
/proc/acpi/thermal_zone/THM?/temperature, it seems stable even under
load. I didn't try watching the thermal_zones when these options were
enabled, but I presume the temperature was not controlled for it to hit
128 degC.
Hm, perhaps I was too optimistic. I have lm_sensors disabled, and all
i2c options unconfigured in my kernel, but it still has temperature
control problems. Perhaps the ambient temperature was lower when I
reported success.

When I do a big compile, the temperature reported in
/proc/acpi/thermal_zone/THM0/temperature rapidly approaches 100C, and
when it goes over 100 it triggers the critical shutdown. When it shuts
down, it (mis-?)reports the temperature as 128C.

This seems to be real, and not a kernel artifact. If I reboot the same
kernel immediately, it boots up to the message "ACPI: Core revision
20070126" and then hangs. If I boot Windows immediately afterwards, it
reboots a short way into the boot process.

I've noticed one behavioral change with this kernel. On the older
kernels, the CPU frequency would sometimes drop to lowest speed,
apparently because of an ACPI thermal limiting event. This kernel
doesn't seem to drop speed. I seem to remember Ingo had a patch to
ignore the ACPI thermal limits in cpufreq; did that get merged?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Henrique de Moraes Holschuh
2007-04-02 02:40:07 UTC
Permalink
Post by Jeremy Fitzhardinge
control problems. Perhaps the ambient temperature was lower when I
reported success.
You can use ibm-acpi to properly track your thinkpad thermal sensors, load
it with the "experimental=1" parameter, and look at what gets exported at
/proc/acpi/ibm/thermal.

You can also use /proc/acpi/ibm/fan to check the fan's state. And use the
"level 7" /proc/acpi/ibm/fan command to set the emergency cooling level, and
"level disengaged" command to set the really badass fan cooling level (might
damage your hardware, we don't know if it is safe and IBM/Lenovo isn't
talking).
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jeremy Fitzhardinge
2007-04-02 05:00:11 UTC
Permalink
Post by Henrique de Moraes Holschuh
Post by Jeremy Fitzhardinge
control problems. Perhaps the ambient temperature was lower when I
reported success.
You can use ibm-acpi to properly track your thinkpad thermal sensors, load
it with the "experimental=1" parameter, and look at what gets exported at
/proc/acpi/ibm/thermal.
Interesting. The first number corresponds with the ACPI THM0
temperature, but I can't see anything corresponding to THM1. Is there
something that documents what all the temperatures are measuring in an
X60? Thinkwiki doesn't seem to have any info.

ezr:pts/1; cat /proc/acpi/ibm/thermal
temperatures: 72 55 -128 65 40 -128 35 -128 51 53 -128 -128 -128 -128
-128 -128
Post by Henrique de Moraes Holschuh
You can also use /proc/acpi/ibm/fan to check the fan's state. And use the
"level 7" /proc/acpi/ibm/fan command to set the emergency cooling level, and
"level disengaged" command to set the really badass fan cooling level (might
damage your hardware, we don't know if it is safe and IBM/Lenovo isn't
talking).
It's set to auto. Presumably that means its tied into the temperature
sensors and will be able to keep the temp under control...

J

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Henrique de Moraes Holschuh
2007-04-03 12:41:31 UTC
Permalink
Post by Jeremy Fitzhardinge
Post by Henrique de Moraes Holschuh
You can use ibm-acpi to properly track your thinkpad thermal sensors, load
it with the "experimental=1" parameter, and look at what gets exported at
/proc/acpi/ibm/thermal.
Interesting. The first number corresponds with the ACPI THM0
temperature, but I can't see anything corresponding to THM1. Is there
something that documents what all the temperatures are measuring in an
X60? Thinkwiki doesn't seem to have any info.
Well, send me the DSDT and dmidecode output (mask off the UUID and serial
numbers), and I will be able to say more.
Post by Jeremy Fitzhardinge
ezr:pts/1; cat /proc/acpi/ibm/thermal
temperatures: 72 55 -128 65 40 -128 35 -128 51 53 -128 -128 -128 -128
-128 -128
This is a highly unusual output for thinkpads, but might be the expected one
for your X60, the X-series has always been a bit weird. I'd higly suggest
asking for X60 thermal data from other X60 owners on the linux-thinkpad ML.
Make sure to state your X60 model number, and to request that everyone does
the same.
Post by Jeremy Fitzhardinge
Post by Henrique de Moraes Holschuh
You can also use /proc/acpi/ibm/fan to check the fan's state. And use the
It's set to auto. Presumably that means its tied into the temperature
sensors and will be able to keep the temp under control...
Yes, if all sensors are working fine. That said, people override the EC fan
control all the time, because it seems not to be doing what people want.
Thinkwiki has more on this, and you want to set your fan to level 7 when
doing CPU-intensive work for now, since you are experiencing some sort of
trouble anyway...
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
RusH
2007-04-03 21:30:10 UTC
Permalink
Attached. Is there some tool for decoding the DSDT?
iasl
http://www.intel.com/technology/iapc/acpi/downloads.htm
http://www.intel.com/technology/iapc/acpi/license2.htm

--
Who logs in to gdm? Not I, said the duck.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Henrique de Moraes Holschuh
2007-04-04 03:50:06 UTC
Permalink
Attached. Is there some tool for decoding the DSDT?
iasl. The documentation is the ACPI Specification.
Post by Henrique de Moraes Holschuh
Post by Jeremy Fitzhardinge
ezr:pts/1; cat /proc/acpi/ibm/thermal
temperatures: 72 55 -128 65 40 -128 35 -128 51 53 -128 -128 -128 -128
-128 -128
This is a highly unusual output for thinkpads, but might be the expected one
for your X60, the X-series has always been a bit weird. I'd higly suggest
How would you expect it to look? I did some non-conclusive tests under
I would not expect a -128 on the third position. The other two -128 are
expected, as they are the thermal sensors for the secondary battery. There
is also one less sensor than I'd expect.
Windows, and I'm beginning to get the feeling that there is actually a
cooling problem with the hardware.
This must be at least the third complain I come across of a X60 which boils
the CPU. The standard fix from Lenovo is a planar card swap (motherboard
swap). Since this *does* mean they replace the thermal compounds, and a
full reassembly of the heat pipes, it might be that just fixing the thermal
coupling between cooling assembly and the CPU might do it.
It doesn't seem to help. When its failing to control cooling (temp
creeps towards 100C while under load), its going at ~3700RPM, which is
about what level 7 does.
Well, at least the EC is not misbehaving, then.
What's a typical max RPM? I'm getting the impression that there's
More than 4000rpm, in disengaged mode.
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pavel Machek
2007-04-01 16:40:06 UTC
Permalink
Hi!
Post by Andi Kleen
When I run 2.6.21-rc5 + Andi's x86 patches + paravirt_ops patches, I've
Hmm, don't think there's anything either in x86 that would touch this code.
But can you double check with plain rc5?
Mar 30 23:19:03 localhost kernel: ACPI: Critical trip point
Mar 30 23:19:03 localhost kernel: Critical temperature reached (128 C), shutting down.
Mar 30 23:19:03 localhost kernel: Critical temperature reached (128 C), shutting down.
Mar 30 23:19:03 localhost shutdown[19417]: shutting down for system halt
and the machine does feel pretty hot.
Pavel has been complaining about higher power consumption on his laptop versus
.20 too.
Yep, sometimes it takes 30W instead of 12W... Anyway, this seems to
be measurement error. Notice how acpi claims 128C. I do not think cpu
can work at 128C and hardware should kill us before cpu is that hot.

Are you running lm_sensors?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Henrique de Moraes Holschuh
2007-04-01 22:50:09 UTC
Permalink
Post by Pavel Machek
Are you running lm_sensors?
lm-sensors can't confuse any recent thinkpad's thermal management. The i2c
buses that matter are all behind the EC, you have to ask the EC for data.
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Kyle Moffett
2007-04-01 19:10:21 UTC
Permalink
When I run 2.6.21-rc5 + Andi's x86 patches + paravirt_ops patches,
I've been getting my machine shut down with critical thermal
Mar 30 23:19:03 localhost kernel: ACPI: Critical trip point
Mar 30 23:19:03 localhost kernel: Critical temperature reached (128
C), shutting down.
Mar 30 23:19:03 localhost kernel: Critical temperature reached (128
C), shutting down.
Mar 30 23:19:03 localhost shutdown[19417]: shutting down for system
halt
and the machine does feel pretty hot. Interestingly, when the
machine reboots, the fan spins up to a noticeably higher speed, so
it seems that maybe something is getting fan speed control wrong.
Well, 128C is more than hot enough to boil water and well above the
thermal tolerances of most CPUs, so I would imagine that were your
CPU actually that hot it wouldn't be capable of printing the
"Critical temperature reached" messages, let alone properly rebooting.

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jeremy Fitzhardinge
2007-04-01 21:20:11 UTC
Permalink
Post by Kyle Moffett
Well, 128C is more than hot enough to boil water and well above the
thermal tolerances of most CPUs, so I would imagine that were your CPU
actually that hot it wouldn't be capable of printing the "Critical
temperature reached" messages, let alone properly rebooting.
Yes, its probably a bad reading, but its not complete absurd - chips can
operate up to ~100C, but they're definitely unhappy at that point. In
fact, I typically get 85-95 degrees from those sensors in normal
operation, but I have no idea whether that's a real measurement or not.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rene Rebe
2007-04-02 09:10:08 UTC
Permalink
Post by Kyle Moffett
When I run 2.6.21-rc5 + Andi's x86 patches + paravirt_ops patches,
I've been getting my machine shut down with critical thermal
Mar 30 23:19:03 localhost kernel: ACPI: Critical trip point
Mar 30 23:19:03 localhost kernel: Critical temperature reached (128
C), shutting down.
Mar 30 23:19:03 localhost kernel: Critical temperature reached (128
C), shutting down.
Mar 30 23:19:03 localhost shutdown[19417]: shutting down for system
halt
and the machine does feel pretty hot. Interestingly, when the
machine reboots, the fan spins up to a noticeably higher speed, so
it seems that maybe something is getting fan speed control wrong.
Well, 128C is more than hot enough to boil water and well above the
thermal tolerances of most CPUs, so I would imagine that were your
CPU actually that hot it wouldn't be capable of printing the
"Critical temperature reached" messages, let alone properly rebooting.
IIRC a MSI Megabook S270 (I formerly owned) BIOS notifies this
"Critical temperature reached (128C)" when the battery run empty
when the OS did no action due to battery low indications. I guess
the BIOS people thought this is a good last resort to let the OS
really shutdown before the box just turns off.

Yours,
--
René Rebe - ExactCODE GmbH - Europe, Germany, Berlin
http://exactcode.de | http://t2-project.org | http://rene.rebe.name
+49 (0)30 / 255 897 45
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
V***@vt.edu
2007-04-08 19:20:08 UTC
Permalink
On Mon, 02 Apr 2007 10:35:40 +0200, Rene Rebe said:

(Sorry for the late reply..)
Post by Rene Rebe
IIRC a MSI Megabook S270 (I formerly owned) BIOS notifies this
"Critical temperature reached (128C)" when the battery run empty
when the OS did no action due to battery low indications. I guess
the BIOS people thought this is a good last resort to let the OS
really shutdown before the box just turns off.
It's not just MSI - I recently managed to put a Dell Latitude D820 into its bag
while still running, where it babbled to itself running on the warm side for
several hours. When I finally did get it out, it *was* quite hot to the touch,
but I was amazed that it managed to run the battery down to somewhere under 4%
(which took some 4 or 5 hours) and then throw the thermal check that made it
shut down - quite the coincidence indeed.

However, "ran warm but tolerable and then used the thermal to shut down when
the battery failed" matches the symptoms much better....

Continue reading on narkive:
Loading...