Discussion:
2.6.29 regression: ATA bus errors on resume
(too old to reply)
Jeff Garzik
2009-03-25 06:10:11 UTC
Permalink
Hi,
After upgrading to 2.6.29 I get the below errors after resuming from
hibernating with s2disk. I ran fsck and tried doing the same thing again
in 2.6.28.9-pae, but do not get any errors there.
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.
EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.
- Ted
Thanks for the info Theodore, this is definitely looks like some type of
regression in 2.6.29, as the problem is not evident when I s2disk using
2.6.28.9, even after multiple suspend/resume cycles.
I found some 'ATA bus errors' and 'SError' messages in
/var/log/messages, so I've attached the messages from both 2.6.29 and
2.6.28 for comparison.
ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }
Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
ata1.00: cmd 60/30:68:07:b3:10/00:00:0c:00:00/40 tag 13 ncq 24576 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
...through the 14th command (tag 13).
Mar 24 21:29:14 linux-7vph kernel: ata1: hard resetting link
Mar 24 21:29:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
SATA link is reset, and ACPI is re-run.
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Mar 24 21:29:14 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Mar 24 21:29:14 linux-7vph kernel: 0c 10 b3 07
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Mar 24 21:29:14 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190455
The SCSI subsystem aborts each of the queued commands.
Mar 24 21:29:14 linux-7vph kernel: ata1: EH complete
SATA error handling completes
Mar 24 21:29:14 linux-7vph kernel: EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2330178, block=9338883
ext4 pukes, because of SATA/SCSI errors
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
SCSI finished re-initialization.

So from this, a few observations and open questions:

Why didn't these SCSI commands get retried?

Were they left over from prior to resume?

Did SCSI error out the commands too soon? The probing is not complete
until AFTER the sector errors, I note.

Did the system resume before ACPI resume and SCSI resume completed? It
sure looks that way, from the log.

Does the asynchronous probing play a role here?

Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
James Bottomley
2009-03-25 22:20:11 UTC
Permalink
Post by Jeff Garzik
Hi,
After upgrading to 2.6.29 I get the below errors after resuming from
hibernating with s2disk. I ran fsck and tried doing the same thing again
in 2.6.28.9-pae, but do not get any errors there.
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.
EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.
- Ted
Thanks for the info Theodore, this is definitely looks like some type of
regression in 2.6.29, as the problem is not evident when I s2disk using
2.6.28.9, even after multiple suspend/resume cycles.
I found some 'ATA bus errors' and 'SError' messages in
/var/log/messages, so I've attached the messages from both 2.6.29 and
2.6.28 for comparison.
ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }
Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
ata1.00: cmd 60/30:68:07:b3:10/00:00:0c:00:00/40 tag 13 ncq 24576 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
...through the 14th command (tag 13).
Mar 24 21:29:14 linux-7vph kernel: ata1: hard resetting link
Mar 24 21:29:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
SATA link is reset, and ACPI is re-run.
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Mar 24 21:29:14 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Mar 24 21:29:14 linux-7vph kernel: 0c 10 b3 07
Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Mar 24 21:29:14 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190455
The SCSI subsystem aborts each of the queued commands.
No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os

The only place these are generated is in ata_sense_to_error() which only
occurs if there's some type of ata error.

If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.

In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the command?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-03-25 22:50:08 UTC
Permalink
After upgrading to 2.6.29 I get the below errors after resuming from
hibernating with s2disk. I ran fsck and tried doing the same thing again
in 2.6.28.9-pae, but do not get any errors there.
what kind of disk/what kind of controller is that?
Hi Pavel,

It is a Lenovo W500 Thinkpad.

lspci shows:

00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI
Controller (rev 03)

and for the disk, hdparm shows:
/dev/sda:

Model=HITACHI HTS722020K9SA00 FDE , FwRev=DC4LC75A,
SerialNo=080726DP0470DTGMUNPC
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=DualPortCache, BuffSize=15203kB, MaxMultSect=16, MultSect=?16?
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=390721968
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-2,3,4,5,6,7

* signifies the current active mode


Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Arjan van de Ven
2009-03-27 22:30:18 UTC
Permalink
On Fri, 27 Mar 2009 21:10:52 +0200
I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
although with all the 2.6.28 variants I used s2disk/resume has always
worked. I was wondering if these "errors" perhaps play more of a role
in 2.6.29, perhaps due to the async. changes that was mentioned?
unless you actively enabled this via a kernel command line option there
are no async changes in 2.6.29 in terms of behavior.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-03-28 10:30:16 UTC
Permalink
Post by Arjan van de Ven
On Fri, 27 Mar 2009 21:10:52 +0200
I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
although with all the 2.6.28 variants I used s2disk/resume has always
worked. I was wondering if these "errors" perhaps play more of a role
in 2.6.29, perhaps due to the async. changes that was mentioned?
unless you actively enabled this via a kernel command line option there
are no async changes in 2.6.29 in terms of behavior.
The only non-default option I had was 'modeset=1'. From Jeff's earlier
comment I understood the probing behaviour changed.

The fundamental difference is that in 2.6.29 everything initially seems
okay, but then there is a
ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x50000 action0xe frozen
ata1.00: irq_stat 0x00400008, PHY RDY changed

There's nothing frozen it 2.6.28.

Should I log a kernel bug, what's the best way forward and is there
anything more I can do to help?

cheers
Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rafael J. Wysocki
2009-03-28 14:10:12 UTC
Permalink
Post by Niel Lambrechts
Post by Arjan van de Ven
On Fri, 27 Mar 2009 21:10:52 +0200
I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
although with all the 2.6.28 variants I used s2disk/resume has always
worked. I was wondering if these "errors" perhaps play more of a role
in 2.6.29, perhaps due to the async. changes that was mentioned?
unless you actively enabled this via a kernel command line option there
are no async changes in 2.6.29 in terms of behavior.
The only non-default option I had was 'modeset=1'. From Jeff's earlier
comment I understood the probing behaviour changed.
The fundamental difference is that in 2.6.29 everything initially seems
okay, but then there is a
ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x50000 action0xe frozen
ata1.00: irq_stat 0x00400008, PHY RDY changed
There's nothing frozen it 2.6.28.
Should I log a kernel bug, what's the best way forward and is there
anything more I can do to help?
Let Tejun have a look a this, perhaps?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-03-30 08:50:16 UTC
Permalink
Hello,
Post by Rafael J. Wysocki
Post by Niel Lambrechts
Post by Arjan van de Ven
On Fri, 27 Mar 2009 21:10:52 +0200
I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
although with all the 2.6.28 variants I used s2disk/resume has always
worked. I was wondering if these "errors" perhaps play more of a role
in 2.6.29, perhaps due to the async. changes that was mentioned?
unless you actively enabled this via a kernel command line option there
are no async changes in 2.6.29 in terms of behavior.
The only non-default option I had was 'modeset=1'. From Jeff's earlier
comment I understood the probing behaviour changed.
The fundamental difference is that in 2.6.29 everything initially seems
okay, but then there is a
ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x50000 action0xe frozen
ata1.00: irq_stat 0x00400008, PHY RDY changed
There's nothing frozen it 2.6.28.
Should I log a kernel bug, what's the best way forward and is there
anything more I can do to help?
Let Tejun have a look a this, perhaps?
What Niel is seeing is probably caused by libata EH somehow moving
forward too fast and receiving the second PHY changed event after the
initial reset is complete. That or the thaw routine is broken and
doesn't clear hotplug event properly. Actually, this double reset
seems to happen quite often, so it might be about time to drill it
down and find out what's really going on. But, generally, it isn't a
serious problem, all that happens is EH doing another round. The
original one looks quite serious tho. I'll reply separately.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-03-30 09:00:15 UTC
Permalink
Hello,

For some reason, I can't find the original thread, so replying here.
Post by James Bottomley
Post by Jeff Garzik
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.
EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.
Yeap, ext4 is just the victim here.
Post by James Bottomley
Post by Jeff Garzik
ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }
Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
...
Post by James Bottomley
Post by Jeff Garzik
The SCSI subsystem aborts each of the queued commands.
No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os
The only place these are generated is in ata_sense_to_error() which only
occurs if there's some type of ata error.
If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.
In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the command?
When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().

Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.

How reproducible is the problem? Are you interested in trying out
some debug patches?

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-03-30 14:40:13 UTC
Permalink
Post by Tejun Heo
Hello,
For some reason, I can't find the original thread, so replying here.
Post by James Bottomley
Post by Jeff Garzik
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.
EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.
Yeap, ext4 is just the victim here.
Post by James Bottomley
Post by Jeff Garzik
ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }
Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
...
Post by James Bottomley
Post by Jeff Garzik
The SCSI subsystem aborts each of the queued commands.
No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os
The only place these are generated is in ata_sense_to_error() which only
occurs if there's some type of ata error.
If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.
In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the command?
When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().
Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.
How reproducible is the problem? Are you interested in trying out
some debug patches?
Hi Tejun,

I think I should be able to reproduce when actively using X with 2.6.29,
and I have an external disk where I could backup to / boot from if the
corruption became a problem.

These issues are keeping me from 2.6.29 so I'll gladly help where I can,
if you can please provide me the patches and the .config settings that
may be required?

Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jeff Garzik
2009-03-30 14:50:14 UTC
Permalink
Post by Niel Lambrechts
Post by Tejun Heo
Hello,
For some reason, I can't find the original thread, so replying here.
Post by James Bottomley
Post by Jeff Garzik
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.
EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.
Yeap, ext4 is just the victim here.
Post by James Bottomley
Post by Jeff Garzik
ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }
Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
...
Post by James Bottomley
Post by Jeff Garzik
The SCSI subsystem aborts each of the queued commands.
No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os
The only place these are generated is in ata_sense_to_error() which only
occurs if there's some type of ata error.
If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.
In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the command?
When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().
Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.
How reproducible is the problem? Are you interested in trying out
some debug patches?
Hi Tejun,
I think I should be able to reproduce when actively using X with 2.6.29,
and I have an external disk where I could backup to / boot from if the
corruption became a problem.
These issues are keeping me from 2.6.29 so I'll gladly help where I can,
if you can please provide me the patches and the .config settings that
may be required?
Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Any chance you could use bisect to narrow down the problem commit?

http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt

This should identify which patch caused your problems, if you have a
known good starting point (such as 2.6.28).

Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-04-01 19:50:12 UTC
Permalink
Post by Jeff Garzik
Post by Niel Lambrechts
Post by Tejun Heo
Hello,
For some reason, I can't find the original thread, so replying here.
Post by James Bottomley
Post by Jeff Garzik
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.
EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.
Yeap, ext4 is just the victim here.
Post by James Bottomley
Post by Jeff Garzik
ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }
Your SATA hardware flags a connect-or-disconnect event ("PHY
ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
...
Post by James Bottomley
Post by Jeff Garzik
The SCSI subsystem aborts each of the queued commands.
No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os
The only place these are generated is in ata_sense_to_error() which only
occurs if there's some type of ata error.
If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.
In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the command?
When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().
Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.
How reproducible is the problem? Are you interested in trying out
some debug patches?
Hi Tejun,
I think I should be able to reproduce when actively using X with 2.6.29,
and I have an external disk where I could backup to / boot from if the
corruption became a problem.
These issues are keeping me from 2.6.29 so I'll gladly help where I can,
if you can please provide me the patches and the .config settings that
may be required?
Niel
--
To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Any chance you could use bisect to narrow down the problem commit?
http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt
This should identify which patch caused your problems, if you have a
known good starting point (such as 2.6.28).
I'm struggling with this - my good kernel is 2.6.28.9 and as far as I
can tell the closest thing good kernel I can tell git to use is 2.6.28
base itself. So now what happens is that resume entirely fails during
some of the bisects due to entirely other regressions that are present
in older and newer kernels than mine, so I can't test the real issue! :(

cheers
Niel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jeff Garzik
2009-04-03 20:10:10 UTC
Permalink
Post by Niel Lambrechts
Post by Jeff Garzik
Post by Niel Lambrechts
Post by Tejun Heo
Hello,
For some reason, I can't find the original thread, so replying here.
Post by James Bottomley
Post by Jeff Garzik
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.
EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.
Yeap, ext4 is just the victim here.
Post by James Bottomley
Post by Jeff Garzik
ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }
Your SATA hardware flags a connect-or-disconnect event ("PHY
ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
...
Post by James Bottomley
Post by Jeff Garzik
The SCSI subsystem aborts each of the queued commands.
No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os
The only place these are generated is in ata_sense_to_error() which only
occurs if there's some type of ata error.
If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.
In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the command?
When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().
Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.
How reproducible is the problem? Are you interested in trying out
some debug patches?
Hi Tejun,
I think I should be able to reproduce when actively using X with 2.6.29,
and I have an external disk where I could backup to / boot from if the
corruption became a problem.
These issues are keeping me from 2.6.29 so I'll gladly help where I can,
if you can please provide me the patches and the .config settings that
may be required?
Niel
--
To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Any chance you could use bisect to narrow down the problem commit?
http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt
This should identify which patch caused your problems, if you have a
known good starting point (such as 2.6.28).
I'm struggling with this - my good kernel is 2.6.28.9 and as far as I
can tell the closest thing good kernel I can tell git to use is 2.6.28
base itself. So now what happens is that resume entirely fails during
some of the bisects due to entirely other regressions that are present
in older and newer kernels than mine, so I can't test the real issue! :(
"git help bisect" or "man git-bisect" has a wealth of information.

Most notably, you can use "git bisect skip" if the current commit cannot
be tested, and thus cannot be declared good or bad.

Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-04-03 21:00:21 UTC
Permalink
Post by Jeff Garzik
Post by Niel Lambrechts
Post by Jeff Garzik
Post by Niel Lambrechts
Post by Tejun Heo
Hello,
For some reason, I can't find the original thread, so replying here.
Post by James Bottomley
Post by Jeff Garzik
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.
EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.
Yeap, ext4 is just the victim here.
Post by James Bottomley
Post by Jeff Garzik
ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }
Your SATA hardware flags a connect-or-disconnect event ("PHY
ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
...
Post by James Bottomley
Post by Jeff Garzik
The SCSI subsystem aborts each of the queued commands.
No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os
The only place these are generated is in ata_sense_to_error() which only
occurs if there's some type of ata error.
If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.
In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the command?
When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().
Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.
How reproducible is the problem? Are you interested in trying out
some debug patches?
Hi Tejun,
I think I should be able to reproduce when actively using X with 2.6.29,
and I have an external disk where I could backup to / boot from if the
corruption became a problem.
These issues are keeping me from 2.6.29 so I'll gladly help where I can,
if you can please provide me the patches and the .config settings that
may be required?
Niel
--
To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Any chance you could use bisect to narrow down the problem commit?
http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt
This should identify which patch caused your problems, if you have a
known good starting point (such as 2.6.28).
I'm struggling with this - my good kernel is 2.6.28.9 and as far as I
can tell the closest thing good kernel I can tell git to use is 2.6.28
base itself. So now what happens is that resume entirely fails during
some of the bisects due to entirely other regressions that are present
in older and newer kernels than mine, so I can't test the real issue! :(
"git help bisect" or "man git-bisect" has a wealth of information.
Most notably, you can use "git bisect skip" if the current commit
cannot be tested, and thus cannot be declared good or bad.
Thanks Jeff, still finding my way around git, it was not that clear to
me if there is a way to tell git that 2.6.28.9 was good, to me it looked
like the main kernel only consists of the RC and the final release, not
any subsequent patches?

Anyhow, Tejun's patch is holding up well apart from the glitch I
reported on yesterday - I guess it could have simply been part of the
debugging output Tejun included in the patch and not a real issue.

I haven't heard from him (yet), but I wonder if you might know if the
results give enough of a clue for a patch that could be included in say
2.6.29.1?

Thanks!
Niel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-04-02 02:00:16 UTC
Permalink
Hello,

Sorry about the delay.
Post by Niel Lambrechts
I think I should be able to reproduce when actively using X with 2.6.29,
and I have an external disk where I could backup to / boot from if the
corruption became a problem.
These issues are keeping me from 2.6.29 so I'll gladly help where I can,
if you can please provide me the patches and the .config settings that
may be required?
Attached is the debug patch. Please reproduce the problem with the
patch applied and post full kernel log.

Thanks.
--
tejun
Tejun Heo
2009-04-02 07:00:19 UTC
Permalink
Post by Tejun Heo
Attached is the debug patch. Please reproduce the problem with the
patch applied and post full kernel log.
Hi there,
Here is the output, it took 2 attempts to reproduce the EXT4 corruption
(clean with errors).
Strange. Maybe IO commands are getting through while the sdev is
still in quiesce state? Can you please repeat the test with the
attached patch?

Thanks.
--
tejun
Tejun Heo
2009-04-04 16:50:09 UTC
Permalink
Hello, Neil.
Thanks for your help, I've done at least 5 hibernate cycles without the
problem recurring, I'll keep at it for a while... :)
Hmmm..
For the sake of being thorough, I'd like to mention some of the
remaining issues/messages, but to be honest some of them were there
1) Can you perhaps confirm if the remaining ATA messages are harmless
enough to ignore?
ata2: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0x1e frozen
ata2: irq_stat 0x00400040, connection status changed
ata2: SError: { PHYRdyChg CommWake DevExch }
Clocksource tsc unstable (delta = -412838835 ns)
Yeah, that happens on some machines during resume. It doesn't on some
machines (including mine) and dunno what's the difference yet. It's a
bit annoying and probably adds a bit to resume time but libata handles
things like this just fine, so no need to be alarmed too much.
Apr 2 12:31:44 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
2) The screen remains blank on resume, right until I both press a key
_and_ touch the touchpad. Weird, but this happens in 2.6.28.9 as well,
perhaps this is i915 related.
Heh... yeah, black magic of display suspend/resume. Hopefully it will
get better with KMS.
Just arrived back home, before suspending I enabled all the powertop
laptop mode/SATA link management etc and I removed a CD-ROM from the
ata2.00: XXX setting retry on qc0
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: irq_stat 0x40000001
ata2.00: XXX terminating qc0 (SENSE), retries=0
XXX scsi_eh_flush_done_q: online=1(2) noretry=0 retries=3 allowed=3
scsi_eh_1: flush finish cmd: f5dea740
ata2.00: XXX setting retry on qc0
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: irq_stat 0x40000001
ata2.00: XXX terminating qc0 (SENSE), retries=0
XXX scsi_eh_flush_done_q: online=1(2) noretry=0 retries=2 allowed=2
.. loop ..
I hope it's simply something like "drive not ready" debugging ouput,
since ata2 seems associated with the CD-Rom drive... :)
This continued, until I again in powertop enabled SATA link management.
That's probably hal polling for media presence. It uses
TEST_UNIT_READY and if there's no media in the drive, it will raise
CHECK_SENSE condition which is handled by the libata EH which is
usually quiet about it as it's not an error but with the debugging
patch, it got more whiny. If you're annoyed by it just put a media in
the drive.

Thanks.
--
tejun

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-04-06 05:10:13 UTC
Permalink
Post by Tejun Heo
Hello, Neil.
Thanks for your help, I've done at least 5 hibernate cycles without the
problem recurring, I'll keep at it for a while... :)
Hmmm..
Hi Tejun,

I have tested your patch some more over the weekend as well as on
2.6.29.1, and suspend-to-disk and ram still works without a glitch for
me apart from the extra debugging being active.

Will the fix naturally make its way into the mainline kernel, or is
there any extra debugging/testing I can help with?

Regards,
Niel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-04-06 10:10:17 UTC
Permalink
Post by Niel Lambrechts
Post by Tejun Heo
Hello, Neil.
Thanks for your help, I've done at least 5 hibernate cycles without the
problem recurring, I'll keep at it for a while... :)
Hmmm..
Hi Tejun,
I have tested your patch some more over the weekend as well as on
2.6.29.1, and suspend-to-disk and ram still works without a glitch for
me apart from the extra debugging being active.
Will the fix naturally make its way into the mainline kernel, or is
there any extra debugging/testing I can help with?
Well, the problem is the debug patch doesn't actually do anything
other than printing out messages. It could be that the problem is
timing dependent (which is likely anyway). You still can reporduce
the problem with the patch, right?

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-04-06 18:30:13 UTC
Permalink
Post by Tejun Heo
Post by Niel Lambrechts
Will the fix naturally make its way into the mainline kernel, or is
there any extra debugging/testing I can help with?
Well, the problem is the debug patch doesn't actually do anything
other than printing out messages. It could be that the problem is
timing dependent (which is likely anyway). You still can reporduce
the problem with the patch, right?
Strange. Maybe IO commands are getting through while the sdev is
still in quiesce state? Can you please repeat the test with the
attached patch?
With the latter, I have not encountered the original problem i.e. any
severe EXT4 corruption again, not in 2.6.29 and not in 2.6.29.1.

Do I also need to try the last patch without any debugging messages?

cheers
Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-04-06 19:40:31 UTC
Permalink
Hello,
Post by Niel Lambrechts
Post by Tejun Heo
Post by Niel Lambrechts
Will the fix naturally make its way into the mainline kernel, or is
there any extra debugging/testing I can help with?
Well, the problem is the debug patch doesn't actually do anything
other than printing out messages. It could be that the problem is
timing dependent (which is likely anyway). You still can reporduce
the problem with the patch, right?
Yeah, the second one actually only added printks to see whether that's
the case. No behavior change.
Post by Niel Lambrechts
Post by Tejun Heo
Strange. Maybe IO commands are getting through while the sdev is
still in quiesce state? Can you please repeat the test with the
attached patch?
With the latter, I have not encountered the original problem i.e. any
severe EXT4 corruption again, not in 2.6.29 and not in 2.6.29.1.
Eh... so, we're definitely seeing something which is dependent on
timing.
Post by Niel Lambrechts
Do I also need to try the last patch without any debugging messages?
Then there will be nothing left. :-) Can you please try the attached
patch? It's still only debug messages but lighter; hopefully, it
won't mask the problem.

Thanks.
--
tejun
Tejun Heo
2009-04-09 18:20:14 UTC
Permalink
Hello,
Sorry, my bad - I assumed the 2nd patch actually made a functional
difference... :)
Here is the output on 2.6.29.1 with your new patch - still nothing
serious happening, resume still seems okay!
Hmm... none of the debugging messages got triggered, so there should
be no noticeable timing change. It seems like you're being lucky for
the time being. Can you please keep testing and report back?

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-05-23 09:20:09 UTC
Permalink
Post by Tejun Heo
Hmm... none of the debugging messages got triggered, so there should
be no noticeable timing change. It seems like you're being lucky for
the time being. Can you please keep testing and report back?
Hi Tejun,

Just to keep you up informed - this issue is still happening, I've been
testing the latest 2.6.30-git kernels the last couple of weeks and just
checked my logs after having an unexpected fsck upon system boot - the
freeze seems to trigger very infrequently though, since I use s2disk at
least twice a day and the error only seems to have happened twice since
May 2.

I had a look at your 2.6.29 debug patch, it does "Hunk #6 succeeded at
3368" so it should still be good (I hope), I'll try and run with it for
a while.

The fsck output also seems a bit odd, on boot it displays a bunch of
newlines and the partition name without a *real* message (the ^H being
newlines):
/dev/disk/by-id/ata-HITACHI_HTS722020K9SA00_FDE_080726DP0470DTGMUNPC-part6:
|| 0.9%
<progress bar skipped>
^H- 56.5%
/dev/disk/by-id/ata-HITACHI_HTS722020K9SA00_FDE_080726DP0470DTGMUNPC-part6:
|
^H\ 56.7%
/dev/disk/by-id/ata-HITACHI_HTS722020K9SA00_FDE_080726DP0470DTGMUNPC-part6:
|
^H| 57.1%
/dev/disk/by-id/ata-HITACHI_HTS722020K9SA00_FDE_080726DP0470DTGMUNPC-part6:
|


Thanks,
Niel


linux-7vph:~ # uname -r
2.6.30-rc6-pae

linux-7vph:~# rpm -qf `which fsck`
e2fsprogs-1.41.4-2.1

linux-7vph:~# lspci -vvv|grep -i sata
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI
Controller (rev 03) (prog-if 01 [AHCI 1.0])

/var/log/warn:
<snip>
May 22 09:28:55 linux-7vph kernel: Syncing filesystems ... done.
May 22 09:28:55 linux-7vph kernel: Freezing user space processes ...
(elapsed 0.00 seconds) done.
May 22 09:28:55 linux-7vph kernel: Freezing remaining freezable tasks
... (elapsed 0.00 seconds) done.
May 22 09:28:55 linux-7vph kernel: Suspending console(s) (use
no_console_suspend to debug)
May 22 09:28:57 linux-7vph kernel: Disabling non-boot CPUs ...
May 22 09:28:57 linux-7vph kernel: CPU1 is down
May 22 09:28:57 linux-7vph kernel: Enabling non-boot CPUs ...
May 22 09:28:57 linux-7vph kernel: microcode: failed to init CPU1
May 22 09:28:57 linux-7vph kernel: CPU1 is up
May 22 09:28:57 linux-7vph kernel: e1000e 0000:00:19.0:
pci_enable_pcie_error_reporting failed 0xfffffffb
May 22 09:28:58 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 22 09:28:58 linux-7vph kernel: ata1: irq_stat 0x00400040, connection
status changed
May 22 09:28:58 linux-7vph kernel: Restarting tasks ... done.
May 22 09:28:58 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 22 09:28:58 linux-7vph kernel: ata2: irq_stat 0x40000001
May 22 09:28:58 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct
0x1ff SErr 0x50000 action 0xe frozen
May 22 09:28:58 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY
changed
May 22 09:28:58 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
May 22 09:28:58 linux-7vph kernel: ata1.00: cmd
60/40:00:91:ca:b5/00:00:09:00:00/40 tag 0 ncq 32768 in
May 22 09:28:58 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:08:91:d4:b5/00:00:09:00:00/40 tag 1 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:10:51:d2:b5/00:00:09:00:00/40 tag 2 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:18:11:cd:b5/00:00:09:00:00/40 tag 3 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:20:11:0a:b7/00:00:09:00:00/40 tag 4 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:28:d1:2b:b6/00:00:09:00:00/40 tag 5 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:30:91:c3:b5/00:00:09:00:00/40 tag 6 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/08:38:5f:88:2e/00:00:0b:00:00/40 tag 7 ncq 4096 in
May 22 09:29:01 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:29:01 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:29:02 linux-7vph kernel: ata1.00: cmd
60/e8:40:5f:8e:2b/00:00:0b:00:00/40 tag 8 ncq 118784 in
May 22 09:29:02 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:29:02 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:29:03 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 22 09:29:03 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187403871
May 22 09:29:03 linux-7vph kernel: EXT4-fs error (device sda6):
__ext4_get_inode_loc: unable to read inode block - inode=641187,
block=2556122
May 22 09:29:03 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure
<snip>

and an older one, could have been 2.6.30-rc5, I'm not sure:
May 14 07:47:17 linux-7vph kernel: Syncing filesystems ... done.
May 14 07:47:18 linux-7vph kernel: Freezing user space processes ...
(elapsed 0.00 seconds) done.
May 14 07:47:18 linux-7vph kernel: Freezing remaining freezable tasks
... (elapsed 0.00 seconds) done.
May 14 07:47:18 linux-7vph kernel: Suspending console(s) (use
no_console_suspend to debug)
May 14 07:47:18 linux-7vph kernel: Disabling non-boot CPUs ...
May 14 07:47:18 linux-7vph kernel: CPU1 is down
May 14 07:47:18 linux-7vph kernel: Enabling non-boot CPUs ...
May 14 07:47:18 linux-7vph kernel: microcode: failed to init CPU1
May 14 07:47:18 linux-7vph kernel: CPU1 is up
May 14 07:47:19 linux-7vph kernel: e1000e 0000:00:19.0:
pci_enable_pcie_error_reporting failed 0xfffffffb
May 14 07:47:19 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 14 07:47:19 linux-7vph kernel: ata1: irq_stat 0x00400040, connection
status changed
May 14 07:47:19 linux-7vph kernel: mac80211-phy0: failed to set key (0,
00:1d:92:1d:1e:8e) to hardware (-22)
May 14 07:47:19 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 14 07:47:19 linux-7vph kernel: ata2: irq_stat 0x40000001
May 14 07:47:19 linux-7vph kernel: Restarting tasks ... done.
May 14 07:47:19 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct
0x3fffff SErr 0x50000 action 0xe frozen
May 14 07:47:19 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY
changed
May 14 07:47:19 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:00:17:70:14/00:00:0d:00:00/40 tag 0 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:08:11:8e:b5/00:00:09:00:00/40 tag 1 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:10:91:a4:b5/00:00:09:00:00/40 tag 2 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:18:5f:08:c8/00:00:0c:00:00/40 tag 3 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:20:6f:88:6f/01:00:0e:00:00/40 tag 4 ncq 135168 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:28:9f:98:6d/00:00:0e:00:00/40 tag 5 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:30:11:91:b5/00:00:09:00:00/40 tag 6 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:38:91:9d:b5/00:00:09:00:00/40 tag 7 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:40:51:7f:b5/00:00:09:00:00/40 tag 8 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:48:91:7d:b5/00:00:09:00:00/40 tag 9 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/a8:50:5f:89:67/00:00:0e:00:00/40 tag 10 ncq 86016 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:58:0f:8a:67/00:00:0e:00:00/40 tag 11 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/b0:60:5f:8e:2b/00:00:0b:00:00/40 tag 12 ncq 90112 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/18:68:17:8f:2b/00:00:0b:00:00/40 tag 13 ncq 12288 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:70:0f:8f:2b/00:00:0b:00:00/40 tag 14 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/18:78:2f:8f:2b/00:00:0b:00:00/40 tag 15 ncq 12288 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/08:80:07:8a:67/00:00:0e:00:00/40 tag 16 ncq 4096 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/50:88:17:8a:67/00:00:0e:00:00/40 tag 17 ncq 40960 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/58:90:6f:06:44/00:00:0c:00:00/40 tag 18 ncq 45056 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/08:98:d7:06:44/00:00:0c:00:00/40 tag 19 ncq 4096 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/18:a0:17:07:44/00:00:0c:00:00/40 tag 20 ncq 12288 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/08:a8:ef:4a:39/00:00:0b:00:00/40 tag 21 ncq 4096 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 242190447
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 241666399
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 241666575
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187403871
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187404055
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187404079
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 241666583
May 14 07:47:20 linux-7vph kernel: EXT4-fs error (device sda6):
__ext4_get_inode_loc: unable to read inode block - inode=2346519,
block=9404418
May 14 07:47:20 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-05-23 10:30:12 UTC
Permalink
Post by Niel Lambrechts
Post by Tejun Heo
Hmm... none of the debugging messages got triggered, so there should
be no noticeable timing change. It seems like you're being lucky for
the time being. Can you please keep testing and report back?
Hi Tejun,
Just to keep you up informed - this issue is still happening, I've
been testing the latest 2.6.30-git kernels the last couple of weeks
and just checked my logs after having an unexpected fsck upon system
boot - the freeze seems to trigger very infrequently though, since I
use s2disk at least twice a day and the error only seems to have
happened twice since May 2.
I had a look at your 2.6.29 debug patch, it does "Hunk #6 succeeded at
3368" so it should still be good (I hope), I'll try and run with it
for a while.
Hi Tejun

Bug triggered with your patch! I played audio while suspending to try
and increase activity (I also removed a CD on boot), and the filesystem
came up dirty! This was on attempt nr. 3 or 4.

Here is the /var/log/messages output - i hope this is sufficient to
trace down the cause:
May 23 12:15:08 linux-7vph kernel: PM: Shrinking memory... done (98443
pages freed)
May 23 12:15:07 linux-7vph pulseaudio[3781]: alsa-util.c: period_time
: 185759
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: tstamp_mode
: ENABLE
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: period_step : 1
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: avail_min
: 15503
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: period_event : 0
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c:
start_threshold : -1
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c:
stop_threshold : 1073741824
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c:
silence_threshold: 0
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: silence_size : 0
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: boundary
: 1073741824
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: Slave:
Hardware PCM card 0 'HDA Intel' device 0 subdevice 0
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: Its setup is:
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: stream
: PLAYBACK
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: access
: MMAP_INTERLEAVED
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: format
: S16_LE
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: subformat
: STD
May 23 12:15:08 linux-7vph kernel: PM: Freed 393772 kbytes in 3.23
seconds (121.91 MB/s)
May 23 12:15:09 linux-7vph kernel: Suspending console(s) (use
no_console_suspend to debug)
May 23 12:15:09 linux-7vph kernel: sd 0:0:0:0: [sda] Synchronizing SCSI
cache
May 23 12:15:09 linux-7vph kernel: ACPI handle has no context!
May 23 12:15:09 linux-7vph kernel: ata1: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x0
May 23 12:15:09 linux-7vph kernel: ata2: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x0
May 23 12:15:09 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D disabled
May 23 12:15:09 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
May 23 12:15:09 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
May 23 12:15:09 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
May 23 12:15:09 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
May 23 12:15:09 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B
disabled
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D disabled
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: channels : 2
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: rate
: 44100
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: exact rate
: 44100 (44100/1)
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: msbits
: 16
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A disabled
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: buffer_size
: 16384
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: period_size
: 8192
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: period_time
: 185759
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: tstamp_mode
: ENABLE
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: period_step : 1
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: avail_min
: 15503
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: period_event : 0
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c:
start_threshold : -1
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: wake-up
capability enabled by ACPI
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c:
stop_threshold : 1073741824
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c:
silence_threshold: 0
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: silence_size : 0
May 23 12:15:10 linux-7vph kernel: ACPI: Preparing to enter system sleep
state S4
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: boundary
: 1073741824
May 23 12:15:10 linux-7vph kernel: PM: Saving platform NVS memory
May 23 12:15:10 linux-7vph kernel: Disabling non-boot CPUs ...
May 23 12:15:10 linux-7vph kernel: CPU 1 is now offline
May 23 12:15:10 linux-7vph kernel: SMP alternatives: switching to UP code
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-sink.c: Increasing
wakeup watermark to 40.00 ms
May 23 12:15:10 linux-7vph kernel: CPU0 attaching NULL sched-domain.
May 23 12:15:10 linux-7vph kernel: CPU1 attaching NULL sched-domain.
May 23 12:15:10 linux-7vph kernel: CPU0 attaching NULL sched-domain.
May 23 12:15:10 linux-7vph kernel: CPU1 is down
May 23 12:15:10 linux-7vph kernel: Extended CMOS year: 2000
May 23 12:15:10 linux-7vph kernel: PM: Creating hibernation image:
May 23 12:15:10 linux-7vph kernel: PM: Need to copy 124750 pages
May 23 12:15:10 linux-7vph kernel: x86 PAT enabled: cpu 0, old
0x7040600070406, new 0x7010600070106
May 23 12:15:10 linux-7vph kernel: Intel machine check architecture
supported.
May 23 12:15:10 linux-7vph kernel: Intel machine check reporting enabled
on CPU#0.
May 23 12:15:10 linux-7vph kernel: PM: Restoring platform NVS memory
May 23 12:15:10 linux-7vph kernel: Extended CMOS year: 2000
May 23 12:15:10 linux-7vph kernel: Enabling non-boot CPUs ...
May 23 12:15:10 linux-7vph kernel: SMP alternatives: switching to SMP code
May 23 12:15:10 linux-7vph kernel: Booting processor 1 APIC 0x1 ip 0x6000
May 23 12:15:10 linux-7vph kernel: Initializing CPU#1
May 23 12:15:10 linux-7vph kernel: Calibrating delay using timer
specific routine.. 5054.05 BogoMIPS (lpj=10108117)
May 23 12:15:10 linux-7vph kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
May 23 12:15:10 linux-7vph kernel: CPU: L2 cache: 6144K
May 23 12:15:10 linux-7vph kernel: CPU: Physical Processor ID: 0
May 23 12:15:10 linux-7vph kernel: CPU: Processor Core ID: 1
May 23 12:15:10 linux-7vph kernel: Intel machine check architecture
supported.
May 23 12:15:10 linux-7vph kernel: Intel machine check reporting enabled
on CPU#1.
May 23 12:15:10 linux-7vph kernel: x86 PAT enabled: cpu 1, old
0x7040600070406, new 0x7010600070106
May 23 12:15:10 linux-7vph kernel: CPU1: Intel(R) Core(TM)2 Duo CPU
T9400 @ 2.53GHz stepping 06
May 23 12:15:10 linux-7vph kernel: CPU0 attaching NULL sched-domain.
May 23 12:15:10 linux-7vph kernel: Switched to high resolution mode on CPU 1
May 23 12:15:10 linux-7vph kernel: CPU0 attaching sched-domain:
May 23 12:15:10 linux-7vph kernel: domain 0: span 0-1 level MC
May 23 12:15:10 linux-7vph kernel: groups: 0 1
May 23 12:15:10 linux-7vph kernel: domain 1: span 0-1 level CPU
May 23 12:15:10 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
May 23 12:15:10 linux-7vph kernel: CPU1 attaching sched-domain:
May 23 12:15:10 linux-7vph kernel: domain 0: span 0-1 level MC
May 23 12:15:10 linux-7vph kernel: groups: 1 0
May 23 12:15:10 linux-7vph kernel: domain 1: span 0-1 level CPU
May 23 12:15:10 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
May 23 12:15:10 linux-7vph kernel: microcode: failed to init CPU1
May 23 12:15:10 linux-7vph kernel: CPU1 is up
May 23 12:15:10 linux-7vph kernel: ACPI: Waking up from system sleep
state S4
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: restoring config
space at offset 0x1 (was 0x900007, writing 0x900403)
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.0: power state
changed by ACPI to D0
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.2: power state
changed by ACPI to D0
May 23 12:15:10 linux-7vph kernel: HDA Intel 0000:00:1b.0: restoring
config space at offset 0x1 (was 0x100106, writing 0x100102)
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.0: power state
changed by ACPI to D0
May 23 12:15:10 linux-7vph kernel: ahci 0000:00:1f.2: restoring config
space at offset 0x1 (was 0x2b00403, writing 0x2b00407)
May 23 12:15:10 linux-7vph kernel: iwlagn 0000:03:00.0: restoring config
space at offset 0x1 (was 0x100106, writing 0x100506)
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.1: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:00:03.0: PME# disabled
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A -> GSI
20 (level, low) -> IRQ 20
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0:
pci_enable_pcie_error_reporting failed 0xfffffffb
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: wake-up
capability disabled by ACPI
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: wake-up
capability disabled by ACPI
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: irq 29 for MSI/MSI-X
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A ->
GSI 20 (level, low) -> IRQ 20
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.0: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B ->
GSI 21 (level, low) -> IRQ 21
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.1: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C ->
GSI 22 (level, low) -> IRQ 22
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.2: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D ->
GSI 23 (level, low) -> IRQ 23
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B ->
GSI 17 (level, low) -> IRQ 17
May 23 12:15:10 linux-7vph kernel: HDA Intel 0000:00:1b.0: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A ->
GSI 16 (level, low) -> IRQ 16
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.0: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B ->
GSI 17 (level, low) -> IRQ 17
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.1: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C ->
GSI 18 (level, low) -> IRQ 18
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.2: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D ->
GSI 19 (level, low) -> IRQ 19
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1d.7: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:00:1e.0: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: ahci 0000:00:1f.2: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: ata1: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x16 frozen
May 23 12:15:10 linux-7vph kernel: ata2: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x16 frozen
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.0: PCI INT A -> GSI 16
(level, low) -> IRQ 16
May 23 12:15:10 linux-7vph kernel: ohci1394: fw-host0: OHCI-1394 1.1
(PCI): IRQ=[17] MMIO=[f4801000-f48017ff] Max Packet=[2048] IR/IT
contexts=[4/4]
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.2: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.3: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.4: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.5: PME# disabled
May 23 12:15:10 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
May 23 12:15:10 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus
113 SControl 300)
May 23 12:15:10 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus
113 SControl 300)
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/02:00:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
f5/00:00:00:00:00:a0 filtered out
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/5f:00:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/10:03:00:00:00:a0 filtered out
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/02:00:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
f5/00:00:00:00:00:a0 filtered out
May 23 12:15:10 linux-7vph kernel: ata2.00: ACPI cmd
e3/00:1f:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/5f:00:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/10:03:00:00:00:a0 filtered out
May 23 12:15:10 linux-7vph kernel: ata2.00: ACPI cmd
e3/00:02:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 12:15:10 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 23 12:15:10 linux-7vph kernel: ata1: irq_stat 0x00400040, connection
status changed
May 23 12:15:10 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 12:15:10 linux-7vph kernel: ata1: EH complete
May 23 12:15:10 linux-7vph kernel: ata2.00: ACPI cmd
e3/00:1f:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata2.00: ACPI cmd
e3/00:02:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata2.00: configured for UDMA/133
May 23 12:15:10 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 23 12:15:10 linux-7vph kernel: ata2: irq_stat 0x40000001
May 23 12:15:10 linux-7vph kernel: ata2.00: configured for UDMA/133
May 23 12:15:10 linux-7vph kernel: ata2: EH complete
May 23 12:15:10 linux-7vph kernel: Registered led device: iwl-phy0::radio
May 23 12:15:10 linux-7vph kernel: Registered led device: iwl-phy0::assoc
May 23 12:15:10 linux-7vph kernel: Registered led device: iwl-phy0::RX
May 23 12:15:10 linux-7vph kernel: Registered led device: iwl-phy0::TX
May 23 12:15:10 linux-7vph kernel: mac80211-phy0: failed to set key (0,
00:1d:92:1d:1e:8e) to hardware (-22)
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: restoring config
space at offset 0x1 (was 0x900407, writing 0x900403)
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: Restarting tasks ... <3>ata1.00:
exception Emask 0x10 SAct 0x1f SErr 0x50000 action 0xe frozen
May 23 12:15:10 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY
changed
May 23 12:15:10 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
May 23 12:15:10 linux-7vph kernel: ata1.00: cmd
60/08:00:5f:89:67/01:00:0e:00:00/40 tag 0 ncq 135168 in
May 23 12:15:10 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:10 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:10 linux-7vph kernel: ata1.00: cmd
60/08:08:6f:88:6f/01:00:0e:00:00/40 tag 1 ncq 135168 in
May 23 12:15:10 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:10 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:10 linux-7vph kernel: ata1.00: cmd
60/08:10:5f:08:c8/00:00:0c:00:00/40 tag 2 ncq 4096 in
May 23 12:15:10 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:10 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:10 linux-7vph kernel: ata1.00: cmd
60/08:18:9f:98:6d/00:00:0e:00:00/40 tag 3 ncq 4096 in
May 23 12:15:10 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:11 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:11 linux-7vph kernel: ata1.00: cmd
60/e8:20:5f:8e:2b/00:00:0b:00:00/40 tag 4 ncq 118784 in
May 23 12:15:11 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:11 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:11 linux-7vph kernel: ata1: hard resetting link
May 23 12:15:11 linux-7vph kernel: done.
May 23 12:15:11 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus
113 SControl 300)
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/02:00:00:00:00:a0 succeeded
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
f5/00:00:00:00:00:a0 filtered out
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/5f:00:00:00:00:a0 succeeded
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/10:03:00:00:00:a0 filtered out
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/02:00:00:00:00:a0 succeeded
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
f5/00:00:00:00:00:a0 filtered out
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/5f:00:00:00:00:a0 succeeded
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/10:03:00:00:00:a0 filtered out
May 23 12:15:11 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 12:15:11 linux-7vph kernel: ata1: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x0 t4
May 23 12:15:11 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2)
noretry=2 retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush finish cmd: f6838ec0
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted
Command [current] [descriptor]
May 23 12:15:11 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 23 12:15:11 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a
80 00 00 00 00 00
May 23 12:15:11 linux-7vph kernel: 0c c8 08 5f
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No
additional sense information
May 23 12:15:11 linux-7vph kernel: end_request: I/O error, dev sda,
sector 241666399
May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2)
noretry=2 retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush finish cmd: f6838680
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted
Command <2>EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
read inode block - inode=2330983, block=9338933
May 23 12:15:11 linux-7vph kernel: [current] [descriptor]
May 23 12:15:11 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 23 12:15:11 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a
80 00 00 00 00 00
May 23 12:15:11 linux-7vph kernel: 0c c8 08 5f
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No
additional sense information
May 23 12:15:11 linux-7vph kernel: end_request: I/O error, dev sda,
sector 242190447
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6):
__ext4_get_inode_loc: XXX scsi_eh_flush_done_q: online=1(2) noretry=0
retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: unable to read inode block -
inode=2346524, block=9404418
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush retry cmd: f6838800
May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2)
noretry=0 retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush retry cmd: f403de00
May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2)
noretry=2 retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush finish cmd: f403d200
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted
Command [current] [descriptor]
May 23 12:15:11 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 23 12:15:11 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a
80 00 00 00 00 00
May 23 12:15:11 linux-7vph kernel: 0c c8 08 5f
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No
additional sense information
May 23 12:15:11 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187403871
May 23 12:15:11 linux-7vph kernel: ata1: EH complete
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6):
__ext4_get_inode_loc: unable to read inode block - inode=641187,
block=2556122
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure
May 23 12:16:50 linux-7vph kernel: ata1: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x10
May 23 12:16:50 linux-7vph kernel: CPU0 attaching NULL sched-domain.
May 23 12:16:50 linux-7vph kernel: CPU1 attaching NULL sched-domain.
May 23 12:16:50 linux-7vph kernel: CPU0 attaching sched-domain:
May 23 12:16:50 linux-7vph kernel: domain 0: span 0-1 level MC
May 23 12:16:50 linux-7vph kernel: groups: 0 1
May 23 12:16:50 linux-7vph kernel: domain 1: span 0-1 level CPU
May 23 12:16:50 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
May 23 12:16:50 linux-7vph kernel: CPU1 attaching sched-domain:
May 23 12:16:50 linux-7vph kernel: domain 0: span 0-1 level MC
May 23 12:16:50 linux-7vph kernel: groups: 1 0
May 23 12:16:50 linux-7vph kernel: domain 1: span 0-1 level CPU
May 23 12:16:50 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)


Regards,
Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-05-25 00:40:08 UTC
Permalink
Hello,
Post by Niel Lambrechts
Bug triggered with your patch! I played audio while suspending to try
and increase activity (I also removed a CD on boot), and the filesystem
came up dirty! This was on attempt nr. 3 or 4.
Great.

Here's the problem.

May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2) noretry=2 retries=0 allowed=5

scsi_noretry_cmd() is returning non-zero indicating that the request
shouldn't be retried and failed immediagely. Looks like the return
value 2 is from blk_failfast_dev() which tests REQ_FAILFAST_DEV. It's
most likely to be set in init_request_from_bio() while translating bio
flags.

cc'ing Theodore Tso. Hello, Neil is reporting ext4 checking out after
resuming.

http://thread.gmane.org/gmane.linux.kernel/814466/focus=817937

The origin of the problem is ATA device triggering a PHY event after
resume sequence is complete. I still don't know why this happens but
it does on certain machines. This in itself shouldn't be a big
problem as the device works fine after one more pass of ATA EH and the
in-flight requests would be retried. However, for some reason, the
aborted commands seem to have REQ_FAILFAST_DEV set thus failing
immediately which, in turn, triggers ext4 errors. Does anything ring
a bell?

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-03-30 18:40:28 UTC
Permalink
Post by Jeff Garzik
Post by Niel Lambrechts
Post by Tejun Heo
Hello,
For some reason, I can't find the original thread, so replying here.
Post by James Bottomley
Post by Jeff Garzik
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.
EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.
Yeap, ext4 is just the victim here.
Post by James Bottomley
Post by Jeff Garzik
ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }
Your SATA hardware flags a connect-or-disconnect event ("PHY
ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA
bus error)
[...]
...
Post by James Bottomley
Post by Jeff Garzik
The SCSI subsystem aborts each of the queued commands.
No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os
The only place these are generated is in ata_sense_to_error() which
only
occurs if there's some type of ata error.
If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and
returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.
In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the
command?
When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().
Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.
How reproducible is the problem? Are you interested in trying out
some debug patches?
Hi Tejun,
I think I should be able to reproduce when actively using X with 2.6.29,
and I have an external disk where I could backup to / boot from if the
corruption became a problem.
These issues are keeping me from 2.6.29 so I'll gladly help where I can,
if you can please provide me the patches and the .config settings that
may be required?
Niel
--
To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Any chance you could use bisect to narrow down the problem commit?
http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt
This should identify which patch caused your problems, if you have a
known good starting point (such as 2.6.28).
Jeff
Any idea of the volume of data would I need to download, git repository
wise? I currently only have the 2.6.27 source and patches on top... and
bandwidth is quite expensive in SA...

Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jeff Garzik
2009-03-30 19:20:10 UTC
Permalink
Post by Niel Lambrechts
Any idea of the volume of data would I need to download, git repository
wise? I currently only have the 2.6.27 source and patches on top... and
bandwidth is quite expensive in SA...
Around 300MB for the repo, I think?

Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-05-23 09:40:11 UTC
Permalink
Post by Niel Lambrechts
Post by Tejun Heo
Hmm... none of the debugging messages got triggered, so there should
be no noticeable timing change. It seems like you're being lucky for
the time being. Can you please keep testing and report back?
Hi Tejun,
Just to keep you up informed - this issue is still happening, I've been
testing the latest 2.6.30-git kernels the last couple of weeks and just
checked my logs after having an unexpected fsck upon system boot - the
freeze seems to trigger very infrequently though, since I use s2disk at
least twice a day and the error only seems to have happened twice since
May 2.
I notice that if I hibernate with a CD in the drive and then remove it as soon
as I boot, I get a hard reset on the ata1 bus.

I'm not sure if this maybe plays a role in triggering the problem at times,
perhaps it is entirely expected - but vim highlights it in red and to a "normal"
user that is bad. :)

Maybe after applying your patch I should play around with this?


/var/log/warn:
May 23 11:28:44 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
May 23 11:28:44 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
May 23 11:28:44 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0x9 t4
May 23 11:28:44 linux-7vph kernel: ata1: irq_stat 0x00400040, connection status
changed
May 23 11:28:44 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata1: EH complete
May 23 11:28:44 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata2.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: pci 0000:00:02.0: restoring config space at
offset 0x1 (was 0x900407, writing 0x900403)
May 23 11:28:44 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
May 23 11:28:44 linux-7vph kernel: Restarting tasks ... done.
May 23 11:28:44 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0x9 t4
May 23 11:28:44 linux-7vph kernel: ata2: irq_stat 0x40000001
May 23 11:28:44 linux-7vph kernel: ata2.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata2: EH complete
May 23 11:28:44 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct 0x1 SErr
0x10000 action 0xe frozen
May 23 11:28:44 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY changed
May 23 11:28:44 linux-7vph kernel: ata1: SError: { PHYRdyChg }
May 23 11:28:44 linux-7vph kernel: ata1.00: cmd
61/08:00:37:5a:f4/00:00:09:00:00/40 tag 0 ncq 4096 out
May 23 11:28:44 linux-7vph kernel: res
50/00:08:37:5a:f4/00:00:09:00:00/40 Emask 0x10 (ATA bus error)
May 23 11:28:44 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 11:28:44 linux-7vph kernel: ata1: hard resetting link
May 23 11:28:44 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata1: EH complete

Regards,
Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-05-25 01:20:09 UTC
Permalink
Post by Niel Lambrechts
I notice that if I hibernate with a CD in the drive and then remove it
as soon as I boot, I get a hard reset on the ata1 bus.
I'm not sure if this maybe plays a role in triggering the problem at
times, perhaps it is entirely expected - but vim highlights it in red
and to a "normal" user that is bad. :)
Maybe after applying your patch I should play around with this?
That isn't normal. You're getting PHY event on ata1 if you do
something to the cdrom which is attached to ata2. Something very
fishy is going on there. Sounds like an electric or some sort of
interference problem but I'm not sure. :-(

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2009-05-25 08:20:15 UTC
Permalink
Post by Tejun Heo
something to the cdrom which is attached to ata2. Something very
fishy is going on there. Sounds like an electric or some sort of
interference problem but I'm not sure. :-(
I'm just wondering what happens on a resume if a phy change event is seen
by the phy before or during the time registers (such as the port mappings
on an ICH) are being restored (or not restored even). That seems to be one
way you could get an event on the wrong port.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-05-26 05:00:14 UTC
Permalink
Hello, Niel.
I've tested all of the kernels I have again since 2.6.29.4 also came out
just recently. I did a hibernate/resume for each in the console, then
repeated the same in X, then continued to the next kernel.
The 2.6.29.4 log is much larger, since some other badness happened there
- there is a large kernel trace in there as my first X hibernation
attempt failed and came back to X after a few seconds. The system seemed
functional, it did not keep generating kernel messages - when I then
retried a hibernate it worked, along with the resume. Another unrelated
bug perhaps?
As for "hard resetting link" messages, they seemed to always happen
under X the times I tried it.
Kernel EXT4-errors? Console:ata1 reset? Console:ata2-reset? X:ata1 reset? X:ata2 reset?
2.6.28.10 No no yes yes no
2.6.29.4* No no no no no
2.6.29.4** No - - yes no
2.6.30-rc6 Yes - - yes no
2.6.30-rc6 No no no yes no
* Xorg hibernation attempt failed.
* Xorg Second hibernation attempt (no extra reboot)
I also did a side by side comparison of the messages I have for
2.6.30-rc6, the one with EXT4 errors I reported on yesterday, and
another one that worked just fine tonight. I stripped all time-stamps
and some pulseaudio messages from the bad one and attached them here,
and also saved the full messages for each kernel to
http://bugzilla.kernel.org/show_bug.cgi?id=13017 .
Since analysing the code-path is still a bit beyond me, I'll leave you
with a little summary of the differences I notice.
A = 2.6.30-rc6 (EXT4 clean)
B = 2.6.30-rc6 (EXT4 errors triggered)
Duplicate PHY events are likely to be dependent on timing and
non-deterministic. The ext4 corrupting or not depends on whether a
request with failfast set was in-flight at the time of the second PHY
event, which again is dependent on timing. At any rate, this looks
like a problem of ext4 (or something between ext4 and the driver). It
either shouldn't issue failfast command or should take appropriate
recovery action if it does. It would be really nice if you can give a
shot at ext3.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-05-26 05:50:08 UTC
Permalink
Post by Tejun Heo
Hello, Niel.
I've tested all of the kernels I have again since 2.6.29.4 also came out
just recently. I did a hibernate/resume for each in the console, then
repeated the same in X, then continued to the next kernel.
The 2.6.29.4 log is much larger, since some other badness happened there
- there is a large kernel trace in there as my first X hibernation
attempt failed and came back to X after a few seconds. The system seemed
functional, it did not keep generating kernel messages - when I then
retried a hibernate it worked, along with the resume. Another unrelated
bug perhaps?
As for "hard resetting link" messages, they seemed to always happen
under X the times I tried it.
Kernel EXT4-errors? Console:ata1 reset? Console:ata2-reset? X:ata1 reset? X:ata2 reset?
2.6.28.10 No no yes yes no
2.6.29.4* No no no no no
2.6.29.4** No - - yes no
2.6.30-rc6 Yes - - yes no
2.6.30-rc6 No no no yes no
* Xorg hibernation attempt failed.
* Xorg Second hibernation attempt (no extra reboot)
I also did a side by side comparison of the messages I have for
2.6.30-rc6, the one with EXT4 errors I reported on yesterday, and
another one that worked just fine tonight. I stripped all time-stamps
and some pulseaudio messages from the bad one and attached them here,
and also saved the full messages for each kernel to
http://bugzilla.kernel.org/show_bug.cgi?id=13017 .
Since analysing the code-path is still a bit beyond me, I'll leave you
with a little summary of the differences I notice.
A = 2.6.30-rc6 (EXT4 clean)
B = 2.6.30-rc6 (EXT4 errors triggered)
Duplicate PHY events are likely to be dependent on timing and
non-deterministic. The ext4 corrupting or not depends on whether a
request with failfast set was in-flight at the time of the second PHY
event, which again is dependent on timing. At any rate, this looks
like a problem of ext4 (or something between ext4 and the driver). It
either shouldn't issue failfast command or should take appropriate
recovery action if it does. It would be really nice if you can give a
shot at ext3.
Urgh. My root file-system is mounted with extents on, I would have to
re-install entirely.

I'm wondering why no one else is complaining, or whether the problem is
limited to ICH9M/M-E controllers with EXT4 or a certain type of
hard-drive. The laptop is a Lenovo W500 (fairly similar to T500), so
maybe not a lot of people with this type of controller is using EXT4 yet.

Anyhow, I think Theodore may have ruled this out as a EXT4 problem
already (I first copied him) so I'm not sure what to do now, it will
take some strong will (and even more time) for me to re-install EXT3. I
just shouldn't have to, dammit. :-p

Regards,
Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-05-26 06:00:28 UTC
Permalink
Post by Niel Lambrechts
Urgh. My root file-system is mounted with extents on, I would have to
re-install entirely.
I'm wondering why no one else is complaining, or whether the problem is
limited to ICH9M/M-E controllers with EXT4 or a certain type of
hard-drive. The laptop is a Lenovo W500 (fairly similar to T500), so
maybe not a lot of people with this type of controller is using EXT4 yet.
The double PHY events are not too rare but it's not too frequent
either. The number of people using ext4 on machines which have such
issues and are regularly suspending and resuming might not be too
high.
Post by Niel Lambrechts
Anyhow, I think Theodore may have ruled this out as a EXT4 problem
already (I first copied him) so I'm not sure what to do now, it will
take some strong will (and even more time) for me to re-install EXT3. I
just shouldn't have to, dammit. :-p
Well, that is a pretty easy way to rule out many possibilities from my
point of view. :-p If it's difficult, we can try to find out where the
flag is being set with debug patches.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Niel Lambrechts
2009-05-26 06:20:12 UTC
Permalink
Post by Tejun Heo
Post by Niel Lambrechts
Urgh. My root file-system is mounted with extents on, I would have to
re-install entirely.
I'm wondering why no one else is complaining, or whether the problem is
limited to ICH9M/M-E controllers with EXT4 or a certain type of
hard-drive. The laptop is a Lenovo W500 (fairly similar to T500), so
maybe not a lot of people with this type of controller is using EXT4 yet.
The double PHY events are not too rare but it's not too frequent
either. The number of people using ext4 on machines which have such
issues and are regularly suspending and resuming might not be too
high.
Post by Niel Lambrechts
Anyhow, I think Theodore may have ruled this out as a EXT4 problem
already (I first copied him) so I'm not sure what to do now, it will
take some strong will (and even more time) for me to re-install EXT3. I
just shouldn't have to, dammit. :-p
Well, that is a pretty easy way to rule out many possibilities from my
point of view. :-p If it's difficult, we can try to find out where the
flag is being set with debug patches.
If you send some patches I'll make every effort to test, it beats having
to re-install, my installation is just too customized. :)

Regards,
Niel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-05-26 13:40:23 UTC
Permalink
Post by Niel Lambrechts
If you send some patches I'll make every effort to test, it beats having
to re-install, my installation is just too customized. :)
First, let's make sure we aren't balking up the wrong tree. Can you
please apply the attached patch and report the kernel log?

Thanks.
--
tejun
Tejun Heo
2009-05-27 00:10:08 UTC
Permalink
Post by Niel Lambrechts
Post by Tejun Heo
Post by Niel Lambrechts
If you send some patches I'll make every effort to test, it beats having
to re-install, my installation is just too customized. :)
First, let's make sure we aren't balking up the wrong tree. Can you
please apply the attached patch and report the kernel log?
Hi Tejun,
Okay it took 5 attempts, some of during which I played audio, did 'find
/' etc. but I still do not have a clue whether the extra activity
helped trigger it or not.
Thanks for testing.

XXX scmd->result=0x8000002 ff_t=4 ff_dev=2 ff_drv=8
XXX DID_OK
XXX CHECK_CONDITION, returning ff_dev
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
09 b8 71 d1
sd 0:0:0:0: [sda] Add. Sense: No additional sense information
end_request: I/O error, dev sda, sector 242190447

The above is the offending failure and all three failfast bits are
set. This corresponds to the following ATA exception.

ata1.00: cmd 60/08:18:6f:88:6f/01:00:0e:00:00/40 tag 3 ncq 135168 in
res 50/00:40:d1:71:b8/00:00:09:00:00/40 Emask 0x10 (ATA bus error)

It's 33 page long read command. Looking at the code the only way all
three fastfail bits can be set seems to be if the request is readahead
- the first part of block/blk-core.c::init_request_from_bio(). Now,
the failure of a readahead request isn't supposed to cause any
problem. If it fails, well, it fails and things should go on as if
nothing happened.

Can you please try the attached patch? It takes suspend/resume cycle
out of the equation and simply induces artificial failure to readahead
requests. It's currently set to fail every 40th readahead. Feel free
to adjust the frequency as you see fit. catting files into /dev/null
would trigger readahead to kick in. Can you reproduce filesystem
failure with this alone?

Thanks.
--
tejun
Niel Lambrechts
2009-06-01 19:00:22 UTC
Permalink
Post by Tejun Heo
The above is the offending failure and all three failfast bits are
set. This corresponds to the following ATA exception.
<snip>
Post by Tejun Heo
Can you please try the attached patch? It takes suspend/resume cycle
out of the equation and simply induces artificial failure to readahead
requests. It's currently set to fail every 40th readahead. Feel free
to adjust the frequency as you see fit. catting files into /dev/null
would trigger readahead to kick in. Can you reproduce filesystem
failure with this alone?
No corruption, I tried cat on an entire directory of mp3s, then also
started X, but there the messages debug output got fairly drastic so I
didn't care to continue for an entire day.
It did trigger the "XXX ...failing readahead" message nearly 300
times, and I also did a s2disk and resume cycle in there - so I hope
this is enough for us to conclude that it is not the cause.
Hi Tejun,

Did you perhaps have any time to look into my feedback around the
readahead patch?

From my side, I tried on Saturday to bisect this problem again, doing
5-8 hibernates per each bisect from 2.6.28. I stopped at 2.6.30-rc2 due
time (or fatigue), and did not manage to replicate the problem at all
which is strange since I was playing audio, doing finds and even doing
an entire dd of the root partition. I saved the bisect logs so perhaps I
can continue to see if the problem becomes more prevalent in later
versions - the first time it ever happened to me was somewhere in 2.6.29
originally.

The other interesting thing was to see that "hard resetting link"
messages seem to first start appearing at v2.6.29-rc7 or perhaps rc8. Is
it worth trying to track down the commit that lead to this?

Do you have any other debug patches to try, or should I try to delve
deeper into finding commits that can be reverted? I'm running out of
ideas, I even tried to find later firmware for my drive, but I seem to
be on the latest level.

Thanks for the help so far!

Niel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-06-03 03:20:15 UTC
Permalink
Hello, Niel.
Post by Niel Lambrechts
Did you perhaps have any time to look into my feedback around the
readahead patch?
Yeah, I've been thinking about it and am a bit out of ideas, so no
immediate follow-up.
Post by Niel Lambrechts
Post by Niel Lambrechts
From my side, I tried on Saturday to bisect this problem again, doing
5-8 hibernates per each bisect from 2.6.28. I stopped at 2.6.30-rc2 due
time (or fatigue), and did not manage to replicate the problem at all
which is strange since I was playing audio, doing finds and even doing
an entire dd of the root partition. I saved the bisect logs so perhaps I
can continue to see if the problem becomes more prevalent in later
versions - the first time it ever happened to me was somewhere in 2.6.29
originally.
The other interesting thing was to see that "hard resetting link"
messages seem to first start appearing at v2.6.29-rc7 or perhaps rc8. Is
it worth trying to track down the commit that lead to this?
Do you have any other debug patches to try, or should I try to delve
deeper into finding commits that can be reverted? I'm running out of
ideas, I even tried to find later firmware for my drive, but I seem to
be on the latest level.
Given the non-deterministic nature of the failure, I think bisection
would be quite difficult. I think the best way to diagnose the
problem is to track down the owner of the failed request. It has
FAILFAST set and yet the issuer fails to deal with an error condition
which is expected when FAILFAST is set. I'll prep a patch to track
request / bio failure.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-06-03 04:30:15 UTC
Permalink
Hello,

Can you please do the followings?

1. Apply the attached patch, build & boot

2. Trigger the problem and record dmesg

3. On failed IO, the kernel will print the address of bi_endio. Run
"nm -n" on the vmlinux in the kernel build root and look up which
function it is and post the dmesg and function name.

Thanks.
--
tejun
Pavel Machek
2009-06-19 15:10:07 UTC
Permalink
Morning Tejun,
Post by Tejun Heo
Hello,
Can you please do the followings?
1. Apply the attached patch, build & boot
I chose 2.6.30-rc7...
Post by Tejun Heo
2. Trigger the problem and record dmesg
It took 3 days and quite a few hibernate attempts ... :-)
Post by Tejun Heo
3. On failed IO, the kernel will print the address of bi_endio. Run
"nm -n" on the vmlinux in the kernel build root and look up which
function it is and post the dmesg and function name.
I did not have that specific vmlinux.o file any more, but
c01a49fd t end_bio_bh_io_sync
Hope this is sufficient to help you. Sorry if this is silly - being so
inexperienced with the kernel - but I wondered if or why a dump_stack()
in that debug patch would not be helpful?
I guess dump_stack would be very helpful :-).
Regards,
Niel
Jun 5 09:26:12 linux-7vph -- MARK --
Jun 5 09:26:12 linux-7vph syslog-ng[24615]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=3777', processed='center(received)=3142', processed='destination(newsnotice)=0', processed='destination(acpid)=13', processed='destination(firewall)=18', processed='destination(null)=208', processed='destination(mail)=0', processed='destination(mailinfo)=0', processed='destination(console)=200', processed='destination(newserr)=0', processed='destination(newscrit)=0', processed='destination(messages)=1909', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(netmgm)=994', processed='destination(mailerr)=0', processed='destination(xconsole)=200', processed='destination(warn)=235', processed='source(src)=3142'
Jun 5 09:26:11 linux-7vph kernel: Syncing filesystems ... done.
Jun 5 09:26:12 linux-7vph kernel: Freezing user space processes ... (elapsed 0.00 seconds) done.
Jun 5 09:26:12 linux-7vph kernel: Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Jun 5 09:26:12 linux-7vph kernel: PM: Shrinking memory... done (149507 pages freed)
Jun 5 09:26:12 linux-7vph kernel: PM: Freed 598028 kbytes in 6.49 seconds (92.14 MB/s)
Jun 5 09:26:12 linux-7vph kernel: Suspending console(s) (use no_console_suspend to debug)
Jun 5 09:26:12 linux-7vph kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Jun 5 09:26:12 linux-7vph kernel: ACPI handle has no context!
Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D disabled
Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
Jun 5 09:26:12 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B disabled
Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D disabled
Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A disabled
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
Jun 5 09:26:13 linux-7vph kernel: ACPI: Preparing to enter system sleep state S4
Jun 5 09:26:13 linux-7vph kernel: PM: Saving platform NVS memory
Jun 5 09:26:13 linux-7vph kernel: Disabling non-boot CPUs ...
Jun 5 09:26:13 linux-7vph kernel: CPU 1 is now offline
Jun 5 09:26:13 linux-7vph kernel: SMP alternatives: switching to UP code
Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Jun 5 09:26:13 linux-7vph kernel: CPU1 attaching NULL sched-domain.
Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Jun 5 09:26:13 linux-7vph kernel: CPU1 is down
Jun 5 09:26:13 linux-7vph kernel: Extended CMOS year: 2000
Jun 5 09:26:13 linux-7vph kernel: PM: Need to copy 223773 pages
Jun 5 09:26:13 linux-7vph kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Jun 5 09:26:13 linux-7vph kernel: Intel machine check architecture supported.
Jun 5 09:26:13 linux-7vph kernel: Intel machine check reporting enabled on CPU#0.
Jun 5 09:26:13 linux-7vph kernel: PM: Restoring platform NVS memory
Jun 5 09:26:13 linux-7vph kernel: Extended CMOS year: 2000
Jun 5 09:26:13 linux-7vph kernel: Enabling non-boot CPUs ...
Jun 5 09:26:13 linux-7vph kernel: SMP alternatives: switching to SMP code
Jun 5 09:26:13 linux-7vph kernel: Booting processor 1 APIC 0x1 ip 0x6000
Jun 5 09:26:13 linux-7vph kernel: Initializing CPU#1
Jun 5 09:26:13 linux-7vph kernel: Calibrating delay using timer specific routine.. 5134.67 BogoMIPS (lpj=10269343)
Jun 5 09:26:13 linux-7vph kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
Jun 5 09:26:13 linux-7vph kernel: CPU: L2 cache: 6144K
Jun 5 09:26:13 linux-7vph kernel: CPU: Physical Processor ID: 0
Jun 5 09:26:13 linux-7vph kernel: CPU: Processor Core ID: 1
Jun 5 09:26:13 linux-7vph kernel: Intel machine check architecture supported.
Jun 5 09:26:13 linux-7vph kernel: Intel machine check reporting enabled on CPU#1.
Jun 5 09:26:13 linux-7vph kernel: x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
Jun 5 09:26:13 linux-7vph kernel: Switched to high resolution mode on CPU 1
Jun 5 09:26:13 linux-7vph kernel: domain 0: span 0-1 level MC
Jun 5 09:26:13 linux-7vph kernel: groups: 0 1
Jun 5 09:26:13 linux-7vph kernel: domain 1: span 0-1 level CPU
Jun 5 09:26:13 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
Jun 5 09:26:13 linux-7vph kernel: domain 0: span 0-1 level MC
Jun 5 09:26:13 linux-7vph kernel: groups: 1 0
Jun 5 09:26:13 linux-7vph kernel: domain 1: span 0-1 level CPU
Jun 5 09:26:13 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
Jun 5 09:26:13 linux-7vph kernel: microcode: failed to init CPU1
Jun 5 09:26:13 linux-7vph kernel: CPU1 is up
Jun 5 09:26:13 linux-7vph kernel: ACPI: Waking up from system sleep state S4
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900007, writing 0x900403)
Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1a.0: power state changed by ACPI to D0
Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1a.2: power state changed by ACPI to D0
Jun 5 09:26:13 linux-7vph kernel: HDA Intel 0000:00:1b.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100102)
Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1d.0: power state changed by ACPI to D0
Jun 5 09:26:13 linux-7vph kernel: ahci 0000:00:1f.2: restoring config space at offset 0x1 (was 0x2b00403, writing 0x2b00407)
Jun 5 09:26:13 linux-7vph kernel: iwlagn 0000:03:00.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100506)
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: PME# disabled
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.1: PME# disabled
Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:03.0: PME# disabled
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: pci_enable_pcie_error_reporting failed 0xfffffffb
Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: irq 29 for MSI/MSI-X
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.1: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C -> GSI 22 (level, low) -> IRQ 22
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.2: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D -> GSI 23 (level, low) -> IRQ 23
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Jun 5 09:26:14 linux-7vph kernel: HDA Intel 0000:00:1b.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.1: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.2: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D -> GSI 19 (level, low) -> IRQ 19
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:1e.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ahci 0000:00:1f.2: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Jun 5 09:26:14 linux-7vph kernel: ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17] MMIO=[f4801000-f48017ff] Max Packet=[2048] IR/IT contexts=[4/4]
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.2: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.3: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.4: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.5: PME# disabled
Jun 5 09:26:14 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
Jun 5 09:26:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jun 5 09:26:14 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Jun 5 09:26:14 linux-7vph kernel: ata1.00: configured for UDMA/133
Jun 5 09:26:14 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Jun 5 09:26:14 linux-7vph kernel: ata1: irq_stat 0x00400040, connection status changed
Jun 5 09:26:14 linux-7vph kernel: ata1.00: configured for UDMA/133
Jun 5 09:26:14 linux-7vph kernel: ata1: EH complete
Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900407, writing 0x900403)
Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
Jun 5 09:26:14 linux-7vph kernel: ata2.00: configured for UDMA/133
Jun 5 09:26:14 linux-7vph kernel: Restarting tasks ... <3>ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
Jun 5 09:26:14 linux-7vph kernel: ata2: irq_stat 0x40000001
Jun 5 09:26:14 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct 0x1ff SErr 0x50000 action 0xe frozen
Jun 5 09:26:14 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY changed
Jun 5 09:26:14 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:00:5f:06:44/00:00:0c:00:00/40 tag 0 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/58:08:6f:06:44/00:00:0c:00:00/40 tag 1 ncq 45056 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:10:d7:06:44/00:00:0c:00:00/40 tag 2 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:18:07:07:44/00:00:0c:00:00/40 tag 3 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/18:20:17:07:44/00:00:0c:00:00/40 tag 4 ncq 12288 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:28:67:88:70/00:00:0e:00:00/40 tag 5 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:30:17:70:14/00:00:0d:00:00/40 tag 6 ncq 4096 in
Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/d8:38:5f:89:6f/00:00:0e:00:00/40 tag 7 ncq 110592 in
Jun 5 09:26:15 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:15 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:15 linux-7vph kernel: ata1.00: cmd 60/28:40:3f:8a:6f/00:00:0e:00:00/40 tag 8 ncq 20480 in
Jun 5 09:26:15 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
Jun 5 09:26:15 linux-7vph kernel: ata1.00: status: { DRDY }
Jun 5 09:26:15 linux-7vph kernel: ata1: hard resetting link
Jun 5 09:26:15 linux-7vph kernel: ata2.00: configured for UDMA/133
Jun 5 09:26:15 linux-7vph kernel: ata2: EH complete
Jun 5 09:26:15 linux-7vph kernel: done.
Jun 5 09:26:15 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
Jun 5 09:26:16 linux-7vph kernel: ata1.00: configured for UDMA/133
Jun 5 09:26:16 linux-7vph kernel: ata1.00: configured for UDMA/133
Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Jun 5 09:26:16 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 5 09:26:16 linux-7vph kernel: 0e 6f 8a 3f
Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Jun 5 09:26:16 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190687
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio e1223140 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=e1223140 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa0c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa0c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa140 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa140 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa1c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa1c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa240 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa240 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa2c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa2c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa340 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa340 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa3c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa3c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa440 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa440 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa4c0 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa4c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa540 with -5
Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa540 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa5c0 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa5c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa640 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa640 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa6c0 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa6c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa740 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa740 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa7c0 with -5
Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa7c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa840 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa840 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa8c0 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa8c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa940 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa940 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa9c0 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa9c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caaa40 with -5
Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caaa40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caaac0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaac0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caab40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caab40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caabc0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caabc0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caac40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caac40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caacc0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caacc0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caad40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caad40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Jun 5 09:26:22 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 5 09:26:22 linux-7vph kernel: 0e 6f 8a 3f
Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
Jun 5 09:26:22 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190911
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caadc0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caadc0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caae40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caae40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caaec0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaec0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caaf40 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaf40 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio e12230c0 with -5
Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=e12230c0 error=-5 bi_end_io=c01a49fd
Jun 5 09:26:22 linux-7vph kernel: ata1: EH complete
Jun 5 09:26:22 linux-7vph kernel: EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346994, block=9404448
Jun 5 09:26:22 linux-7vph kernel: EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-05-26 05:00:15 UTC
Permalink
Post by Alan Cox
Post by Tejun Heo
something to the cdrom which is attached to ata2. Something very
fishy is going on there. Sounds like an electric or some sort of
interference problem but I'm not sure. :-(
I'm just wondering what happens on a resume if a phy change event is seen
by the phy before or during the time registers (such as the port mappings
on an ICH) are being restored (or not restored even). That seems to be one
way you could get an event on the wrong port.
It doesn't really matter. Before the ports are resumed, all such
events are ignored.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Berthold Gunreben
2009-09-18 21:11:10 UTC
Permalink
Eh... if my analysis is correct the bug will be dependent on some
non-deterministic things including timing and layout of blocks of the
file which the filesystem touchs during suspend/resume. I'm fairly
sure the problem is still there. Oh well, it seems I'll have to
construct a synthetic case to test it. Anyways, please let me know if
you encounter the problem again. I'll keep you in loop with further
patches.
I am not quite sure if I am at the right place here, however, I get very
similar problems with a totally different setup. What I do is the following:

I added a fourth disk to a software raid5 array and did setup the raid
completely from scratch (the same disks have been running for about
1.5 years without any problems before, the difference is that previously
one of the disks was setup as hot spare).

After the software raid was in sync, I started to copy my data back to
the raid, and after a less than 5 minutes time, one of the disks failed
(from /var/log/warn):

Sep 18 22:04:02 Bacchus kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Sep 18 22:04:02 Bacchus kernel: ata3.00: irq_stat 0x00400000, PHY RDY changed
Sep 18 22:04:02 Bacchus kernel: ata3: SError: { PHYRdyChg }
Sep 18 22:04:02 Bacchus kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Sep 18 22:04:02 Bacchus kernel: res 40/00:0c:3f:6c:d5/00:00:15:00:00/40 Emask 0x10 (ATA bus error)
Sep 18 22:04:02 Bacchus kernel: ata3.00: status: { DRDY }
Sep 18 22:04:02 Bacchus kernel: ata3: hard resetting link
Sep 18 22:04:07 Bacchus kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 18 22:04:12 Bacchus kernel: ata3.00: qc timeout (cmd 0xec)
Sep 18 22:04:12 Bacchus kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Sep 18 22:04:12 Bacchus kernel: ata3.00: revalidation failed (errno=-5)
Sep 18 22:04:12 Bacchus kernel: ata3: hard resetting link
Sep 18 22:04:12 Bacchus kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 18 22:04:12 Bacchus kernel: ata3.00: configured for UDMA/133
Sep 18 22:04:12 Bacchus kernel: ata3: EH complete
Sep 18 22:04:12 Bacchus kernel: end_request: I/O error, dev sdc, sector 1953519853
Sep 18 22:04:12 Bacchus kernel: md: super_written gets error=-5, uptodate=0
Sep 18 22:04:12 Bacchus kernel: raid5: Disk failure on sdc1, disabling device.
Sep 18 22:04:12 Bacchus kernel: raid5: Operation continuing on 3 devices.
Sep 18 22:04:13 Bacchus kernel: RAID5 conf printout:
Sep 18 22:04:13 Bacchus kernel: --- rd:4 wd:3
Sep 18 22:04:13 Bacchus kernel: disk 0, o:1, dev:sda1
Sep 18 22:04:13 Bacchus kernel: disk 1, o:1, dev:sdb1
Sep 18 22:04:13 Bacchus kernel: disk 2, o:0, dev:sdc1
Sep 18 22:04:13 Bacchus kernel: disk 3, o:1, dev:sdd1
Sep 18 22:04:13 Bacchus kernel: RAID5 conf printout:
Sep 18 22:04:13 Bacchus kernel: --- rd:4 wd:3
Sep 18 22:04:13 Bacchus kernel: disk 0, o:1, dev:sda1
Sep 18 22:04:13 Bacchus kernel: disk 1, o:1, dev:sdb1
Sep 18 22:04:13 Bacchus kernel: disk 3, o:1, dev:sdd1

The filesystem on /dev/md0 was XFS and there was no suspend/resume applied.
The raid is built with four 1TB disks, attached to a Intel ICH7 Controller:
00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) SATA AHCI Controller (rev 01)

The system is a openSUSE11.1 with all patches applied. When I first
encountered this problem I had the original openSUSE kernel (2.6.27.29)
running, but I thought it was a good idea to go with a current upstream kernel
instead. The difference was, that with the openSUSE kernel, I could copy
several 100 GB before I lost the disk, while with the upstream kernel
(2.6.31 from ftp.kernel.org) I lost the disk after less than 4 GB. The result
however is the same.

smart does not tell me any problems with the disks, and I did not encounter
any problems during the build of the raid5 which took about 6 hours on
this machine. Thus I don't believe that there is a hardware problem.

Maybe you have an idea what I can do to provide more information.

Thanks in advance

Berthold


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-09-25 04:20:04 UTC
Permalink
Hello, Berthold.
Post by Berthold Gunreben
I am not quite sure if I am at the right place here, however, I get very
I added a fourth disk to a software raid5 array and did setup the raid
completely from scratch (the same disks have been running for about
1.5 years without any problems before, the difference is that previously
one of the disks was setup as hot spare).
After the software raid was in sync, I started to copy my data back to
the raid, and after a less than 5 minutes time, one of the disks failed
Sep 18 22:04:02 Bacchus kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Sep 18 22:04:02 Bacchus kernel: ata3.00: irq_stat 0x00400000, PHY RDY changed
Sep 18 22:04:02 Bacchus kernel: ata3: SError: { PHYRdyChg }
Sep 18 22:04:02 Bacchus kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Sep 18 22:04:02 Bacchus kernel: res 40/00:0c:3f:6c:d5/00:00:15:00:00/40 Emask 0x10 (ATA bus error)
Sep 18 22:04:02 Bacchus kernel: ata3.00: status: { DRDY }
The disk is most likely losing power briefly. After boot, run
"smartctl -a" on the device and record the output. After triggering
the problem, do it again. See if Start_Stop_Count, Power_Cycle_Count
or Power-Off_Retract_Count has increased. If so, take out your PSU,
bury it half-deep in your backyard, apply some gasoline, light it up
and enjoy the sight of perishing evil with a can of beer.

Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Berthold Gunreben
2009-09-30 10:00:19 UTC
Permalink
Hi Tejun,

thanks a lot for your reply.
Post by Tejun Heo
Hello, Berthold.
The disk is most likely losing power briefly. After boot, run
"smartctl -a" on the device and record the output. After triggering
the problem, do it again. See if Start_Stop_Count, Power_Cycle_Count
or Power-Off_Retract_Count has increased. If so, take out your PSU,
bury it half-deep in your backyard, apply some gasoline, light it up
and enjoy the sight of perishing evil with a can of beer.
You might be right. However, I cannot reproduce the problem anymore, since I
switched to the totally unsupported JFS as filesystem.

In the meantime, I was able to copy 1.5TB of data back to the array, and the
system also survived artificially generated high load. If the problem is a
race (which I do not know), it might still be there. Obviously, it does not
show up as often again.

It could still be the power supply of course, but I don't understand why a new
kernel would trigger power outages so often (current kernels triggered the
problem latest after 5 minutes). Maybe it has something to do with the
chipset (ICH7R) which is capable of hot remove/add disks. Or it is related to
the hotswap harddisk slots in the case
(http://www.chenbro.eu/corporatesite/products_detail.php?sku=79 ). I have no
idea....

Thanks

Berthold

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2009-09-30 10:40:05 UTC
Permalink
Hello,
Post by Berthold Gunreben
It could still be the power supply of course, but I don't understand
why a new kernel would trigger power outages so often (current
kernels triggered the problem latest after 5 minutes).
Probably because older kernels didn't handle barrier properly for
raid. Changing fs can definitely affect things but if I were you I
wouldn't be trusting any valuable data on that configuration.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Continue reading on narkive:
Loading...