Author Topic: ata4: softreset failed. ata4.00 disabled (SATA devices not working sometimes)  (Read 10657 times)

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
From time to time during the boot phase my SATA devices (HDD and BD reader) are disabled due to initialisation problems:

Code: [Select]
[    0.990587] ata4: SATA max UDMA/133 abar m2048@0x600c100000000 port 0x600c100000280 irq 30
[    1.487797] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    1.491980] ata4.00: ATAPI: ASUS    BW-16D1HT, 3.10, max UDMA/133
[    1.499775] ata4.00: configured for UDMA/133
[   97.577022] ata4: softreset failed (1st FIS failed)
[  147.576656] ata4: reset failed, giving up
[  147.576658] ata4.00: disabled

The NVMe SSD is still working (always).

Rebooting does not help, sometimes even a cold start does not help (so the problems smells like a boot time Linux problem).

After this error also petitboot does no longer recognize the SATA devices (even when I choose the menu item "rescan devices").

I could reproduce this problem with Ubuntu Server 19.10 (kernel 5.3.x) as well as with Fedora 31 with a newer kernel (5.4.x)

What is the reason for that and how can I fix this?

BTW: There is a wiki entry at voidlinux bit it does not explain the background (reasons + impact):

https://wiki.voidlinux.org/Frequently_Asked_Questions#How_to_get_rid_of_.22ataN:_softreset_failed_.28device_not_ready.29.22_.3F

q66

  • Guest
there is some issue with the SATA controller doing this when an optical drive is connected to SATA together with other devices (it will negotiate lower and lower speeds across kexecs until you get nothing). It requires a full reboot (i.e. either shut down and boot again, or disable fast-reboot, which will mitigate the problem by reinitializing everything on every boot). Or disconnect the optical drive.
« Last Edit: February 01, 2020, 03:57:13 pm by q66 »

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
THX for the background :-)

So does this mean if I add
Code: [Select]
libata.force=norst to the kernel boot parameters that the SATA controller is negotiating correctly then (or does this only suppress the dmesg output ;-) ?

q66

  • Guest
yeah, i don't think it will help.

running:

nvram -p ibm,skiboot --update-config fast-reset=0

as root on your OS (need powerpc-utils installed) will make your reboots slower, but it should mitigate the problem.

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
nvram -p ibm,skiboot --update-config fast-reset=0
as root on your OS (need powerpc-utils installed) will make your reboots slower, but it should mitigate the problem.

I have already tried this but with no reliable success - even cold starts don't help sometimes.

I suspect a kind of non-deterministic "race condition" (very difficult to diagnose from the logs -ideas welcome since I can reproduce the problem quite often ;-)

MPC7500

  • Hero Member
  • *****
  • Posts: 588
  • Karma: +41/-1
    • View Profile
    • Twitter
Have you tried to disconnect the BD-drive, whether the problem disappears or persists?

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
Have you tried to disconnect the BD-drive, whether the problem disappears or persists?

Not yet, I am trying to diagnose the problem by changing only one small thing (step-by-step). But I am almost quite sure it works if no NVMe SSD is attached so it looks like a hardware incompatibility.

Currently I have applied the libata.force=norst kernel parameter to disable hard and soft resets
and the optical drive is working currently (but the problem is non-deterministic so I have to wait until the error occurs again).

I have already unplugged the attached Seagate SATA HDD so I would exclude this device from the suspects.

Next step would be to change the cable, then disconnect (no more SATA devices)...

BTW The kernel.org doc for kernel parameters explains the options for libata.force quite well:

https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html

Quote
libata.force=   [LIBATA] Force configurations.  The format is comma separated list of "[ID:]VAL" where ID is
                        PORT[.DEVICE].  PORT and DEVICE are decimal numbers matching port, link or device.  Basically, it matches
                        the ATA ID string printed on console by libata.  If the whole ID part is omitted, the last PORT and DEVICE
                        values are used.  If ID hasn't been specified yet, the configuration applies to all ports, links and devices.

                        If only DEVICE is omitted, the parameter applies to the port and all links and devices behind it.  DEVICE
                        number of 0 either selects the first device or the first fan-out link behind PMP device.  It does not
                        select the host link.  DEVICE number of 15 selects the host link and device attached to it.

                        The VAL specifies the configuration to force.  As long as there's no ambiguity shortcut notation is allowed.
                        For example, both 1.5 and 1.5G would work for 1.5Gbps. The following configurations can be forced.

                        * Cable type: 40c, 80c, short40c, unk, ign or sata.  Any ID with matching PORT is used.

                        * SATA link speed limit: 1.5Gbps or 3.0Gbps.

                        * Transfer mode: pio[0-7], mwdma[0-4] and udma[0-7]. udma[/][16,25,33,44,66,100,133] notation is also allowed.

                        * [no]ncq: Turn on or off NCQ.

                        * [no]ncqtrim: Turn off queued DSM TRIM.

                        * nohrst, nosrst, norst: suppress hard, soft and both resets.

                        * rstonce: only attempt one reset during hot-unplug link recovery

                        * dump_id: dump IDENTIFY data.

                        * atapi_dmadir: Enable ATAPI DMADIR bridge support

                        * disable: Disable this device.

                        If there are multiple matching configurations changing the same attribute, the last one is used.

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
Next step would be to change the cable, then disconnect (no more SATA devices)...

OK, short update: <tt>libata.force=norst</tt> does not solve the problem but changes only the error message:

Code: [Select]
[6.397240] ata4.00 failed to IDENTIFY (I/O error, err_mask=0x4)
emask 0x4 should mean "timeout" AFAIR

madscientist159

  • Raptor Staff
  • *****
  • Posts: 47
  • Karma: +11/-0
    • View Profile
Interesting thread and issue!

I'd suspect one of the following:
  • Linux driver problem with Marvell controller (would probably be a regression since these used to work quite reliably)
  • Hardware problem -- e.g. PSU sagging slightly under extra load from NVMe drive, causing controller or drive to malfunction

Have you tried (carefully) removing power to the SATA drive when it's in timeout status, and reapplying it, to see if the link comes back up?  Or moving the cable to a different port to see if it's the entire chip that's locked up or just the one port?

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
I'd suspect one of the following:

> Hardware problem -- e.g. PSU sagging slightly under extra load from NVMe drive, causing controller or drive to malfunction

My power supply unit is quite oversized (650 W) and a famous brand.
With a SATA HDD alone it seems to work. I will try out the optical drive in another X86 Linux computer with a similar kernel (but without having a NVMe SSD available) to see what happens.
Changing the PSU would be my "last resort".

> Have you tried (carefully) removing power to the SATA drive when it's in timeout status, and reapplying it, to see if the link comes back up?

How save is this when the system is powered on and running (risk to damage the main board...)?
And how do I enforce a new PCI bus scan (does reapplying the power to the optical SATA drive cause a new scan?)?

>  Or moving the cable to a different port to see if it's the entire chip that's locked up or just the one port?

Will be my next try, currently I am running with a SATA HDD only and have observed no problems so far.

BTW: If you possible have hints or a link on how to debug a libata/marvel during booting of Linux I would try to do this (I know gdb quite well) but I am sure your time is rare so it is OK to ignore my wish ;-)


madscientist159

  • Raptor Staff
  • *****
  • Posts: 47
  • Karma: +11/-0
    • View Profile
My power supply unit is quite oversized (650 W) and a famous brand.
With a SATA HDD alone it seems to work. I will try out the optical drive in another X86 Linux computer with a similar kernel (but without having a NVMe SSD available) to see what happens.
Changing the PSU would be my "last resort".

Yes, agreed, but was listing it for completeness.

How save is this when the system is powered on and running (risk to damage the main board...)?
And how do I enforce a new PCI bus scan (does reapplying the power to the optical SATA drive cause a new scan?)?
Quote

Perfectly safe.  SATA power and data cables are hotpluggable by design (you'll note the long ground pins on each connector, this is specifically for hotplug support).  You don't need to manually rescan anything; the kernel will see the hotplug event automatically and either add the device or fail trying (i.e. you'll probably see something in dmesg regardless of whether this works or not).

Will be my next try, currently I am running with a SATA HDD only and have observed no problems so far.

BTW: If you possible have hints or a link on how to debug a libata/marvel during booting of Linux I would try to do this (I know gdb quite well) but I am sure your time is rare so it is OK to ignore my wish ;-)

To be honest, that kind of hardware incompatibility would be very strange indeed.  Any chance you can try a different card in the same PCIe slot to see if the same problem is triggered?

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
To be honest, that kind of hardware incompatibility would be very strange indeed.  Any chance you can try a different card in the same PCIe slot to see if the same problem is triggered?

At the moment I am suffering from the inability to boot into and see petitboot (even without any device attached except network cable at net3):

https://forums.raptorcs.com/index.php/topic,49.0.html

So do you mean to try any type of PCIe card (not neccessarily an NVMe adapter card because I have already used another adapter card but in the same slot)...