Author Topic: Power9 8 core (v2 with DD2.3): Two of 8 cores are unavailable (offline)  (Read 9295 times)

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
Two of my 8 Cores of my brand-new IBM Power9 8-core are dead, how can I get more diagnostic information (could be my Blackbird, a defective CPU or maybe a firmware issue)?

And: Is there a way to re-enable the cores?

Initially it worked for a few days with 8 cores, but suddenly two cores disappeared (I did not realize that immediately)...

Ubuntu Server as well as petitboot are only showing 6 working cores (cat /proc/cpuinfo).

Funny thing within the little tragedy: The CPU is still working with 6 cores ;-)

Excerpt from my pb-sos msglog file:

Code: [Select]
[   56.324724294,6] CORE[0]: HW_PROC_ID=0 PROC_CHIP_ID=0 EC=0x23 OK
[   56.324726725,6] CORE[0]: PIR=00000004 OK (4 threads)
[   56.324729109,6]     Cache: I=32 D=32/512/10240/0
[   56.324757258,6] CORE[1]: HW_PROC_ID=1 PROC_CHIP_ID=0 EC=0x23 OK
[   56.324759514,6] CORE[1]: PIR=0000000c OK (4 threads)
[   56.324761974,6]     Cache: I=32 D=32/512/10240/0
[   56.324790184,6] CORE[2]: HW_PROC_ID=2 PROC_CHIP_ID=0 EC=0x23 OK
[   56.324792498,6] CORE[2]: PIR=00000014 OK (4 threads)
[   56.324794826,6]     Cache: I=32 D=32/512/10240/0
[   56.324824587,4] CORE[3]: HW_PROC_ID=3 PROC_CHIP_ID=0 EC=0x23 UNAVAILABLE
[   56.324912952,6] CORE[3]: PIR=0000001c UNUSABLE (4 threads)
[   56.324915586,6]     Cache: I=32 D=32/512/10240/0
[   56.324945482,6] CORE[4]: HW_PROC_ID=4 PROC_CHIP_ID=0 EC=0x23 OK
[   56.324947787,6] CORE[4]: PIR=00000024 OK (4 threads)
[   56.324950086,6]     Cache: I=32 D=32/512/10240/0
[   56.324980722,6] CORE[5]: HW_PROC_ID=5 PROC_CHIP_ID=0 EC=0x23 OK
[   56.324983113,6] CORE[5]: PIR=00000028 OK (4 threads)
[   56.324985419,6]     Cache: I=32 D=32/512/10240/0
[   56.325017618,4] CORE[6]: HW_PROC_ID=6 PROC_CHIP_ID=0 EC=0x23 UNAVAILABLE
[   56.325099991,6] CORE[6]: PIR=00000034 UNUSABLE (4 threads)
[   56.325102540,6]     Cache: I=32 D=32/512/10240/0
[   56.325134927,6] CORE[7]: HW_PROC_ID=7 PROC_CHIP_ID=0 EC=0x23 OK
[   56.325137083,6] CORE[7]: PIR=0000003c OK [boot] (4 threads)
[   56.325139792,6]     Cache: I=32 D=32/512/10240/0
[   56.325175402,6] IPLPARAMS: v0x70 Platform family/type: ibm,p9-openbmc/rcs,blackbird

Code: [Select]
# cat /proc/cpuinfo
processor : 0
cpu : POWER9, altivec supported
clock : 2154.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 1
cpu : POWER9, altivec supported
clock : 2154.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 2
cpu : POWER9, altivec supported
clock : 2154.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 3
cpu : POWER9, altivec supported
clock : 2154.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 4
cpu : POWER9, altivec supported
clock : 2154.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 5
cpu : POWER9, altivec supported
clock : 2154.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 6
cpu : POWER9, altivec supported
clock : 2154.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 7
cpu : POWER9, altivec supported
clock : 2154.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 8
cpu : POWER9, altivec supported
clock : 2220.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 9
cpu : POWER9, altivec supported
clock : 2220.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 10
cpu : POWER9, altivec supported
clock : 2220.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 11
cpu : POWER9, altivec supported
clock : 2220.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 16
cpu : POWER9, altivec supported
clock : 2204.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 17
cpu : POWER9, altivec supported
clock : 2204.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 18
cpu : POWER9, altivec supported
clock : 2204.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 19
cpu : POWER9, altivec supported
clock : 2204.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 20
cpu : POWER9, altivec supported
clock : 2204.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 21
cpu : POWER9, altivec supported
clock : 2204.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 22
cpu : POWER9, altivec supported
clock : 2204.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 23
cpu : POWER9, altivec supported
clock : 2204.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 28
cpu : POWER9, altivec supported
clock : 2303.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 29
cpu : POWER9, altivec supported
clock : 2170.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 30
cpu : POWER9, altivec supported
clock : 2170.000000MHz
revision : 2.3 (pvr 004e 1203)

processor : 31
cpu : POWER9, altivec supported
clock : 2170.000000MHz
revision : 2.3 (pvr 004e 1203)

timebase : 512000000
platform : PowerNV
model : C1P9S01 REV 1.01
machine : PowerNV C1P9S01 REV 1.01
firmware : OPAL
MMU : Radix

q66

  • Guest
Re: Power9 8 core (v2 with DD2.3): Two of 8 cores are unavailable (offline)
« Reply #1 on: January 27, 2020, 08:44:16 pm »
run

pflash -P GUARD -c


in BMC shell, then reboot

madscientist159

  • Raptor Staff
  • *****
  • Posts: 47
  • Karma: +11/-0
    • View Profile
Re: Power9 8 core (v2 with DD2.3): Two of 8 cores are unavailable (offline)
« Reply #2 on: January 28, 2020, 02:33:31 am »
This issue looks very familiar...what kernel version are you on?  Do you have an AMD GPU installed?

Reason I ask is that I tracked down a bug on an older kernel (could have been early 5.x series or mid to late 4.x series, can't recall offhand) that manifested with cores dropping out like this.  Basically, the driver was doing an invalid coherent access, the firmware didn't like that one bit and assumed the CPU was faulty, thus offlining more and more of the cores until it finally didn't have enough to even IPL.

I do know RHEL 7 was great at provoking it.

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
Re: Power9 8 core (v2 with DD2.3): Two of 8 cores are unavailable (offline)
« Reply #3 on: January 28, 2020, 02:31:53 pm »
Thanks for your background information!

Do you have an AMD GPU installed?

I do not use and have never installed a GPU. I am only using AST2500 with Full HD resolution...

what kernel version are you on?



I am not quite sure who to blame but most probably it is Ubuntu Server 19.10 with the original kernel of the ISO image boot (Installation) DVD.
I have also done many tests with an installed Ubuntu Server 19.10 and kernel version 5.3.0-26-generic.

Furthermore I have started the installation via a Fedora Server 31 ISO image (so the bundled kernel could have been the problem)
but the installation failed initially so I stopped the installation.


Fedora Server 31 is now up and running from the SSD without any problems so far (kernel 5.4.13-201.fc31.ppc64le) but the CPUs got offline before I used this updated kernel version.
[/s]
Update: Too early, Fedora Server 31 does now also have boot problems with the SATA drop-outs from time-to-time - but much less than frequent than Ubuntu 19.10 (without ever having booted Ubuntu Server 19.10 since a cold start)...

What is appealing when booting Ubuntu Server 19.10 is:

The combination of
  • a Samsung EVO Plus 970 TB NVMe SSD in the PCIe x8 slot
  • together with a Seagate IronWolf Pro 8 TB (ST8000NE0004) SATA III HDD in SATA-2
  • (and an Asus BW-16D1HT Retail BluRay Writer in SATA-1)
  • and Ubuntu Server 19.10 installed on the SSD
 
caused many drop-outs of the SATA devices (HDD and BluRay) during the boot phase of Ubuntu Server so this may also be related (bad driver?).

If I unplug the SSD and use the Ubuntu Server 19.10 installed on the HDD only it works very reliable (no boot problems or SATA device drop-outs
so I suspect the NVMe SSD causing the problem. I have also tried another PCIe to M.2 NVMe hardware with the same problems (RaidSonic ICY BOX IB-PCI214M2-HSL M.2 to PCIe adapter and Delock M.2 PCI Express x4 card) to exclude an adapter incompatibility issue.

petitboot always recognized SATA devices until booting Ubuntu 19.10 caused a SATA drop-out (then also petitboot did not show the SATA devices anymore until power-off and restart)...

During Ubuntu Server 19.10 boot dmesg had a lot of entries like these causing the SATA hardware to be disabled:

Code: [Select]
[   10.463160] ata3.00: qc timeout (cmd 0x47)
[   10.463165] ata3.00: READ LOG DMA EXT failed, trying PIO
[   10.463166] ata3.00: failed to get NCQ Send/Recv Log Emask 0x40
[   10.463171] ata3.00: configured for UDMA/133
...
[   65.726083] ata3: softreset failed (1st FIS failed)
[   65.726090] ata3: limiting SATA link speed to 3.0 Gbps
[   70.732850] ata3: softreset failed (1st FIS failed)
[   70.732858] ata3: reset failed, giving up
[   70.732861] ata3.00: disabled
...
[   75.720781] ata4: softreset failed (1st FIS failed)
[   80.722510] ata4: softreset failed (1st FIS failed)
[   80.722515] ata4: reset failed, giving up
[   80.722519] ata4.00: disabled
...

and after adding the boot parameter
Code: [Select]
libata.force=noncq
Code: [Select]
[   37.887253] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[   37.887702] ata3.00: failed command: READ DMA
[   37.888147] ata3.00: cmd c8/00:80:00:00:00/00:00:00:00:00/e0 tag 28 dma 65536 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   37.889049] ata3.00: status: { DRDY }
[   37.889496] ata3: hard resetting link
[   47.391014] ata4: softreset failed (1st FIS failed)
[   47.886705] ata3: softreset failed (1st FIS failed)
[   47.887151] ata3: hard resetting link
[   57.887577] ata3: softreset failed (1st FIS failed)
[   57.888017] ata3: hard resetting link
[   82.391257] ata4: softreset failed (1st FIS failed)
[   87.390760] ata4: softreset failed (1st FIS failed)
[   87.391183] ata4: reset failed, giving up
[   87.391586] ata4.00: disabled
[   87.391990] scsi 3:0:0:0: scsi scan: 96 byte inquiry failed.  Consider BLIST_INQUIRY_36 for this device
[   92.886631] ata3: softreset failed (1st FIS failed)
[   92.887041] ata3: limiting SATA link speed to 3.0 Gbps
[   92.887445] ata3: hard resetting link
[   97.887325] ata3: softreset failed (1st FIS failed)
[   97.887754] ata3: reset failed, giving up
[   97.888172] ata3.00: disabled
[   97.888592] ata3: EH complete

https://ata.wiki.kernel.org/index.php/Libata_error_messages
says:

Quote
Timeout: Most often this is due to an unrelated interrupt subsystem bug (try booting with 'pci=nomsi' or 'acpi=off' or 'noapic'),
 which failed to deliver an interrupt when we were expecting one from the hardware.

Fedora Server 31 log with the SATA drop-outs shown in dmesg:

Code: [Select]
[    0.990585] ata3: SATA max UDMA/133 abar m2048@0x600c100000000 port 0x600c100000200 irq 30
[    1.487812] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.507342] ata3.00: ATA-10: ST8000NE0004-1ZF11G, EN01, max UDMA/133
[    1.507345] ata3.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    6.557731] ata3.00: qc timeout (cmd 0xec)
[    6.557737] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[    6.557738] ata3.00: revalidation failed (errno=-5)
[   16.557020] ata3: softreset failed (1st FIS failed)
[   26.557020] ata3: softreset failed (1st FIS failed)
[   61.556654] ata3: softreset failed (1st FIS failed)
[   61.556656] ata3: limiting SATA link speed to 3.0 Gbps
[   66.556654] ata3: softreset failed (1st FIS failed)
[   66.556656] ata3: reset failed, giving up
[   66.556658] ata3.00: disabled

[    0.990587] ata4: SATA max UDMA/133 abar m2048@0x600c100000000 port 0x600c100000280 irq 30
[    1.487797] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    1.491980] ata4.00: ATAPI: ASUS    BW-16D1HT, 3.10, max UDMA/133
[    1.499775] ata4.00: configured for UDMA/133
[   97.577022] ata4: softreset failed (1st FIS failed)
[  107.577021] ata4: softreset failed (1st FIS failed)
[  142.576654] ata4: softreset failed (1st FIS failed)
[  147.576654] ata4: softreset failed (1st FIS failed)
[  147.576656] ata4: reset failed, giving up
[  147.576658] ata4.00: disabled
« Last Edit: January 29, 2020, 02:35:58 pm by FlyingBlackbird »

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
Re: Power9 8 core (v2 with DD2.3): Two of 8 cores are unavailable (offline)
« Reply #4 on: January 28, 2020, 03:45:24 pm »
run
pflash -P GUARD -c
in BMC shell, then reboot

Damn, my cores are back again, thanks a lot!  :)  :D

For other readers with similar issues: This is the output:

Code: [Select]
root@blackbird:~# pflash -P GUARD -c
About to erase and set ECC bits in region 0x0002c000 to 0x00031000
WARNING ! This will modify your HOST flash chip content !
Enter "yes" to confirm:yes
About to erase 0x0002c000..0x00031000 !
Erasing...
[==================================================] 100%
Programming ECC bits...
[==================================================] 100%
root@blackbird:~# reboot
« Last Edit: January 30, 2020, 06:28:11 pm by FlyingBlackbird »