Author Topic: Samsung PM9A1 IcyBox IB-PCI208-HS not working in one Talos II, OK in the other  (Read 4509 times)

pocock

  • Sr. Member
  • ****
  • Posts: 296
  • Karma: +32/-0
    • View Profile

For my first workstation, I purchased two Samsung PM9A1 (1TB) SSDs and I purchased two of the IcyBox IB-PCI208-HS cards to put them into PCIe slots.  They worked immediately and I've had them for 11 months without any problem

Now I purchased two of the PM9A1 (2TB) SSDs and two more identical IB-PCI208-HS cards.  These are only showing up intermittently or not at all when the machine boots.

In petitboot there is a lot of kernel error logging like this:

Code: [Select]
[  937.395593] EEH: Recovering PHB#1-PE#fd
[  937.395607] EEH: PE location: UOPWR.A100029-Node0-CPU1 Slot1 (8x), PHB location: N/A
[  937.395609] EEH: This PCI device has failed 1 times in the last hour and will  be permanently disabled after 5 failures.
[  937.395611] EEH: Notify device drivers to shutdown
[  937.395614] EEH: Beginning: 'error_detected(IO frozen)'
[  937.395620] PCI 0001:01:00.0#00fd: EEH: Invoking nvme->error_detected(IO frozen)
[  937.395627] nvme nvme0: frozen state error detected, reset controller
[  937.578659] PCI 0001:01:00.0#00fd: EEH: nvme driver reports: 'need reset'
[  937.578662] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[  937.578667] EEH: Collect temporary log
[  937.578698] EEH: of node=0001:01:00.0
[  937.578702] EEH: PCI device/vendor: ffffffff
[  937.578706] EEH: PCI cmd/status register: ffffffff
[  937.578707] EEH: PCI-E capabilities and status follow:
[  937.578722] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff

I tried some of the following with no luck:

- removing the cards and re-inserting them in different slots

- removing the SSDs and putting them back into the cards

- upgrading the FPGA on the second workstation from 0xa to 0xc (v1.08) so that it matches the first workstation

- upgrading the PNOR on the second workstation to the v2.01 beta so that it matches the first workstation

- putting a new and bigger PSU on the second workstation (I needed to do this anyway for bigger GPU and more RAM)

- removing the GPU and everything else from the system and trying one SSD at a time

Debian Developer
https://danielpocock.com

pocock

  • Sr. Member
  • ****
  • Posts: 296
  • Karma: +32/-0
    • View Profile

I put them both into a HP Z6 G4 workstation and they worked immediately in there so I don't think they are faulty, it looks more like a compatibility issue

Is there any other data I can collect from the Talos II to help troubleshoot?

I can try to purchase an alternative PCIe card for these SSDs, I don't mind swapping the cards and testing other models
Debian Developer
https://danielpocock.com

ClassicHasClass

  • Sr. Member
  • ****
  • Posts: 467
  • Karma: +35/-0
  • Talospace Earth Orbit
    • View Profile
    • Floodgap
Do the new cards work in the first workstation? I couldn't tell.

pocock

  • Sr. Member
  • ****
  • Posts: 296
  • Karma: +32/-0
    • View Profile
Unfortunately I can not shut it down right now to test them in it
Debian Developer
https://danielpocock.com