For my first workstation, I purchased two Samsung PM9A1 (1TB) SSDs and I purchased two of the IcyBox IB-PCI208-HS cards to put them into PCIe slots. They worked immediately and I've had them for 11 months without any problem
Now I purchased two of the PM9A1 (2TB) SSDs and two more identical IB-PCI208-HS cards. These are only showing up intermittently or not at all when the machine boots.
In petitboot there is a lot of kernel error logging like this:
[ 937.395593] EEH: Recovering PHB#1-PE#fd
[ 937.395607] EEH: PE location: UOPWR.A100029-Node0-CPU1 Slot1 (8x), PHB location: N/A
[ 937.395609] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 937.395611] EEH: Notify device drivers to shutdown
[ 937.395614] EEH: Beginning: 'error_detected(IO frozen)'
[ 937.395620] PCI 0001:01:00.0#00fd: EEH: Invoking nvme->error_detected(IO frozen)
[ 937.395627] nvme nvme0: frozen state error detected, reset controller
[ 937.578659] PCI 0001:01:00.0#00fd: EEH: nvme driver reports: 'need reset'
[ 937.578662] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[ 937.578667] EEH: Collect temporary log
[ 937.578698] EEH: of node=0001:01:00.0
[ 937.578702] EEH: PCI device/vendor: ffffffff
[ 937.578706] EEH: PCI cmd/status register: ffffffff
[ 937.578707] EEH: PCI-E capabilities and status follow:
[ 937.578722] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff
I tried some of the following with no luck:
- removing the cards and re-inserting them in different slots
- removing the SSDs and putting them back into the cards
- upgrading the FPGA on the second workstation from 0xa to 0xc (v1.08) so that it matches the first workstation
- upgrading the PNOR on the second workstation to the v2.01 beta so that it matches the first workstation
- putting a new and bigger PSU on the second workstation (I needed to do this anyway for bigger GPU and more RAM)
- removing the GPU and everything else from the system and trying one SSD at a time