So my Blackbird system just suddenly shut down. I'd appreciate any help with diagnosing the issue. I found 3 high priority events logged by the BMC within 2 minutes of the sudden shut down. The messages are:
xyz.openbmc_project.Sensor.Device.Error.ReadFailure
CALLOUT_DEVICE_PATH=/sys/devices/platform/gpio-fsi/fsi0/slave@00:00/00:00:00:06/sbefifo1-dev0/occ-hwmon.1
CALLOUT_ERRNO=108
_PID=4791
org.open_power.Proc.FSI.Error.MasterDetectionFailure
CALLOUT_DEVICE_PATH=/sys/devices/platform/gpio-fsi/fsi0/slave@00:00/raw
CALLOUT_ERRNO=0
_PID=26801
org.open_power.Proc.FSI.Error.MasterDetectionFailure
CALLOUT_DEVICE_PATH=/sys/devices/platform/gpio-fsi/fsi0/slave@00:00/raw
CALLOUT_ERRNO=0
_PID=26873
I see many similar "MasterDetectionFailure" messages in the BMC event log that don't seem to have caused issues before. I don't see any other "ReadFailure" messages though.
The BMC seemed to think the system was still running in the "Server power operations" section. A Warm reboot failed, but a subsequent Power on fired up the system again.
There were no messages in the system journal for 5 minutes before the shut down, and those messages were not errors and appear to be unrelated to the issue.
The system is connected to a UPS, and other computers connected to the same had no problems.