Raptor Computing Systems Community Forums (BETA)
Raptor Computing Systems Hardware => Blackbird => Topic started by: r34per on April 28, 2023, 04:46:46 pm
-
I checked the OpenBMC web interface to find it reported in being critical health with 200 high priority errors logged yesterday over the course of about 3 hours, and as far as I can tell it's the same error for all of them.
The error is org.open_power.Proc.FSI.Error.MasterDetectionFailure
When I expand the entry this is what it reads- CALLOUT_DEVICE_PATH=/sys/devices/platform/gpio-fsi/fsi0/slave@00:00/raw CALLOUT_ERRNO=0 _PID=8606
Like I said they all appear to be the same error with that same message, although the PID= number is different. My blackbird appeared to have powered off at some point as well, though I don't know when(can I check that somewhere in openbmc?). I stepped away from my pc for the evening before the first was logged and forgot to shut it down for the night, and when I went to use it this morning it was not running.
I'm running void linux as the os, and I couldn't find any logs that would shed any light on it. It seems by default void does have a syslog daemon and I never bothered installing one, oops.
Is this a cause for concern, and should I put a ticket in with RCS about it? It happened once before a few weeks or so ago but I chalked it up to a fluke, I cleared the logs and it seemed to be fine.
-
You may be interested to know that Raptor weighed in on their Twitter after it was pointed out to them, with a first suggestion to try reseating the CPU.
The link:
https://nitter.poast.org/RaptorCompSys/status/1652858741635665920#m
-
I'll give that a try, thanks for the heads up!
-
I got this error in my Blackbird's BMC log when there was a brief (like one second) power outage (brownout maybe?) last week.
-
Then I would try this:
https://wiki.raptorcs.com/wiki/Troubleshooting/Guard_Partition
Otherwise, if that doesn't help, I would re-flash the BMC and OpenPOWER firmware:
https://wiki.raptorcs.com/wiki/Updating_Firmware
-
I got this error in my Blackbird's BMC log when there was a brief (like one second) power outage (brownout maybe?) last week.
That could be what's happening to mine too when I think about. Brief brown-outs aren't uncommon when it's windy or stormy, which it has been when it happened. I should probably invest in a UPS for it and see if it still gives me any trouble. I'll try MPC7500's suggestions too, thanks for the help!
-
These events have also been showing in the 'Server Health' section of my BMC web panel. Just four marked from 2020, and two more recent from December 2023.
org.Open_power.Proc.FSI.Error.MasterDetectionFailure
CALLOUT_DEVICE_PATH=/sys/devices/platform/gpio-fsi/fsi0/slave@00:00/raw CALLOUT_ERRNO=0_PID=4612
My system has been fine, other than occasionally booting without properly setting RTC time to the host (same issue that ClassicHasClass has been seeing? (https://www.talospace.com/2023/12/fedora-39-mini-review-on-blackbird-and.html))
If this is power outage related, that would make sense since, through 2020, I unknowingly had my Blackbird on a bad uninterruptible power supply. Then later had it in a location prone to outages before replacing the UPS.
If these logs are nothing critical, they should be safe to clear from the log?
-
If you believe you've determined the cause, I don't see a reason to keep the error messages around for further investigation. Good work troubleshooting it.
-
My system has been fine, other than occasionally booting without properly setting RTC time to the host (same issue that ClassicHasClass has been seeing? (https://www.talospace.com/2023/12/fedora-39-mini-review-on-blackbird-and.html))
Yeah, I'm trying to do more research on that. I'm assuming the BMC settings just got whacked given that the password was also scrambled but it still seems an odd failure mode.