Our production Blackbird had suddenly suffered a breakdown when the BMC had suddenly requested IPMI shutdown. We were able to reboot, and the normal booting sequence completed successfully, for the exact same event to happen 10 minutes later.
And it persists.
I had looked into BMC dmesg log to find this:
[ 294.400092] aspeed-g5-pinctrl 1e6e2000.syscon:pinctrl: request pin 26 (F20) for 1e780000.gpio:306
[ 294.400136] Want SCU90[0x00000002]=0x1, got 0x0 from 0x063F0000
[ 294.400158] Want SCU8C[0x00000200]=0x1, got 0x0 from 0x00000001
[ 294.400171] Want SCU70[0x00200000]=0x1, got 0x0 from 0xF1105206
[ 648.795152] aspeed-i2c-bus 1e78a440.i2c-bus: irq handled != irq. expected 0x00001010, but was 0x00000010
[ 648.805898] aspeed-i2c-bus 1e78a440.i2c-bus: irq handled != irq. expected 0x00001001, but was 0x00000001
[ 648.817794] aspeed-i2c-bus 1e78a440.i2c-bus: irq handled != irq. expected 0x00001010, but was 0x00000010
[ 649.661265] aspeed-i2c-bus 1e78a440.i2c-bus: irq handled != irq. expected 0x00001001, but was 0x00000001
This issue looks similar to
https://forums.raptorcs.com/index.php?topic=377.0However, in our case the system is plugged into a UPS, and there hasn’t been electric failures of any kind. To be sure, we had swapped out the chassis PSU for a brand-new one we had laying around, and bypassed the UPS completely. Unfortunately, it did absolutely nothing. We have also tried re-flashing the self-compiled PNOR as well as the latest release & manually upgrading BMC firmware, and unfortunately it had no effect. Note: we did upgrade to the latest-release PNOR a few months back, although I think it probably had nothing to do with the issue at hand.
Should we proceed with dumping & re-flashing the FPGA now? I hate that it looks just like the issue linked above, but the remedies aren't working.
Best
P.S. Is jerry-rigged serprog-based programmer good enough for at least dumping the FPGA flash for inspection?