Author Topic: Did my Blackbird just die on me?  (Read 6268 times)

kth5

  • Newbie
  • *
  • Posts: 11
  • Karma: +2/-0
    • View Profile
Did my Blackbird just die on me?
« on: July 14, 2022, 12:32:56 pm »
The other day I was logged in from remote and the box just goes down. I could still reach the BMC and attempt to power it up but to no avail. No Hostboot output on serial (via BMC) or event logs on the BMC. Just plain nothing.

Once I got home I switched the box on manually via switch, the fans started running on full tilt as usual but after pretty much exactly 30s it switched off again, without leaving a trace as to why in the eventlog on the BMC.

Then, I went to remove all hardware but the CPU one by one with tries in between, same effect.

The only thing that looks weird obviously are repeating dmesg entries every few seconds on the BMC:

Code: [Select]
[ 1367.988668] aspeed-g5-pinctrl 1e6e2000.syscon:pinctrl: request pin 26 (F20) for 1e780000.gpio:306
[ 1367.988711] Want SCU90[0x00000002]=0x1, got 0x0 from 0x063F0000
[ 1367.988731] Want SCU8C[0x00000200]=0x1, got 0x0 from 0x00000001
[ 1367.988746] Want SCU70[0x00200000]=0x1, got 0x0 from 0xF1105206
[ 1370.989477] aspeed-g5-pinctrl 1e6e2000.syscon:pinctrl: request pin 26 (F20) for 1e780000.gpio:306
[ 1370.989520] Want SCU90[0x00000002]=0x1, got 0x0 from 0x063F0000
[ 1370.989538] Want SCU8C[0x00000200]=0x1, got 0x0 from 0x00000001
[ 1370.989548] Want SCU70[0x00200000]=0x1, got 0x0 from 0xF1105206
[ 1373.990267] aspeed-g5-pinctrl 1e6e2000.syscon:pinctrl: request pin 26 (F20) for 1e780000.gpio:306
[ 1373.990311] Want SCU90[0x00000002]=0x1, got 0x0 from 0x063F0000
[ 1373.990330] Want SCU8C[0x00000200]=0x1, got 0x0 from 0x00000001
[ 1373.990342] Want SCU70[0x00200000]=0x1, got 0x0 from 0xF1105206

Do these mean anything or are we just talking verbosity?

I can upgrade PNOR etc from BMC without failure and read it back, so that's not it either.


Did my CPU just die and if so, how the hell can I confirm this before I set on another investment of hundreds of dollars? :(

atomicdog

  • Newbie
  • *
  • Posts: 39
  • Karma: +4/-0
    • View Profile
Re: Did my Blackbird just die on me?
« Reply #1 on: July 14, 2022, 04:04:41 pm »
Looks like the BMC is failing to setup an IO pin.
Just a guess but maybe there's short at a button, connector, or the pins on BMC IC.

MPC7500

  • Hero Member
  • *****
  • Posts: 592
  • Karma: +41/-1
    • View Profile
    • Twitter
Re: Did my Blackbird just die on me?
« Reply #2 on: July 14, 2022, 05:08:54 pm »
In September I had a similar error message.
It happened after a thunderstorm and lightning struck (far away). There was a surge as a result.
During the same time, our heating control system also needed to be replaced, because of this.

Long story short: I had to reflash the BMC and OpenPOWER firmware.
https://wiki.raptorcs.com/wiki/Updating_Firmware

Now I always turn off the power strip when a storm is coming. BTW, Awilfox had the same issue, longe time ago.

Borley

  • Full Member
  • ***
  • Posts: 177
  • Karma: +16/-0
    • View Profile
Re: Did my Blackbird just die on me?
« Reply #3 on: July 14, 2022, 07:19:56 pm »
"Pin 26" is only referenced on the PCIe port and on J10117, which I think is the FlexVer port? I'm not sure how exacting that error message might be.

In September I had a similar error message.
It happened after a thunderstorm and lightning struck (far away). There was a surge as a result.
During the same time, our heating control system also needed to be replaced, because of this.

Long story short: I had to reflash the BMC and OpenPOWER firmware.
https://wiki.raptorcs.com/wiki/Updating_Firmware

Now I always turn off the power strip when a storm is coming. BTW, Awilfox had the same issue, longe time ago.

It might also be worth putting behind a surge suppressor. Normally I wouldn't care so much but seeing as these parts cost what they do...

kth5

  • Newbie
  • *
  • Posts: 11
  • Karma: +2/-0
    • View Profile
Re: Did my Blackbird just die on me?
« Reply #4 on: July 15, 2022, 02:02:29 am »
Long story short: I had to reflash the BMC and OpenPOWER firmware.
https://wiki.raptorcs.com/wiki/Updating_Firmware

That was the only thing I have not tried. So once at my desk at work I did it remotely only to find out that the BMC did not recover within 30 minutes after the reboot. Switched it off after approx 35 via the power strip (remotely accessible) and back on, to no avail.

Seems I may have bricked it fully now. :(

Once I get home it's time to hook up the serial again and see if there's any live visable still.

kth5

  • Newbie
  • *
  • Posts: 11
  • Karma: +2/-0
    • View Profile
Re: Did my Blackbird just die on me?
« Reply #5 on: July 15, 2022, 02:05:09 am »
It might also be worth putting behind a surge suppressor. Normally I wouldn't care so much but seeing as these parts cost what they do...

Well, too late now.  :P

kth5

  • Newbie
  • *
  • Posts: 11
  • Karma: +2/-0
    • View Profile
Re: Did my Blackbird just die on me?
« Reply #6 on: July 15, 2022, 10:41:13 am »
I probably found the culprit, the PSU seems to be bad. Another otherwise working x86 machine also proves to be unstable with it even at lower loads than 250W (it's a 550W). I have another PSU coming, will report if that solved my issue.

MPC7500

  • Hero Member
  • *****
  • Posts: 592
  • Karma: +41/-1
    • View Profile
    • Twitter
Re: Did my Blackbird just die on me?
« Reply #7 on: July 16, 2022, 12:58:33 pm »
Surprisingly, I have also heard of this problem more than once.
The main thing is that it works again.

"Pin 26" is only referenced on the PCIe port and on J10117, which I think is the FlexVer port? I'm not sure how exacting that error message might be.

I was not aware of this. This means that this error is always displayed?

atomicdog

  • Newbie
  • *
  • Posts: 39
  • Karma: +4/-0
    • View Profile
Re: Did my Blackbird just die on me?
« Reply #8 on: July 16, 2022, 05:34:16 pm »
The TalosII schematic shows pin26 F20 of the BMC to be an input, SYS_PWROK_BUF, from the FPGA.

kth5

  • Newbie
  • *
  • Posts: 11
  • Karma: +2/-0
    • View Profile
Re: Did my Blackbird just die on me?
« Reply #9 on: July 19, 2022, 12:03:34 pm »
New PSU arrived (same model that was in there before) and the Blackbird revived itself. A few boots failed with PNOR checksum failures which probably stem from my attempts to update with flaky power... Anyway, here's to another 3 years of 24/7. :)

Thanks everyone and I hope this thread may help anyone else who runs into this.

ClassicHasClass

  • Sr. Member
  • ****
  • Posts: 468
  • Karma: +36/-0
  • Talospace Earth Orbit
    • View Profile
    • Floodgap
Re: Did my Blackbird just die on me?
« Reply #10 on: July 19, 2022, 06:53:11 pm »
Whew!