Author Topic: Updating Talos II firmware to IBM PNOR V2.18?  (Read 1643 times)

JeremyRand

  • Newbie
  • *
  • Posts: 24
  • Karma: +8/-0
    • View Profile
Updating Talos II firmware to IBM PNOR V2.18?
« on: February 09, 2023, 07:07:59 pm »
According to this IBM documentation (thanks awilfox for the pointer), PNOR v2.18 contains the following change:

Quote
HIPER/Pervasive:   A problem was fixed for a processor core checkstop with SRC BC70E540 logged with Signature Description " ex(n0p1c4) (NCUFIR[11]) NCU no response to snooped TLBIE".  This problem is intermittent and random but occurs with relatively high frequency for certain workloads.  The trigger for the failure is one core of a fused core pair going into a stopped state while the other core of the pair continues running.

Is it possible to get the Talos II firmware bumped to upstream IBM v2.18? I believe I may be running into this bug.
« Last Edit: February 10, 2023, 09:33:17 am by JeremyRand »

tle

  • Sr. Member
  • ****
  • Posts: 425
  • Karma: +47/-0
    • View Profile
    • Trung's Personal Website
Re: Updating Talos II firmware to IBM PNOR V2.18?
« Reply #1 on: March 01, 2023, 04:38:04 am »
AFAIK no, Talos firmware is not synced with upstream however it does not stop people from using a modified version of upstream kernel. Perhaps you should have a read at https://www.flamingspork.com/blog/2020/05/25/op-build-v2-5-firmware-for-the-raptor-blackbird/ and see if you could adapt anything from his work
Faithful Linux enthusiast

My Raptor Blackbird

tle

  • Sr. Member
  • ****
  • Posts: 425
  • Karma: +47/-0
    • View Profile
    • Trung's Personal Website
Re: Updating Talos II firmware to IBM PNOR V2.18?
« Reply #2 on: March 01, 2023, 04:40:26 am »
Having said that, I am unsure if using custom firmware would void warranty or not. That I have to leave the answer to official response from RaptorCS
Faithful Linux enthusiast

My Raptor Blackbird

ejfluhr

  • Newbie
  • *
  • Posts: 44
  • Karma: +3/-0
    • View Profile
Re: Updating Talos II firmware to IBM PNOR V2.18?
« Reply #3 on: March 14, 2023, 06:05:09 pm »
I believe that all OpenPOWER/OPAL-based systems use SMT4 cores, not SMT8 (i.e. "fused") cores.    So that may mean you aren't seeing the same problem.

Do you get the same fault callout?

JeremyRand

  • Newbie
  • *
  • Posts: 24
  • Karma: +8/-0
    • View Profile
Re: Updating Talos II firmware to IBM PNOR V2.18?
« Reply #4 on: April 10, 2023, 01:21:43 am »
I believe that all OpenPOWER/OPAL-based systems use SMT4 cores, not SMT8 (i.e. "fused") cores.    So that may mean you aren't seeing the same problem.

Do you get the same fault callout?

I get the same error "NCU no response to snooped TLBIE". I asked Timothy on IRC and he said that the 18/22-core Raptor CPU's use fused cores. Which tracks with the fact that this started happening to me when I upgraded from 4-core CPU's to 22-core CPU's.

Looks like Raptor only has minimal changes to the IBM repo that contains the fix, and there are no rebase conflicts. So, I'll try building a modified firmware that incorporates IBM's fix and see if it helps here.

ejfluhr

  • Newbie
  • *
  • Posts: 44
  • Karma: +3/-0
    • View Profile
Re: Updating Talos II firmware to IBM PNOR V2.18?
« Reply #5 on: April 17, 2023, 05:44:57 pm »
The 18c & 22c parts are "paired" meaning 2 SMT4 cores share the same L2 & L3, unlike the 4c and 8c which are "unpaired" meaning each core gets the full L2 and L3 to itself.    This is not the same as "fused" (i.e. SMT8 cores) but it is quite likely that the fix will also work for "paired" cores as presumably the issue is sharing cacheable/non-cacheable pathways.   Good luck!

JeremyRand

  • Newbie
  • *
  • Posts: 24
  • Karma: +8/-0
    • View Profile
Re: Updating Talos II firmware to IBM PNOR V2.18?
« Reply #6 on: April 21, 2023, 10:15:14 pm »
The 18c & 22c parts are "paired" meaning 2 SMT4 cores share the same L2 & L3, unlike the 4c and 8c which are "unpaired" meaning each core gets the full L2 and L3 to itself.    This is not the same as "fused" (i.e. SMT8 cores) but it is quite likely that the fix will also work for "paired" cores as presumably the issue is sharing cacheable/non-cacheable pathways.   Good luck!

I think you're right about the terminology, but yeah, I doubt the IBM docs draw a distinction since those docs are exclusively written for SMT8 users.

I've successfully built a PNOR with IBM's patch; it installed without major issues (I accidentally hosed things the first time I installed it due to the BMC running out of RAM -- protip, don't put multiple PNOR images in OpenBMC /tmp/ -- but power-cycling the BMC fixed that). So far it seems stable, I'll be running it for the next month or so to see if any checkstops happen.

tle

  • Sr. Member
  • ****
  • Posts: 425
  • Karma: +47/-0
    • View Profile
    • Trung's Personal Website
Re: Updating Talos II firmware to IBM PNOR V2.18?
« Reply #7 on: May 28, 2023, 08:21:48 am »
The 18c & 22c parts are "paired" meaning 2 SMT4 cores share the same L2 & L3, unlike the 4c and 8c which are "unpaired" meaning each core gets the full L2 and L3 to itself.    This is not the same as "fused" (i.e. SMT8 cores) but it is quite likely that the fix will also work for "paired" cores as presumably the issue is sharing cacheable/non-cacheable pathways.   Good luck!

I think you're right about the terminology, but yeah, I doubt the IBM docs draw a distinction since those docs are exclusively written for SMT8 users.

I've successfully built a PNOR with IBM's patch; it installed without major issues (I accidentally hosed things the first time I installed it due to the BMC running out of RAM -- protip, don't put multiple PNOR images in OpenBMC /tmp/ -- but power-cycling the BMC fixed that). So far it seems stable, I'll be running it for the next month or so to see if any checkstops happen.

Would you be able to provide more details on which patch? Many thanks
Faithful Linux enthusiast

My Raptor Blackbird

JeremyRand

  • Newbie
  • *
  • Posts: 24
  • Karma: +8/-0
    • View Profile
Re: Updating Talos II firmware to IBM PNOR V2.18?
« Reply #8 on: June 16, 2023, 02:32:16 am »
The 18c & 22c parts are "paired" meaning 2 SMT4 cores share the same L2 & L3, unlike the 4c and 8c which are "unpaired" meaning each core gets the full L2 and L3 to itself.    This is not the same as "fused" (i.e. SMT8 cores) but it is quite likely that the fix will also work for "paired" cores as presumably the issue is sharing cacheable/non-cacheable pathways.   Good luck!

I think you're right about the terminology, but yeah, I doubt the IBM docs draw a distinction since those docs are exclusively written for SMT8 users.

I've successfully built a PNOR with IBM's patch; it installed without major issues (I accidentally hosed things the first time I installed it due to the BMC running out of RAM -- protip, don't put multiple PNOR images in OpenBMC /tmp/ -- but power-cycling the BMC fixed that). So far it seems stable, I'll be running it for the next month or so to see if any checkstops happen.

Would you be able to provide more details on which patch? Many thanks

This is the hcode branch I used: https://github.com/JeremyRand/hcode/tree/talos-2019-07-25-master-rebased

As you can see, it's simply a copy of Raptor's hcode, rebased against current upstream IBM hcode (there were no rebase conflicts). The specific bugfix commit is this one: https://github.com/JeremyRand/hcode/commit/ca06a0c996e3b48c02cfb3912dddf7ca23ec4202

I've been running it on my DD2.3 2x18-core Talos II for 2 months, and my DD2.2 2x22-core Talos II for 1 month, and I have had no checkstops on either since applying the patch, nor any new issues. At this point I can recommend that Raptor integrate the bugfix into a new PNOR release. Since the rebase had no conflicts, it should be trivially easy for Raptor to do this.
« Last Edit: June 16, 2023, 02:49:29 am by JeremyRand »

tle

  • Sr. Member
  • ****
  • Posts: 425
  • Karma: +47/-0
    • View Profile
    • Trung's Personal Website
Re: Updating Talos II firmware to IBM PNOR V2.18?
« Reply #9 on: March 21, 2024, 09:21:33 pm »
The 18c & 22c parts are "paired" meaning 2 SMT4 cores share the same L2 & L3, unlike the 4c and 8c which are "unpaired" meaning each core gets the full L2 and L3 to itself.    This is not the same as "fused" (i.e. SMT8 cores) but it is quite likely that the fix will also work for "paired" cores as presumably the issue is sharing cacheable/non-cacheable pathways.   Good luck!

I think you're right about the terminology, but yeah, I doubt the IBM docs draw a distinction since those docs are exclusively written for SMT8 users.

I've successfully built a PNOR with IBM's patch; it installed without major issues (I accidentally hosed things the first time I installed it due to the BMC running out of RAM -- protip, don't put multiple PNOR images in OpenBMC /tmp/ -- but power-cycling the BMC fixed that). So far it seems stable, I'll be running it for the next month or so to see if any checkstops happen.

Would you be able to provide more details on which patch? Many thanks

This is the hcode branch I used: https://github.com/JeremyRand/hcode/tree/talos-2019-07-25-master-rebased

As you can see, it's simply a copy of Raptor's hcode, rebased against current upstream IBM hcode (there were no rebase conflicts). The specific bugfix commit is this one: https://github.com/JeremyRand/hcode/commit/ca06a0c996e3b48c02cfb3912dddf7ca23ec4202

I've been running it on my DD2.3 2x18-core Talos II for 2 months, and my DD2.2 2x22-core Talos II for 1 month, and I have had no checkstops on either since applying the patch, nor any new issues. At this point I can recommend that Raptor integrate the bugfix into a new PNOR release. Since the rebase had no conflicts, it should be trivially easy for Raptor to do this.

Thanks for the information
Faithful Linux enthusiast

My Raptor Blackbird