Recent Posts

Pages: 1 2 [3] 4 5 ... 10
22
Rejoice! Chromium is also now available for Fedora 40! It takes years to get to this stage. Now we could all relax :D
23
Firmware / Re: Firmware 2.10 for Talos-II and Blackbird available
« Last post by Borley on March 23, 2024, 09:52:03 pm »
what is the linux kernel version of firmware 2.10? Some cards are known to be buggy with old kernel, please refer to https://wiki.raptorcs.com/wiki/POWER9_Hardware_Compatibility_List/PCIe_Devices for compatability

6.6 I think. The new firmware is up and running on my evaluation system but I'm not loading firmwares as r34per is trying to do.
24
GPU Compute / Accelerators / Re: Radeon RDNA3 support?
« Last post by Borley on March 23, 2024, 09:45:45 pm »
I've heard the same too, that it's had it since Vega -- but that it needs PSP or ME on the CPU side too to be effective.

Do you know where I might find literature on this? In what way does it attempt to link to the CPU PSP?
25
GPU Compute / Accelerators / Re: AMD flirting with open sourcing GPU firmware
« Last post by Borley on March 23, 2024, 09:41:52 pm »
It seems to have fallen through with no change brought about by any of the dialogue.
26
Firmware / Re: Firmware 2.10 for Talos-II and Blackbird available
« Last post by atomicdog on March 21, 2024, 10:17:00 pm »
Their gitlab repo has more recent changes so I'm guessing that's where the 2.10 version is.
27
Firmware / Re: Updating Talos II firmware to IBM PNOR V2.18?
« Last post by tle on March 21, 2024, 09:21:33 pm »
The 18c & 22c parts are "paired" meaning 2 SMT4 cores share the same L2 & L3, unlike the 4c and 8c which are "unpaired" meaning each core gets the full L2 and L3 to itself.    This is not the same as "fused" (i.e. SMT8 cores) but it is quite likely that the fix will also work for "paired" cores as presumably the issue is sharing cacheable/non-cacheable pathways.   Good luck!

I think you're right about the terminology, but yeah, I doubt the IBM docs draw a distinction since those docs are exclusively written for SMT8 users.

I've successfully built a PNOR with IBM's patch; it installed without major issues (I accidentally hosed things the first time I installed it due to the BMC running out of RAM -- protip, don't put multiple PNOR images in OpenBMC /tmp/ -- but power-cycling the BMC fixed that). So far it seems stable, I'll be running it for the next month or so to see if any checkstops happen.

Would you be able to provide more details on which patch? Many thanks

This is the hcode branch I used: https://github.com/JeremyRand/hcode/tree/talos-2019-07-25-master-rebased

As you can see, it's simply a copy of Raptor's hcode, rebased against current upstream IBM hcode (there were no rebase conflicts). The specific bugfix commit is this one: https://github.com/JeremyRand/hcode/commit/ca06a0c996e3b48c02cfb3912dddf7ca23ec4202

I've been running it on my DD2.3 2x18-core Talos II for 2 months, and my DD2.2 2x22-core Talos II for 1 month, and I have had no checkstops on either since applying the patch, nor any new issues. At this point I can recommend that Raptor integrate the bugfix into a new PNOR release. Since the rebase had no conflicts, it should be trivially easy for Raptor to do this.

Thanks for the information
28
Firmware / Re: Firmware 2.10 for Talos-II and Blackbird available
« Last post by tle on March 21, 2024, 09:16:29 pm »
Folks, any idea where could I find 2.10 changes in git repo?

I was looking into https://git.raptorcs.com/git/ but unable to find any changes in blackbird-* that are related
29
GPU Compute / Accelerators / Re: AMD flirting with open sourcing GPU firmware
« Last post by lepidotos on March 20, 2024, 06:06:13 am »
I dunno about their motives, but even just from a performance standpoint, I imagine they have a point. AMD's own software is really bad, and I can't help but imagine their firmware could be improved upon too. Plus, being able to actually see what goes on in there is nice.
30
Firmware / Re: Firmware 2.10 for Talos-II and Blackbird available
« Last post by tle on March 17, 2024, 05:30:54 pm »
I may have spoke too soon :( I get no video when I boot into the os, and the boot console spits this out-
Code: [Select]
SIGTERM received, booting...
[   99.149386402,3] PHB#0000[0:0]:                  brdgCtl = 00000002
[   99.149481878,3] PHB#0000[0:0]:             deviceStatus = 00000020
[   99.149523997,3] PHB#0000[0:0]:               slotStatus = 00402000
[   99.149618190,3] PHB#0000[0:0]:               linkStatus = a0840008
[   99.149660035,3] PHB#0000[0:0]:             devCmdStatus = 00100107
[   99.149727651,3] PHB#0000[0:0]:             devSecStatus = 00002000
[   99.149774208,3] PHB#0000[0:0]:          rootErrorStatus = 00000000
[   99.149829970,3] PHB#0000[0:0]:          corrErrorStatus = 00000000
[   99.149869722,3] PHB#0000[0:0]:        uncorrErrorStatus = 00000000
[   99.149918442,3] PHB#0000[0:0]:                   devctl = 00000020
[   99.149955380,3] PHB#0000[0:0]:                  devStat = 00000000
[   99.149996897,3] PHB#0000[0:0]:                  tlpHdr1 = 00000000
[   99.150043352,3] PHB#0000[0:0]:                  tlpHdr2 = 00000000
[   99.150096694,3] PHB#0000[0:0]:                  tlpHdr3 = 00000000
[   99.150143000,3] PHB#0000[0:0]:                  tlpHdr4 = 00000000
[   99.150189643,3] PHB#0000[0:0]:                 sourceId = 00000000
[   99.150231444,3] PHB#0000[0:0]:                     nFir = 0000000000000000
[   99.150275820,3] PHB#0000[0:0]:                 nFirMask = 0030001c00000000
[   99.150319837,3] PHB#0000[0:0]:                  nFirWOF = 0000000000000000
[   99.150378022,3] PHB#0000[0:0]:                 phbPlssr = 0000001c00000000
[   99.150433559,3] PHB#0000[0:0]:                   phbCsr = 0000001c00000000
[   99.150489148,3] PHB#0000[0:0]:                   lemFir = 0000000100280000
[   99.150533384,3] PHB#0000[0:0]:             lemErrorMask = 0000000000000000
[   99.150577353,3] PHB#0000[0:0]:                   lemWOF = 0000000100000000
[   99.150621318,3] PHB#0000[0:0]:           phbErrorStatus = 0000088000000000
[   99.150672497,3] PHB#0000[0:0]:      phbFirstErrorStatus = 0000008000000000
[   99.150728026,3] PHB#0000[0:0]:             phbErrorLog0 = 2148000098000240
[   99.150774762,3] PHB#0000[0:0]:             phbErrorLog1 = a008400000000000
[   99.150823696,3] PHB#0000[0:0]:        phbTxeErrorStatus = 0000000000000000
[   99.150872357,3] PHB#0000[0:0]:   phbTxeFirstErrorStatus = 0000000000000000
[   99.150916641,3] PHB#0000[0:0]:          phbTxeErrorLog0 = 0000000000000000
[   99.150965287,3] PHB#0000[0:0]:          phbTxeErrorLog1 = 0000000000000000
[   99.151018775,3] PHB#0000[0:0]:     phbRxeArbErrorStatus = 4000200000000000
[   99.151074489,3] PHB#0000[0:0]: phbRxeArbFrstErrorStatus = 0000200000000000
[   99.151127737,3] PHB#0000[0:0]:       phbRxeArbErrorLog0 = 02409fde30000000
[   99.151171863,3] PHB#0000[0:0]:       phbRxeArbErrorLog1 = 0000000000000000
[   99.151215896,3] PHB#0000[0:0]:     phbRxeMrgErrorStatus = 0000000000000000
[   99.151260084,3] PHB#0000[0:0]: phbRxeMrgFrstErrorStatus = 0000000000000000
[   99.151315450,3] PHB#0000[0:0]:       phbRxeMrgErrorLog0 = 0000000000000000
[   99.151369016,3] PHB#0000[0:0]:       phbRxeMrgErrorLog1 = 0000000000000000
[   99.151424438,3] PHB#0000[0:0]:     phbRxeTceErrorStatus = 0000000000000000
[   99.151471170,3] PHB#0000[0:0]: phbRxeTceFrstErrorStatus = 0000000000000000
[   99.151517918,3] PHB#0000[0:0]:       phbRxeTceErrorLog0 = 0000000000000000
[   99.151561833,3] PHB#0000[0:0]:       phbRxeTceErrorLog1 = 0000000000000000
[   99.151614682,3] PHB#0000[0:0]:        phbPblErrorStatus = 0000000001000000
[   99.151663274,3] PHB#0000[0:0]:   phbPblFirstErrorStatus = 0000000001000000
[   99.151716727,3] PHB#0000[0:0]:          phbPblErrorLog0 = 0000000000000000
[   99.151762796,3] PHB#0000[0:0]:          phbPblErrorLog1 = 0000000000000000
[   99.151813691,3] PHB#0000[0:0]:      phbPcieDlpErrorLog1 = 0000000000000000
[   99.151858094,3] PHB#0000[0:0]:      phbPcieDlpErrorLog2 = 0000000000000000
[   99.151904253,3] PHB#0000[0:0]:    phbPcieDlpErrorStatus = 00be000000000000
[   99.151959774,3] PHB#0000[0:0]:       phbRegbErrorStatus = 0000004000000000
[   99.152015372,3] PHB#0000[0:0]:  phbRegbFirstErrorStatus = 0000004000000000
[   99.152068905,3] PHB#0000[0:0]:         phbRegbErrorLog0 = 8800006c00000000
[   99.152115691,3] PHB#0000[0:0]:         phbRegbErrorLog1 = 0000000007011000
[   99.152162310,3] PHB#0000[0:0]:                PEST[000] = a440002a00000000 8000000000000000
[   99.152218234,3] PHB#0000[0:0]:                PEST[001] = 8000000000000000 8000000000000000
[   99.152285858,3] PHB#0000[0:0]:                PEST[002] = 8000000000000000 8000000000000000
[   99.152350714,3] PHB#0000[0:0]:                PEST[003] = 8000000000000000 8000000000000000
[   99.152414534,3] PHB#0000[0:0]:                PEST[004] = 8000000000000000 8000000000000000
[   99.152474834,3] PHB#0000[0:0]:                PEST[005] = 8000000000000000 8000000000000000
[   99.152528675,3] PHB#0000[0:0]:                PEST[006] = 8000000000000000 8000000000000000
[   99.152589889,3] PHB#0000[0:0]:                PEST[007] = 8000000000000000 8000000000000000
[   99.152657446,3] PHB#0000[0:0]:                PEST[008] = 8000000000000000 8000000000000000
[   99.152720282,3] PHB#0000[0:0]:                PEST[1ff] = 3740002a03000000 0000000000000000
[    3.560406] EEH: Recovering PHB#0-PE#0
[    3.560433] EEH: PE location: UOPWR.D100029-Node0-SLOT1 PCIE 4.0 X16, PHB location: N/A
[    3.560473] EEH: Frozen PHB#0-PE#0 detected
[    3.560486] EEH: Call Trace:
[    3.560526] EEH: [00000000c094f14c] __eeh_send_failure_event+0x7c/0x160
[    3.560585] EEH: [00000000c2fbde4c] eeh_dev_check_failure+0x2c4/0x6a0
[    3.560634] EEH: [00000000eb293b00] amdgpu_device_rreg.part.0+0x160/0x1f0 [amdgpu]
[    3.560924] EEH: [0000000009854edf] psp_wait_for+0xac/0x130 [amdgpu]
[    3.561223] EEH: [0000000006086f20] psp_v11_0_mode1_reset+0xbc/0x130 [amdgpu]
[    3.561554] EEH: [00000000927ca5cd] psp_gpu_reset+0x88/0xd0 [amdgpu]
[    3.561868] EEH: [000000000d948d66] amdgpu_device_mode1_reset+0x148/0x180 [amdgpu]
[    3.562116] EEH: [00000000d607b75f] nv_asic_reset+0xbc/0x290 [amdgpu]
[    3.562414] EEH: [00000000893f34f2] amdgpu_device_init+0x172c/0x2300 [amdgpu]
[    3.562693] EEH: [00000000d8547fbc] amdgpu_driver_load_kms+0x30/0x1e0 [amdgpu]
[    3.562966] EEH: [000000008c9f0b1b] amdgpu_pci_probe+0x1f0/0x540 [amdgpu]
[    3.563210] EEH: [0000000067c06d95] local_pci_probe+0x68/0x110
[    3.563250] EEH: [000000004224f0ca] work_for_cpu_fn+0x38/0x60
[    3.563290] EEH: [00000000c5105116] process_one_work+0x2a4/0x570
[    3.563332] EEH: [00000000f81a86b6] worker_thread+0x280/0x5b0
[    3.563372] EEH: [00000000bf39fc31] kthread+0x120/0x130
[    3.563409] EEH: [0000000036d034ff] ret_from_kernel_thread+0x5c/0x64
[    3.852813] kernel BUG at drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c:593!
[    3.852840] Oops: Exception in kernel mode, sig: 5 [#1]
[    3.852856] LE PAGE_SIZE=4K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[    3.852884] Modules linked in: uas usb_storage sd_mod amdgpu(+) gpu_sched drm_buddy i2c_algo_bit drm_display_helper cec rc_core drm_ttm_helper ttm drm_kms_helper xhci_pci xhci_pci_renesas syscopyarea sysfillrect ahci sysimgblt fb_sys_fops libahci xhci_hcd libata drm vmx_crypto gf128mul usbcore scsi_mod drm_panel_orientation_quirks usb_common scsi_common agpgart dm_mirror dm_region_hash dm_log dm_mod btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic crc32c_vpmsum
[    3.853130] CPU: 0 PID: 23 Comm: kworker/0:0 Not tainted 6.0.13_1 #1
[    3.853162] Workqueue: events work_for_cpu_fn
[    3.853201] NIP:  c008000002cbb648 LR: c008000002c3cb50 CTR: c008000002cbb5f8
[    3.853241] REGS: c000000002527500 TRAP: 0700   Not tainted  (6.0.13_1)
[    3.853288] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002248  XER: 20040000
[    3.853339] CFAR: c008000002cbb6dc IRQMASK: 0
[    3.853339] GPR00: c008000002c3cb50 c0000000025277a0 c0080000033b1000 00feffffff900000
[    3.853339] GPR04: 00feffffff900000 c000000002527858 c000000002527860 c0000000024f0000
[    3.853339] GPR08: 0000000000000001 00fe000000000000 0040000000000002 c00800000318aef0
[    3.853339] GPR12: c008000002cbb5f8 c0000003ff7ef600 c000000016e86070 c000000016e86078
[    3.853339] GPR16: c000000016e86068 c000000016e98338 c000000016e86088 c000000016e86090
[    3.853339] GPR20: c000000016e86080 c008000003430dcc 0000000000000100 c000000016e97250
[    3.853339] GPR24: 0000000000000001 c0080000033c5dd0 c000000016e80000 c000000016e85208
[    3.853339] GPR28: c000000016e80000 ffffffffffffffff c000000002527860 c000000002527860
[    3.853711] NIP [c008000002cbb648] gmc_v10_0_get_vm_pde+0x50/0x120 [amdgpu]
[    3.854018] LR [c008000002c3cb50] amdgpu_gmc_get_pde_for_bo+0xa8/0x110 [amdgpu]
[    3.854326] Call Trace:
[    3.854348] [c0000000025277a0] [c0000000025277e0] 0xc0000000025277e0 (unreliable)
[    3.854389] [c0000000025277e0] [c008000002c3cb50] amdgpu_gmc_get_pde_for_bo+0xa8/0x110 [amdgpu]
[    3.854699] [c000000002527830] [c008000002c3cc08] amdgpu_gmc_pd_addr+0x50/0xa8 [amdgpu]
[    3.855008] [c000000002527870] [c008000002cb7b30] gfxhub_v2_0_gart_enable+0x48/0x11f0 [amdgpu]
[    3.855325] [c0000000025278d0] [c008000002cbce30] gmc_v10_0_hw_init+0x88/0x270 [amdgpu]
[    3.855651] [c000000002527960] [c008000002be4a9c] amdgpu_device_init+0x1ee4/0x2300 [amdgpu]
[    3.855968] [c000000002527ac0] [c008000002be6758] amdgpu_driver_load_kms+0x30/0x1e0 [amdgpu]
[    3.856240] [c000000002527b40] [c008000002bdae68] amdgpu_pci_probe+0x1f0/0x540 [amdgpu]
[    3.856532] [c000000002527be0] [c0000000008d6078] local_pci_probe+0x68/0x110
[    3.856583] [c000000002527c60] [c00000000017f5b8] work_for_cpu_fn+0x38/0x60
[    3.856634] [c000000002527c90] [c000000000184ee4] process_one_work+0x2a4/0x570
[    3.856684] [c000000002527d30] [c000000000185a30] worker_thread+0x280/0x5b0
[    3.856725] [c000000002527dc0] [c000000000191a70] kthread+0x120/0x130
[    3.856765] [c000000002527e10] [c00000000000cecc] ret_from_kernel_thread+0x5c/0x64
[    3.856807] Instruction dump:
[    3.856829] 7c7c1b78 fbe1fff8 7c9d2378 f821ffc1 e8850000 794a07c6 7cdf3378 614a0002
[    3.856876] 7d095039 41820074 788982a0 79298002 <0b090000> 893c0d44 2c090000 41820014
[    3.856935] ---[ end trace 0000000000000000 ]---

fast reboot is disabled, and these were the firmware files I used-
Code: [Select]
navi10_asd.bin     navi14_gpu_info.bin  navi14_me_wks.bin   navi14_smc.bin
navi10_ta.bin      navi14_me.bin        navi14_pfp.bin      navi14_sos.bin
navi10_vcn.bin     navi14_mec2.bin      navi14_pfp_wks.bin  navi14_ta.bin
navi14_asd.bin     navi14_mec2_wks.bin  navi14_rlc.bin      navi14_vcn.bin
navi14_ce.bin      navi14_mec.bin       navi14_sdma1.bin
navi14_ce_wks.bin  navi14_mec_wks.bin   navi14_sdma.bin

I tried all manner of combinations of the navi firmware and the ones that did give me video in petitboot would throw the same error.


what is the linux kernel version of firmware 2.10? Some cards are known to be buggy with old kernel, please refer to https://wiki.raptorcs.com/wiki/POWER9_Hardware_Compatibility_List/PCIe_Devices for compatability
Pages: 1 2 [3] 4 5 ... 10