Third Party Hardware > GPU Compute / Accelerators

Wx9100 on Talos II

<< < (2/4) > >>

rheaplex:
Current status:

Stock 64k page Fedora kernel: 5.14.10-300.fc35.ppc64le

Kernel arguments: root=UUID=73574303-f000-44ec-bbc2-7c86d0ab50d2 ro modprobe.blacklist=ast video=offb:off amdgpu.aspm=0

I removed the AST disable jumper switch that the box shipped with, and used the AST VGA until petitboot, then after selecting the kernel to boot I switched to the AMD output. No joy.  :(

I tried a cold boot, which I thought had worked before, but it didn't seem to help.

Current dmesg output for amdgpu on boot:

[    2.997064] [drm] amdgpu kernel modesetting enabled.
[    2.998294] amdgpu: CRAT table disabled by module option
[    2.998297] amdgpu: DSDT table not found for OEM information
[    2.998300] amdgpu: IO link not available for non x86 platforms
[    2.998302] amdgpu: IO link not available for non x86 platforms
[    2.998305] amdgpu: Virtual CRAT table created for CPU
[    2.998329] amdgpu: Topology: Add CPU node
[    2.998522] amdgpu 0000:03:00.0: enabling device (0540 -> 0542)
[    2.998537] [drm] initializing kernel modesetting (VEGA10 0x1002:0x6861 0x103C:0x0B0F 0x00).
[    2.998546] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    2.998563] [drm] register mmio base: 0x00000000
[    2.998567] [drm] register mmio size: 524288
[    2.998581] [drm] PCIE atomic ops is not supported
[    2.998593] [drm] add ip block number 0 <soc15_common>
[    2.998598] [drm] add ip block number 1 <gmc_v9_0>
[    2.998602] [drm] add ip block number 2 <vega10_ih>
[    2.998606] [drm] add ip block number 3 <psp>
[    2.998609] [drm] add ip block number 4 <gfx_v9_0>
[    2.998613] [drm] add ip block number 5 <sdma_v4_0>
[    2.998617] [drm] add ip block number 6 <powerplay>
[    2.998621] [drm] add ip block number 7 <dm>
[    2.998625] [drm] add ip block number 8 <uvd_v7_0>
[    2.998628] [drm] add ip block number 9 <vce_v4_0>
[    3.324728] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    3.324734] amdgpu: ATOM BIOS: 113-D0510300-100
[    3.326345] [drm] UVD(0) is enabled in VM mode
[    3.326348] [drm] UVD(0) ENC is enabled in VM mode
[    3.326350] [drm] VCE enabled in VM mode
[    3.326368] amdgpu 0000:03:00.0: amdgpu: MEM ECC is active.
[    3.326372] amdgpu 0000:03:00.0: amdgpu: SRAM ECC is not presented.
[    3.326377] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[    3.326408] amdgpu 0000:03:00.0: amdgpu: VRAM: 16368M 0x000000F400000000 - 0x000000F7FEFFFFFF (16368M used)
[    3.326413] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    3.326417] amdgpu 0000:03:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[    3.326423] [drm] Detected VRAM RAM=16368M, BAR=16384M
[    3.326426] [drm] RAM width 2048bits HBM
[    3.326461] [drm] amdgpu: 16368M of VRAM memory ready
[    3.326464] [drm] amdgpu: 16368M of GTT memory ready.
[    3.326483] [drm] GART: num cpu pages 8192, num gpu pages 131072
[    3.326576] [drm] PCIE GART of 512M enabled.
[    3.326579] [drm] PTB located at 0x000000F400900000
[    3.331010] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[    3.341794] usb 1-3.1: New USB device found, idVendor=1d6b, idProduct=0104, bcdDevice= 1.00
[    3.341800] usb 1-3.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[    3.341804] usb 1-3.1: Product: virtual_input
[    3.341807] usb 1-3.1: Manufacturer: OpenBMC
[    3.341810] usb 1-3.1: SerialNumber: OBMC0001
[    3.352927] amdgpu: hwmgr_sw_init smu backed is vega10_smu
[    3.380768] [drm] Found UVD firmware Version: 66.43 Family ID: 17
[    3.380775] [drm] PSP loading UVD firmware
[    3.392403] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[    3.392411] [drm] PSP loading VCE firmware
[    3.787815] usb 1-2.2: new full-speed USB device number 7 using xhci_hcd
[    3.787827] [drm] reserve 0x400000 from 0xf7fec00000 for PSP TMR
[    3.796806] input: OpenBMC virtual_input as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/usb1/1-3/1-3.1/1-3.1:1.0/0003:1D6B:0104.0001/input/input0
[    3.872788] [drm] kiq ring mec 2 pipe 1 q 0
[    3.873646] EEH: Recovering PHB#0-PE#0
[    3.873660] EEH: PE location: UOPWR.A100059-Node0-CPU1 Slot2 (16x), PHB location: N/A
[    3.873666] EEH: Frozen PHB#0-PE#0 detected
[    3.873670] EEH: Call Trace:
[    3.873673] EEH: [(____ptrval____)] __eeh_send_failure_event+0x7c/0x160
[    3.873684] EEH: [(____ptrval____)] eeh_dev_check_failure+0x2b4/0x670
[    3.873692] EEH: [(____ptrval____)] amdgpu_device_rreg.part.0+0x160/0x1f0 [amdgpu]
[    3.873916] EEH: [(____ptrval____)] gfx_v9_0_mqd_init.isra.0+0x314/0x748 [amdgpu]
[    3.874169] EEH: [(____ptrval____)] gfx_v9_0_hw_init+0x1e70/0x2f20 [amdgpu]
[    3.874427] EEH: [(____ptrval____)] amdgpu_device_init+0x1ec4/0x2160 [amdgpu]
[    3.874670] EEH: [(____ptrval____)] amdgpu_driver_load_kms+0x48/0x370 [amdgpu]
[    3.874890] EEH: [(____ptrval____)] amdgpu_pci_probe+0x174/0x330 [amdgpu]
[    3.875113] EEH: [(____ptrval____)] local_pci_probe+0x68/0x110
[    3.875122] EEH: [(____ptrval____)] work_for_cpu_fn+0x38/0x60
[    3.875128] EEH: [(____ptrval____)] process_one_work+0x294/0x580
[    3.875134] EEH: [(____ptrval____)] worker_thread+0x2b0/0x650
[    3.875140] EEH: [(____ptrval____)] kthread+0x17c/0x190
[    3.875145] EEH: [(____ptrval____)] ret_from_kernel_thread+0x5c/0x64
[    3.875151] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[    3.875157] EEH: Notify device drivers to shutdown
[    3.875162] EEH: Beginning: 'error_detected(IO frozen)'
[    4.120637] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[    4.120782] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
[    4.120934] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v9_0> failed -110
[    4.121072] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[    4.121075] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[    4.121106] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[    4.122273] hid-generic 0003:1D6B:0104.0001: input,hidraw0: USB HID v1.01 Keyboard [OpenBMC virtual_input] on usb-0003:01:00.0-3.1/input0
[    4.129995] input: OpenBMC virtual_input as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/usb1/1-3/1-3.1/1-3.1:1.1/0003:1D6B:0104.0002/input/input1
[    4.132757] hid-generic 0003:1D6B:0104.0002: input,hidraw1: USB HID v1.01 Mouse [OpenBMC virtual_input] on usb-0003:01:00.0-3.1/input1
[    4.139626] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = (____ptrval____); ring_buffer_end = (____ptrval____); write_frame = (____ptrval____)
[    4.139800] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[    4.139975] [drm] free PSP TMR buffer
[    4.139980] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = (____ptrval____); ring_buffer_end = (____ptrval____); write_frame = (____ptrval____)
[    4.140146] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[    4.162159] amdgpu: probe of 0000:03:00.0 failed with error -110
[    4.162221] BUG: Unable to handle kernel data access on read at 0xc00a0000c0aa0000
[    4.162226] Faulting instruction address: 0xc00000000002c030
[    4.162229] Oops: Kernel access of bad area, sig: 11 [#1]
[    4.162233] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[    4.162238] Modules linked in: amdgpu(+) drm_ttm_helper ttm mfd_core gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm i2c_core nvme vmx_crypto nvme_core drm_panel_orientation_quirks tg3 crc32c_vpmsum




ClassicHasClass:
This is an older kernel. When I updated to F35, I got 5.14.18 to start with, and that's been fine with my WX7100 (T2). The only difference between my system and yours is that I use the HPT MMU instead of radix, but that shouldn't cause this problem. Are you able to update the kernel at all?

MPC7500:
This is weird, because tle has or had a Vega, too and it worked. Also on Fedora
https://forums.raptorcs.com/index.php/topic,47.0.html

rheaplex:
I tried Fedora 36 Beta as the host OS with kernel 5.17 , which renders the desktop to AST but not to AMDGPU. I’ll head back to a newer 5.15 , then try a 5.5 if that doesn’t help (Petitboot is on 5.5 and works with AMDGPU).

The Talos II arrived with a jumper attached to the onboard VGA disabler. I removed it to see if that helped with the suggested approach of getting to Petitboot with AST then switching to AMDGPU to load the host OS but it doesn’t seem to have.

The variables other than the kernel and it’s config are:

* The boot console in Petitboot is set to tty1
* The VGA jumper is currently not set.
* The microDP output of the AMDGPU board is connected to the monitor via an HDMI converter, but that doesn’t seem to affect Petitboot.
I’m sure I’m missing something obvious either in the precise combination of configuration options I’m using or in some implicit subtlety of the suggested solutions that I’m failing to follow correctly.  :-*

MPC7500:
Sadly, I see no error. I would try the Live-Image of Void Linux.

add
--- Code: ---modprobe.blacklist=ast video=offb:off amdgpu.aspm=0

--- End code ---
to the Kernel commandline and start the Live-Image.

It should work out of the box. If that fails also I would guess that the GPU has a failure.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version