Author Topic: Wx9100 on Talos II  (Read 1193 times)

rheaplex

  • Newbie
  • *
  • Posts: 8
  • Karma: +1/-0
    • View Profile
Wx9100 on Talos II
« on: March 27, 2022, 05:54:18 pm »
Heya all.

(Thank you to MPC7500 for previous help on this.)

I have a Talos II with a WX9100 graphics card installed, direct from RaptorCS. The graphics card displays PetitBoot perfectly every time, and then almost always fails to come back up for the main OS.

I have tried many different things to fix this:
- Different kernel arguments? Nothing seems to help.
- Different distro and kernel versions? Nothing reliable.
- Different kernel page size? I thought that 5.5 with 4k worked well, until it didn't. Otherwise there was no correlation.
- Try nomodest for late kms? Passed nomodeset to OS kernel, faster boot, can’t lsmod amdgpu though.
- Fast reset causing problems? https://wiki.raptorcs.com/wiki/Enabling_Navi_10_On_Fedora_31#Disabling_fast-reset Followed guide, no change.
- PCI needs resetting? https://wiki.raptorcs.com/wiki/File:Pcie_hot_reset.sh Ran script with 0000:03:00.01 as the argument (the AMDGPU lspci entry). It jammed after printing “Removing 0000:03:00.01…”.
- USB hub causing issues? Removed during boot, no change.
- Electrical power load issues on CPUs loading main OS? Changed to two separate wall sockets first r the power supplies, no change.
- Bad DP/HDMI connector? Changed converter and cable, no change.
- No access to firmware? Try amdgpu as module, or compiled in w/firmware in image. Rebuilt initramfs after it looked like it failed during install (the Fedora post-install script didn’t like the Petitboot version strung being hex rather than decimal), it doesn’t boot.

I notice that the PetitBoot amdgpu module size is different from the main OS one in my current setup. Other than that I've got nothing.  :'(

uname -a:

Code: [Select]
Linux workstation 5.15.0-60.local.fc35.ppc64le #1 SMP Sat Mar 26 17:31:31 PDT 2022 ppc64le ppc64le ppc64le GNU/Linux
lspci:

Code: [Select]
000:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0000:01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Vega 10 PCIe Bridge
0000:02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Vega 10 PCIe Bridge
0000:03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]
0000:03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
0001:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0001:01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11)
0002:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0003:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0004:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0004:01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe
0004:01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe
0005:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0005:01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
0005:02:00.0 Multimedia video controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
0030:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0030:01:00.0 Audio device: Creative Labs EMU20k2 [Sound Blaster X-Fi Titanium Series] (rev 03)
0031:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0031:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
0032:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0033:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)

inxi:
Code: [Select]
System:
  Host: workstation Kernel: 5.15.0-60.local.fc35.ppc64le ppc64le bits: 64 Desktop: N/A
    Distro: Fedora release 35 (Thirty Five)
Machine:
  Type: PPC System: T2P9S01 REV 1.01 details: N/A
CPU:
  Info: 2x 18-core POWER9 altivec supported [MT MCP SMP] speed (MHz): avg: 2211
    min/max: 2166/3800
Graphics:
  Device-1: AMD Vega 10 XT [Radeon PRO WX 9100] driver: amdgpu v: kernel
  Device-2: ASPEED Graphics Family driver: N/A
  Display: server: X.org v: 1.20.14 driver: X: loaded: ati,radeon unloaded: fbdev,modesetting
    gpu: amdgpu
  Message: No GL data found on this system.
Network:
  Device-1: Broadcom NetXtreme BCM5719 Gigabit Ethernet PCIe driver: tg3
  Device-2: Broadcom NetXtreme BCM5719 Gigabit Ethernet PCIe driver: tg3
Drives:
  Local Storage: total: 1.36 TiB used: 320.2 GiB (22.9%)
Info:
  Processes: 1176 Uptime: 10m Memory: 125.41 GiB used: 2.73 GiB (2.2%) Init: systemd runlevel: 3
  Shell: Bash inxi: 3.3.13

dmesg output is attached. The interesting part seems to be after "[drm] amdgpu kernel modesetting enabled."

System config also attached.

Can anyone suggest a known good distro/kernel/firmware version/configuration that I can test against? Or is there anything I'm doing obviously wrong?

Thank you.

MPC7500

  • Sr. Member
  • ****
  • Posts: 470
  • Karma: +31/-1
    • View Profile
    • Twitter
Re: Wx9100 on Talos II
« Reply #1 on: April 01, 2022, 07:01:52 am »
I would use the AST GPU till Petitboot and then switch to the amdgpu.

xilinder

  • Jr. Member
  • **
  • Posts: 78
  • Karma: +7/-0
    • View Profile
Re: Wx9100 on Talos II
« Reply #2 on: April 01, 2022, 08:44:16 am »
@rheaplex

"The graphics card displays PetitBoot perfectly every time, and then almost always fails to come back up for the main OS."

If you switch OFF the main power and let the system go completely dead, does it boot differently (better) ?
Talos II 2x8, 32GB RAM, onboard Microsemi RAID,  AMD WX7100, J.Micron SATA/PATA PCIe adapter. Debian with Mate.

ClassicHasClass

  • Sr. Member
  • ****
  • Posts: 353
  • Karma: +27/-0
  • Talospace Earth Orbit
    • View Profile
    • Floodgap
Re: Wx9100 on Talos II
« Reply #3 on: April 01, 2022, 09:40:33 am »
Petitboot is almost certainly using an older kernel, too (which is usually the case).

What are your kernel boot arguments?

MPC7500

  • Sr. Member
  • ****
  • Posts: 470
  • Karma: +31/-1
    • View Profile
    • Twitter
Re: Wx9100 on Talos II
« Reply #4 on: April 01, 2022, 05:53:08 pm »
If you're on Kernel 5.14+ you have to add amdgpu.aspm=0 to the Kernel commandline.

rheaplex

  • Newbie
  • *
  • Posts: 8
  • Karma: +1/-0
    • View Profile
Re: Wx9100 on Talos II
« Reply #5 on: April 13, 2022, 03:29:56 pm »
Current status:

Stock 64k page Fedora kernel: 5.14.10-300.fc35.ppc64le

Kernel arguments: root=UUID=73574303-f000-44ec-bbc2-7c86d0ab50d2 ro modprobe.blacklist=ast video=offb:off amdgpu.aspm=0

I removed the AST disable jumper switch that the box shipped with, and used the AST VGA until petitboot, then after selecting the kernel to boot I switched to the AMD output. No joy.  :(

I tried a cold boot, which I thought had worked before, but it didn't seem to help.

Current dmesg output for amdgpu on boot:

[    2.997064] [drm] amdgpu kernel modesetting enabled.
[    2.998294] amdgpu: CRAT table disabled by module option
[    2.998297] amdgpu: DSDT table not found for OEM information
[    2.998300] amdgpu: IO link not available for non x86 platforms
[    2.998302] amdgpu: IO link not available for non x86 platforms
[    2.998305] amdgpu: Virtual CRAT table created for CPU
[    2.998329] amdgpu: Topology: Add CPU node
[    2.998522] amdgpu 0000:03:00.0: enabling device (0540 -> 0542)
[    2.998537] [drm] initializing kernel modesetting (VEGA10 0x1002:0x6861 0x103C:0x0B0F 0x00).
[    2.998546] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    2.998563] [drm] register mmio base: 0x00000000
[    2.998567] [drm] register mmio size: 524288
[    2.998581] [drm] PCIE atomic ops is not supported
[    2.998593] [drm] add ip block number 0 <soc15_common>
[    2.998598] [drm] add ip block number 1 <gmc_v9_0>
[    2.998602] [drm] add ip block number 2 <vega10_ih>
[    2.998606] [drm] add ip block number 3 <psp>
[    2.998609] [drm] add ip block number 4 <gfx_v9_0>
[    2.998613] [drm] add ip block number 5 <sdma_v4_0>
[    2.998617] [drm] add ip block number 6 <powerplay>
[    2.998621] [drm] add ip block number 7 <dm>
[    2.998625] [drm] add ip block number 8 <uvd_v7_0>
[    2.998628] [drm] add ip block number 9 <vce_v4_0>
[    3.324728] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    3.324734] amdgpu: ATOM BIOS: 113-D0510300-100
[    3.326345] [drm] UVD(0) is enabled in VM mode
[    3.326348] [drm] UVD(0) ENC is enabled in VM mode
[    3.326350] [drm] VCE enabled in VM mode
[    3.326368] amdgpu 0000:03:00.0: amdgpu: MEM ECC is active.
[    3.326372] amdgpu 0000:03:00.0: amdgpu: SRAM ECC is not presented.
[    3.326377] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[    3.326408] amdgpu 0000:03:00.0: amdgpu: VRAM: 16368M 0x000000F400000000 - 0x000000F7FEFFFFFF (16368M used)
[    3.326413] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    3.326417] amdgpu 0000:03:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[    3.326423] [drm] Detected VRAM RAM=16368M, BAR=16384M
[    3.326426] [drm] RAM width 2048bits HBM
[    3.326461] [drm] amdgpu: 16368M of VRAM memory ready
[    3.326464] [drm] amdgpu: 16368M of GTT memory ready.
[    3.326483] [drm] GART: num cpu pages 8192, num gpu pages 131072
[    3.326576] [drm] PCIE GART of 512M enabled.
[    3.326579] [drm] PTB located at 0x000000F400900000
[    3.331010] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[    3.341794] usb 1-3.1: New USB device found, idVendor=1d6b, idProduct=0104, bcdDevice= 1.00
[    3.341800] usb 1-3.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[    3.341804] usb 1-3.1: Product: virtual_input
[    3.341807] usb 1-3.1: Manufacturer: OpenBMC
[    3.341810] usb 1-3.1: SerialNumber: OBMC0001
[    3.352927] amdgpu: hwmgr_sw_init smu backed is vega10_smu
[    3.380768] [drm] Found UVD firmware Version: 66.43 Family ID: 17
[    3.380775] [drm] PSP loading UVD firmware
[    3.392403] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[    3.392411] [drm] PSP loading VCE firmware
[    3.787815] usb 1-2.2: new full-speed USB device number 7 using xhci_hcd
[    3.787827] [drm] reserve 0x400000 from 0xf7fec00000 for PSP TMR
[    3.796806] input: OpenBMC virtual_input as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/usb1/1-3/1-3.1/1-3.1:1.0/0003:1D6B:0104.0001/input/input0
[    3.872788] [drm] kiq ring mec 2 pipe 1 q 0
[    3.873646] EEH: Recovering PHB#0-PE#0
[    3.873660] EEH: PE location: UOPWR.A100059-Node0-CPU1 Slot2 (16x), PHB location: N/A
[    3.873666] EEH: Frozen PHB#0-PE#0 detected
[    3.873670] EEH: Call Trace:
[    3.873673] EEH: [(____ptrval____)] __eeh_send_failure_event+0x7c/0x160
[    3.873684] EEH: [(____ptrval____)] eeh_dev_check_failure+0x2b4/0x670
[    3.873692] EEH: [(____ptrval____)] amdgpu_device_rreg.part.0+0x160/0x1f0 [amdgpu]
[    3.873916] EEH: [(____ptrval____)] gfx_v9_0_mqd_init.isra.0+0x314/0x748 [amdgpu]
[    3.874169] EEH: [(____ptrval____)] gfx_v9_0_hw_init+0x1e70/0x2f20 [amdgpu]
[    3.874427] EEH: [(____ptrval____)] amdgpu_device_init+0x1ec4/0x2160 [amdgpu]
[    3.874670] EEH: [(____ptrval____)] amdgpu_driver_load_kms+0x48/0x370 [amdgpu]
[    3.874890] EEH: [(____ptrval____)] amdgpu_pci_probe+0x174/0x330 [amdgpu]
[    3.875113] EEH: [(____ptrval____)] local_pci_probe+0x68/0x110
[    3.875122] EEH: [(____ptrval____)] work_for_cpu_fn+0x38/0x60
[    3.875128] EEH: [(____ptrval____)] process_one_work+0x294/0x580
[    3.875134] EEH: [(____ptrval____)] worker_thread+0x2b0/0x650
[    3.875140] EEH: [(____ptrval____)] kthread+0x17c/0x190
[    3.875145] EEH: [(____ptrval____)] ret_from_kernel_thread+0x5c/0x64
[    3.875151] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[    3.875157] EEH: Notify device drivers to shutdown
[    3.875162] EEH: Beginning: 'error_detected(IO frozen)'
[    4.120637] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[    4.120782] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
[    4.120934] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v9_0> failed -110
[    4.121072] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[    4.121075] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[    4.121106] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[    4.122273] hid-generic 0003:1D6B:0104.0001: input,hidraw0: USB HID v1.01 Keyboard [OpenBMC virtual_input] on usb-0003:01:00.0-3.1/input0
[    4.129995] input: OpenBMC virtual_input as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/usb1/1-3/1-3.1/1-3.1:1.1/0003:1D6B:0104.0002/input/input1
[    4.132757] hid-generic 0003:1D6B:0104.0002: input,hidraw1: USB HID v1.01 Mouse [OpenBMC virtual_input] on usb-0003:01:00.0-3.1/input1
[    4.139626] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = (____ptrval____); ring_buffer_end = (____ptrval____); write_frame = (____ptrval____)
[    4.139800] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[    4.139975] [drm] free PSP TMR buffer
[    4.139980] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = (____ptrval____); ring_buffer_end = (____ptrval____); write_frame = (____ptrval____)
[    4.140146] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[    4.162159] amdgpu: probe of 0000:03:00.0 failed with error -110
[    4.162221] BUG: Unable to handle kernel data access on read at 0xc00a0000c0aa0000
[    4.162226] Faulting instruction address: 0xc00000000002c030
[    4.162229] Oops: Kernel access of bad area, sig: 11 [#1]
[    4.162233] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[    4.162238] Modules linked in: amdgpu(+) drm_ttm_helper ttm mfd_core gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm i2c_core nvme vmx_crypto nvme_core drm_panel_orientation_quirks tg3 crc32c_vpmsum





ClassicHasClass

  • Sr. Member
  • ****
  • Posts: 353
  • Karma: +27/-0
  • Talospace Earth Orbit
    • View Profile
    • Floodgap
Re: Wx9100 on Talos II
« Reply #6 on: April 13, 2022, 04:10:50 pm »
This is an older kernel. When I updated to F35, I got 5.14.18 to start with, and that's been fine with my WX7100 (T2). The only difference between my system and yours is that I use the HPT MMU instead of radix, but that shouldn't cause this problem. Are you able to update the kernel at all?

MPC7500

  • Sr. Member
  • ****
  • Posts: 470
  • Karma: +31/-1
    • View Profile
    • Twitter
Re: Wx9100 on Talos II
« Reply #7 on: April 13, 2022, 05:23:39 pm »
This is weird, because tle has or had a Vega, too and it worked. Also on Fedora
https://forums.raptorcs.com/index.php/topic,47.0.html

rheaplex

  • Newbie
  • *
  • Posts: 8
  • Karma: +1/-0
    • View Profile
Re: Wx9100 on Talos II
« Reply #8 on: April 13, 2022, 11:52:27 pm »
I tried Fedora 36 Beta as the host OS with kernel 5.17 , which renders the desktop to AST but not to AMDGPU. I’ll head back to a newer 5.15 , then try a 5.5 if that doesn’t help (Petitboot is on 5.5 and works with AMDGPU).

The Talos II arrived with a jumper attached to the onboard VGA disabler. I removed it to see if that helped with the suggested approach of getting to Petitboot with AST then switching to AMDGPU to load the host OS but it doesn’t seem to have.

The variables other than the kernel and it’s config are:
  • The boot console in Petitboot is set to tty1
  • The VGA jumper is currently not set.
  • The microDP output of the AMDGPU board is connected to the monitor via an HDMI converter, but that doesn’t seem to affect Petitboot.

I’m sure I’m missing something obvious either in the precise combination of configuration options I’m using or in some implicit subtlety of the suggested solutions that I’m failing to follow correctly.  :-*

MPC7500

  • Sr. Member
  • ****
  • Posts: 470
  • Karma: +31/-1
    • View Profile
    • Twitter
Re: Wx9100 on Talos II
« Reply #9 on: April 14, 2022, 02:54:43 pm »
Sadly, I see no error. I would try the Live-Image of Void Linux.

add
Code: [Select]
modprobe.blacklist=ast video=offb:off amdgpu.aspm=0
to the Kernel commandline and start the Live-Image.

It should work out of the box. If that fails also I would guess that the GPU has a failure.

MPC7500

  • Sr. Member
  • ****
  • Posts: 470
  • Karma: +31/-1
    • View Profile
    • Twitter
Re: Wx9100 on Talos II
« Reply #10 on: April 14, 2022, 02:58:40 pm »
I have it. Vega doesn't work with a 64K Kernel page size:
https://gitlab.freedesktop.org/drm/amd/-/issues/1446

Anyway, it should work with Void OOTB. Void uses 4K Kernel pagesize.

rheaplex

  • Newbie
  • *
  • Posts: 8
  • Karma: +1/-0
    • View Profile
Re: Wx9100 on Talos II
« Reply #11 on: April 14, 2022, 11:42:19 pm »
I tried the void live image with the suggested command line. It didn’t work.

Just to make sure, I tried void live with the AST VGA connection, and then I pulled out the WX9100 and replaced it with an old Nvidia card. In both cases void booted successfully to the XFCE desktop.

So it looks like the card. Which is bizarre as it’s straight from Raptor and it works with Petitboot.

MPC7500

  • Sr. Member
  • ****
  • Posts: 470
  • Karma: +31/-1
    • View Profile
    • Twitter
Re: Wx9100 on Talos II
« Reply #12 on: April 15, 2022, 06:22:31 am »
This is so weird. You're right. If the Vega works in Petitboot the the card is okay.
If you only use Vega (with disable jumper active) does it work then?
IIRC, the Kernel on the Live-Image is 5.13, right?
The Vega GPU you bought from Raptor? If so, what does Raptor response about the issue?

Edit: You could also try the older Void Live-Image from 2020 with Kernel 5.4. But both Kernels 5.13 and 5.4 should work. I will ping tle on Twitter. Maybe he has an idea.
« Last Edit: April 15, 2022, 06:29:21 am by MPC7500 »

rheaplex

  • Newbie
  • *
  • Posts: 8
  • Karma: +1/-0
    • View Profile
Re: Wx9100 on Talos II
« Reply #13 on: April 18, 2022, 02:59:29 pm »
Here's what worked. Thank you for everyone who helped - one of the subtleties did seem to be booting all the way on the GPU. In a few months when I've more time for it I will start bisecting my way towards a more modern kernel, but regardless I'm going to stick with Debian for now.

This is one of those journeys where I have ended up back where I started but with much greater knowledge of an unfamiliar system.

Hardware configuration:

* VGA disable jumper applied.
* WX9100 connected to monitor via MiniDP -> HDMI.

Petitboot (etc.) configuration:

* Fast reset *not* disabled.
* Boot console: /dev/hvc0.
* (Current interface: /dev/tty0)

Host OS configuration:

* Debian 11.3 installed via hvc0 then switched to Linux kernel 5.5 from RaptorCS git compiled for POWER9 and 4k page sizes.
* Boot image rebuilt to ensure amdgpu module included.
* Kernel args added: modprobe.blacklist=ast video=offb:off console=tty0
* xorg.conf.d configuration snippet to set amdgpu rather than modesetting as the module to use for amdgpu:
  https://wiki.raptorcs.com/wiki/Troubleshooting/GPU#Alternative_Xorg_configuration_using_OutputClass_and_PrimaryGPU
* MATE desktop environment.

How to boot:

* Boot with GPU output connected to monitor. The fan will spin up, and the screen will go blue or black at various points in the process. When the keyboard state LEDs flash it's about to reach PetitBoot.
* If left alone it will autoboot all the way through to the desktop, although the signal will drop with a blue video screen briefly after kexec and the host OS kernel amdgpu module will show some alarming-looking errors (ring test failed, KCQ enable failed, etc.).
* Log in to your desktop environment via the session manager and enjoy!