Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - rheaplex

Pages: 1 [2]
16
GPU Compute / Accelerators / Re: Wx9100 on Talos II
« on: April 14, 2022, 11:42:19 pm »
I tried the void live image with the suggested command line. It didn’t work.

Just to make sure, I tried void live with the AST VGA connection, and then I pulled out the WX9100 and replaced it with an old Nvidia card. In both cases void booted successfully to the XFCE desktop.

So it looks like the card. Which is bizarre as it’s straight from Raptor and it works with Petitboot.

17
GPU Compute / Accelerators / Re: Wx9100 on Talos II
« on: April 13, 2022, 11:52:27 pm »
I tried Fedora 36 Beta as the host OS with kernel 5.17 , which renders the desktop to AST but not to AMDGPU. I’ll head back to a newer 5.15 , then try a 5.5 if that doesn’t help (Petitboot is on 5.5 and works with AMDGPU).

The Talos II arrived with a jumper attached to the onboard VGA disabler. I removed it to see if that helped with the suggested approach of getting to Petitboot with AST then switching to AMDGPU to load the host OS but it doesn’t seem to have.

The variables other than the kernel and it’s config are:
  • The boot console in Petitboot is set to tty1
  • The VGA jumper is currently not set.
  • The microDP output of the AMDGPU board is connected to the monitor via an HDMI converter, but that doesn’t seem to affect Petitboot.

I’m sure I’m missing something obvious either in the precise combination of configuration options I’m using or in some implicit subtlety of the suggested solutions that I’m failing to follow correctly.  :-*

18
GPU Compute / Accelerators / Re: Wx9100 on Talos II
« on: April 13, 2022, 03:29:56 pm »
Current status:

Stock 64k page Fedora kernel: 5.14.10-300.fc35.ppc64le

Kernel arguments: root=UUID=73574303-f000-44ec-bbc2-7c86d0ab50d2 ro modprobe.blacklist=ast video=offb:off amdgpu.aspm=0

I removed the AST disable jumper switch that the box shipped with, and used the AST VGA until petitboot, then after selecting the kernel to boot I switched to the AMD output. No joy.  :(

I tried a cold boot, which I thought had worked before, but it didn't seem to help.

Current dmesg output for amdgpu on boot:

[    2.997064] [drm] amdgpu kernel modesetting enabled.
[    2.998294] amdgpu: CRAT table disabled by module option
[    2.998297] amdgpu: DSDT table not found for OEM information
[    2.998300] amdgpu: IO link not available for non x86 platforms
[    2.998302] amdgpu: IO link not available for non x86 platforms
[    2.998305] amdgpu: Virtual CRAT table created for CPU
[    2.998329] amdgpu: Topology: Add CPU node
[    2.998522] amdgpu 0000:03:00.0: enabling device (0540 -> 0542)
[    2.998537] [drm] initializing kernel modesetting (VEGA10 0x1002:0x6861 0x103C:0x0B0F 0x00).
[    2.998546] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    2.998563] [drm] register mmio base: 0x00000000
[    2.998567] [drm] register mmio size: 524288
[    2.998581] [drm] PCIE atomic ops is not supported
[    2.998593] [drm] add ip block number 0 <soc15_common>
[    2.998598] [drm] add ip block number 1 <gmc_v9_0>
[    2.998602] [drm] add ip block number 2 <vega10_ih>
[    2.998606] [drm] add ip block number 3 <psp>
[    2.998609] [drm] add ip block number 4 <gfx_v9_0>
[    2.998613] [drm] add ip block number 5 <sdma_v4_0>
[    2.998617] [drm] add ip block number 6 <powerplay>
[    2.998621] [drm] add ip block number 7 <dm>
[    2.998625] [drm] add ip block number 8 <uvd_v7_0>
[    2.998628] [drm] add ip block number 9 <vce_v4_0>
[    3.324728] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    3.324734] amdgpu: ATOM BIOS: 113-D0510300-100
[    3.326345] [drm] UVD(0) is enabled in VM mode
[    3.326348] [drm] UVD(0) ENC is enabled in VM mode
[    3.326350] [drm] VCE enabled in VM mode
[    3.326368] amdgpu 0000:03:00.0: amdgpu: MEM ECC is active.
[    3.326372] amdgpu 0000:03:00.0: amdgpu: SRAM ECC is not presented.
[    3.326377] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[    3.326408] amdgpu 0000:03:00.0: amdgpu: VRAM: 16368M 0x000000F400000000 - 0x000000F7FEFFFFFF (16368M used)
[    3.326413] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    3.326417] amdgpu 0000:03:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[    3.326423] [drm] Detected VRAM RAM=16368M, BAR=16384M
[    3.326426] [drm] RAM width 2048bits HBM
[    3.326461] [drm] amdgpu: 16368M of VRAM memory ready
[    3.326464] [drm] amdgpu: 16368M of GTT memory ready.
[    3.326483] [drm] GART: num cpu pages 8192, num gpu pages 131072
[    3.326576] [drm] PCIE GART of 512M enabled.
[    3.326579] [drm] PTB located at 0x000000F400900000
[    3.331010] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[    3.341794] usb 1-3.1: New USB device found, idVendor=1d6b, idProduct=0104, bcdDevice= 1.00
[    3.341800] usb 1-3.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[    3.341804] usb 1-3.1: Product: virtual_input
[    3.341807] usb 1-3.1: Manufacturer: OpenBMC
[    3.341810] usb 1-3.1: SerialNumber: OBMC0001
[    3.352927] amdgpu: hwmgr_sw_init smu backed is vega10_smu
[    3.380768] [drm] Found UVD firmware Version: 66.43 Family ID: 17
[    3.380775] [drm] PSP loading UVD firmware
[    3.392403] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[    3.392411] [drm] PSP loading VCE firmware
[    3.787815] usb 1-2.2: new full-speed USB device number 7 using xhci_hcd
[    3.787827] [drm] reserve 0x400000 from 0xf7fec00000 for PSP TMR
[    3.796806] input: OpenBMC virtual_input as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/usb1/1-3/1-3.1/1-3.1:1.0/0003:1D6B:0104.0001/input/input0
[    3.872788] [drm] kiq ring mec 2 pipe 1 q 0
[    3.873646] EEH: Recovering PHB#0-PE#0
[    3.873660] EEH: PE location: UOPWR.A100059-Node0-CPU1 Slot2 (16x), PHB location: N/A
[    3.873666] EEH: Frozen PHB#0-PE#0 detected
[    3.873670] EEH: Call Trace:
[    3.873673] EEH: [(____ptrval____)] __eeh_send_failure_event+0x7c/0x160
[    3.873684] EEH: [(____ptrval____)] eeh_dev_check_failure+0x2b4/0x670
[    3.873692] EEH: [(____ptrval____)] amdgpu_device_rreg.part.0+0x160/0x1f0 [amdgpu]
[    3.873916] EEH: [(____ptrval____)] gfx_v9_0_mqd_init.isra.0+0x314/0x748 [amdgpu]
[    3.874169] EEH: [(____ptrval____)] gfx_v9_0_hw_init+0x1e70/0x2f20 [amdgpu]
[    3.874427] EEH: [(____ptrval____)] amdgpu_device_init+0x1ec4/0x2160 [amdgpu]
[    3.874670] EEH: [(____ptrval____)] amdgpu_driver_load_kms+0x48/0x370 [amdgpu]
[    3.874890] EEH: [(____ptrval____)] amdgpu_pci_probe+0x174/0x330 [amdgpu]
[    3.875113] EEH: [(____ptrval____)] local_pci_probe+0x68/0x110
[    3.875122] EEH: [(____ptrval____)] work_for_cpu_fn+0x38/0x60
[    3.875128] EEH: [(____ptrval____)] process_one_work+0x294/0x580
[    3.875134] EEH: [(____ptrval____)] worker_thread+0x2b0/0x650
[    3.875140] EEH: [(____ptrval____)] kthread+0x17c/0x190
[    3.875145] EEH: [(____ptrval____)] ret_from_kernel_thread+0x5c/0x64
[    3.875151] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[    3.875157] EEH: Notify device drivers to shutdown
[    3.875162] EEH: Beginning: 'error_detected(IO frozen)'
[    4.120637] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[    4.120782] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
[    4.120934] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v9_0> failed -110
[    4.121072] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[    4.121075] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[    4.121106] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[    4.122273] hid-generic 0003:1D6B:0104.0001: input,hidraw0: USB HID v1.01 Keyboard [OpenBMC virtual_input] on usb-0003:01:00.0-3.1/input0
[    4.129995] input: OpenBMC virtual_input as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/usb1/1-3/1-3.1/1-3.1:1.1/0003:1D6B:0104.0002/input/input1
[    4.132757] hid-generic 0003:1D6B:0104.0002: input,hidraw1: USB HID v1.01 Mouse [OpenBMC virtual_input] on usb-0003:01:00.0-3.1/input1
[    4.139626] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = (____ptrval____); ring_buffer_end = (____ptrval____); write_frame = (____ptrval____)
[    4.139800] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[    4.139975] [drm] free PSP TMR buffer
[    4.139980] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = (____ptrval____); ring_buffer_end = (____ptrval____); write_frame = (____ptrval____)
[    4.140146] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[    4.162159] amdgpu: probe of 0000:03:00.0 failed with error -110
[    4.162221] BUG: Unable to handle kernel data access on read at 0xc00a0000c0aa0000
[    4.162226] Faulting instruction address: 0xc00000000002c030
[    4.162229] Oops: Kernel access of bad area, sig: 11 [#1]
[    4.162233] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[    4.162238] Modules linked in: amdgpu(+) drm_ttm_helper ttm mfd_core gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm i2c_core nvme vmx_crypto nvme_core drm_panel_orientation_quirks tg3 crc32c_vpmsum





19
GPU Compute / Accelerators / Wx9100 on Talos II
« on: March 27, 2022, 05:54:18 pm »
Heya all.

(Thank you to MPC7500 for previous help on this.)

I have a Talos II with a WX9100 graphics card installed, direct from RaptorCS. The graphics card displays PetitBoot perfectly every time, and then almost always fails to come back up for the main OS.

I have tried many different things to fix this:
- Different kernel arguments? Nothing seems to help.
- Different distro and kernel versions? Nothing reliable.
- Different kernel page size? I thought that 5.5 with 4k worked well, until it didn't. Otherwise there was no correlation.
- Try nomodest for late kms? Passed nomodeset to OS kernel, faster boot, can’t lsmod amdgpu though.
- Fast reset causing problems? https://wiki.raptorcs.com/wiki/Enabling_Navi_10_On_Fedora_31#Disabling_fast-reset Followed guide, no change.
- PCI needs resetting? https://wiki.raptorcs.com/wiki/File:Pcie_hot_reset.sh Ran script with 0000:03:00.01 as the argument (the AMDGPU lspci entry). It jammed after printing “Removing 0000:03:00.01…”.
- USB hub causing issues? Removed during boot, no change.
- Electrical power load issues on CPUs loading main OS? Changed to two separate wall sockets first r the power supplies, no change.
- Bad DP/HDMI connector? Changed converter and cable, no change.
- No access to firmware? Try amdgpu as module, or compiled in w/firmware in image. Rebuilt initramfs after it looked like it failed during install (the Fedora post-install script didn’t like the Petitboot version strung being hex rather than decimal), it doesn’t boot.

I notice that the PetitBoot amdgpu module size is different from the main OS one in my current setup. Other than that I've got nothing.  :'(

uname -a:

Code: [Select]
Linux workstation 5.15.0-60.local.fc35.ppc64le #1 SMP Sat Mar 26 17:31:31 PDT 2022 ppc64le ppc64le ppc64le GNU/Linux
lspci:

Code: [Select]
000:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0000:01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Vega 10 PCIe Bridge
0000:02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Vega 10 PCIe Bridge
0000:03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]
0000:03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
0001:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0001:01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11)
0002:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0003:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0004:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0004:01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe
0004:01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe
0005:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0005:01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
0005:02:00.0 Multimedia video controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
0030:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0030:01:00.0 Audio device: Creative Labs EMU20k2 [Sound Blaster X-Fi Titanium Series] (rev 03)
0031:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0031:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
0032:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0033:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)

inxi:
Code: [Select]
System:
  Host: workstation Kernel: 5.15.0-60.local.fc35.ppc64le ppc64le bits: 64 Desktop: N/A
    Distro: Fedora release 35 (Thirty Five)
Machine:
  Type: PPC System: T2P9S01 REV 1.01 details: N/A
CPU:
  Info: 2x 18-core POWER9 altivec supported [MT MCP SMP] speed (MHz): avg: 2211
    min/max: 2166/3800
Graphics:
  Device-1: AMD Vega 10 XT [Radeon PRO WX 9100] driver: amdgpu v: kernel
  Device-2: ASPEED Graphics Family driver: N/A
  Display: server: X.org v: 1.20.14 driver: X: loaded: ati,radeon unloaded: fbdev,modesetting
    gpu: amdgpu
  Message: No GL data found on this system.
Network:
  Device-1: Broadcom NetXtreme BCM5719 Gigabit Ethernet PCIe driver: tg3
  Device-2: Broadcom NetXtreme BCM5719 Gigabit Ethernet PCIe driver: tg3
Drives:
  Local Storage: total: 1.36 TiB used: 320.2 GiB (22.9%)
Info:
  Processes: 1176 Uptime: 10m Memory: 125.41 GiB used: 2.73 GiB (2.2%) Init: systemd runlevel: 3
  Shell: Bash inxi: 3.3.13

dmesg output is attached. The interesting part seems to be after "[drm] amdgpu kernel modesetting enabled."

System config also attached.

Can anyone suggest a known good distro/kernel/firmware version/configuration that I can test against? Or is there anything I'm doing obviously wrong?

Thank you.

20
Talos II / Re: Radeon Pro WX 9100 GPU on Any Kernel > 5.5
« on: March 27, 2022, 05:01:00 pm »
Heya MPC7500.

Thank you for your help. Unfortunately I can still only get the GPU to work under the main OS intermittently if at all.

I'm going to move this to the GPU topic with more information.

21
Talos II / Radeon Pro WX 9100 GPU on Any Kernel > 5.5
« on: March 18, 2022, 08:00:06 pm »
Heya everyone.

I have a Talos II with an AMD Radeon PRO WX 9100 (Vega 10 XT, connected to a monitor via DP to HDMI). I've tried several distros on it, but currently it has Debian Bookworm installed (I know Debian best but I would happily switch to another distro if needed).

Everything is fine if I build and use the 5.5 kernel from Raptor's git site - I'm typing this on Firefox LTS in Cinnamon/Xorg. I've set the boot console to be VGA in Petitboot and added console=tty0 to the grub command line. Otherwise there's nothing different or special about this setup.

If I try any other kernel, with 4k or 64k pages and any combinations of kernel boot arguments, the screen drops out some time after the kexek message from Petitboot. I can ssh in, and see that the amdgpu module is loaded, but if I try to start Xorg it complains about no matching screens (or devices, if I remove the xorg conf snippet that specifies amdgpu as the default - 5.5 doesn't need this, though).

I'm happy to use 5.5 for now but at some point I'll need to upgrade to a newer kernel.

Can anyone recommend another known-good combination of kernel and configuration for Vega 10?

Thank you.

- Rhea.

Pages: 1 [2]