Raptor Computing Systems Community Forums (BETA)
Software => User Zone => Topic started by: bernie on August 20, 2024, 08:12:07 pm
-
I have a Blackbird system in which I'd like to install an NVIDIA GT710. However, after installation, boot gets to the Petitboot menu, but then fails as it searches for boot options. The following is logged in the console:
[enP4p1s0f0] Configuring with DHCP[ 7.117579] Oops: Exception in kernel mode, sig: 5 [#1]
[ 7.117744] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[ 7.117896] Modules linked in: sd_mod sg xhci_pci xhci_hcd ast ibmpowernv rtc_opal nouveau(+) at24 regmap_i2c drm_shmem_helper usbcore tg3 usb_common drm_exec gpu_sched drm_ttm_helper ahci ttm libahci drm_display_helper backlight mtdblock mtd_blkdevs ofpart powernv_flash mtd
[ 7.118470] CPU: 0 PID: 199 Comm: kworker/0:1 Not tainted 6.6.16-openpower1 #4
[ 7.118682] Hardware name: C1P9S01 REV 1.02 POWER9 0x4e1203 opal:skiboot-ecb1dc7 PowerNV
[ 7.118903] Workqueue: events work_for_cpu_fn
[ 7.119114] NIP: c0000000002beddc LR: c0000000002bedd8 CTR: c0000000005da76c
[ 7.119337] REGS: c000000009c0f7c0 TRAP: 0700 Not tainted (6.6.16-openpower1)
[ 7.119579] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28000288 XER: 00000000
[ 7.119820] CFAR: c0000000000d1dd4 IRQMASK: 0
[ 7.119820] GPR00: c0000000002bedd8 c000000009c0fa60 c000000000819c00 000000000000006d
[ 7.119820] GPR04: 00000000ffff7fff c000000009c0f890 0000000000000001 c0000000023e85d8
[ 7.119820] GPR08: 0000000ffbb30000 0000000000000027 0000000000000027 6536666461313030
[ 7.119820] GPR12: 0000000048000288 c000000fff7ff480 c0000000000aab34 c00000000ad59840
[ 7.119820] GPR16: 0000000000000000 0000000000000000 0000000000000000 c00000000330410d
[ 7.119820] GPR20: 000000007fffffff c000000006a56db0 c00000000910fa48 61c8864680b583eb
[ 7.119820] GPR24: c00000000910fa08 c000000006a56580 c0000000030f26b0 c000000016ff2348
[ 7.119820] GPR28: c000000016ffc130 c00000001adf7340 0000000000000000 c00000001adf6e40
[ 7.122259] NIP [c0000000002beddc] __list_del_entry_valid_or_report+0xec/0x134
[ 7.122558] LR [c0000000002bedd8] __list_del_entry_valid_or_report+0xe8/0x134
[ 7.122862] Call Trace:
[ 7.123156] [c000000009c0fa60] [c0000000002bedd8] __list_del_entry_valid_or_report+0xe8/0x134 (unreliable)
[ 7.123492] [c000000009c0fac0] [c008000001118280] list_del+0x74/0x84 [nouveau]
[ 7.123887] [c000000009c0faf0] [c00800000111852c] nvkm_mm_free+0x94/0x144 [nouveau]
[ 7.124283] [c000000009c0fb40] [c008000001115acc] nvkm_gpuobj_del+0x44/0x84 [nouveau]
[ 7.124697] [c000000009c0fb70] [c00800000111a724] nvkm_ramht_del+0x30/0x58 [nouveau]
[ 7.125101] [c000000009c0fba0] [c00800000119280c] nvkm_disp_dtor+0x30/0x1bc [nouveau]
[ 7.125542] [c000000009c0fc20] [c0080000011142bc] nvkm_engine_dtor+0x38/0x54 [nouveau]
[ 7.125964] [c000000009c0fc40] [c00800000111b13c] nvkm_subdev_del+0xdc/0x154 [nouveau]
[ 7.126386] [c000000009c0fcc0] [c00800000118cf40] nvkm_device_del+0x144/0x174 [nouveau]
[ 7.126843] [c000000009c0fd20] [c0080000011e47a0] nouveau_drm_probe+0x15c/0x22c [nouveau]
[ 7.127313] [c000000009c0fdb0] [c0000000002eb7d0] local_pci_probe+0x3c/0x80
[ 7.127702] [c000000009c0fe20] [c00000000009f3b8] work_for_cpu_fn+0x30/0x40
[ 7.128099] [c000000009c0fe50] [c0000000000a2d90] process_scheduled_works+0x1d0/0x28c
[Disk: sdb1 / 4e48a184-7db9-4084-ab58-f08d164df005]
Trisquel GNU/Linux, with Linux-Libre 5.15.0-117-generic (recovery mode)
Trisquel GNU/Linux, with Linux-Libre 5.15.0-117-generic
Trisquel GNU/Linux, with Linux-Libre 5.15.0-118-generic (recovery mode)
Trisquel GNU/Linux, with Linux-Libre 5.15.0-118-generic
(*) Trisquel GNU/Linux
[ 7.128512] [c000000009c0ff20] [c0000000000a32c8] worker_thread+0x244/0x288
[ 7.128923] [c000000009c0ff90] [c0000000000aac24] kthread+0xf8/0x100
[ 7.129329] [c000000009c0ffe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
[ 7.129753] Code: 4be12fe9 60000000 0fe00000 4bffff74 e9460000 7c2a1840 41820020 3c62fff4 7d455378 386382e7 4be12fc1 60000000 <0fe00000> 4bffff4c e8a90008 7c255040
[ 7.130643] ---[ end trace 0000000000000000 ]---
[ 7.899186] sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 sda9 sda10 sda11 sda12 sda13 sda14 sda15 sda16 sda17 sda18 >
[ 7.903465] sd 2:0:0:0:
[ 7.962273] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[ 7.963812] device-mapper: ioctl: 4.48.0-ioctl (2023-03-01) initialised: dm-devel@redhat.com
[sdb3] Processing new Disk device[ 8.097452] EXT4-fs (dm-2): orphan cleanup on readonly fs
Booting in 10 sec: [sdb1] Trisquel GNU/Linux[ 8.433254]
9 [ 9.433724] Kernel panic - not syncing: Fatal exception
[ 10.[ 113.217585323,5] OPAL: Reboot request...
480719] Rebooting in 30 seconds..
I've tried with and without the "disable onboard video" jumper installed. The system boots fine with a Radeon HD5450 installed, and the GT710 works fine in another system. What could be causing this problem?
-
It looks like nouveau is barfing, and it has a history of not working with non-4KB kernel page sizes. The developers do not show any indication that they want to fix that. As you're running with a 64KB page size, that is likely to cause problems. See if you can manage to blacklist nouveau and proceed with the boot sequence.
-
Thanks for your suggestion. I know how to blacklist a module in an operating system, but not for the boot firmware. Can you provide any reference or tips? I have compiled and installed the OpenPOWER firmware (I was unable to build the OpenBMC firmware), do I need to make some change to the configuration there and rebuild?
Note: the system also boots with an NVIDIA GT1030. I was wondering if the issue was firmware, but the GT1030 requires firmware for the OS to use it, whereas the GT710 does not. Or, at least, it doesn't require proprietary firmware, as it works just fine in Trisquel and Parabola.
-
Ok, I figured it out... Documenting here in case someone (particularly my self) needs it in the future.
Blacklist kernel module at boot time in nvram by:
1. Exiting to Petitboot shell from Petitiboot menu
2. Executing nvram -p ibm,skiboot --update-config bootargs="modprobe.blacklist=nouveau"
3. Check config by executing nvram -p ibm,skiboot --print-config
3. Reboot
I don't know if this will persist after a firmware update, is there a way to make this permanent?
This resolved the issue with Petitboot, although when the OS boots, it can't initialise the card. The card works fine in my other desktop machine (amd64), but doesn't work in either Trsiquel, Fedora or Debian in my Blackbird system. In all of the distros, Xorg log shows:
[ 139.646] (EE) [drm] Failed to open DRM device for pci:0000:01:00.0: -19
[ 139.647] (EE) open /dev/dri/card0: No such file or directory
Any clue as to what the solution might be for this issue? I don't believe it's firmware, I specifically bought this card because it doesn't require proprietary firmware, there's no mention of missing firmware in the Xorg log, and nonfree firmware is installed in Fedora and Debian.
-
Hi Bernie,
Did you ever get the GT710 working in the OS on your Blackbird? I've installed an NVIDIA Quadro K620 on mine, and it works in Wayland on Debian Testing (which started using 4k page sizes with 6.10—maybe they integrated Daniel Pocock's patch (https://forums.raptorcs.com/index.php/topic,524.msg4365.html#msg4365)?).
Unfortunately, after updating to kernel 6.11.9, the below errors appear after a reboot, and the OS won't load until I force a shutdown in the BMC and power it back on again.
nouveau 0000:01:00.0: disp: chid 0 stat 00001000 reason 1 [PUSHBUFFER_ERR] mthd 0000 data 00000400 code 00000002
nouveau 0000:01:00.0: DRM: core caps notifier timeout
nouveau 0000:01:00.0: disp: chid 1 stat 00001000 reason 1 [PUSHBUFFER_ERR] mthd 0000 data 00000400 code 00000002
nouveau 0000:01:00.0: DRM: core notifier timeout
nouveau 0000:01:00.0: DRM: base-0: timeout
In your other post (https://forums.raptorcs.com/index.php/topic,554.msg0), it sounds like compiling a kernel for POWER9 with 4k page sizes and VSX support should work, but I'm hoping to confirm this before I try it.
-
Yes, I have it working now.
Great to hear that Debian has switched to 4K page sizes. Sorry to hear about your issues with kernel 6.11.9, but I don't have any insights to help with that. You have a different error than the one I received when lacking VSX, so I think that's probably not the issue. The Debian kernel should have VSX enabled.
I'm running 6.6.47 on my Gentoo installation, and I also have Trisquel running with the linux-libre kernel version 6.11.3, That was a test version given to me from https://trisquel.info/en/forum/trisquel-nouveau-and-page-sizes-power9. I haven't upgraded to a later version yet as I've been working on my Gentoo install, but you could try that before attempting to compile your own: https://www.fsfla.org/ikiwiki/selibre/linux-libre/index.en.html.
-
Thank you. I'll try the Linux-libre kernel or will install Trisquel. Your success there (https://trisquel.info/en/forum/trisquel-nouveau-and-page-sizes-power9#comment-177733) is encouraging.
-
No problem. Just to be clear - the default kernel in Trisquel is still 64k. I'm not sure if they'll change it with the next release. Installing linux-libre should be the same on Debian or Trisquel.
-
That's good to know, thanks.
-
Hi
Bernie
I using the driver Nouveau and Gnu OS (Trisquel 11), however it looks like Nouveau it not working well on 4K either, as it shown critical pci-allocation-memory.
Can you share please: sudo dmesg | grep failed
Thanks.
-
With the latest linux-libre kernel:
bernie@prosthetic-conscience-trisquel:~$ uname -a
Linux prosthetic-conscience-trisquel 6.16.5-gnu #1.0 SMP PREEMPT_DYNAMIC Tue Sep 27 12:35:59 EST 1983 ppc64le ppc64le ppc64le GNU/Linux
bernie@prosthetic-conscience-trisquel:~$ getconf PAGESIZE
4096
bernie@prosthetic-conscience-trisquel:~$ sudo dmesg | grep failed
[ 0.156899] pci 0000:00:00.0: bridge window [io size 0x1000]: failed to assign
[ 0.156911] pci 0000:00:00.0: bridge window [io size 0x1000]: failed to assign
[ 0.156989] pci 0000:01:00.0: BAR 5 [io size 0x0080]: failed to assign
[ 0.156998] pci 0000:01:00.0: BAR 5 [io size 0x0080]: failed to assign
[ 0.157096] pci 0002:00:00.0: bridge window [io size 0x1000]: failed to assign
[ 0.157105] pci 0002:00:00.0: bridge window [io size 0x1000]: failed to assign
[ 0.157130] pci 0002:01:00.0: BAR 4 [io size 0x0020]: failed to assign
[ 0.157138] pci 0002:01:00.0: BAR 0 [io size 0x0008]: failed to assign
[ 0.157146] pci 0002:01:00.0: BAR 2 [io size 0x0008]: failed to assign
[ 0.157154] pci 0002:01:00.0: BAR 1 [io size 0x0004]: failed to assign
[ 0.157162] pci 0002:01:00.0: BAR 3 [io size 0x0004]: failed to assign
[ 0.157171] pci 0002:01:00.0: BAR 4 [io size 0x0020]: failed to assign
[ 0.157179] pci 0002:01:00.0: BAR 0 [io size 0x0008]: failed to assign
[ 0.157187] pci 0002:01:00.0: BAR 2 [io size 0x0008]: failed to assign
[ 0.157195] pci 0002:01:00.0: BAR 1 [io size 0x0004]: failed to assign
[ 0.157203] pci 0002:01:00.0: BAR 3 [io size 0x0004]: failed to assign
[ 0.157615] pci 0005:00:00.0: bridge window [io size 0x1000]: failed to assign
[ 0.157624] pci 0005:00:00.0: bridge window [io size 0x1000]: failed to assign
[ 0.157638] pci 0005:01:00.0: bridge window [io size 0x1000]: failed to assign
[ 0.157646] pci 0005:01:00.0: bridge window [io size 0x1000]: failed to assign
[ 0.157675] pci 0005:02:00.0: BAR 2 [io size 0x0080]: failed to assign
[ 0.157683] pci 0005:02:00.0: BAR 2 [io size 0x0080]: failed to assign
-
With the latest linux-libre kernel:
The POWER PHBs don't support legacy I/O cycle accesses, so the above messages are expected (and harmless). Every modern device falls back to MMIO access, which is actually required in the PCIe specifications -- I/O access is only available at all due to legacy Intel 80x86 (!) compatibility requirements.
Is the driver failing in any other way? Failing to map MMIO or otherwise printing any messages about BAR size? Is there any chance the Nouveau developers have coded the driver to use only the legacy I/O accesses instead of MMIO?
-
No, for me the driver is working fine despite the failures shown in dmesg. I only posted it as requested by carlosgonz. Thanks for the explanation though, good to know that it's nothing to worry about.
-
No, for me the driver is working fine despite the failures shown in dmesg. I only posted it as requested by carlosgonz. Thanks for the explanation though, good to know that it's nothing to worry about.
Good to hear, thank you for the update!
-
Thank you RCS for let to know us that the output is normal.
I was just worried when i noticed those PCI outputs failed. I was just testing a Nvidia card and started trying out SuperTuxKart and saw that the performance was terrible, only 6fps at maximum performance in STK. So i started looking into what could be wrong and saw those pci failed in dmesg which make me to think that Petitboot it could be interfering with Nouveau as Nouveau doesn't support 65k.
The Nouveau and Nvidia card work OK, however performance is not much good as i using nvidia k620. Recently Mesa 25.1 added support to maxwell-v1 to use the driver zink, so i hope to get better performance when i got testing.
The other question i have to RCS is why do those PCI io-failed appear even if i remove the Nvidia card?
thank you bernie for sharing ur output which is same like me.
-
ok looks like the poor performance on my Nvidia card was due to unreclock, i reclocked to max, and now the performance is amazing overall, now STK is show 50 fps from 15 fps from same grafic configs. Even i am able to get close 100fps on same line card but more cores, even more it could work better if switch to zink.
EDIT:
blackbird:~$ sudo dmesg | grep FAULT
[ 14.728196] nouveau 0000:01:00.0: bus: MMIO read of 00000000 FAULT at 3e6684 [ PRIVRING ]
-
Thanks for the tip, I didn't know about reclocking! Since my original post I'd found and installed a used 780 Ti, which performed better than the 710, but still couldn't handle SuperTuxKart with the highest graphics settings. After reclocking, I can get 30fps with graphics settings set to their highest. or about 110fps with the lowest graphics settings. Good to know!
-
Nice that you got reclocked.
Bernie i testing some GPUs and Drivers to get better perfomance on Raptor Backbird and Gnu Systems, so my raptor-blackbird it is original raptor SFF-mTX so i like this factor,however i not have much space for big GPUs, so it only fits gpu low profil, single slot, so i have gpus to testing compat to Gnu OS like intel arc pro a40, which at the moment it is not working as it need some patches on libdrm, but this arc gpu may work OK without firmware but i untested yet. So what i using now is Quadro k1200 which it is Maxwell, maxwell it is newer than Kepler and still do not using signed firmware on host. k1200 it working nice and it compat to NVK 1.4, however Kepler is only compat to NVK 1.2. Why so import NVK? On Gnu system we do not have a Libre driver for video hardware decoder on Nvidia GPUs, howerver we can use the Libre NVK not just for gaming but for video decoder and encoder for smooth and hight performance for video play and any vk-software acceleration ready, so this why Maxwell 1.0 shine.
So Bernie GPUs like 780 Ti and 710 is still Kepler, however there are i high performance Maxwell gpu that do not need firmware y compat to nvk 1.4 as mine it is: Nvidia GTX 750 Ti or Nvidia GTX 750, this GPUs it is the best of the best for newer features and performances for Gnu OS.
-
Interesting, I will have a look out for a 750 Ti to compare... On raw processing power, the 780 Ti seems like it should be better, I did not realise that the NVK version would make such a difference.
-
You know what Bernie, there are some specific model of Pascal Nvidia GPU that work without firmware signed and reclock. Pascal is newer that Maxwell.
Let me testing first then i let you know how WONDERFUL work on Gnu system.
Additionally videolan-dav1d added more acceleration for av1 for powerpc. So this is nice because you can use the powerpc-vsx3.0 acceleration for video hard decode than nvidia-vpu to watch fast and smooth youtvbe or any video on powerpc, as NVK still missing ATM for kepler, maxwell, this news is nice for powerpc video performance.
https://code.videolan.org/videolan/dav1d/-/tags
-
Bernie
GnuLinux FSFLA 7.0 may release next Monday or Tuesday, this version enable Nouveau 64k for PPC and other improvements, so can you go ASAP to same 64k-thread on Gnu Trisquel Forum to enable 64k for GnuLinux 7.0 release?
Thank you,
-
Hi carlosgonz,
Sorry, I didn't see this message in time. That's exciting news, but I'd like to verify it works myself before recommending to Trisquel that they switch back. I'll update here once I've done so.
Regards,
Bernie
-
Bernie looks like 64k still not on gnulinux 7.0 for nouveau. I need to look what happened.
Also i got NVK work on PPC: https://trisquel.info/files/gnuark.png