Using the Debian 12 (bookworm) kernel 6.1 package with 4k page size I started getting errors in the log and eventually Firefox would freeze and the whole GUI (GNOME desktop) would freeze. The system is still accessible over SSH.
After the first crash, I started radeontop and noticed up to 80% GPU VRAM utilisation with Firefox running.
I rebooted into the kernel from bullseye (5.10.46-4.1 built from my 4k page size branch), running with the Debian 12 filesystem and it is running fine, no crashes. I have had that kernel running for months on end on this platform using the bullseye filesystem.
I'm going to build the 6.7.12 kernel backport for bookworm with the 4k page size and try that as well.
journalctl captures a lot of errors like this, sometimes they appear for hours before it eventually crashes. I was able to repeat the crash a couple of times.
I see it as a good thing that the platform is still responding over SSH even when the GUI has crashed.
kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
...
The crash is captured too:
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 5 PID: 6577 at drivers/gpu/drm/ttm/ttm_bo.c:357 ttm_bo_release+0x538/0x5b0 [ttm]
kernel: Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge snd_seq_dummy snd_hrtimer snd_seq rfkill qrtr 8021q garp stp mrp llc overlay sunrpc binfmt_misc ext4 crc16 mbcache jbd2 uvcvideo videobuf2_vmalloc videobuf2_memops snd_usb_audio snd_hda_codec_hdmi videobuf2_v4l2 videobuf2_common snd_usbmidi_lib snd_hda_intel snd_rawmidi snd_intel_dspcfg videodev snd_seq_device evdev joydev snd_hda_codec mc snd_hda_core snd_hwdep snd_pcm sg snd_timer snd ofpart soundcore ipmi_powernv powernv_flash ctr ipmi_devintf at24 vmx_crypto mtd regmap_i2c ipmi_msghandler gf128mul opal_prd parport_pc lp parport fuse configfs loop ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress xts ecb uas usb_storage dm_crypt dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic
kernel: raid1 raid0 multipath linear md_mod hid_generic usbhid hid sd_mod amdgpu gpu_sched drm_buddy i2c_algo_bit drm_display_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt xhci_pci fb_sys_fops xhci_hcd nvme nvme_core t10_pi tg3 drm mpt3sas crc64_rocksoft_generic usbcore crc64_rocksoft crc_t10dif crct10dif_generic crc64 crct10dif_common drm_panel_orientation_quirks raid_class libphy usb_common scsi_transport_sas
kernel: CPU: 5 PID: 6577 Comm: Renderer Not tainted 6.1.0-21-powerpc64le-4k #1 Debian 6.1.90-1.1
kernel: Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1203 opal:skiboot-9858186 PowerNV
kernel: NIP: c00800000f4d2120 LR: c008000012fe7270 CTR: c00800000f4d2198
kernel: REGS: c00020001bc571d0 TRAP: 0700 Not tainted (6.1.0-21-powerpc64le-4k Debian 6.1.90-1.1)
kernel: MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 84824244 XER: 20040036
kernel: CFAR: c00800000f4d1c34 IRQMASK: 0
GPR00: c008000012fe7270 c00020001bc57470 c00800000f508800 c0000003b100c5c0
GPR04: 0000000000000000 0000000ffcac0000 0000000000041a8b 0000000000000000
GPR08: 0000000000041a8a c00020001af87738 0000000000000000 c008000013564c90
GPR12: c00800000f4d2198 c000000ffffdbf00 c00020001b054b80 c0002000ef821798
GPR16: 00000000002c6800 0000000000000001 0000000000000000 0000000000000000
GPR20: 00000000002c6880 c000200015080000 0000000000000071 00000000002c6800
GPR24: 0000000000000003 c00020001af87010 0000000000000000 000000003ee00000
GPR28: c0000003b100c458 c000200015080000 c000200015085508 c0000003b100c5c0
kernel: NIP [c00800000f4d2120] ttm_bo_release+0x538/0x5b0 [ttm]
kernel: LR [c008000012fe7270] amdgpu_bo_unref+0x38/0x60 [amdgpu]
kernel: Call Trace:
kernel: [c00020001bc57470] [c00800000f4d1f18] ttm_bo_release+0x330/0x5b0 [ttm] (unreliable)
kernel: [c00020001bc57500] [c008000012fe7270] amdgpu_bo_unref+0x38/0x60 [amdgpu]
kernel: [c00020001bc57530] [c0080000130105fc] amdgpu_vm_ptes_update+0xc24/0xc60 [amdgpu]
kernel: [c00020001bc576a0] [c00800001300935c] amdgpu_vm_update_range+0x304/0x880 [amdgpu]
kernel: [c00020001bc577c0] [c008000013009f04] amdgpu_vm_bo_update+0x2ec/0x630 [amdgpu]
kernel: [c00020001bc578e0] [c008000012ff0bcc] amdgpu_gem_va_ioctl+0x674/0x6b0 [amdgpu]
kernel: [c00020001bc57a20] [c008000012f07040] drm_ioctl_kernel+0x118/0x230 [drm]
kernel: [c00020001bc57a80] [c008000012f073b0] drm_ioctl+0x258/0x560 [drm]
kernel: [c00020001bc57bf0] [c008000012fc00b8] amdgpu_drm_ioctl+0x70/0xd0 [amdgpu]
kernel: [c00020001bc57c40] [c0000000005493f4] sys_ioctl+0x744/0x1460
kernel: [c00020001bc57d40] [c00000000002afd8] system_call_exception+0x138/0x260
kernel: [c00020001bc57e10] [c00000000000c0f0] system_call_vectored_common+0xf0/0x280
kernel: --- interrupt: 3000 at 0x7fff8eb4433c
kernel: NIP: 00007fff8eb4433c LR: 00007fff8eb4433c CTR: 0000000000000000
kernel: REGS: c00020001bc57e80 TRAP: 3000 Not tainted (6.1.0-21-powerpc64le-4k Debian 6.1.90-1.1)
kernel: MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 44224840 XER: 00000000
kernel: IRQMASK: 0
GPR00: 0000000000000036 00007fff75b9bb20 00007fff8ec56f00 0000000000000053
GPR04: 00000000c0286448 00007fff75b9bc00 0000000000280000 00000002c5a00000
GPR08: 0000000000100000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff75ba68c0 00007fff75b9c028 00007fff75b9c178
GPR16: 00007fff75b9c508 0000000000000001 0000000000020000 00007fff75b9cd18
GPR20: 0000000000000001 0000000000020000 00007fff75b9be80 0000000000ea0000
GPR24: 0000000000000000 0000000000200000 0000000000000004 0000000000200000
GPR28: 0000000000000053 00000000c0286448 00007fff75b9bc00 00007ffefa5ca0a0
kernel: NIP [00007fff8eb4433c] 0x7fff8eb4433c
kernel: LR [00007fff8eb4433c] 0x7fff8eb4433c
kernel: --- interrupt: 3000
kernel: Instruction dump:
kernel: 4bfffe30 60000000 60000000 60420000 0fe00000 7c0802a6 fb610068 fba10078
kernel: f80100a0 60000000 60000000 60420000 <0fe00000> 4bffffe0 60000000 60420000
kernel: ---[ end trace 0000000000000000 ]---
kernel: [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-12)
kernel: Kernel attempted to read user page (0) - exploit attempt? (uid: 1000)
kernel: BUG: Kernel NULL pointer dereference on read at 0x00000000
kernel: Faulting instruction address: 0xc008000012fe8f30
kernel: Oops: Kernel access of bad area, sig: 11 [#1]
kernel: LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
kernel: Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge snd_seq_dummy snd_hrtimer snd_seq rfkill qrtr 8021q garp stp mrp llc overlay sunrpc binfmt_misc ext4 crc16 mbcache jbd2 uvcvideo videobuf2_vmalloc videobuf2_memops snd_usb_audio snd_hda_codec_hdmi videobuf2_v4l2 videobuf2_common snd_usbmidi_lib snd_hda_intel snd_rawmidi snd_intel_dspcfg videodev snd_seq_device evdev joydev snd_hda_codec mc snd_hda_core snd_hwdep snd_pcm sg snd_timer snd ofpart soundcore ipmi_powernv powernv_flash ctr ipmi_devintf at24 vmx_crypto mtd regmap_i2c ipmi_msghandler gf128mul opal_prd parport_pc lp parport fuse configfs loop ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress xts ecb uas usb_storage dm_crypt dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic
kernel: raid1 raid0 multipath linear md_mod hid_generic usbhid hid sd_mod amdgpu gpu_sched drm_buddy i2c_algo_bit drm_display_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt xhci_pci fb_sys_fops xhci_hcd nvme nvme_core t10_pi tg3 drm mpt3sas crc64_rocksoft_generic usbcore crc64_rocksoft crc_t10dif crct10dif_generic crc64 crct10dif_common drm_panel_orientation_quirks raid_class libphy usb_common scsi_transport_sas
kernel: CPU: 4 PID: 6567 Comm: firefox-es:cs0 Tainted: G W 6.1.0-21-powerpc64le-4k #1 Debian 6.1.90-1.1
kernel: Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1203 opal:skiboot-9858186 PowerNV
kernel: NIP: c008000012fe8f30 LR: c008000013030f54 CTR: c00000000098c350
kernel: REGS: c000200033e2ae10 TRAP: 0300 Tainted: G W (6.1.0-21-powerpc64le-4k Debian 6.1.90-1.1)
kernel: MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24224244 XER: 200400dd
kernel: CFAR: c008000013030f50 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0
GPR00: c008000013030f54 c000200033e2b0b0 c00800001377c600 c000200015085508
GPR04: c0000003b100c400 0000000000000000 000000003ee00000 0000000000000080
GPR08: 0000000000001000 0000000000000000 c0000003cce4e800 c008000013565488
GPR12: c00000000098c350 c000000ffffdcc00 00000000002c6880 00000000002c6880
GPR16: 0000000000080000 c0000003b100c400 0000000000001000 c000200033e2b408
GPR20: 0000000000000000 c000200015080000 c0000003b100c400 0000000000000000
GPR24: 000000003ee00000 c0000003cce4e800 00000000000003f1 0000000000001000
GPR28: 000000003ee00000 c000200033e2b3e8 0000000000000080 0000000000000000
kernel: NIP [c008000012fe8f30] amdgpu_bo_gpu_offset_no_check+0x28/0x78 [amdgpu]
kernel: LR [c008000013030f54] amdgpu_vm_sdma_set_ptes+0x5c/0x1b0 [amdgpu]
kernel: Call Trace:
kernel: [c000200033e2b0b0] [c000200033e2b110] 0xc000200033e2b110 (unreliable)
kernel: [c000200033e2b0e0] [c008000013030f54] amdgpu_vm_sdma_set_ptes+0x5c/0x1b0 [amdgpu]
kernel: [c000200033e2b150] [c00800001303195c] amdgpu_vm_sdma_update+0x3b4/0x440 [amdgpu]
kernel: [c000200033e2b220] [c00800001300fd40] amdgpu_vm_ptes_update+0x368/0xc60 [amdgpu]
kernel: [c000200033e2b390] [c00800001300935c] amdgpu_vm_update_range+0x304/0x880 [amdgpu]
kernel: [c000200033e2b4b0] [c008000013009f04] amdgpu_vm_bo_update+0x2ec/0x630 [amdgpu]
kernel: [c000200033e2b5d0] [c008000012ff5a28] amdgpu_cs_ioctl+0x1610/0x22c0 [amdgpu]
kernel: [c000200033e2b880] [c008000012f07040] drm_ioctl_kernel+0x118/0x230 [drm]
kernel: [c000200033e2b8e0] [c008000012f073b0] drm_ioctl+0x258/0x560 [drm]
kernel: [c000200033e2ba50] [c008000012fc00b8] amdgpu_drm_ioctl+0x70/0xd0 [amdgpu]
kernel: [c000200033e2baa0] [c0000000005493f4] sys_ioctl+0x744/0x1460
kernel: [c000200033e2bba0] [c00000000002afd8] system_call_exception+0x138/0x260
kernel: [c000200033e2be10] [c00000000000c0f0] system_call_vectored_common+0xf0/0x280
kernel: --- interrupt: 3000 at 0x7fff8eb4433c
kernel: NIP: 00007fff8eb4433c LR: 00007fff8eb4433c CTR: 0000000000000000
kernel: REGS: c000200033e2be80 TRAP: 3000 Tainted: G W (6.1.0-21-powerpc64le-4k Debian 6.1.90-1.1)
kernel: MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 48884842 XER: 00000000
kernel: IRQMASK: 0
GPR00: 0000000000000036 00007fff62bfe1f0 00007fff8ec56f00 0000000000000053
GPR04: 00000000c0186444 00007fff62bfe2f0 0000000000180000 00007fff62bfe478
GPR08: 0000000000100000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff62c068c0 00007fff7dc86600 0000000000000005
GPR16: 00007fff623f0000 0000000000000001 0000000000000031 0000000000000001
GPR20: fffffffffffffffd 0000000000000000 00007fff59870000 00007fff59860000
GPR24: 00007fff8b4e1000 00007fff62bfe428 00007fff62bfe448 0000000000000000
GPR28: 0000000000000053 00000000c0186444 00007fff62bfe2f0 00007fff62bfe2d0
kernel: NIP [00007fff8eb4433c] 0x7fff8eb4433c
kernel: LR [00007fff8eb4433c] 0x7fff8eb4433c
kernel: --- interrupt: 3000
kernel: Instruction dump:
kernel: 007936f8 00000000 3c4c0079 384236f8 7c0802a6 60000000 7c0802a6 fbe1fff8
kernel: f8010010 f821ffd1 e92301c8 e86301a8 <ebe90000> 80890010 3863aaf8 7bff83e4
kernel: ---[ end trace 0000000000000000 ]---
kernel:
kernel: note: firefox-es:cs0[6567] exited with irqs disabled