Raptor Computing Systems Community Forums (BETA)

Software => Operating Systems and Porting => Topic started by: MauryG5 on January 15, 2022, 10:32:42 am

Title: New Kernel 5.16 and new problem
Post by: MauryG5 on January 15, 2022, 10:32:42 am
Dear guys you Power, as you know the new Kernel 5.16 has just come out and I like to do it as usual, I wanted to try it immediately, compiling my version.  Obviously now from 5.14 onwards you have to set the AMD GPU parameter to 0 otherwise it won't start.  Well after having done everything and started this beautiful news appeared to me, which you can see in the picture and at that point everything freezes and eventually it does not work and you have to do a reset and restart ... I say it is possible never that every time there must be a new problem in these Kernels ??? !!!!  I can not understand!
Title: Re: New Kernel 5.16 and new problem
Post by: tle on January 16, 2022, 07:28:47 pm
I have not tried it yet, still waiting for official build from Fedora team.

However I do not think 5.16 is that much different to 5.14 because there aren't major change that benefits PPC.

The AMDGPU is still very unreliable so system freeze is expected. What pagesize do you compile the kernel with?
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on January 17, 2022, 01:00:30 am
Ever since Daniel Pocock showed me the difference, I always use 4K pages and since then the Kernel on Navi 10 has always worked.  Now I don't know why something is missing, it seems that with 5.16 they have made some modification that requires what it writes in the start screen you see ... I don't know ...
Title: Re: New Kernel 5.16 and new problem
Post by: Logout on January 17, 2022, 04:29:18 am
However I do not think 5.16 is that much different to 5.14 because there aren't major change that benefits PPC.

There should be support from 5.16 for KVM on Book3e PowerPC CPUs derived from e6500 like the T2080 in the future PowerPC notebook and current T2080 dev boards. That is no benefit for us (= POWER9 users), but certainly quite a step for people on other current PPC64 machines as there in fact are just the two groups - big-iron-derived POWER9 and embedded Book3e CPUs.
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on January 17, 2022, 01:17:39 pm
I have not tried it yet, still waiting for official build from Fedora team.

However I do not think 5.16 is that much different to 5.14 because there aren't major change that benefits PPC.

The AMDGPU is still very unreliable so system freeze is expected. What pagesize do you compile the kernel with?
I personally Fedora I unmounted it from the computer and put it aside, I don't use it anymore. Too bad since he switched to Gnome 40 and they don't improve anything from that point of view so I gave up. Much better Ubuntu that better manages the Gnome in all its forms, it also updates a lot and among other things with the 22.04 it will have an optimization for Power 9 and currently it seems to me that it is the only distro that has it in the pipeline. You have some problems like the unresolved bug in the Firefox history, or the audio CDs that it does not read, but these are things that cannot be overcome because the distro itself works perfectly unlike Fedora or Debian ... In any case, the problem is always for AMD GPU management, something they have changed and now you have to understand what it takes to get it started, for sure other people will notice the problem and will start investigating ...
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on March 18, 2022, 01:20:05 pm
Guys sorry but still have no news of Kernel 5.16 problems on AMD Radeon graphics cards? It's been a while now and I think it will have come to you via Fedora too ...
Title: Re: New Kernel 5.16 and new problem
Post by: SiteAdmin on March 19, 2022, 09:43:25 pm
The issues should be fixed in the upstream kernel GIT tree:

https://www.spinics.net/lists/amd-gfx/msg74710.html

https://lwn.net/Articles/886568/

Can you confirm the issues are resolved for you on 5.16.2 or higher?
Title: Re: New Kernel 5.16 and new problem
Post by: MPC7500 on March 20, 2022, 08:27:35 am
Kernel 5.16 has a different issue. Even AST GPU doesn't work.
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on March 20, 2022, 08:38:20 am
I didn't know about the AST GPU to be honest. Personally I am still using 10.15, given the problems that there are on 5.16 so now I don't know if something has changed, I would like some confirmation before compiling a version 16 ...
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on March 22, 2022, 01:52:56 pm
Here I just tried to compile and install the new 5.17 kernel but the problem is still the same as 5.16, something is missing and it does not start ... this is the error message at startup ...
Title: Re: New Kernel 5.16 and new problem
Post by: SiteAdmin on March 22, 2022, 08:28:36 pm
Has this failure been reported upstream to the graphics maintainers?  If so, can you provide a link to the bug report?
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on March 23, 2022, 04:47:08 am
I don't know how to report the problem and I don't know where to do it, I already the first time when I noticed it on the first version of 5.16, I started this discussion here on the forum but I don't know how to do it specifically and where to do it unfortunately ... I need that you help me in this regard ...
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on March 23, 2022, 08:02:37 am
Blackbird + Radeon 5700 XT + debian sid + kernel 5.16.16

Code: [Select]
[    3.072146] Unrecoverable VMX/Altivec Unavailable Exception f20 at c008000001f3357c
[    3.072170] Oops: Unrecoverable VMX/Altivec Unavailable Exception, sig: 6 [#1]
[    3.072196] LE PAGE_SIZE=4K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[    3.072212] Modules linked in: amdgpu(+) gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea sd_mod sysfillrect sysimgblt xhci_pci fb_sys_fops xhci_hcd nvme tg3 usbcore drm nvme_core crc32c_vpmsum libphy t10_pi ahci crc_t10dif crct10dif_generic ptp crct10dif_vpmsum usb_common libahci pps_core crct10dif_common drm_panel_orientation_quirks
[    3.072310] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.16.16 #1
[    3.072336] Workqueue: events work_for_cpu_fn
[    3.072361] NIP:  c008000001f3357c LR: c008000001f344cc CTR: 0000000000000000
[    3.072395] REGS: c000000001aa32b0 TRAP: 0f20   Not tainted  (5.16.16)
[    3.072428] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 84002220  XER: 00000004
[    3.072472] CFAR: c008000001f344c8 IRQMASK: 0
[    3.072472] GPR00: c008000001f344cc c000000001aa3550 c00800000224d000 0000000000000000
[    3.072472] GPR04: c000000004e80000 c000000004c8c800 c000000004c8d000 c0ccc80404c8b0c0
[    3.072472] GPR08: c0000003ff3a4328 0000000000000000 0000000000000000 c000000004c8cc00
[    3.072472] GPR12: c000000000477530 c000000001777000 c00000000016b338 c000000003e66120
[    3.072472] GPR16: c000000003e66128 c000000003e60000 c000000003e76ac0 c000000003e66138
[    3.072472] GPR20: c000000003e66140 c000000003e66130 0000000000000100 0000000000000001
[    3.072472] GPR24: 0000000000000001 c00800000225ba38 0000000000000001 c000000003e75bc1
[    3.072472] GPR28: c000000004e90000 c000000001aa37b0 c000000004e80000 c000000004c8c800
[    3.072726] NIP [c008000001f3357c] dcn20_resource_construct+0x44/0xf30 [amdgpu]
[    3.073010] LR [c008000001f344cc] dcn20_create_resource_pool+0x64/0x100 [amdgpu]
[    3.073291] Call Trace:
[    3.073299] [c000000001aa3550] [c008000001f344b0] dcn20_create_resource_pool+0x48/0x100 [amdgpu] (unreliable)
[    3.073585] [c000000001aa35d0] [c00800000205d7f0] dc_create_resource_pool+0x2f8/0x3a0 [amdgpu]
[    3.073860] [c000000001aa3600] [c00800000204e694] dc_create+0x1cc/0x680 [amdgpu]
[    3.074129] [c000000001aa36b0] [c008000001eb93c4] amdgpu_dm_init.isra.0+0x1ec/0x1dd0 [amdgpu]
[    3.074404] [c000000001aa3910] [c008000001ebafd0] dm_hw_init+0x28/0x60 [amdgpu]
[    3.074672] [c000000001aa3940] [c008000001c0ac9c] amdgpu_device_init+0x1d34/0x21b0 [amdgpu]
[    3.074899] [c000000001aa3a90] [c008000001c0c3b0] amdgpu_driver_load_kms+0x48/0x370 [amdgpu]
[    3.075118] [c000000001aa3b10] [c008000001c01e6c] amdgpu_pci_probe+0x264/0x400 [amdgpu]
[    3.075336] [c000000001aa3bc0] [c000000000802688] local_pci_probe+0x68/0x110
[    3.075365] [c000000001aa3c40] [c000000000158e98] work_for_cpu_fn+0x38/0x60
[    3.075391] [c000000001aa3c70] [c00000000015ec28] process_one_work+0x2a8/0x590
[    3.075428] [c000000001aa3d10] [c00000000015f810] worker_thread+0x2a0/0x610
[    3.075463] [c000000001aa3da0] [c00000000016b4e8] kthread+0x1b8/0x1c0
[    3.075488] [c000000001aa3e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
[    3.075507] Instruction dump:
[    3.075517] fb61ffd8 fbc1fff0 fbe1fff8 fa81ffa0 faa1ffa8 fac1ffb0 fae1ffb8 fb01ffc0
[    3.075549] fb21ffc8 fb41ffd0 fb81ffe0 fba1ffe8 <100004c4> 3920ff90 3940ff80 7c9f2378
[    3.075583] ---[ end trace d9884fa0e00b9a00 ]---
Title: Re: New Kernel 5.16 and new problem
Post by: MPC7500 on March 23, 2022, 11:42:51 am
This is exactly the error message I remember and this has nothing to do with amdgpu.
Code: [Select]
Unrecoverable VMX/Altivec Unavailable Exception
Title: Re: New Kernel 5.16 and new problem
Post by: ClassicHasClass on March 23, 2022, 12:36:45 pm
Seeing if @sharkcz knows anything about it.
Title: Re: New Kernel 5.16 and new problem
Post by: sharkcz on March 23, 2022, 12:46:56 pm
Seeing if @sharkcz knows anything about it.
hmm, says me nothing, but is it reproducible? I believe a standard kernel shouldn't have problems executing VMX instructions ...
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on March 23, 2022, 04:18:40 pm
But in fact it seems strange to me too that suddenly there is a conflict between altivec and the new kernels, because it would mean that we at Power have the problem specifically ... I don't know what is happening on X86 or Arm in relation to this fact...
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on March 23, 2022, 04:28:13 pm
stock debian sid kernel:

Code: [Select]
[    2.714570] Unrecoverable VMX/Altivec Unavailable Exception f20 at c008000002944d5c
[    2.714596] Oops: Unrecoverable VMX/Altivec Unavailable Exception, sig: 6 [#1]
[    2.714612] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[    2.714637] Modules linked in: amdgpu(E+) gpu_sched(E) i2c_algo_bit(E) drm_ttm_helper(E) ttm(E) xhci_pci(E) sd_mod(E) drm_kms_helper(E) xhci_hcd(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) tg3(E) nvme(E) libphy(E) usbcore(E) nvme_core(E) ptp(E) drm(E) t10_pi(E) ahci(E) pps_core(E) crc_t10dif(E) crct10dif_generic(E) usb_common(E) crct10dif_common(E) libahci(E) drm_panel_orientation_quirks(E)
[    2.714777] CPU: 0 PID: 5 Comm: kworker/0:0 Tainted: G            E     5.16.0-5-powerpc64le #1  Debian 5.16.14-1
[    2.714821] Workqueue: events work_for_cpu_fn
[    2.714846] NIP:  c008000002944d5c LR: c008000002945c6c CTR: 0000000000000000
[    2.714871] REGS: c000000002f53290 TRAP: 0f20   Tainted: G            E      (5.16.0-5-powerpc64le Debian 5.16.14-1)
[    2.714919] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 84002220  XER: 00000004
[    2.714953] CFAR: c008000002945c68 IRQMASK: 0
[    2.714953] GPR00: c008000002945c6c c000000002f53530 c008000002c78000 0000000000000000
[    2.714953] GPR04: c000000034450000 c000000022eda000 c000000022eda800 c0a4ed2222ed30c0
[    2.714953] GPR08: c0000003fc2a2328 0000000000000000 0000000000000000 c000000022eda400
[    2.714953] GPR12: c00000000048a380 c0000000028b0000 c000000000176ac8 c00000000cfe5e80
[    2.714953] GPR16: c00000000cfe5e88 c00000000cfe0000 c00000000cff6820 c00000000cfe5e98
[    2.714953] GPR20: c00000000cfe5ea0 c00000000cfe5e90 0000000000000100 0000000000000001
[    2.714953] GPR24: 0000000000000001 0000000000000001 c008000002c86a60 c00000000cff0000
[    2.714953] GPR28: c000000034460000 c000000002f53790 c000000034450000 c000000022eda000
[    2.715175] NIP [c008000002944d5c] dcn20_resource_construct+0x44/0xef0 [amdgpu]
[    2.715438] LR [c008000002945c6c] dcn20_create_resource_pool+0x64/0x100 [amdgpu]
[    2.715724] Call Trace:
[    2.715741] [c000000002f53530] [c008000002945c50] dcn20_create_resource_pool+0x48/0x100 [amdgpu] (unreliable)
[    2.716014] [c000000002f535b0] [c008000002a6f540] dc_create_resource_pool+0x2f8/0x3a0 [amdgpu]
[    2.716274] [c000000002f535e0] [c008000002a60364] dc_create+0x1cc/0x650 [amdgpu]
[    2.716515] [c000000002f53690] [c0080000028ca584] amdgpu_dm_init.isra.0+0x1ec/0x1df0 [amdgpu]
[    2.716780] [c000000002f538f0] [c0080000028cc1b0] dm_hw_init+0x28/0x60 [amdgpu]
[    2.717045] [c000000002f53920] [c008000002619c78] amdgpu_device_init+0x1c00/0x2190 [amdgpu]
[    2.717270] [c000000002f53a70] [c00800000261b4f0] amdgpu_driver_load_kms+0x48/0x370 [amdgpu]
[    2.717513] [c000000002f53af0] [c008000002610ef4] amdgpu_pci_probe+0x2dc/0x4c0 [amdgpu]
[    2.717723] [c000000002f53bc0] [c00000000081c598] local_pci_probe+0x68/0x110
[    2.717750] [c000000002f53c40] [c0000000001644f8] work_for_cpu_fn+0x38/0x60
[    2.717776] [c000000002f53c70] [c00000000016a2ec] process_one_work+0x2ac/0x5a0
[    2.717804] [c000000002f53d10] [c00000000016aed0] worker_thread+0x2a0/0x610
[    2.717840] [c000000002f53da0] [c000000000176c78] kthread+0x1b8/0x1c0
[    2.717865] [c000000002f53e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
[    2.717902] Instruction dump:
[    2.717921] fb61ffd8 fbc1fff0 fbe1fff8 fa21ff88 fa81ffa0 faa1ffa8 fac1ffb0 fae1ffb8
[    2.717964] fb01ffc0 fb21ffc8 fb81ffe0 fba1ffe8 <100004c4> 3920ff90 3940ff80 7c9f2378
[    2.718007] ---[ end trace 30cf29bfebd0290d ]---

Will happily provide more info, if needed.
Title: Re: New Kernel 5.16 and new problem
Post by: SiteAdmin on March 23, 2022, 05:25:00 pm
Upstream bug report:

https://gitlab.freedesktop.org/drm/amd/-/issues/1949
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on March 23, 2022, 05:35:47 pm
Confirmed then apparently, the problem is ours on Power unfortunately ...
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on March 24, 2022, 01:51:26 am
Thanks a lot for the bug report!
Looking forward to test a patch which will hopefully resolve the regression.
Title: Re: New Kernel 5.16 and new problem
Post by: MPC7500 on March 24, 2022, 12:55:36 pm
This is exactly the error message I remember and this has nothing to do with amdgpu.
Code: [Select]
Unrecoverable VMX/Altivec Unavailable Exception

I have to correct myself, AST is working on 5.16.x
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on March 24, 2022, 03:18:31 pm
sorry guys, speaking of kernels, on Debian I was trying a different procedure to compile the Kernel. Until now I have always used the make menuconfig commands to configure and then I did make -j32, make modules, make modules_install and make install and he compiled and installed modules and kernels and automatically updated the Grub. The only flaw of this procedure is that if you query the system asking for all the installed kernels, it never shows you the one installed with this procedure and consequently you cannot remove it with the classic purge command but you have to manually remove it file by file and in each case the system does not recognize it as an officially installed kernel, even though it works fine in the end. Using instead the command: sudo make -j 32 KDEB_PKGVERSION = 1.-maury.ppc64le deb-pkg, then in effect create all the packages with extension .deb and then with dpkg -i install them and you have the kernel recognized by the system as well. Except that while on Ubuntu I managed to get it to work, on Debian it always gives me this error: make [2]: *** [debian / rules: 7: build-arch] Error 2
dpkg-buildpackage: error: debian / rules binary subprocess returned exit status 2
make [1]: *** [scripts / Makefile.package: 77: deb-pkg] Error 2
make: *** [Makefile: 1576: deb-pkg] Error 2

I did some research on Google, I saw that they had some libraries installed but nothing had no effect, they say to configure the CONFIG TRUSTED KEY but I have always done it by deleting all the text and it has always worked with the classic make -j32 but nothing here, it always keeps giving me that damn mistake ... I don't understand why it doesn't want to work on Debian ... Can you tell me what's missing or where the error is? Thank you
Title: Re: New Kernel 5.16 and new problem
Post by: SiteAdmin on March 24, 2022, 05:36:59 pm
Thanks a lot for the bug report!
Looking forward to test a patch which will hopefully resolve the regression.

They'd like a bisect on that bug report, any chance you could do that since you have a reproducible issue on your specific hardware?
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on March 25, 2022, 05:36:45 pm
They'd like a bisect on that bug report, any chance you could do that since you have a reproducible issue on your specific hardware?

I will try. Never bisected before. Well, then git bisect start ...
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on March 26, 2022, 01:18:43 pm
bisect of the error:

Code: [Select]
9fa0fb77132fe9e83f2b357fd5a2b16293a5b9ee is the first bad commit
commit 9fa0fb77132fe9e83f2b357fd5a2b16293a5b9ee
Author: Meenakshikumar Somasundaram <meenakshikumar.somasundaram@amd.com>
Date:   Tue Jan 26 15:15:33 2021 -0500

    drm/amd/display: USB4 DPIA enumeration and AUX Tunneling
   
    [WHY]
    To enable dc links for USB4 DPIA ports and AUX command tunneling
    for YELLOW_CARP_B0.
   
    [HOW]
    1) Created dc links for all USB4 DPIA ports in create_links().
       dc_link_construct() implementation is split for legacy DDC and DPIAs.
       As usb4 has no ddc, ddc->ddc_pin will be set to NULL for its dc link
       and this parameter will be used to identify the dc links as DPIA. The
       dc link for DPIA is further to be enhanced with implementation for link
       encoder and link initialization.
    2) usb4_dpia_count in struct resource_pool will be initialized to 4 in
       dcn31_resource_construct() if the DCN is YELLOW_CARP_B0.
    3) Enabled DMUB AUX via outbox for YELLOW_CARP_B0.
   
    Reviewed-by: Jimmy Kizito <Jimmy.Kizito@amd.com>
    Acked-by: Wayne Lin <Wayne.Lin@amd.com>
    Acked-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
    Acked-by: Harry Wentland <harry.wentland@amd.com>
    Signed-off-by: Meenakshikumar Somasundaram <meenakshikumar.somasundaram@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

 drivers/gpu/drm/amd/display/dc/core/dc.c           | 32 +++++++++-
 drivers/gpu/drm/amd/display/dc/core/dc_link.c      | 71 +++++++++++++++++++++-
 drivers/gpu/drm/amd/display/dc/core/dc_link_ddc.c  |  3 +-
 drivers/gpu/drm/amd/display/dc/dcn31/dcn31_hwseq.c |  6 ++
 .../gpu/drm/amd/display/dc/dcn31/dcn31_resource.c  |  6 ++
 drivers/gpu/drm/amd/display/dc/inc/core_types.h    |  1 +
 drivers/gpu/drm/amd/display/dc/inc/dc_link_ddc.h   |  1 +
 drivers/gpu/drm/amd/display/dc/irq_types.h         |  5 +-
 8 files changed, 120 insertions(+), 5 deletions(-)

I am pleased to do more.
Title: Re: New Kernel 5.16 and new problem
Post by: MPC7500 on March 26, 2022, 01:59:42 pm
Does the Kernel work then?
Title: Re: New Kernel 5.16 and new problem
Post by: SiteAdmin on March 26, 2022, 02:04:36 pm
@matgraf Awesome, thank you for that!  Let's see if upstream can work out how to fix it from there.
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on March 28, 2022, 03:37:24 pm
Hey guys where are we with the Kernel fixes? I am ready for testing as soon as you make it available ...
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on April 13, 2022, 01:07:25 pm
Guys apparently with the Kernel we have stopped, no news yet on the correction of the problem on Power of the new Kernel 5.16 and 5.17 and the 5.18 is coming as well ...
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on April 17, 2022, 08:22:11 pm
@SiteAdmin Was able to pinpoint the error to a single line of code!
The last working commit is eabf2019b7e5bf8216e373a74e08f13ca6b6c550.
If I apply of the next commit 9fa0fb77132fe9e83f2b357fd5a2b16293a5b9ee only the part
Code: [Select]
diff --git a/drivers/gpu/drm/amd/display/dc/inc/dc_link_ddc.h b/drivers/gpu/drm/amd/display/dc/inc/dc_link_ddc.h
index 4d7b271b6409..95fb61d62778 100644
--- a/drivers/gpu/drm/amd/display/dc/inc/dc_link_ddc.h
+++ b/drivers/gpu/drm/amd/display/dc/inc/dc_link_ddc.h
@@ -69,6 +69,7 @@ struct ddc_service_init_data {
        struct graphics_object_id id;
        struct dc_context *ctx;
        struct dc_link *link;
+       bool is_dpia_link;
 };
I get the following error:
Code: [Select]
[    1.813696] Unrecoverable VMX/Altivec Unavailable Exception f20 at c008000002933e0c
[    1.813720] Oops: Unrecoverable VMX/Altivec Unavailable Exception, sig: 6 [#1]
[    1.813731] LE PAGE_SIZE=4K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[    1.813742] Modules linked in: sd_mod amdgpu(+) gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt xhci_hcd fb_sys_fops nvme drm tg3 nvme_core usbcore crc32c_vpmsum t10_pi libphy crc_t10dif crct10dif_generic ahci ptp crct10dif_vpmsum pps_core libahci crct10dif_common drm_panel_orientation_quirks
[    1.813821] CPU: 0 PID: 237 Comm: kworker/0:3 Not tainted 5.15.0-rc2+ #1
[    1.813832] Workqueue: events work_for_cpu_fn
[    1.813862] NIP:  c008000002933e0c LR: c008000002934d3c CTR: 0000000000000000
[    1.813882] REGS: c0000000057c3250 TRAP: 0f20   Not tainted  (5.15.0-rc2+)
[    1.813901] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 84002240  XER: 00000021
[    1.813926] CFAR: c008000002934d38 IRQMASK: 0
[    1.813926] GPR00: c008000002934d3c c0000000057c34f0 c008000002c4b000 0000000000000000
[    1.813926] GPR04: c000000003e80000 c0000000058b5800 c0000000058b6000 c05c8b05058b18c0
[    1.813926] GPR08: c0000003ff3a4328 0000000000000000 005c8b05000000c0 78926402dcab8839
[    1.813926] GPR12: c00000000046dbf0 c000000001763000 c0000000038a0000 c0000000038a6120
[    1.813926] GPR16: c0000000038a6128 c0000000038a6118 c0000000038b6a88 c0000000038a6138
[    1.813926] GPR20: c0000000038a6140 c0000000038a6130 0000000000000100 0000000000000001
[    1.813926] GPR24: 0000000000000001 0000000000000001 0000000000000001 c000000034647bc0
[    1.813926] GPR28: c000000003e90000 c0000000057c37c8 c000000003e80000 c0000000058b5800
[    1.814077] NIP [c008000002933e0c] dcn20_resource_construct+0x44/0xf10 [amdgpu]
[    1.814269] LR [c008000002934d3c] dcn20_create_resource_pool+0x64/0x100 [amdgpu]
[    1.814437] Call Trace:
[    1.814452] [c0000000057c34f0] [c008000002934d20] dcn20_create_resource_pool+0x48/0x100 [amdgpu] (unreliable)
[    1.814645] [c0000000057c3570] [c008000002a5b520] dc_create_resource_pool+0x2f8/0x3a0 [amdgpu]
[    1.814827] [c0000000057c35a0] [c008000002a4cec8] dc_create+0x1d0/0xa80 [amdgpu]
[    1.815000] [c0000000057c36c0] [c0080000028b9ca4] amdgpu_dm_init.isra.0+0x1dc/0x1cb0 [amdgpu]
[    1.815177] [c0000000057c3920] [c0080000028bb7a0] dm_hw_init+0x28/0x60 [amdgpu]
[    1.815357] [c0000000057c3950] [c00800000260ab74] amdgpu_device_init+0x1d9c/0x21a0 [amdgpu]
[    1.815506] [c0000000057c3aa0] [c00800000260c200] amdgpu_driver_load_kms+0x48/0x370 [amdgpu]
[    1.815652] [c0000000057c3b20] [c008000002601c7c] amdgpu_pci_probe+0x1d4/0x370 [amdgpu]
[    1.815786] [c0000000057c3bc0] [c0000000007f6778] local_pci_probe+0x68/0x110
[    1.815819] [c0000000057c3c40] [c000000000158198] work_for_cpu_fn+0x38/0x60
[    1.815842] [c0000000057c3c70] [c00000000015df68] process_one_work+0x2a8/0x590
[    1.815866] [c0000000057c3d10] [c00000000015eb50] worker_thread+0x2a0/0x610
[    1.815899] [c0000000057c3da0] [c00000000016a7d4] kthread+0x184/0x190
[    1.815931] [c0000000057c3e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
[    1.815962] Instruction dump:
[    1.815978] fb61ffd8 fbc1fff0 fbe1fff8 fa81ffa0 faa1ffa8 fac1ffb0 fae1ffb8 fb01ffc0
[    1.816023] fb21ffc8 fb41ffd0 fb81ffe0 fba1ffe8 <100004c4> 3920ff90 3940ff80 7c9f2378
[    1.816058] ---[ end trace b28b055b72a2e4fc ]---
Title: Re: New Kernel 5.16 and new problem
Post by: ClassicHasClass on April 17, 2022, 11:07:05 pm
That seems a little odd. Literally just adding that bool to that struct makes it go crazy?
Title: Re: New Kernel 5.16 and new problem
Post by: rheaplex on April 18, 2022, 12:28:55 am
If the struct is allocated elsewhere with the old layout, or if the firmware is expecting it, that would explain it.
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on April 18, 2022, 04:07:39 am
The similar addition in the other file core_types.h does not trigger the error. This means that the following part of the problematic commit 9fa0fb77132fe9e83f2b357fd5a2b16293a5b9ee does not trigger the error:
Code: [Select]
diff --git a/drivers/gpu/drm/amd/display/dc/inc/core_types.h b/drivers/gpu/drm/amd/display/dc/inc/core_types.h
index ed09af238911..6fc6488c54c0 100644
--- a/drivers/gpu/drm/amd/display/dc/inc/core_types.h
+++ b/drivers/gpu/drm/amd/display/dc/inc/core_types.h
@@ -62,6 +62,7 @@ struct link_init_data {
        uint32_t connector_index; /* this will be mapped to the HPD pins */
        uint32_t link_index; /* this is mapped to DAL display_index
                                TODO: remove it when DC is complete. */
+       bool is_dpia_link;
 };
Title: Re: New Kernel 5.16 and new problem
Post by: ClassicHasClass on April 18, 2022, 05:05:17 pm
Maybe the problem is a mismatch between the struct definitions if it (or an equivalent object) is in different locations.
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on April 19, 2022, 10:48:30 am
Was able to fix the bug. Now I have a working kernel 5.17.3 !

Code: [Select]
diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
index 2a72517e2b28..1f83c7331b06 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
@@ -3721,7 +3721,7 @@ static bool dcn20_resource_construct(
        int i;
        struct dc_context *ctx = dc->ctx;
        struct irq_service_init_data init_data;
-       struct ddc_service_init_data ddc_init_data = {0};
+       struct ddc_service_init_data ddc_init_data;
        struct _vcs_dpi_soc_bounding_box_st *loaded_bb =
                        get_asic_rev_soc_bb(ctx->asic_id.hw_internal_rev);
        struct _vcs_dpi_ip_params_st *loaded_ip =
diff --git a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_resource.c b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_resource.c
index 8ca26383b568..f93b944f75fa 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_resource.c
@@ -2559,7 +2559,7 @@ static bool dcn30_resource_construct(
        int i;
        struct dc_context *ctx = dc->ctx;
        struct irq_service_init_data init_data;
-       struct ddc_service_init_data ddc_init_data = {0};
+       struct ddc_service_init_data ddc_init_data;
        uint32_t pipe_fuses = read_pipe_fuses(ctx);
        uint32_t num_pipes = 0;
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on April 19, 2022, 01:11:23 pm
Great Matgraf, great job, how can we take advantage of the correction now? Should we simply wait for the Linux Kernel team to fix it and then regularly download the tarball from the linux-archive site or do we need to do something else? Thanks
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on April 19, 2022, 03:33:29 pm
@MauryG5 Download a kernel source tree of choice and unpack it. Then download my patch and save it inside the top directory of the kernel source. Now apply the patch as follows from inside of the kernel source directory:
Code: [Select]
patch -p1 < fpu_exeption_fix_for_ppc64le_with_amdgpu.patchNow configure and compile as usual.

Warning: I don't know what my patch exactly does. It was mere guess work. I found the solution by looking at resembling code in the file drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c applying it to the other two files and the error was gone. I have no clue if my patch has any implications whatsoever. Use at your own risk.
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on April 19, 2022, 04:06:12 pm
Excuse me but when you make a correction, shouldn't you tell the Kernel developers, warning them that a correction must be made regarding that architecture and therefore to make the Kernel work, you must make a correction for everyone?
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on April 19, 2022, 05:03:26 pm
Upstream is informed, see https://gitlab.freedesktop.org/drm/amd/-/issues/1949 (https://gitlab.freedesktop.org/drm/amd/-/issues/1949)
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on April 20, 2022, 01:10:07 am
Well then in theory maybe already in the next Kernel 5.17.4, there could be the correction without the need to insert it manually ...
Title: Re: New Kernel 5.16 and new problem
Post by: ClassicHasClass on April 20, 2022, 11:44:59 am
I agree with the comments on that thread that skipping initialization is probably the wrong approach. The fact it works is likely a weird side effect. But great work on localizing where the problem is and I don't think it will take them long to find a better solution.
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on April 21, 2022, 01:46:23 pm
Hi MatGraf, unfortunately, having never patched the kernel, I'm not practical and I don't understand where to apply this patch exactly. I read in this guide on the net, which must be applied in the path of / usr / src but in the downloaded kernel there is only the usr directory, the src is not there and I don't understand where to apply this patch. That path is present in the root system directory, that is the one that is identified with the symbol / but in the directory of the Kernel just downloaded I can't find this path ... Can you tell me better how to do it kindly? So I too start learning how to patch kernels ... Thanks
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on April 22, 2022, 03:28:30 pm
No matgraf, as not mentioned, I only understood afterwards what you meant in your explanation and how to insert the patch. Just up and running in version 5.17.4, your patch apparently works very well, congratulations. I'm testing it on Ubuntu, then compiling it for Debian. Thanks
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on April 22, 2022, 04:03:11 pm
Hi MauryG5, I am glad you got it running!
Using debian as well and compiled my latest kernel as follows:
Code: [Select]
cd
mkdir mylinux
cd mylinux
wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.4.tar.xz
tar xf linux-5.17.4.tar.xz
wget http://ftp.de.debian.org/debian/pool/main/l/linux/linux-config-5.17_5.17.3-1_ppc64el.deb
dpkg-deb -x linux-config-5.17_5.17.3-1_ppc64el.deb ./
cp usr/src/linux-config-5.17/config.ppc64el_none_powerpc64le.xz ./
xz -d config.ppc64el_none_powerpc64le.xz
cd linux-5.17.4
curl -o fpu_exeption_fix_for_ppc64le_with_amdgpu.patch 'https://forums.raptorcs.com/index.php?action=dlattach;topic=332.0;attach=388'
patch -p1 < fpu_exeption_fix_for_ppc64le_with_amdgpu.patch
cp ../config.ppc64el_none_powerpc64le .config
echo CONFIG_PPC_4K_PAGES=y >> .config
make olddefconfig
make -j32 bindeb-pkg
sudo dpkg -i ../linux-headers-5.17.4_5.17.4-1_ppc64el.deb ../linux-image-5.17.4_5.17.4-1_ppc64el.deb
sudo reboot
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on April 22, 2022, 04:38:07 pm
Hi matgraf, yes I just finished compiling and installing on Debian too. Everything was fine on Debian as well as you can see in the picture. I also used make bindeb-pkg after make menuconfig. After he created the files with extension .deb, I gave the command to have all 4 installed then I ran dpkg -i linux - *. Deb and he installed the generated .deb files and then automatically updated the grub. Right now I'm writing from Debian with Kernel 5.17.4. Great job thanks matgraf!

Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on April 25, 2022, 02:01:47 pm
Guys I also wanted to point out 2 other things about this new Kernel 5.17, the first is that Debian now loads with a few seconds less and it was already fast on its own, so in practice I gained a few seconds of loading speed and then I don't have needed to make that famous setting amdgpu.asm = 0. Not if this thing is always due to the patch of his friend matgraf or if it is something that they have solved with the new kernel, the fact is that it is no longer needed.
Title: Re: New Kernel 5.16 and new problem
Post by: matgraf on April 25, 2022, 03:39:23 pm
Didn't know I can ditch amdgpu.aspm=0 now. Thanks for the information! I am always looking forward to get as close to stock debian sid with as few tweaks as possible.
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on April 25, 2022, 03:52:07 pm
Yes matgraf I confirm that I am no longer using that setting and the kernel starts regularly, I thought you had experienced it too but in any case here we are here to help each other and give us information so I'm happy that the information you be served! Even on Ubuntu it starts without that line, you don't need it anymore! Now we hope that as soon as possible the official Kernel team will serialize your fix patch on all future Kernels so we won't even have to boot it on every new kernel to be compiled.
Title: Re: New Kernel 5.16 and new problem
Post by: MauryG5 on May 28, 2022, 04:46:17 am
Hello everyone, I tested the new Kernel 5.18 just released, I can tell you that you can finally compile and install the new kernel on Power without having to make any changes. They also fixed the last problem for which friend matgraf had to create the patch. Now you can compile and install your own custom Kernel without having to change anything. Tested on Ubuntu, in the next few days I will install it in a definitive stable version also on Debian.