Raptor Computing Systems Community Forums (BETA)

Raptor Computing Systems Hardware => Blackbird => Topic started by: tle on August 27, 2020, 11:50:42 pm

Title: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: tle on August 27, 2020, 11:50:42 pm
Wondering if anyone has bumped into the issue with amdgpu crashing the kernel upon booting up? I am trying to isolate if this issue occurs to all radeon gpu or not.... or could it be related to something else.

I've reported the issue at https://bugzilla.redhat.com/show_bug.cgi?id=1855976 and https://gitlab.freedesktop.org/drm/amd/-/issues/1293

Many thanks in advance

P/S: My card is Radeon R9 Nano and the old kernel 5.6 works correctly.

** UPDATE **

I manage to get the card working by adding `amdgpu.dc=0` to the kernel parameter. Verified with 5.7.0

However yet there is likely another regression with kernel 5.8.x and 5.9.x because the driver crashes instantly upon loading up.

** UPDATE **

Kernel 5.8, 5.9 and 5.11 runs perfectly if the kernel is compiled with 4K pages

** UPDATE **

Kernel 5.10.5 to 5.10.24 runs perfectly with either 64K or 4K pages! So if you are on FC33, you don't have to do anything, it would just work! :D
FYI Fedora 33 skips 5.10.24 to jump straight to 5.11.7. If you are looking for the F33 RPMs for the 5.10.x series, check out my fork at https://github.com/runlevel5/fedora-linux-kernel/releases/tag/5.10.24.f33

** UPDATE **

Kernel 5.11.12 or older (64K pages) still suffers the same issue. Please make sure you use 5.11.12 or newer
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on August 28, 2020, 03:52:37 am
Hi TLE, I also reported the same problem to the Fedora team. Kernel 5.7.X still does not work with AMD Navi GPUs. We have arrived at version 5.7.15 if I remember correctly, or 5.7.14, but the Kernel still does not boot under Navi. I have the Navi 10 Radeon 5700 XT. They haven't answered me yet, I don't know if any of you have received an answer regarding the problem that has persisted for months now on this Kernel ...
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: MPC7500 on August 28, 2020, 10:13:13 am
I've reported the issue at https://bugzilla.redhat.com/show_bug.cgi?id=1855976

Maybe you get more attention on this ML?
https://lists.ozlabs.org/listinfo/linuxppc-dev
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: ClassicHasClass on August 28, 2020, 11:14:42 am
No problem with my WX7100 (Polaris), so it may be specific to Navi cards.
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: MPC7500 on August 28, 2020, 11:35:02 am
I have a Navi, too with no problem and the R9 Nano is a Fiji-GPU, AFAIK.
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on August 28, 2020, 05:53:46 pm
Yes guys, the problem is unfortunately specific to Navi 10 Radeon 5700 / 5700XT, or Radeon Nano R9. The problem does not arise on yours. We hope that at least for the 5.8.X Kernel they solve because it seems to me that by now for 5.7.X the thing will remain as it is at this point ...
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: tle on August 28, 2020, 07:07:26 pm
I've just tried with 5.8.4 (testing kernel from bodhdi build system) and yeah... the problem is still there
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on August 29, 2020, 05:55:50 am
With 5.8.X still have not solved?  But it is absurd sorry, possible that they ignore our complaints from various parts ?!
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: tle on September 11, 2020, 10:57:07 am
I can confirm this issue also occur on Void Linux 5.8. So it is definitely regressed upstream

Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: MPC7500 on September 11, 2020, 12:04:35 pm
Yes, since Kernel 5.8.6 (https://gitlab.freedesktop.org/drm/amd/-/issues/1283) I get also a black screen.
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on September 11, 2020, 05:23:11 pm
Guys at this point I think the only thing to do is report directly to the team that develops the Linux Kernel. There is something they have put that does not go from version 5.7.X onwards and from that moment it seems they are unaware that it no longer works, they do not know that the Kernel is no longer having the Navi or Nano cards ...
Title: Re: Fedora 32 Linux kernel 5.7.x crashes
Post by: tle on September 15, 2020, 08:20:31 pm
I've lodged a new ticket on amd drm repo
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on September 19, 2020, 06:58:56 am
Thanks for what you do TLE, your contribution to the Power community is always valuable.  Let's hope someone moves because this problem has been going on for a long time now unfortunately.  Every Kernel that I install with the update, I am forced to delete it immediately after as it does not work, this thing is frustrating ...
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on September 22, 2020, 06:32:27 pm
Did anybody test the Navi cards with 4k kernel page size, as described in the other thread (https://forums.raptorcs.com/index.php/topic,200.0.html)?

That fixed the issues for Nouveau
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MPC7500 on September 23, 2020, 03:22:04 am
I assume it has nothing to do with that, because with kernel 5.4 and the patch it worked perfectly. It's a FP regression (at least for Navi) (https://gitlab.freedesktop.org/drm/amd/-/issues/1215).
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on September 23, 2020, 04:19:49 am
Regressions can be correlated with any specific feature or aspect of the platform, they don't always arise spontaneously.

The 64k page size is a significant difference for low-level code in device drivers like GPUs.

Developers of the kernel and drivers normally make a series of unit tests and manual tests before releasing new code.  If they don't do any tests on systems with a 64k page size, using the same combination of CPU and GPU, then it is possible that all their tests appear to succeed and they release code including a regression.

Therefore, I highly recommend that somebody tests different permutations.  I only have the RX 580 for now so I can't test this with my own kernels.  I only have one ppc64el system for now, I plan to use it for other development but when I get to the point where I have multiple machines here then I could dedicate one to regression testing things like this.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on September 23, 2020, 06:48:48 am
What I don't understand is why they are still not solving the problem considering that we have been reporting it for several months already on the official Fedora channels. TLE also reported it to AMD, what more should we do? I do not know...
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on September 23, 2020, 12:12:18 pm
Developers are always busy.  We have lists of bugs and feature requests from many places.  We don't usually work through them in chronological order: they are prioritized in different ways, based on the urgency of an issue, the effort required to fix an issue, etc.

That said, developers like quick wins and low hanging fruit.  If people do some testing and prove which permutations of kernel settings, firmware and hardware are troublesome and which permutations are good and also provide log data, the developer behind the code might recognize what the problem is and make a quick fix for it.

If the developer has to obtain hardware and do the tests himself, he might lose a day on it, in fact, he might never get around to it.

To give a personal example, I often spend a few weeks working on a feature or major change to some code and then before making the official release, I look over the bug list for anything that is easy and I fix those things and include them in the release.  If a bug report doesn't have enough detail, I have to defer it to the next release cycle because I can't delay a release for something that I can't reproduce.

I personally have no plan to buy the RX 5700 right now, I was going to skip that generation and go directly onto Big Navi.  If somebody else wants to test with one of my kernels using 4k page size, I'm happy to provide some guidance.

If anybody has contacts at AMD to get sample hardware for developers under NDA, there are a few people, myself included, who are happy to test it and provide feedback and sometimes fixes.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on September 25, 2020, 12:52:40 am
I understand thanks for your detailed explanation. From what I see you are also a developer for Power I am very pleased. If I can ask you a question a little off topic, being you a developer, how much software do we really have today, which is developed natively on Power and therefore really exploits this architecture?
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on September 25, 2020, 03:10:12 pm
It is a good question

I don't claim to be an expert on POWER

On the other hand, I got my first computer, TRS-80 Color Computer 3, when I was about 10 and started learning the Motorola 6809.  This was really fortunate, because they used the Motorola chipset in my undergraduate studies and I had a huge advantage.

I go wherever a project takes me, from soldering together ham radio equipment to working in quantitative finance.

Most of the free, open source projects I work on are for communications.  In this domain, the highest priority is interoperability, it is no use if a user on one platform can't communicate with a user on another platform.  Metcalfe's law (https://en.wikipedia.org/wiki/Metcalfe%27s_law) tells us that the value of any communications system increases in proportion to the number of users squared.  This emphasizes how important it is for a network like SIP or XMPP to work across architectures.

Rather than designing software exclusively for POWER, my own goals typically involve designing or improving software so that it runs on any current or future platform.  This is an important goal.

Some of my recent activities include starting to investigate bugs in Blender (https://developer.blender.org/T80912)and generalizing that to GNU/Linux development (https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/OKKGSD5RRCIQTCHQPQDC4IHDGJMQEIJV/)

Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on September 25, 2020, 04:46:26 pm
Congratulations on your activities, very interesting, you will be a good connoisseur then also of MotorolaSolutions radio equipment I presume ... In any case, returning to our Kernel speech, the problem is not only on 5700 but also on the other 5000 series, as well as on Nano mashed potato. I hope they solve it as soon as possible because we are already at version 5.8 and it still doesn't work ...
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on September 27, 2020, 11:59:10 am

Motorola?   Talos II + RTL-SDR is my radio (https://danielpocock.com/quickstart-sdr-ham-radio-gqrx-gnu-radio/)

gqrx (and GNU Radio) just works installing the packages. Please remember not to plug in the RTL-SDR dongle until after you install the packages.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MPC7500 on October 03, 2020, 05:16:01 pm
Kernel 5.8.12 works again for me. Sound is for a short period of time distorted. Almost perfect.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on October 04, 2020, 07:02:04 am
well, let's hope that Navi 10 and nano also work at this point ... This morning I did various updates and I saw the Kernel 5.8.12 but I haven't downloaded it, I also try and see what happens.  Thanks for reporting MPC ...
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on October 04, 2020, 10:54:34 am
Tried in this moment the Kernel 5.8.12, nothing to do, it doesn't work, same exact problem on Navi 10. I'm starting to think that if it continues this way this problem will never solve it is incredible...! TLE let us know if on NANO it works for you, for me on Navi 10 it doesn't work as usual....!
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: tle on October 24, 2020, 10:20:28 am
I have good news, I manged to get the driver working with Linux 5.7.0 kernel by enabling `amdgpu.dc=0` parameter in GRUB2.

Unfortunately this trick no longer work with 5.8.x -> 5.10.x
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on October 24, 2020, 03:13:22 pm

Does anybody have any idea what this means for Big Navi?

As they already merged patches into the kernel, should it just work out of the box?

The Big Navi launch is supposed to be on Wednesday, 28 October and we can potentially buy the cards in November.

Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: tle on October 25, 2020, 12:07:34 am
I think we could find out when one of us get their hands on the new card next month....
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on November 12, 2020, 12:48:49 pm
Guys I found an answer from Fedora's team about Kernel 5.7/5.8 with AMD Navi 10 GPUs, they tell me that the problems related to AMD GPUs for our architecture, are not treated by AMD and so it must be one of us to find them, like Daniel did on Kernel 5.6. So if not before ours solve it, there is no way to fix the problem unfortunately. Great problem I would say at this point... We don't know when we will be able to get the bug fixed at this point...
 :( :( :(
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on November 12, 2020, 12:53:30 pm
Did anybody try the kernel with 4k page size?  I feel it would be very useful to have feedback on that.

The Big Navi cards arrive on 18 November, next Wednesday and I'm probably going to try one of those in my system.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on November 12, 2020, 02:49:50 pm
I doubt you will have the drivers available in a short time to be able to use big ships on Power ... For Navi 10/14 we had to wait a long time ...
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on November 12, 2020, 03:06:16 pm
Which drivers?

Something in the kernel or something in Xorg?

For some things, like OBS Studio, the code was there but nobody had created a package.  If it is a problem like that then it is very easy for me to make a package and put it in a repository.

If you can provide specific links to mailing list discussions or changelog entries that can help me understand the actual issues involved.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on November 13, 2020, 04:08:52 am

I found a Phoronix article about RX 6000 support on GNU/Linux (https://www.phoronix.com/scan.php?page=article&item=radeon-rx-6000&num=1)

It looks like the issues are the same for every architecture, not only POWER9

Some things are already done, for example, the kernel 5.9 is in Fedora 33.  Debian stable-backports has 5.8 but sooner or later it will get 5.9.

The other things mentioned in the article (LLVM, GFX, Mesa) appear to be available in Debian unstable so it may be possible to simply run unstable or backport them to buster.

People can actually upgrade to Fedora 33, upgrade to Debian unstable or start running the newer kernel and/or preparing backports of these things before getting the RX 6000 series cards.  So you can see if everything else on your computer works normally with the new kernel and libraries before spending money on hardware.

Once you have all that software installed/upgraded, you may simply be able to insert the card and it works immediately, whether you are on x86 or POWER9
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on November 13, 2020, 07:07:44 am
I hope that's how I tell you.  Before using Navi 10 we had to wait months and among other things they had to compile a custom kernel before it became usable in the official one.  In any case to date the Navi 10 problems on new kernels continue and they told me that it must be one of you Power developers who must find the problem because AMD doesn't even consider us ...
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on November 13, 2020, 07:47:29 am
There are many bugs that are not so hard to fix.

Sometimes a bug that is only visible on one platform today reveals a quality issue that might cause future problems on other platforms.  Experienced developers usually know this and will be keen to resolve it if they have some hints from the people impacted by the bug.

It is related to the thread I started about having a critical mass of developers (https://forums.raptorcs.com/index.php/topic,225.0.html).  It is also a chicken-and-egg problem: the more developers tinkering with things on the platform, the more attractive it is for additional developers to join.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on November 13, 2020, 01:37:05 pm
I also tried to write to Daniel Kolesa, he is one of the developers who deal with any Kernel fixes on Power, I hope he can tell me something about it ...
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: ClassicHasClass on November 14, 2020, 10:56:45 am
He's here (q66).
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MPC7500 on November 14, 2020, 11:41:18 am
He is mostly on IRC
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on November 14, 2020, 01:03:29 pm

I'm thinking about exchanging my Talos II for two of the Talos II Lite motherboards.  Then I could use one as my regular workstation and the other as both a test system and a spare.

If I do that then it will be much easier for me to test new GPUs and drivers and anything else that requires reboots, X server restarts and potential crashes.


Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on November 14, 2020, 05:17:29 pm
Great, do you think that by doing this you would also be able to find this damn bug that has been haunting us for months now?
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: ClassicHasClass on November 16, 2020, 12:44:45 pm

I'm thinking about exchanging my Talos II for two of the Talos II Lite motherboards.  Then I could use one as my regular workstation and the other as both a test system and a spare.

If I do that then it will be much easier for me to test new GPUs and drivers and anything else that requires reboots, X server restarts and potential crashes.

I recommend something like this (or at least a test and a production system). My Blackbird serves as test so that I don't whack my daily driver.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on November 18, 2020, 06:47:28 am
Guys any news on Kernel 5.9.X on Fedora for Navi 10?  Will we ever get this damn bug fixed ?!
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on November 18, 2020, 09:46:11 am

The best place to ask about the Fedora kernel is on the Fedora developer or kernel mailing list (https://lists.fedoraproject.org/archives/).

It is also a good idea to ask about the page size, I feel workstation users will have better results using the 4K page size.  That is what I use for everything I do now so it is harder for me to assist people using the 64k page size.  Maybe they can offer different permutations of the kernel package so each user can choose, just like Debian.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on November 18, 2020, 02:58:23 pm
Pocock do you think there will be full support for Navi 10 in the next Debian 11?  Do you know anything about it?  What kind of Kernel will it use?
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on November 18, 2020, 03:06:12 pm
Interesting question, I was looking at that myself the other day.  I notice people just started working on packaging the 5.10 kernel (https://salsa.debian.org/kernel-team/linux/-/commit/4ad7097da8516cc5e2894f298c7e39a744f0507c) so maybe that will be in Debian 11 by default.

llvm-11 and llvm-12 have both been packaged.

Mesa 20.2 is in unstable and 20.3-rc1 is in experimental (https://packages.qa.debian.org/m/mesa.html).  I was able to build 20.3-rc1 on buster so I suspect it has a good chance of coming through to stable for Debian 11.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on November 18, 2020, 04:05:25 pm
Ah well, so we'll have that as Kernel great news. Kernel 5.10 should now have internal support for these GPUs I hope. I just hope there are no bugs on that as well otherwise it wouldn't make sense ...
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: tle on December 17, 2020, 06:52:14 pm
Sadly 5.10 has not addressed the issue..
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on December 18, 2020, 03:21:47 am
Hi TLE, I also tweeted to Raptor and they told me that there is still no solution, all this is incredible anyway, there is a bug that afflicts these GPUs and it is still not fixed after months and months ...  I understand what's so complicated that then everything worked fine, how they could suddenly spoil everything is a mystery ... Navi 10 has been around for more than 1 year now and the price has also dropped ... Incredible ...
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on December 18, 2020, 03:35:21 pm
For the newer Big Navi cards, it looks like stock may not arrive in serious quantities until February or March.  If they work well then maybe it will be easier to just sell any 5700 XT cards and buy the newer cards.  16GB RAM will be really useful for those multi-monitor and multi-seat setups.

Looking at Debian, the Linux 5.9 kernel is now backported and all the related stuff like llvm and mesa builds as a backport too.  Fedora will have all that too.  That is at least 2 major operating systems that will be ready to run these cards in some form.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on December 18, 2020, 04:17:42 pm
Hi Pocock, so do you think this damn bug that has plagued this Kernel with Navi for months will finally be fixed? Another question, you who are Debian experts, do you think that with Debian 11 there will be full support and just put the classic Xorg setting to enable Navi 10 and it will work on Debian natively too?
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: pocock on December 19, 2020, 04:10:58 am
I don't know if it will be fixed.  If none of the developers responsible for that code are using POWER9 with Navi 10 then they may never be able to investigate the problem completely.

Debian 11 will include a newer kernel and newer components like mesa.  That means newer hardware is more likely to work but it is never a guarantee.  That said, you can now run the 5.9.6 kernel on Debian 10, buster, it is in backports:

https://packages.qa.debian.org/linux

Once again, I would really appreciate feedback from anybody who tries the Navi 10 cards with a kernel compiled for 4k page size.  That would reveal whether the problem only impacts the 64k page size.  It would provide another reason to ask distributions to change their default page size for POWER9 kernels from 64k to the more common 4k.  This was the issue for the NVIDIA Nouveau driver.
Title: Re: [amdgpu] [Fiji] Fedora 32 Linux kernel 5.7.x crashes
Post by: MauryG5 on December 19, 2020, 07:39:51 am
I understand, of course you are not giving us good news on the Fedora front at this point, the main problem is that we have tried to talk to all the main supporters, from the Fedora team to Raptor but nobody can give us a solution ... Honestly I find  this absolutely amazing thing ... That said, this Kernel you say needs to be installed or is it on the current 10.6 version of Debian?  I would like to try installing 10.6 and see how it behaves with Navi 10 ...