Raptor Computing Systems Community Forums (BETA)

Software => User Zone => Topic started by: FlyingBlackbird on February 01, 2020, 12:32:27 pm

Title: GPU DMA issue diagnoses and impact (missing DMA kernel support beyond 32 bits)
Post by: FlyingBlackbird on February 01, 2020, 12:32:27 pm
The Raptor wiki mentions

Quote
All AMD GPUs currently have DMA issues (limited to 32-bit, which can cause crashes) due to missing Linux kernel support for DMA windows between 33 and 63 bits in length.
The root cause is GPU vendors (and occasionally some non-GPU vendors) cutting costs and only including 40-bit capable (Intel-style) DMA controllers.
A compatibility mode is expected to be included in Linux 5.4 and above that will resolve this issue

https://wiki.raptorcs.com/wiki/POWER9_Hardware_Compatibility_List/PCIe_Devices#Graphics_Cards

What I would like to understand:
Title: Re: GPU DMA issue diagnoses and impact (missing DMA kernel support beyond 32 bits)
Post by: madscientist159 on February 01, 2020, 05:04:24 pm
This is no longer an issue with kernel 5.4 and higher.
Title: Re: GPU DMA issue diagnoses and impact (missing DMA kernel support beyond 32 bits)
Post by: FlyingBlackbird on February 02, 2020, 04:05:06 am
This is no longer an issue with kernel 5.4 and higher.

Yes, I have read this but I would like to understand the background to make decent decisions on future hardware purchases ("compatibility mode" sounds like "performance impact")
Title: Re: GPU DMA issue diagnoses and impact (missing DMA kernel support beyond 32 bits)
Post by: MPC7500 on February 02, 2020, 09:05:25 am
I'm not an expert on this topic (If one of my statements is wrong, please correct it ;)), but as far as I understand, the system crashes when the graphics card uses more than 4GiB (32bit). For example when you open a ton of Firefox windows. 40bit is 1TiB by the way, not 128GiB.

And even if a user would be affected by this bug, he would not notice it.
Title: Re: GPU DMA issue diagnoses and impact (missing DMA kernel support beyond 32 bits)
Post by: madscientist159 on February 02, 2020, 05:46:09 pm
Yes, I have read this but I would like to understand the background to make decent decisions on future hardware purchases ("compatibility mode" sounds like "performance impact")

No, it's much the same as "32-bit compatibility mode" in the x86 software context.  The GPUs are broken, not quite spec compliant, so IBM had to introduce additional kernel handling to work around that.  Therefore, "compatibility with old broken GPU hardware", or "compatibility mode"  ;)

If they were not broken, the extra "compatibility" code would not be needed.  You can of course run them in 32-bit mode, but when you exhaust the 32-bit space you'll get GPU memory allocation errors and that very well might crash X or your applications (or the machine, since GPU drivers don't generally handle errors well).
Title: Re: GPU DMA issue diagnoses and impact (missing DMA kernel support beyond 32 bits)
Post by: JeremyRand on September 02, 2022, 08:49:47 pm
A few questions about this: