Author Topic: GPU DMA issue diagnoses and impact (missing DMA kernel support beyond 32 bits)  (Read 4503 times)

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
The Raptor wiki mentions

Quote
All AMD GPUs currently have DMA issues (limited to 32-bit, which can cause crashes) due to missing Linux kernel support for DMA windows between 33 and 63 bits in length.
The root cause is GPU vendors (and occasionally some non-GPU vendors) cutting costs and only including 40-bit capable (Intel-style) DMA controllers.
A compatibility mode is expected to be included in Linux 5.4 and above that will resolve this issue

https://wiki.raptorcs.com/wiki/POWER9_Hardware_Compatibility_List/PCIe_Devices#Graphics_Cards

What I would like to understand:
  • How can I diagnose this (am I affected)?
  • What is the impact of this issue (crashes under which conditions)?

madscientist159

  • Raptor Staff
  • *****
  • Posts: 47
  • Karma: +11/-0
    • View Profile
This is no longer an issue with kernel 5.4 and higher.

FlyingBlackbird

  • Full Member
  • ***
  • Posts: 102
  • Karma: +3/-0
    • View Profile
This is no longer an issue with kernel 5.4 and higher.

Yes, I have read this but I would like to understand the background to make decent decisions on future hardware purchases ("compatibility mode" sounds like "performance impact")

MPC7500

  • Hero Member
  • *****
  • Posts: 572
  • Karma: +40/-1
    • View Profile
    • Twitter
I'm not an expert on this topic (If one of my statements is wrong, please correct it ;)), but as far as I understand, the system crashes when the graphics card uses more than 4GiB (32bit). For example when you open a ton of Firefox windows. 40bit is 1TiB by the way, not 128GiB.

And even if a user would be affected by this bug, he would not notice it.

madscientist159

  • Raptor Staff
  • *****
  • Posts: 47
  • Karma: +11/-0
    • View Profile
Yes, I have read this but I would like to understand the background to make decent decisions on future hardware purchases ("compatibility mode" sounds like "performance impact")

No, it's much the same as "32-bit compatibility mode" in the x86 software context.  The GPUs are broken, not quite spec compliant, so IBM had to introduce additional kernel handling to work around that.  Therefore, "compatibility with old broken GPU hardware", or "compatibility mode"  ;)

If they were not broken, the extra "compatibility" code would not be needed.  You can of course run them in 32-bit mode, but when you exhaust the 32-bit space you'll get GPU memory allocation errors and that very well might crash X or your applications (or the machine, since GPU drivers don't generally handle errors well).

JeremyRand

  • Newbie
  • *
  • Posts: 24
  • Karma: +8/-0
    • View Profile
A few questions about this:

  • Does the Linux 5.4 compatibility mode fix all affected PCIe devices, or only AMD GPU's?  If the latter, does it apply to the Radeon driver, the AMDGPU driver, or both?
  • If I'm on Linux 5.4 or later, how would I test whether the compatibility mode is in use for a given PCIe device?
  • Am I correct in understanding that the compatibility mode is Linux-specific, and thus would need to be re-implemented in other kernels to fix the broken hardware in those kernels?
  • Does the compatibility mode make memory allocation behavior for affected PCIe devices more predictable, even slightly?  (E.g. does it artificially restrict the PCIe device's memory space, as seen by the host, to a 40-bit window, rather than the much larger 64-bit window that would otherwise exist?)  If so, doesn't this decrease security (e.g. by making ASLR less effective)?  (I'm not asking if this security degradation is significant to average users, just whether it exists at all.)