Author Topic: GPU DMA issue diagnoses and impact (missing DMA kernel support beyond 32 bits)  (Read 112 times)

FlyingBlackbird

  • Jr. Member
  • **
  • Posts: 83
  • Karma: +0/-0
    • View Profile
The Raptor wiki mentions

Quote
All AMD GPUs currently have DMA issues (limited to 32-bit, which can cause crashes) due to missing Linux kernel support for DMA windows between 33 and 63 bits in length.
The root cause is GPU vendors (and occasionally some non-GPU vendors) cutting costs and only including 40-bit capable (Intel-style) DMA controllers.
A compatibility mode is expected to be included in Linux 5.4 and above that will resolve this issue

https://wiki.raptorcs.com/wiki/POWER9_Hardware_Compatibility_List/PCIe_Devices#Graphics_Cards

What I would like to understand:
  • How can I diagnose this (am I affected)?
  • What is the impact of this issue (crashes under which conditions)?

madscientist159

  • Raptor Staff
  • *****
  • Posts: 41
  • Karma: +7/-0
    • View Profile
This is no longer an issue with kernel 5.4 and higher.

FlyingBlackbird

  • Jr. Member
  • **
  • Posts: 83
  • Karma: +0/-0
    • View Profile
This is no longer an issue with kernel 5.4 and higher.

Yes, I have read this but I would like to understand the background to make decent decisions on future hardware purchases ("compatibility mode" sounds like "performance impact")

MPC7500

  • Full Member
  • ***
  • Posts: 189
  • Karma: +11/-0
    • View Profile
    • Twitter
I'm not an expert on this topic (If one of my statements is wrong, please correct it ;)), but as far as I understand, the system crashes when the graphics card uses more than 4GiB (32bit). For example when you open a ton of Firefox windows. 40bit is 1TiB by the way, not 128GiB.

And even if a user would be affected by this bug, he would not notice it.

madscientist159

  • Raptor Staff
  • *****
  • Posts: 41
  • Karma: +7/-0
    • View Profile
Yes, I have read this but I would like to understand the background to make decent decisions on future hardware purchases ("compatibility mode" sounds like "performance impact")

No, it's much the same as "32-bit compatibility mode" in the x86 software context.  The GPUs are broken, not quite spec compliant, so IBM had to introduce additional kernel handling to work around that.  Therefore, "compatibility with old broken GPU hardware", or "compatibility mode"  ;)

If they were not broken, the extra "compatibility" code would not be needed.  You can of course run them in 32-bit mode, but when you exhaust the 32-bit space you'll get GPU memory allocation errors and that very well might crash X or your applications (or the machine, since GPU drivers don't generally handle errors well).