Raptor Computing Systems Community Forums (BETA)
Software => User Zone => Topic started by: FlyingBlackbird on February 01, 2020, 12:32:27 pm
-
The Raptor wiki mentions
All AMD GPUs currently have DMA issues (limited to 32-bit, which can cause crashes) due to missing Linux kernel support for DMA windows between 33 and 63 bits in length.
The root cause is GPU vendors (and occasionally some non-GPU vendors) cutting costs and only including 40-bit capable (Intel-style) DMA controllers.
A compatibility mode is expected to be included in Linux 5.4 and above that will resolve this issue
https://wiki.raptorcs.com/wiki/POWER9_Hardware_Compatibility_List/PCIe_Devices#Graphics_Cards
What I would like to understand:
- How can I diagnose this (am I affected)?
- What is the impact of this issue (crashes under which conditions)?
-
This is no longer an issue with kernel 5.4 and higher.
-
This is no longer an issue with kernel 5.4 and higher.
Yes, I have read this but I would like to understand the background to make decent decisions on future hardware purchases ("compatibility mode" sounds like "performance impact")
-
I'm not an expert on this topic (If one of my statements is wrong, please correct it ;)), but as far as I understand, the system crashes when the graphics card uses more than 4GiB (32bit). For example when you open a ton of Firefox windows. 40bit is 1TiB by the way, not 128GiB.
And even if a user would be affected by this bug, he would not notice it.
-
Yes, I have read this but I would like to understand the background to make decent decisions on future hardware purchases ("compatibility mode" sounds like "performance impact")
No, it's much the same as "32-bit compatibility mode" in the x86 software context. The GPUs are broken, not quite spec compliant, so IBM had to introduce additional kernel handling to work around that. Therefore, "compatibility with old broken GPU hardware", or "compatibility mode" ;)
If they were not broken, the extra "compatibility" code would not be needed. You can of course run them in 32-bit mode, but when you exhaust the 32-bit space you'll get GPU memory allocation errors and that very well might crash X or your applications (or the machine, since GPU drivers don't generally handle errors well).
-
A few questions about this:
- Does the Linux 5.4 compatibility mode fix all affected PCIe devices, or only AMD GPU's? If the latter, does it apply to the Radeon driver, the AMDGPU driver, or both?
- If I'm on Linux 5.4 or later, how would I test whether the compatibility mode is in use for a given PCIe device?
- Am I correct in understanding that the compatibility mode is Linux-specific, and thus would need to be re-implemented in other kernels to fix the broken hardware in those kernels?
- Does the compatibility mode make memory allocation behavior for affected PCIe devices more predictable, even slightly? (E.g. does it artificially restrict the PCIe device's memory space, as seen by the host, to a 40-bit window, rather than the much larger 64-bit window that would otherwise exist?) If so, doesn't this decrease security (e.g. by making ASLR less effective)? (I'm not asking if this security degradation is significant to average users, just whether it exists at all.)