Author Topic: SMART repeatedly reporting temperature errors for NVMe SSD  (Read 760 times)

jcsiblesei

  • Newbie
  • *
  • Posts: 5
  • Karma: +0/-0
    • View Profile
SMART repeatedly reporting temperature errors for NVMe SSD
« on: December 02, 2024, 10:07:21 am »
I just got a new Talos II TL2SV4 server a few days ago, with the 500GB Samsung internal NVMe storage option. Shortly after installing the OS, smartd started reporting "Device: /dev/nvme0, Critical Warning (0x02): Temperature" about once per day.

Looking at "smartctl -a /dev/nvme0", with the system totally idle, I see temperatures like this:

Code: [Select]
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               52 Celsius

And after running "dd if=/dev/nvme0n1 of=/dev/null bs=1M" to generate load, after only a few seconds, I see temperatures like this:

Code: [Select]
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               80 Celsius

These temperatures seem way too high. The BMC was also reporting a warning that "Temperature Pcie" was 49 C. I looked inside the case, and it looks like there's basically no way for any air to flow over the NVMe drive. It's all the way at the bottom against the edge of the case, with no ventilation around it.

Has anyone ran into this before? What should I do about it?

MPC7500

  • Hero Member
  • *****
  • Posts: 596
  • Karma: +41/-1
    • View Profile
    • Twitter
Re: SMART repeatedly reporting temperature errors for NVMe SSD
« Reply #1 on: December 02, 2024, 12:09:16 pm »
Which Samsung NVMe are you using?
In any case, I would install an NVMe that doesn't get that hot.

atomicdog

  • Newbie
  • *
  • Posts: 43
  • Karma: +4/-0
    • View Profile
Re: SMART repeatedly reporting temperature errors for NVMe SSD
« Reply #2 on: December 02, 2024, 01:16:08 pm »
Is there a fan shroud you can adjust to change the airflow?
The 2U supermicro chassis w/HDD Raptor used has an adjustable fan shroud.

jcsiblesei

  • Newbie
  • *
  • Posts: 5
  • Karma: +0/-0
    • View Profile
Re: SMART repeatedly reporting temperature errors for NVMe SSD
« Reply #3 on: December 02, 2024, 01:49:53 pm »
Which Samsung NVMe are you using?
It's a Samsung SSD 980 500GB.
In any case, I would install an NVMe that doesn't get that hot.
But this is the one that it shipped with.
Is there a fan shroud you can adjust to change the airflow?
The 2U supermicro chassis w/HDD Raptor used has an adjustable fan shroud.
There is a fan shroud, but even if it weren't there at all, there wouldn't be a path for airflow. It's only being blocked by the NVMe carrier card itself and the sides of the case. (I have photos to show what I mean, but when I try to attach them, I get the message "The upload folder is full. Please try a smaller file and/or contact an administrator.")

MPC7500

  • Hero Member
  • *****
  • Posts: 596
  • Karma: +41/-1
    • View Profile
    • Twitter
Re: SMART repeatedly reporting temperature errors for NVMe SSD
« Reply #4 on: December 02, 2024, 03:52:30 pm »
You ordered the Talos II directly from RaptorCS?
In fact, the 980 should remain below 80°C.

jcsiblesei

  • Newbie
  • *
  • Posts: 5
  • Karma: +0/-0
    • View Profile
Re: SMART repeatedly reporting temperature errors for NVMe SSD
« Reply #5 on: December 02, 2024, 03:54:08 pm »
You ordered the Talos II directly from RaptorCS?
Yes. I ordered it directly from RaptorCS, and I didn't make any hardware changes to it.

jcsiblesei

  • Newbie
  • *
  • Posts: 5
  • Karma: +0/-0
    • View Profile
Re: SMART repeatedly reporting temperature errors for NVMe SSD
« Reply #6 on: December 13, 2024, 01:29:25 pm »
Updates:

I upgraded the SSD's firmware from 1B4QFXO7 to 3B4QFXO7. This didn't help at all.

I rearranged the PCIe cards in the host so that the NVMe carrier is in slot 1 (it was in slot 5 before), and slot 2 is empty (with a bracket with ventilation holes). Now there is a path for the fan to blow air right over the SSD. This did help: it now takes several minutes of constant reading before the temperature approaches 80C (it was less than 30 seconds before I made this change).

For comparison, I temporarily moved the SSD to another computer with better airflow, and no matter what I did, the temperature never went above 66C there. So even though it's overheating way less now, it's still concerning that it's overheating at all.

Borley

  • Full Member
  • ***
  • Posts: 181
  • Karma: +17/-0
    • View Profile
Re: SMART repeatedly reporting temperature errors for NVMe SSD
« Reply #7 on: December 13, 2024, 10:34:52 pm »
It is possible to find heat spreaders that adhere to the drive. I know I've gotten a few nvme drives that include them, but never had to use them. Otherwise, this sounds like something a strategically placed 40/60/80mm fan could alleviate.