Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - cy384

Pages: [1]
1
Firmware / Re: Firmware 2.10 for Talos-II and Blackbird available
« on: February 28, 2024, 11:03:17 am »
When I built the firmware (prior to this release) I used an ubuntu 18.04 VM.

2
Blackbird / Re: Blackbird Cooling
« on: August 05, 2023, 08:01:03 pm »
sorry to necro this thread, but where did you get those heatsinks from @cy384? I plan on trying a 16 core cpu in my blackbird and those heatsinks look like they'd work pretty well with my fan setup

I used this double sided thermal tape and these 6.5mm x 6.5mm heatsinks, but they're both just chinese generic parts, probably easy to find many different "brands" on ebay or amazon or whatever.  I used a flathead screwdriver to spread the tines of the heatsinks a bit.

I changed around my fan setup so they get more air flow, but the power area still gets really hot and I don't think the power/thermal management stuff is being run optimally.  Fortunately/unfortunately it works well enough that I don't want to spend more time messing with it!

3
I got this error in my Blackbird's BMC log when there was a brief (like one second) power outage (brownout maybe?) last week.

4
Blackbird / Re: Blackbird Cooling
« on: October 20, 2022, 03:24:26 pm »
It looks like there are 7 of the voltage regulator chips, no idea how they're used/wired up though.

Reading about the fan control code and it seems kinda crude, very on/off in design, based more on events rather than target temperatures and dynamic control.  Under max load my CPU fan oscillates between high and medium speeds very noticeably.  I assume this is not an upstream concern, since nobody cares if a server is slightly louder, but it's unpleasant for a workstation.  Raptor spliced in some PID control but I don't think the tuning is ideal.

AFAICT the regulator temperature is considered for fan speed but only after it hits 85C.  Ref https://git.raptorcs.com/git/blackbird-openbmc/tree/meta-rcs/meta-blackbird/recipes-phosphor/fans/phosphor-fan-control-events-config-native/events.yaml

5
Blackbird / Re: Blackbird Cooling
« on: October 13, 2022, 07:01:38 pm »
Got these nice little copper heatsinks installed and did some quick temperature testing... not a huge difference, unfortunately!  Need to redirect more airflow over them.

Edit: also, the voltage regulators seem to be TDA21472, with a maximum recommended temperature of 125C and thermal shutdown at 140C.  I guess running them hot is only moderately concerning.

6
Blackbird / Re: Blackbird Cooling
« on: October 08, 2022, 06:09:00 pm »
just to share some numbers, here's what "sensors" reports at idle:

Code: [Select]
nvme-pci-0100
Adapter: PCI adapter
Composite:    +39.9°C  (low  = -273.1°C, high = +76.8°C)
                       (crit = +79.8°C)
Sensor 1:     +39.9°C  (low  = -273.1°C, high = +65261.8°C)

ibmpowernv-isa-0000
Adapter: ISA adapter
Chip 0 Vdd Remote Sense: 683.00 mV (lowest =  +0.67 V, highest =  +1.01 V)
Chip 0 Vdn Remote Sense: 674.00 mV (lowest =  +0.67 V, highest =  +0.67 V)
Chip 0 Vdd:              685.00 mV (lowest =  +0.68 V, highest =  +1.02 V)
Chip 0 Vdn:              675.00 mV (lowest =  +0.68 V, highest =  +0.68 V)
Chip 0 Core 0:            +44.0°C  (lowest = +22.0°C, highest = +65.0°C)
Chip 0 Core 4:            +44.0°C  (lowest = +22.0°C, highest = +65.0°C)
Chip 0 Core 8:            +44.0°C  (lowest = +23.0°C, highest = +65.0°C)
Chip 0 Core 12:           +44.0°C  (lowest = +23.0°C, highest = +65.0°C)
Chip 0 Core 16:           +45.0°C  (lowest = +21.0°C, highest = +65.0°C)
Chip 0 Core 20:           +45.0°C  (lowest = +23.0°C, highest = +67.0°C)
Chip 0 Core 24:           +45.0°C  (lowest = +23.0°C, highest = +66.0°C)
Chip 0 Core 28:           +44.0°C  (lowest = +24.0°C, highest = +68.0°C)
Chip 0 Core 32:           +44.0°C  (lowest = +22.0°C, highest = +67.0°C)
Chip 0 Core 36:           +44.0°C  (lowest = +22.0°C, highest = +67.0°C)
Chip 0 Core 40:           +44.0°C  (lowest = +22.0°C, highest = +66.0°C)
Chip 0 Core 44:           +44.0°C  (lowest = +22.0°C, highest = +67.0°C)
Chip 0 Core 48:           +45.0°C  (lowest = +23.0°C, highest = +66.0°C)
Chip 0 Core 52:           +45.0°C  (lowest = +23.0°C, highest = +67.0°C)
Chip 0 Core 56:           +45.0°C  (lowest = +23.0°C, highest = +68.0°C)
Chip 0 Core 60:           +45.0°C  (lowest = +24.0°C, highest = +68.0°C)
Chip 0 DIMM 0 :           +49.0°C  (lowest = +30.0°C, highest = +51.0°C)
Chip 0 DIMM 1 :            +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 2 :            +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 3 :            +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 4 :            +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 5 :            +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 6 :            +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 7 :            +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 8 :            +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 9 :            +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 10 :           +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 11 :           +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 12 :          +50.0°C  (lowest = +30.0°C, highest = +53.0°C)
Chip 0 DIMM 13 :           +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 14 :           +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 DIMM 15 :           +0.0°C  (lowest =  +0.0°C, highest =  +0.0°C)
Chip 0 Nest:              +45.0°C  (lowest = +23.0°C, highest = +63.0°C)
Chip 0 VRM VDD:           +53.0°C  (lowest = +35.0°C, highest = +90.0°C)
Chip 0 :                  32.00 W  (lowest =  28.00 W, highest = 156.00 W)
Chip 0 Vdd:                4.00 W  (lowest =   2.00 W, highest = 127.00 W)
Chip 0 Vdn:                9.00 W  (lowest =   7.00 W, highest =  11.00 W)
Chip 0 :                 326.27 kJ
Chip 0 Vdd:               47.31 kJ
Chip 0 Vdn:               89.58 kJ
Chip 0 Vdd:                6.38 A  (lowest =  +4.00 A, highest = +129.75 A)
Chip 0 Vdn:               14.38 A  (lowest = +11.50 A, highest = +17.38 A)

and I've attached an image of what the openbmc web interface reports.

Now that I look at it again, the "Temperature Pcie" might just be the NVMe drive.

7
Blackbird / Blackbird Cooling
« on: October 08, 2022, 01:35:46 pm »
I've been struggling a bit to figure out how to best cool my Blackbird.  Possibly this is made worse by my choice of case (smallest mATX I could find, a SAMA IM01) and the CPU (160W 16 core).  I am using the 3U heatsink module with the provided fan.  Main complaints:

* CPU fan is oriented up/down, and blows downward, like an inch away from whatever's in the first PCIe slot, so I flipped the fan around to blow upwards, which means hot air going over the voltage regulators.
* No heatsink on the voltage regulators, while the Talos and Talos Lite both have them.  The BB doesn't have holes to mount a heatsink there, either.
* I'm not sure where various heat sensors are, physically, on the board.  Where's the ambient temperature sensor?  The PCIe sensor?  The CPU ambient sensor?
* RAM slots are as close as physically possible to the CPU (good for signal integrity, bad for airflow).
* Voltage regulator temps do not seem to be considered in setting fan speed.
* Changing the cooling parameters requires recompiling firmware.

I don't really care about any of these except that the voltage regulators hit 90C within a minute under heavy load.  I've ordered some tiny little heatsinks that can be stuck directly to the chips but I'm wondering if anyone else has a nice solution here.  They're really low on the board and in an awkward spot.

I do have a 3D printer and will design some ducts/shrouds if I can't get temps low enough otherwise.

Attaching some pics for the curious.

8
Firmware / Re: Messing with WOF Tables
« on: October 07, 2022, 02:53:44 pm »
Can you read out the processor frequencies and tell if it is boosting properly?    Do you see ~160W when at idle?   What about when running max threads off some heavy workload (e.g. Mersenne primes is a good one)?

I haven't looked closely at the frequencies, it idles around 30W and only approaches 160W under heavy load.

9
Firmware / Messing with WOF Tables
« on: October 06, 2022, 08:44:59 pm »
tl;dr I bought an unsupported CPU, which was mostly ok, and I tweaked some firmware to make it work properly

This will be a bit of a narrative, documenting it in case anyone else is ever in the same situation:

I saw an astonishingly cheap used POWER9 CPU on ebay and knew it was finally time to buy a Raptor Blackbird.  Specifically, I now have a 02CY231, which is a 16 core, 160W part (not one of the chips that Raptor sells).  I figured since the Blackbird is rated for 160W it should be fine, and it does work out of the box, except it would only hit 90W, which I assume leaves a lot of performance on the table (for the record, I believe my BB shipped with 2.00 firmware).  I spotted a section in the boot log like this:

Code: [Select]
  4.94593|================================================
  4.96605|Error reported by fapi2 (0x3300) EID 0x90000566
  4.98696|  No WOF table match found
  4.98697|  ModuleId   0x10 fapi2::MOD_FAPI2_PLAT_PARSE_WOF_TABLES
  4.98698|  ReasonCode 0x332d fapi2::RC_WOF_TABLE_NOT_FOUND
  4.98699|  UserData1  Number of cores : 0x00100002000000a0
  4.98700|  UserData2  WOF Power Mode (1=Nominal, 2=Turbo) : 0x000009c400000012
  4.98700|------------------------------------------------
  4.98701|  Callout type             : Procedure Callout
  4.98702|  Procedure                : EPUB_PRC_HB_CODE
  4.98702|  Priority                 : SRCI_PRIORITY_HIGH
  4.98703|------------------------------------------------
  4.98704|  Callout type             : Hardware Callout
  4.98705|  Target                   : Physical:/Sys0/Node0/Proc0
  4.98706|  Deconfig State           : NO_DECONFIG
  4.98706|  GARD Error Type          : GARD_NULL
  4.98707|  Priority                 : SRCI_PRIORITY_MED
  4.98707|------------------------------------------------

Ok, seems suspicious, but what's a WOF table? Apparently, it's a CSV file, containing specifications of frequencies and voltages to manage the CPU, which gets compiled into the PNOR image (they're named something like "WOF_V7_4_2_SFORZA_16_160_2500_TM.csv").  What's PNOR?  Early stage bootloader flash.  Fortunately(?) this is all open source and can in theory be modified to support my CPU, so I've been messing with this every evening this week.  Gotta love a long day at work messing with build systems followed by a long evening of messing with build systems.

The instructions on the wiki to build the firmware are basically solid, just replace "talos" with "blackbird" in the obvious places.  One gotcha is that you definitely want to compile on an older distro, I ran Ubuntu 18.04 in a VM to do this.  The other gotcha I ran into was this one ( https://forums.raptorcs.com/index.php/topic,241.0.html ) but as far as I can tell, you don't need to modify OpenBMC if you're just tweaking the WOF tables in the PNOR.

Anyway, I got the firmware building.  I dug around in the files it downloads and found a Raptor repository called "blackbird-xml" which contains the WOF tables; sure enough, it didn't contain any for a 16 core 160W chip.  I searched around and did find a repository on github ( https://github.com/open-power/WOF-Tables ) with a bunch more, so I made a copy of "blackbird-xml" and added all the new WOF tables.  I changed the address of the repository in "machine-xml.mk" to point towards mine, and added the commit hash for my changes to the "blackbird_defconfig" file.  I built and got a new error, like this:

Code: [Select]
ERROR: PnorUtils::checkSpaceConstraints: Image provided (/home/cy384/blackbird-op-build/output/host/powerpc64le-buildroot-linux-gnu/sysroot/openpower_pnor_scratch//wofdata.bin.ecc) has size (6285312) which is greater than allocated space (3145728) for section=WOFDATA.  Aborting! at /home/cy384/blackbird-op-build/output/host/powerpc64le-buildroot-linux-gnu/sysroot/hostboot_build_images/PnorUtils.pm line 462.

I assume there's either a hard limit, or configured limit, on the size of the WOF table data in the PNOR, so I deleted all the WOF tables I didn't care about from my repository, updated the commit hash again, and it built successfully.

I followed the instructions on the wiki page to test out the new PNOR, and my BB booted without the WOF table error!  I did some load testing and sensors does report power usage near 160W, so I'm calling this a success.  The voltage regulators do get really spicy very quick, but that's a subject for another post.

Pages: [1]