Author Topic: Asymmetric CPUs with the Same Core Count  (Read 6031 times)

amock

  • Newbie
  • *
  • Posts: 20
  • Karma: +1/-0
    • View Profile
Asymmetric CPUs with the Same Core Count
« on: September 18, 2022, 09:35:19 pm »
I recently saw https://github.com/rigtorp/c2clat and ran it on my dual 18-core machine and got an interesting result.  I've attached an image of it, and the part that seems strange is the bottom left quadrant next to the origin where the pattern doesn't match what's in the top right quadrant.  It reminded me that I had earlier looked at my CPU caches and one seemed to have more than the other.

Code: [Select]
Package L#0
      L3 L#0 (10MB) + L2 L#0 (512KB)
        L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
          PU L#0 (P#0)
          PU L#1 (P#1)
          PU L#2 (P#2)
          PU L#3 (P#3)
        L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
          PU L#4 (P#4)
          PU L#5 (P#5)
          PU L#6 (P#6)
          PU L#7 (P#7)
      L3 L#1 (10MB) + L2 L#1 (512KB)
        L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
          PU L#8 (P#8)
          PU L#9 (P#9)
          PU L#10 (P#10)
          PU L#11 (P#11)
        L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
          PU L#12 (P#12)
          PU L#13 (P#13)
          PU L#14 (P#14)
          PU L#15 (P#15)
      L3 L#2 (10MB) + L2 L#2 (512KB)
        L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
          PU L#16 (P#16)
          PU L#17 (P#17)
          PU L#18 (P#18)
          PU L#19 (P#19)
        L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
          PU L#20 (P#20)
          PU L#21 (P#21)
          PU L#22 (P#22)
          PU L#23 (P#23)
      L3 L#3 (10MB) + L2 L#3 (512KB)
        L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
          PU L#24 (P#24)
          PU L#25 (P#25)
          PU L#26 (P#26)
          PU L#27 (P#27)
        L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
          PU L#28 (P#28)
          PU L#29 (P#29)
          PU L#30 (P#30)
          PU L#31 (P#31)
      L3 L#4 (10MB) + L2 L#4 (512KB)
        L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
          PU L#32 (P#32)
          PU L#33 (P#33)
          PU L#34 (P#34)
          PU L#35 (P#35)
        L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
          PU L#36 (P#36)
          PU L#37 (P#37)
          PU L#38 (P#38)
          PU L#39 (P#39)
      L3 L#5 (10MB) + L2 L#5 (512KB)
        L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
          PU L#40 (P#40)
          PU L#41 (P#41)
          PU L#42 (P#42)
          PU L#43 (P#43)
        L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
          PU L#44 (P#44)
          PU L#45 (P#45)
          PU L#46 (P#46)
          PU L#47 (P#47)
      L3 L#6 (10MB) + L2 L#6 (512KB)
        L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
          PU L#48 (P#48)
          PU L#49 (P#49)
          PU L#50 (P#50)
          PU L#51 (P#51)
        L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
          PU L#52 (P#52)
          PU L#53 (P#53)
          PU L#54 (P#54)
          PU L#55 (P#55)
      L3 L#7 (10MB) + L2 L#7 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#56 (P#56)
        PU L#57 (P#57)
        PU L#58 (P#58)
        PU L#59 (P#59)
      L3 L#8 (10MB) + L2 L#8 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#60 (P#60)
        PU L#61 (P#61)
        PU L#62 (P#62)
        PU L#63 (P#63)
      L3 L#9 (10MB) + L2 L#9 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#64 (P#64)
        PU L#65 (P#65)
        PU L#66 (P#66)
        PU L#67 (P#67)
      L3 L#10 (10MB) + L2 L#10 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#68 (P#68)
        PU L#69 (P#69)
        PU L#70 (P#70)
        PU L#71 (P#71)
compared to
Code: [Select]
Package L#1
      L3 L#11 (10MB) + L2 L#11 (512KB)
        L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
          PU L#72 (P#72)
          PU L#73 (P#73)
          PU L#74 (P#74)
          PU L#75 (P#75)
        L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
          PU L#76 (P#76)
          PU L#77 (P#77)
          PU L#78 (P#78)
          PU L#79 (P#79)
      L3 L#12 (10MB) + L2 L#12 (512KB)
        L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
          PU L#80 (P#80)
          PU L#81 (P#81)
          PU L#82 (P#82)
          PU L#83 (P#83)
        L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
          PU L#84 (P#84)
          PU L#85 (P#85)
          PU L#86 (P#86)
          PU L#87 (P#87)
      L3 L#13 (10MB) + L2 L#13 (512KB)
        L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
          PU L#88 (P#88)
          PU L#89 (P#89)
          PU L#90 (P#90)
          PU L#91 (P#91)
        L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
          PU L#92 (P#92)
          PU L#93 (P#93)
          PU L#94 (P#94)
          PU L#95 (P#95)
      L3 L#14 (10MB) + L2 L#14 (512KB)
        L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
          PU L#96 (P#96)
          PU L#97 (P#97)
          PU L#98 (P#98)
          PU L#99 (P#99)
        L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
          PU L#100 (P#100)
          PU L#101 (P#101)
          PU L#102 (P#102)
          PU L#103 (P#103)
      L3 L#15 (10MB) + L2 L#15 (512KB)
        L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
          PU L#104 (P#104)
          PU L#105 (P#105)
          PU L#106 (P#106)
          PU L#107 (P#107)
        L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
          PU L#108 (P#108)
          PU L#109 (P#109)
          PU L#110 (P#110)
          PU L#111 (P#111)
      L3 L#16 (10MB) + L2 L#16 (512KB)
        L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
          PU L#112 (P#112)
          PU L#113 (P#113)
          PU L#114 (P#114)
          PU L#115 (P#115)
        L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
          PU L#116 (P#116)
          PU L#117 (P#117)
          PU L#118 (P#118)
          PU L#119 (P#119)
      L3 L#17 (10MB) + L2 L#17 (512KB)
        L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
          PU L#120 (P#120)
          PU L#121 (P#121)
          PU L#122 (P#122)
          PU L#123 (P#123)
        L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
          PU L#124 (P#124)
          PU L#125 (P#125)
          PU L#126 (P#126)
          PU L#127 (P#127)
      L3 L#18 (10MB) + L2 L#18 (512KB)
        L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
          PU L#128 (P#128)
          PU L#129 (P#129)
          PU L#130 (P#130)
          PU L#131 (P#131)
        L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
          PU L#132 (P#132)
          PU L#133 (P#133)
          PU L#134 (P#134)
          PU L#135 (P#135)
      L3 L#19 (10MB) + L2 L#19 (512KB)
        L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
          PU L#136 (P#136)
          PU L#137 (P#137)
          PU L#138 (P#138)
          PU L#139 (P#139)
        L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
          PU L#140 (P#140)
          PU L#141 (P#141)
          PU L#142 (P#142)
          PU L#143 (P#143)

Does anyone else have asymmetric CPUs with the same core count?  For people with 18-core CPUs, what cache configuration do you have?

ClassicHasClass

  • Sr. Member
  • ****
  • Posts: 467
  • Karma: +35/-0
  • Talospace Earth Orbit
    • View Profile
    • Floodgap
Re: Asymmetric CPUs with the Same Core Count
« Reply #1 on: September 20, 2022, 11:01:25 am »
I don't have a system that big, but I don't remember seeing anything like that with my dual-8. Are you sure Hostboot didn't guard anything out? (At a BMC root prompt, try `pflash -P GUARD -c` to ensure there are no guard entries in the PNOR.)

amock

  • Newbie
  • *
  • Posts: 20
  • Karma: +1/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #2 on: September 20, 2022, 11:44:05 am »
I'm pretty sure nothing was guarded out.  I just cleared it now and I've cleared out the guards before and it has always been like this.  The 8 core and smaller all have 10MB of cache per core, where the 18 core an larger have 10MB of cache per core pair.  So it seems like one of my CPUs has cores with unpaired cache (with 11 10MB caches like the 22 core CPU would) and one is just the paired caches (so 9 10MB caches) as normal.

I think this also contributes to an issue I had when I was overclocking.  I used the overclock from the Raptor repos and sometimes it would just suddenly die when under very heavy load.  My guess is that the CPU with 20MB of extra cache drew more power than it should have and tripped something.

ejfluhr

  • Newbie
  • *
  • Posts: 44
  • Karma: +3/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #3 on: September 20, 2022, 08:15:01 pm »
Bonus cache!    Presumably 1 core in each of the pair that share the L2/L3 was bad in some way, and that die did not have enough paired-cache cores to make a matched set of 18, so used some of the remaining unpaired cores to make up the difference.  Given the difficulty of yielding big die on a modern technology node, reducing core count is common industry practice.  It's a little unusual here that 2 core share the cache so when one gets knocked out, the other retains access to the full amount of cache.  If the cache had been reduced any, then that core might perform worse versus the same workload on only 1 of the 2 cores in a paired-cache config, so it gets to keep the full local cache without having to share any...yay!

Raptor identifies this config specifically as "unpaired" in their 4-core and 8-core cpu description, e.g.:  https://raptorcs.com/content/CP9M31/intro.html
   4 cores per package
       3.2GHz base / 3.8GHz turbo (WoF)
       90W TDP
       All Core Turbo capable
       32KB L1 data cache + 32KB L1 instruction cache / core
       512KB unpaired L2 cache / core
       10MB unpaired L3 cache / core

Versus the larger core-count, e.g.:  https://raptorcs.com/content/CP9M36/intro.html
   18 cores per package
       2.8GHz base / 3.8GHz turbo (WoF)
       190W TDP
       32KB L1 data cache + 32KB L1 instruction cache / core
       512KB L2 cache / core pair
       10MB L3 cache / core pair


If you don't want the bonus cache, I am sure someone would trade CPUs with you.  ;-)

« Last Edit: September 20, 2022, 08:25:33 pm by ejfluhr »

ejfluhr

  • Newbie
  • *
  • Posts: 44
  • Karma: +3/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #4 on: September 20, 2022, 08:30:16 pm »
How do you read the purple and orange plot??

amock

  • Newbie
  • *
  • Posts: 20
  • Karma: +1/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #5 on: September 20, 2022, 09:19:14 pm »
If you don't want the bonus cache, I am sure someone would trade CPUs with you.  ;-)
I'm very happy with my Talos II :D

How do you read the purple and orange plot??

The darker purple is fastest going up to light purple and then red, then orange, then yellow.  There's a legend on the right side of the image, but you might have to scroll to see it. 

ejfluhr

  • Newbie
  • *
  • Posts: 44
  • Karma: +3/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #6 on: September 29, 2022, 05:18:26 pm »
Does that indicate the caching benefit?   E.g. is the darkest purple runs from the local L2, hence is the lowest latency?  Then the light-but-still-solid purple runs from the shared 10MB-L3 so has the next lowest latency.  The red/purple mix is going out to DIMM memory attached to the same die so it has yet greater latency, and finally the orange is DIMM memory on the other die, hence hops across the socket-to-socket link and has the largest latency.

I don't get why some dark-purple boxes are bigger than others, though.  Also why do some have the solid-light purple around them but not all?  It doesn't all line up.

amock

  • Newbie
  • *
  • Posts: 20
  • Karma: +1/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #7 on: September 29, 2022, 10:29:50 pm »
The darkest purple is a core pair, which shares L2 and L3 cache.  I'm guessing that the lighter purple is neighboring cores, but I don't know enough about the CPU to say for sure.  The graph is a measurement of how long it takes to send a message between cores, so I don't think it would ever go out to RAM on a single CPU, so the other purplish area is just cores in the same CPU that are far away and then the orange is cores on the other CPU.  There's a die picture at https://web.archive.org/web/20190325062541/https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b779-61ff0266d32a/page/1cb956e8-4160-4bea-a956-e51490c2b920/attachment/56cea2a9-a574-4fbb-8b2c-675432367250/media/POWER9-VUG.pdf that I'm using to inform my speculation.

The small dark purple boxes just down and right from center are cores that have their own L2 and L3 cache instead of being paired, so that's just the 4 threads for a single core instead of the 8 that the other cores have.  If you ran this on an 4 core or 8 core machine it should have all small boxes.  Also, some of the cores might not have a neighboring core since only 18 of the potentially 24 cores are enabled.  There's also some error in the measurements since it doesn't use a cycle counting mechanism like on x86.

ejfluhr

  • Newbie
  • *
  • Posts: 44
  • Karma: +3/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #8 on: September 30, 2022, 06:11:02 pm »
> The graph is a measurement of how long it takes to send a message between cores,

Aha, that was a major piece of info I did not grasp.   Your other explanations all coincide nicely now with the picture.  Pretty interesting measurement!

Do you, or anyone, know if anyone has made similar type of measurement on memory latency, including caching effects, from a single core to all memory in the system?

amock

  • Newbie
  • *
  • Posts: 20
  • Karma: +1/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #9 on: September 30, 2022, 10:36:29 pm »
Do you, or anyone, know if anyone has made similar type of measurement on memory latency, including caching effects, from a single core to all memory in the system?

I don't know of any, but I see it done regularly for x86 systems so there might be something out there that's easily adaptable to other systems.  It seems like on x86 many benchmarking tools use
Code: [Select]
RDTSC and I haven't found an exact equivalent for the Power ISA, but there seems to be a 500MHz counter at least on POWER9 that I might try if I can find some simple code.  It's something I've thought about but haven't gotten around to yet.

ClassicHasClass

  • Sr. Member
  • ****
  • Posts: 467
  • Karma: +35/-0
  • Talospace Earth Orbit
    • View Profile
    • Floodgap
Re: Asymmetric CPUs with the Same Core Count
« Reply #10 on: October 01, 2022, 01:12:02 pm »
Do you mean the timebase register? Not equivalent with the time-stamp counter on x86(_64) but should be usable for similar purposes.

https://www.gnu.org/software/libc/manual/html_node/PowerPC.html

amock

  • Newbie
  • *
  • Posts: 20
  • Karma: +1/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #11 on: October 01, 2022, 09:28:06 pm »
Do you mean the timebase register? Not equivalent with the time-stamp counter on x86(_64) but should be usable for similar purposes.

https://www.gnu.org/software/libc/manual/html_node/PowerPC.html
Yes, thanks.  I couldn't remember the name but it looks to be easy to use and good enough.

ejfluhr

  • Newbie
  • *
  • Posts: 44
  • Karma: +3/-0
    • View Profile
Re: Asymmetric CPUs with the Same Core Count
« Reply #12 on: October 07, 2022, 01:05:39 pm »
500MHz affords 1/500Mhz = 2ns resolution of latency?   Sounds cool!   What is same-die typical memory latency, ~70ns??   Local L3 would have to run on the order of ~10ns??

There could be sensitivity to core frequency, presuming the L3 cache and some of the internal transfer network is pipelined.  E.g. running at 3.8GHz turbo will give a better answer than 2.8GHz base.