Resurrecting this thread for anyone interested/curious... I mentioned forever ago I'd post some findings comparing the Power9 with other CPUs for my CPU-based number crunching needs.
For my latest work, the single-threaded initial implementation took about half an hour to run on my Xeon desktop, and any experimentation meant really thinking it through before a run.
Breaking down the calculations and into chunks then running on all threads, the slowest Threadripper at work, a 3060X, took 20 seconds, and the fastest, an overclocked water-cooled 3990X, took 8 seconds! The 24 core Xeon took 1m13s.
In comparison my 144 core Power9 takes 19 seconds, but, since the algorithm is broken into 256 chunks, it processes the first 128 batches followed by 112, with the processor showing 77% usage partway through, whereas the Threadrippers keep the CPU at 100%. Still, it's a good indication and the machine compares well with a 3060X.
All this said, the software was never optimised for Power (it has some SIMD for Intel), and I've found both Clang and GCC to be quite variable on Power. GGC 10 gave the best results, with Clang 15 the worst (27 seconds; I have a whole Clang/GCC rant for another day).