Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - kth5

Pages: [1]
1
I had big-endian at first (in PowerVM) but I ultimately decided to run little-endian upon a re-install using Opal as that's also the primary build of ArchPOWER and I figured I have basically zero entitlements on the box anyway (original owner unknown, ebay steal at just 250EUR shipping included so I don't complain).

I'll have a look at installing ppc64 on a spare drive once I either get to the datacenter or get remote management set up properly. ASMI is unfortunately not providing remote serial nor virtual disks. :(

I want VMs for all three architectures in the end though, ppc64le, ppc64 as well as ppc 32bit which works awesome on Blackbird.

2
General Hardware Discussion / KVM cross-endian guest on POWER8 / IBM S822L
« on: November 19, 2024, 02:08:49 am »
Hi everyone,

I recently acquired a IBM S822L dual socket system and now am starting to get to the point of using KVM/libvirt to bring guests to it. As I was used to on my Blackbird I could run big-endian guests without much performance penalty at all running little-endian on the host.

However, on the POWER8 it seems that no matter what I do, I get up to 50% steal time inside a VM when the bare-metal and the VM are otherwise completely idle. When I put any load on the VM it gets worse and jumps up to 80% or more making any vCPU config moot. This only happens when running a big-endian guest kernel and userland, which I need for my build-bots.

A bit of background of what I'm dealing with here:
* dual POWER8E 10 core SMT8 / 256GB RAM running Opal on latest available firmware
* no entitlements beyond basic & micro LPAR and the usual AIX
* bare-metal runs ArchPOWER ppc64le with SMT switched off as required
* guests I'm testing with run ArchPOWER ppc64 or ppc64+32bit userland
* kernel version is 6.11.9 on bare-metal & guests
* Qemu 9.0.3/9.1.1 w/ libvirt 10.8.0

I've tried the following all exhibiting the same steal time issue:
* single vCPU w/o threads
* single vCPU w/o threads pinned to a physical core & NUMA zone
* dual vCPU w/ 8 threads (SMT)
* dual vCPU w/ 8 threads pinned to a physical core & NUMA zone (SMT)

What I did not try is running actual PowerKVM. It seems too outdated for my liking and I'd rather have a recent package base as close to upstream as possible.

Running ppc64le in the guest does not show the problem at all and performs splendidly.

Bit at a loss here and I would rather I not yet call it a lost cause.  :(

3
Blackbird / Re: Did my Blackbird just die on me?
« on: July 19, 2022, 12:03:34 pm »
New PSU arrived (same model that was in there before) and the Blackbird revived itself. A few boots failed with PNOR checksum failures which probably stem from my attempts to update with flaky power... Anyway, here's to another 3 years of 24/7. :)

Thanks everyone and I hope this thread may help anyone else who runs into this.

4
Blackbird / Re: Did my Blackbird just die on me?
« on: July 15, 2022, 10:41:13 am »
I probably found the culprit, the PSU seems to be bad. Another otherwise working x86 machine also proves to be unstable with it even at lower loads than 250W (it's a 550W). I have another PSU coming, will report if that solved my issue.

5
Blackbird / Re: Did my Blackbird just die on me?
« on: July 15, 2022, 02:05:09 am »
It might also be worth putting behind a surge suppressor. Normally I wouldn't care so much but seeing as these parts cost what they do...

Well, too late now.  :P

6
Blackbird / Re: Did my Blackbird just die on me?
« on: July 15, 2022, 02:02:29 am »
Long story short: I had to reflash the BMC and OpenPOWER firmware.
https://wiki.raptorcs.com/wiki/Updating_Firmware

That was the only thing I have not tried. So once at my desk at work I did it remotely only to find out that the BMC did not recover within 30 minutes after the reboot. Switched it off after approx 35 via the power strip (remotely accessible) and back on, to no avail.

Seems I may have bricked it fully now. :(

Once I get home it's time to hook up the serial again and see if there's any live visable still.

7
Blackbird / Did my Blackbird just die on me?
« on: July 14, 2022, 12:32:56 pm »
The other day I was logged in from remote and the box just goes down. I could still reach the BMC and attempt to power it up but to no avail. No Hostboot output on serial (via BMC) or event logs on the BMC. Just plain nothing.

Once I got home I switched the box on manually via switch, the fans started running on full tilt as usual but after pretty much exactly 30s it switched off again, without leaving a trace as to why in the eventlog on the BMC.

Then, I went to remove all hardware but the CPU one by one with tries in between, same effect.

The only thing that looks weird obviously are repeating dmesg entries every few seconds on the BMC:

Code: [Select]
[ 1367.988668] aspeed-g5-pinctrl 1e6e2000.syscon:pinctrl: request pin 26 (F20) for 1e780000.gpio:306
[ 1367.988711] Want SCU90[0x00000002]=0x1, got 0x0 from 0x063F0000
[ 1367.988731] Want SCU8C[0x00000200]=0x1, got 0x0 from 0x00000001
[ 1367.988746] Want SCU70[0x00200000]=0x1, got 0x0 from 0xF1105206
[ 1370.989477] aspeed-g5-pinctrl 1e6e2000.syscon:pinctrl: request pin 26 (F20) for 1e780000.gpio:306
[ 1370.989520] Want SCU90[0x00000002]=0x1, got 0x0 from 0x063F0000
[ 1370.989538] Want SCU8C[0x00000200]=0x1, got 0x0 from 0x00000001
[ 1370.989548] Want SCU70[0x00200000]=0x1, got 0x0 from 0xF1105206
[ 1373.990267] aspeed-g5-pinctrl 1e6e2000.syscon:pinctrl: request pin 26 (F20) for 1e780000.gpio:306
[ 1373.990311] Want SCU90[0x00000002]=0x1, got 0x0 from 0x063F0000
[ 1373.990330] Want SCU8C[0x00000200]=0x1, got 0x0 from 0x00000001
[ 1373.990342] Want SCU70[0x00200000]=0x1, got 0x0 from 0xF1105206

Do these mean anything or are we just talking verbosity?

I can upgrade PNOR etc from BMC without failure and read it back, so that's not it either.


Did my CPU just die and if so, how the hell can I confirm this before I set on another investment of hundreds of dollars? :(

8
I think it might be the controller as well. Do you have a PCI one you can try?

I do have one somewhere at the office. I don't want it to be a permanent thing since I quite like my accellerated dual-screen graphics with an nvme drive as a build partition. :D
I'll see if I can get it over tomorrow to try.

The big question though, I have had my blackbird since I think August 2019 and never did try connecting any kind of optical drive other than via USB. Could it be that it's just my board, its production run or a design flaw in the controller itself?
I'm fine without a warranty kind of process even if it were a possibility, just too bad I can't waste the 8x slot for another SATA controller and not miss out of the nvme.

9
Reviving this thread since I recently attempted to add a DVD-ROM drive I had lying around. So pardon me...

As I said, I attempted to add a SATA LG DVD-ROM (yes, ROM) and I could boot most times but it wouldn't mount any CD or DVD. When it did not boot, it just started to fail with those softreset FIS messages whenever I tried doing anything with it from the initramfs. Sometimes going so far as to disable other drives connected to one of the 4 SATA ports on my Blackbird. Curiously, the eject program did seem to do what it should.

So I thought to myself, this drive's probably trash after years in storage. So I just got a fresh LG DVD-RW drive and it does the same.

Worse, it almost always halts the boot process completely after kexec launched the OS's kernel and it trying to detect all drives. Once at the port with the LG DVD-RW connected, it halts and will not expose any of the drives.

When the system DOES boot - which happens randomly, I see this in dmesg until it eventually downgrades to UDMA/33 and locks up SATA entirely.

Code: [Select]
[Wed Dec 16 21:44:55 2020] ahci 0002:01:00.0: AHCI 0001.0000 32 slots 4 ports 6 Gbps 0xf impl SATA mode
[Wed Dec 16 21:44:55 2020] ahci 0002:01:00.0: flags: 64bit ncq sntf led only pmp fbs pio slum part sxs
[Wed Dec 16 21:44:55 2020] scsi host0: ahci
[Wed Dec 16 21:44:55 2020] scsi host1: ahci
[Wed Dec 16 21:44:55 2020] scsi host2: ahci
[Wed Dec 16 21:44:55 2020] scsi host3: ahci
[Wed Dec 16 21:44:55 2020] ata1: SATA max UDMA/133 abar m2048@0x600c100010000 port 0x600c100010100 irq 30
[Wed Dec 16 21:44:55 2020] ata2: SATA max UDMA/133 abar m2048@0x600c100010000 port 0x600c100010180 irq 30
[Wed Dec 16 21:44:55 2020] ata3: SATA max UDMA/133 abar m2048@0x600c100010000 port 0x600c100010200 irq 30
[Wed Dec 16 21:44:55 2020] ata4: SATA max UDMA/133 abar m2048@0x600c100010000 port 0x600c100010280 irq 30
[Wed Dec 16 21:44:55 2020] nvme nvme0: 15/0/0 default/read/poll queues
[Wed Dec 16 21:44:55 2020]  nvme0n1: p1 p2
[Wed Dec 16 21:44:56 2020] random: fast init done
[Wed Dec 16 21:44:56 2020] ata1: SATA link down (SStatus 0 SControl 300)
[Wed Dec 16 21:44:56 2020] usb 1-1: new high-speed USB device number 2 using xhci_hcd
[Wed Dec 16 21:44:56 2020] usb 1-1: New USB device found, idVendor=1a40, idProduct=0101, bcdDevice= 1.11
[Wed Dec 16 21:44:56 2020] usb 1-1: New USB device strings: Mfr=0, Product=1, SerialNumber=0
[Wed Dec 16 21:44:56 2020] usb 1-1: Product: USB 2.0 Hub
[Wed Dec 16 21:44:56 2020] hub 1-1:1.0: USB hub found
[Wed Dec 16 21:44:56 2020] hub 1-1:1.0: 4 ports detected
[Wed Dec 16 21:44:56 2020] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Dec 16 21:44:56 2020] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Dec 16 21:44:56 2020] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[Wed Dec 16 21:44:56 2020] ata3.00: supports DRM functions and may not be fully accessible
[Wed Dec 16 21:44:56 2020] ata2.00: ATAPI: HL-DT-ST DVDRAM GH24NSD5, LV00, max UDMA/133
[Wed Dec 16 21:44:56 2020] ata3.00: ATA-11: Samsung SSD 860 EVO 500GB, RVT04B6Q, max UDMA/133
[Wed Dec 16 21:44:56 2020] ata3.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA
[Wed Dec 16 21:44:56 2020] ata2.00: configured for UDMA/133
[Wed Dec 16 21:44:56 2020] ata3.00: supports DRM functions and may not be fully accessible
[Wed Dec 16 21:44:56 2020] scsi 1:0:0:0: CD-ROM            HL-DT-ST DVDRAM GH24NSD5  LV00 PQ: 0 ANSI: 5
[Wed Dec 16 21:44:56 2020] ata3.00: configured for UDMA/133
[Wed Dec 16 21:44:56 2020] scsi 2:0:0:0: Direct-Access     ATA      Samsung SSD 860  4B6Q PQ: 0 ANSI: 5
[Wed Dec 16 21:44:56 2020] ata3.00: Enabling discard_zeroes_data
[Wed Dec 16 21:44:56 2020] sd 2:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/466 GiB)
[Wed Dec 16 21:44:56 2020] sd 2:0:0:0: [sda] Write Protect is off
[Wed Dec 16 21:44:56 2020] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
[Wed Dec 16 21:44:56 2020] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[Wed Dec 16 21:44:56 2020] ata4.00: ATA-10: ST4000DM004-2CV104, 0001, max UDMA/133
[Wed Dec 16 21:44:56 2020] ata4.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 32), AA
[Wed Dec 16 21:44:56 2020] sr 1:0:0:0: [sr0] scsi3-mmc drive: 10x/48x writer dvd-ram cd/rw xa/form2 cdda tray
[Wed Dec 16 21:44:56 2020] cdrom: Uniform CD-ROM driver Revision: 3.20
[Wed Dec 16 21:44:56 2020]  sda: sda1
[Wed Dec 16 21:44:56 2020] ata3.00: Enabling discard_zeroes_data
[Wed Dec 16 21:44:56 2020] sd 2:0:0:0: [sda] supports TCG Opal

<snip>

[Wed Dec 16 21:45:12 2020] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[Wed Dec 16 21:45:12 2020] ata2.00: configured for UDMA/133

<snip>

[Wed Dec 16 21:48:55 2020] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Wed Dec 16 21:48:55 2020] ata2.00: cmd a0/00:00:00:02:00/00:00:00:00:00/a0 tag 12 pio 16388 in
                                    Mode Sense(10) 5a 00 2a 00 00 00 00 00 02 00res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[Wed Dec 16 21:48:55 2020] ata2.00: status: { DRDY }
[Wed Dec 16 21:48:55 2020] ata2: hard resetting link
[Wed Dec 16 21:48:56 2020] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[Wed Dec 16 21:48:56 2020] ata2.00: configured for UDMA/133
[Wed Dec 16 21:48:56 2020] ata2: EH complete
[Wed Dec 16 21:49:36 2020] ata2.00: limiting speed to UDMA/100:PIO4
[Wed Dec 16 21:49:36 2020] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Wed Dec 16 21:49:36 2020] ata2.00: cmd a0/00:00:00:02:00/00:00:00:00:00/a0 tag 3 pio 16388 in
                                    Mode Sense(10) 5a 00 2a 00 00 00 00 00 02 00res 40/00:02:00:00:02/00:00:00:00:00/00 Emask 0x4 (timeout)
[Wed Dec 16 21:49:36 2020] ata2.00: status: { DRDY }
[Wed Dec 16 21:49:36 2020] ata2: hard resetting link
[Wed Dec 16 21:49:37 2020] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[Wed Dec 16 21:49:37 2020] ata2.00: configured for UDMA/100
[Wed Dec 16 21:49:37 2020] ata2: EH complete


I also happen to have a nvme drive in the 4x slot but I don't think this has anything to do with it.

What I tried so far:
* the drive works in an x86 box (so did the old LG DVD-ROM I tried first)
* I removed any other drives from SATA and left just the drive, no dice
* swapped SATA cables thrice. no dice
* swapped the PSU connector for another branch, no dice
* upgraded to Linux 5.10.1, no dice

I have another 5.10.1 building with some legacy Marvell and OF ATA/PATA/SATA stuff but honestly, I think it's something to do with the controller I'm missing.

Only sr_mod based devices seem affected, all other drives (good ol'spinning, SSHD & SSD) do work.

EDIT:
Both drives work via USB2SATA on the same box with the same OS constellation.

Pages: [1]