Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Messages - pocock

Pages: [1] 2 3 ... 19

I already started another thread here about all the suspend modes, including suspend to RAM and the special modes of the POWER9 architecture.  I'm starting this new thread to focus on hibernate, in other words, suspend-to-disk (STD)

There is a section about hibernation in the official Linux kernel guide.  They state:


This state (also referred to as Suspend-to-Disk or STD) offers the greatest energy savings and can be used even in the absence of low-level platform support for system suspend.

This means it doesn't matter if POWER9 or the motherboard (Talos II, Blackbird, Condor) have any hardware support for hibernation.  The suspend and resume work is all done by the operating system.

The kernel guide goes on to state:

However, it requires some low-level code for resuming the system to be present for the underlying CPU architecture.

Is that code already available in any newer kernels?

Is that code available in any alternative operating system like FreeBSD or OpenBSD?

I did a check on one of my own systems and it doesn't appear to support it, disk is missing from the output:

$ cat /sys/power/state
freeze mem

Here is the same command on an Intel host, notice the disk support is listed there:

$ cat /sys/power/state
freeze mem disk

"dispute with Debian" is not correct.

If you look at the vote about Dr Richard Stallman, the majority of Debian Developers do not want these disputes at all and they voted not to comment on the dispute.

There are a hardcore group of people who run these disputes.  In many cases, they hide what they are doing from the rest of us.  Many of the victims are afraid to speak up.  For example, if somebody receives some email about the CoC, they usually just quit and go to work on something else.  Quitting like that is not an admission of wrongdoing, the victims just don't want to lose their time with these games.

If you can only access a particular application by running it from a web site in Chrome / Chromium then that is still very bad news.  It implies there is no equivalent application that you can run natively.

By way of example, people have been trying to promote free, open source webcam and chat solutions for many years but when the pandemic came along, a lot of users were willing to download the Zoom client or run the WebAssembly client in their browser and the free software community was completely left out in the cold.

Two things come to mind:

Debian involvement can be a burden for any developer, taking up a little bit more time and energy that could be used for other parts of the platform.

Tim's presence as a maintainer may be a hint that there is a shortage of other volunteers willing to do POWER related tasks in the Debian world.  In fact, all the big distributions have had problems that are discouraging volunteers, it is not only an issue for POWER.

Unfortunately I can not shut it down right now to test them in it


I put them both into a HP Z6 G4 workstation and they worked immediately in there so I don't think they are faulty, it looks more like a compatibility issue

Is there any other data I can collect from the Talos II to help troubleshoot?

I can try to purchase an alternative PCIe card for these SSDs, I don't mind swapping the cards and testing other models


For my first workstation, I purchased two Samsung PM9A1 (1TB) SSDs and I purchased two of the IcyBox IB-PCI208-HS cards to put them into PCIe slots.  They worked immediately and I've had them for 11 months without any problem

Now I purchased two of the PM9A1 (2TB) SSDs and two more identical IB-PCI208-HS cards.  These are only showing up intermittently or not at all when the machine boots.

In petitboot there is a lot of kernel error logging like this:

Code: [Select]
[  937.395593] EEH: Recovering PHB#1-PE#fd
[  937.395607] EEH: PE location: UOPWR.A100029-Node0-CPU1 Slot1 (8x), PHB location: N/A
[  937.395609] EEH: This PCI device has failed 1 times in the last hour and will  be permanently disabled after 5 failures.
[  937.395611] EEH: Notify device drivers to shutdown
[  937.395614] EEH: Beginning: 'error_detected(IO frozen)'
[  937.395620] PCI 0001:01:00.0#00fd: EEH: Invoking nvme->error_detected(IO frozen)
[  937.395627] nvme nvme0: frozen state error detected, reset controller
[  937.578659] PCI 0001:01:00.0#00fd: EEH: nvme driver reports: 'need reset'
[  937.578662] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[  937.578667] EEH: Collect temporary log
[  937.578698] EEH: of node=0001:01:00.0
[  937.578702] EEH: PCI device/vendor: ffffffff
[  937.578706] EEH: PCI cmd/status register: ffffffff
[  937.578707] EEH: PCI-E capabilities and status follow:
[  937.578722] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff

I tried some of the following with no luck:

- removing the cards and re-inserting them in different slots

- removing the SSDs and putting them back into the cards

- upgrading the FPGA on the second workstation from 0xa to 0xc (v1.08) so that it matches the first workstation

- upgrading the PNOR on the second workstation to the v2.01 beta so that it matches the first workstation

- putting a new and bigger PSU on the second workstation (I needed to do this anyway for bigger GPU and more RAM)

- removing the GPU and everything else from the system and trying one SSD at a time


The wiki mentions that newer boards (v1.01) have an external oscillator for the FPGA

The oscillator is mentioned in this commit

If somebody has an older FPGA, e.g. 1.06 and they have a v1.01 motherboard, is there a pressing reason to upgrade the FPGA to 1.08?

Are there any particular problems that might occur if the user does not upgrade?


It does appear to be related to the jumper

With the jumper installed:
the view of petitboot on the VGA output and in the SSH session are identical.
When I type something in the SSH session it appears on both the VGA and in the SSH console

Without the jumper:
the VGA and USB keyboard act as a terminal together.
The SSH / BMC terminal operates independently, what I type there doesn't appear on the VGA and vice-versa, what I type on the USB keyboard doesn't appear in the SSH view of petitboot

It would be useful to have a note about this in the motherboard manual, to the effect that the jumper impacts the keyboard and not only the VGA


I have the jumper installed for disabling internal VGA, does that also serve to mask the keyboard in some way?


Connecting to the BMC with SSH and accessing the petitboot console with obmc-console-client

In petitboot, I choose the shell option

At the shell, I run dmesg

The dmesg output shows that usbhid is loaded and it shows the keyboard

Code: [Select]
[    6.004628] input:   USB Keyboard Consumer Control as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/usb1/1-4/1-4.2/1-4.2:1.1/0003:04D9:1702.0005/input/input11
[    6.004683] hid-generic 0003:04D9:1702.0005: input: USB HID v1.10 Device [  USB Keyboard] on usb-0003:01:00.0-4.2/input1

The numlock light works on the keyboard too, so it appears to have power to the keyboard


If you make a fresh install of Debian with GNOME and then go and look at the default Settings panel for Power, you will see "Power Saving" default setting

The Debian wiki has some notes about systems which should never go to sleep.  In particular, they show some settings in systemd so it seems this can't just be fixed by changing the settings in GNOME

Code: [Select]
sudo systemctl mask

This default is one of many ways where systemd appears to diverge from a traditional UNIX-like system.

You can check the current status of a system by checking each of those targets with the status command:

Code: [Select]
$ sudo systemctl status
● - Sleep
     Loaded: loaded (/lib/systemd/system/; static)
     Active: inactive (dead)
       Docs: man:systemd.special(7)

and on a system where the target was disabled:

Code: [Select]
$ sudo systemctl status
     Loaded: masked (Reason: Unit is masked.)
     Active: inactive (dead)


I did a fresh Debian install on one system and then left it at the GNOME login screen while making a phone call.

When I came back, the screen was frozen, SSH was not working any more but the CPUs and fans still active.

I rebooted it, checked the log and found that it had tried to sleep and managed to become frozen.

It is probably a good idea to track this with bug reports in both Debian and GNOME and change their default settings on ppc64le.

However, is there anything that can be done at a lower level, for example, in the kernel, to reject the sleep attempt rather than letting the system get into this frozen state?  If that could make it appear stable for every OS it would be much better than fixing it in one distribution at a time.

It was kind of obvious to me what had happened before I even looked in the logs but I can imagine some users might get a fright

On one of my machines, I notice that it doesn't respond to the USB keyboard when I am in the bootloader.  I haven't tested the other machine yet.

Normally I control these workstations through SSH to the BMC and it is a long time since I tried using a directly connected keyboard so I don't even remember if this worked before

I tried two different keyboards.  Both keyboards work fine with other workstations.

When Linux and GNOME boots on the Talos II, the keyboard begins working

In the bootloader config, I found an option to choose between hvc0 and tty1 consoles.  I don't know if that should make any difference but I tried both settings and it didn't help.

Operating Systems and Porting / Re: suspend, sleep, hibernate and resume?
« on: September 12, 2022, 01:16:41 am »

The Tweet says Blackbird v1.02 motherboard supports it but firmware and kernel changes also required

It would be useful to know which version of the Talos II ( / Lite) supports it, if any

Is there any easy way to identify the motherboard version without opening the machine?  Is it logged at bootup, available in the IPMI API or anything else?  Can we deduce the version from the serial number?

Pages: [1] 2 3 ... 19