Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Topics - pocock

Pages: [1] 2 3 ... 6
1

I already started another thread here about all the suspend modes, including suspend to RAM and the special modes of the POWER9 architecture.  I'm starting this new thread to focus on hibernate, in other words, suspend-to-disk (STD)

There is a section about hibernation in the official Linux kernel guide.  They state:

Quote
Hibernation

This state (also referred to as Suspend-to-Disk or STD) offers the greatest energy savings and can be used even in the absence of low-level platform support for system suspend.

This means it doesn't matter if POWER9 or the motherboard (Talos II, Blackbird, Condor) have any hardware support for hibernation.  The suspend and resume work is all done by the operating system.

The kernel guide goes on to state:

Quote
However, it requires some low-level code for resuming the system to be present for the underlying CPU architecture.

Is that code already available in any newer kernels?

Is that code available in any alternative operating system like FreeBSD or OpenBSD?

I did a check on one of my own systems and it doesn't appear to support it, disk is missing from the output:


$ cat /sys/power/state
freeze mem


Here is the same command on an Intel host, notice the disk support is listed there:


$ cat /sys/power/state
freeze mem disk



2

For my first workstation, I purchased two Samsung PM9A1 (1TB) SSDs and I purchased two of the IcyBox IB-PCI208-HS cards to put them into PCIe slots.  They worked immediately and I've had them for 11 months without any problem

Now I purchased two of the PM9A1 (2TB) SSDs and two more identical IB-PCI208-HS cards.  These are only showing up intermittently or not at all when the machine boots.

In petitboot there is a lot of kernel error logging like this:

Code: [Select]
[  937.395593] EEH: Recovering PHB#1-PE#fd
[  937.395607] EEH: PE location: UOPWR.A100029-Node0-CPU1 Slot1 (8x), PHB location: N/A
[  937.395609] EEH: This PCI device has failed 1 times in the last hour and will  be permanently disabled after 5 failures.
[  937.395611] EEH: Notify device drivers to shutdown
[  937.395614] EEH: Beginning: 'error_detected(IO frozen)'
[  937.395620] PCI 0001:01:00.0#00fd: EEH: Invoking nvme->error_detected(IO frozen)
[  937.395627] nvme nvme0: frozen state error detected, reset controller
[  937.578659] PCI 0001:01:00.0#00fd: EEH: nvme driver reports: 'need reset'
[  937.578662] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[  937.578667] EEH: Collect temporary log
[  937.578698] EEH: of node=0001:01:00.0
[  937.578702] EEH: PCI device/vendor: ffffffff
[  937.578706] EEH: PCI cmd/status register: ffffffff
[  937.578707] EEH: PCI-E capabilities and status follow:
[  937.578722] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff

I tried some of the following with no luck:

- removing the cards and re-inserting them in different slots

- removing the SSDs and putting them back into the cards

- upgrading the FPGA on the second workstation from 0xa to 0xc (v1.08) so that it matches the first workstation

- upgrading the PNOR on the second workstation to the v2.01 beta so that it matches the first workstation

- putting a new and bigger PSU on the second workstation (I needed to do this anyway for bigger GPU and more RAM)

- removing the GPU and everything else from the system and trying one SSD at a time


3

The wiki mentions that newer boards (v1.01) have an external oscillator for the FPGA

The oscillator is mentioned in this commit

If somebody has an older FPGA, e.g. 1.06 and they have a v1.01 motherboard, is there a pressing reason to upgrade the FPGA to 1.08?

Are there any particular problems that might occur if the user does not upgrade?

4

I did a fresh Debian install on one system and then left it at the GNOME login screen while making a phone call.

When I came back, the screen was frozen, SSH was not working any more but the CPUs and fans still active.

I rebooted it, checked the log and found that it had tried to sleep and managed to become frozen.

It is probably a good idea to track this with bug reports in both Debian and GNOME and change their default settings on ppc64le.

However, is there anything that can be done at a lower level, for example, in the kernel, to reject the sleep attempt rather than letting the system get into this frozen state?  If that could make it appear stable for every OS it would be much better than fixing it in one distribution at a time.

It was kind of obvious to me what had happened before I even looked in the logs but I can imagine some users might get a fright

5
On one of my machines, I notice that it doesn't respond to the USB keyboard when I am in the bootloader.  I haven't tested the other machine yet.

Normally I control these workstations through SSH to the BMC and it is a long time since I tried using a directly connected keyboard so I don't even remember if this worked before

I tried two different keyboards.  Both keyboards work fine with other workstations.

When Linux and GNOME boots on the Talos II, the keyboard begins working

In the bootloader config, I found an option to choose between hvc0 and tty1 consoles.  I don't know if that should make any difference but I tried both settings and it didn't help.

6

I decided to make another test of OpenBSD using the nightly build from 29 August that I found here.

The system is Talos II, dual CPU, 128GB and AMD Radeon RX 580 8GB.  Is this GPU currently supported on OpenBSD ppc?

I looked at the instructions here in the FAQ for X setup

Code: [Select]
# rcctl enable xenodm
# rcctl start xenodm

# pkg_info

amdgpu-firmware-20220708 firmware binary images for amdgpu(4) driver


Looking at the log, the X server is failing to run

Code: [Select]
cat /var/log/Xorg.0.log

[    19.833] (--) no aperture driver access: only wsfb driver useable
[    20.049] (EE)
Fatal server error:
[    20.049] (EE) xf86OpenConsole: No console driver found
Supported drivers: wscons
Check your kernel's console driver configuration and /dev entries(EE)
[    20.049] (EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
 for help.
[    20.049] (EE) Please also check the log file at
"/var/log/Xorg.0.log" for additional information.
[    20.049] (EE)
[    20.050] (EE) Server terminated with error (1). Closing log file.





7
Talos II / mixing memory sizes on the same CPU?
« on: August 30, 2022, 03:00:09 am »

I understand that each CPU has four memory channels, therefore, optimum performance is achieved when all four memory banks have an identical size and they are accessed in parallel.

Nonetheless, is it safe or practical to mix memory sizes in the memory banks on the same CPU?

For example, could the user have 2 x 16GB and 2 x 32GB connected to a single CPU?

Would the CPU ignore the extra capacity in the larger modules and use them all as 16GB?  Or would it somehow access the RAM using two channel speeds?

8
Talos II / different memory sizes on each CPU in multi-CPU systems?
« on: August 30, 2022, 02:56:21 am »

For dual CPU systems, is it necessary to have an identical memory configuration on each CPU?

For example, if a user has 4 x 16GB on CPU1 does that mean the user must have 4 x 16GB on CPU2?

Or can the user have 4 x 16GB on CPU1 and 4 x 32GB on CPU2?

9
This was a comment in the discussion about donating a Blackbird to a worthy cause

It is probably not hard to create a dashboard to scrape or aggregate issues from bug trackers in various free software projects and assemble them into some kind of dashboard or heat map to identify the pinch-points for widespread OpenPOWER use.

Scraping the data is rather easy, most bug trackers have at least one well known API like RSS or iCalendar.

Identifying which issues relate to OpenPOWER depends on how diligent people are in tagging their bugs.

Deciding how to prioritize the issues on a dashboard or heat map may be more contentious as different people have different perspectives about which issues are important.

10

Consider the following setup:

main workstation:
- POWER9
- GPU
- OBS
- capture hardware

secondary workstation or laptop
- x86
- untrusted OS (e.g. Windows)
- applications that you are required to run for a client or employer (e.g. a messaging or remote desktop app) or a video meeting system that requires WebAssembly

You can generate an RTP video stream from the workstation, receive it on the secondary workstation and make it available as a virtual webcam for the software on that machine

I opened a couple of discussions about the topic:

http://ffmpeg.org/pipermail/ffmpeg-user/2022-August/055275.html

https://obsproject.com/forum/threads/obs-streaming-udp-to-ffmpeg-to-v4l2-loopback.158367/


11
Operating Systems and Porting / suspend, sleep, hibernate and resume?
« on: July 08, 2022, 08:24:44 am »

I saw a couple of threads with comments that the Raptor systems can't suspend or hibernate, this comment and this comment

I searched the wiki and it didn't have any pages about suspend, sleep, hibernate

Does anybody have more details about this?

In the event that hibernate is really impossible, are there any workarounds that people recommend for restoring desktop to a previous state after a complete shutdown?  For example, I've seen some utilities that can reopen windows on the same workspaces and in the same places but this doesn't solve everything.


12
There are some hints about multi-seat setups in other topics in the forum

Multi-seat can make the workstations more viable because the cost is shared between two or three users.

I had some discussions about this with a few people.  Most people are satisfied that POWER9 provides enough compute capacity but there were some practical concerns

The type of user who would benefit from this setup is typically a developer, system administrator or IT support worker who already works in Linux

By their nature, this type of user would like to tweak the system, for example, installing some kernel module, installing some non-standard version of some development header files or whatever.

What level of isolation can be achieved between users in such scenarios?

For example, using virtualization, people can have their own kernels and separate GPU ports.

Using LXC and cgroups, people can have their own root filesystem but they share a kernel

Has anybody tested these possibilities with POWER / Linux workstations or even on x86?

I found some links about the topic for some of the distributions that people are using

Hardware passthrough in LXC or running a desktop in a cgroup

two X servers one graphics card

Debian - Multi Seat

Fedora - Multi Seat

Arch - Multi Seat

FreeBSD - Multi Seat

13
Has anybody tried any of these cards with a recent kernel?

Does the kernel command line option mentioned on the wiki here make any difference?

Code: [Select]
amdgpu.aspm=0

14
General Discussion / Areca Tri-mode HBA NVMe / SAS / SATA controllers
« on: October 17, 2021, 05:41:43 am »

Online store / catalog

These are not cheap but they look like interesting HBAs for people who have a lot of disks and want to link them all into their workstation or server through a single x16 slot.

Has anybody tried any of them?

How do they compare to other brands offering a Tri-mode solution?

15

Quattro 400 web page

Has anybody tried it?

It is not really fast enough for somebody who wants the newest SSDs like the Samsung Pro 980, each of those can do 7GB/s.

For previous generation SSDs operating up to 4GB/s this looks like an interesting HBA for software RAID, Btrfs or ZFS in a x8 slot.

Pages: [1] 2 3 ... 6