Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - pocock

Pages: [1] 2 3 ... 19
1
Unfortunately I can not shut it down right now to test them in it

2

I put them both into a HP Z6 G4 workstation and they worked immediately in there so I don't think they are faulty, it looks more like a compatibility issue

Is there any other data I can collect from the Talos II to help troubleshoot?

I can try to purchase an alternative PCIe card for these SSDs, I don't mind swapping the cards and testing other models

3

For my first workstation, I purchased two Samsung PM9A1 (1TB) SSDs and I purchased two of the IcyBox IB-PCI208-HS cards to put them into PCIe slots.  They worked immediately and I've had them for 11 months without any problem

Now I purchased two of the PM9A1 (2TB) SSDs and two more identical IB-PCI208-HS cards.  These are only showing up intermittently or not at all when the machine boots.

In petitboot there is a lot of kernel error logging like this:

Code: [Select]
[  937.395593] EEH: Recovering PHB#1-PE#fd
[  937.395607] EEH: PE location: UOPWR.A100029-Node0-CPU1 Slot1 (8x), PHB location: N/A
[  937.395609] EEH: This PCI device has failed 1 times in the last hour and will  be permanently disabled after 5 failures.
[  937.395611] EEH: Notify device drivers to shutdown
[  937.395614] EEH: Beginning: 'error_detected(IO frozen)'
[  937.395620] PCI 0001:01:00.0#00fd: EEH: Invoking nvme->error_detected(IO frozen)
[  937.395627] nvme nvme0: frozen state error detected, reset controller
[  937.578659] PCI 0001:01:00.0#00fd: EEH: nvme driver reports: 'need reset'
[  937.578662] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[  937.578667] EEH: Collect temporary log
[  937.578698] EEH: of node=0001:01:00.0
[  937.578702] EEH: PCI device/vendor: ffffffff
[  937.578706] EEH: PCI cmd/status register: ffffffff
[  937.578707] EEH: PCI-E capabilities and status follow:
[  937.578722] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff

I tried some of the following with no luck:

- removing the cards and re-inserting them in different slots

- removing the SSDs and putting them back into the cards

- upgrading the FPGA on the second workstation from 0xa to 0xc (v1.08) so that it matches the first workstation

- upgrading the PNOR on the second workstation to the v2.01 beta so that it matches the first workstation

- putting a new and bigger PSU on the second workstation (I needed to do this anyway for bigger GPU and more RAM)

- removing the GPU and everything else from the system and trying one SSD at a time


4

The wiki mentions that newer boards (v1.01) have an external oscillator for the FPGA

The oscillator is mentioned in this commit

If somebody has an older FPGA, e.g. 1.06 and they have a v1.01 motherboard, is there a pressing reason to upgrade the FPGA to 1.08?

Are there any particular problems that might occur if the user does not upgrade?

5

It does appear to be related to the jumper

With the jumper installed:
the view of petitboot on the VGA output and in the SSH session are identical.
When I type something in the SSH session it appears on both the VGA and in the SSH console

Without the jumper:
the VGA and USB keyboard act as a terminal together.
The SSH / BMC terminal operates independently, what I type there doesn't appear on the VGA and vice-versa, what I type on the USB keyboard doesn't appear in the SSH view of petitboot

It would be useful to have a note about this in the motherboard manual, to the effect that the jumper impacts the keyboard and not only the VGA

6

I have the jumper installed for disabling internal VGA, does that also serve to mask the keyboard in some way?

7

Connecting to the BMC with SSH and accessing the petitboot console with obmc-console-client

In petitboot, I choose the shell option

At the shell, I run dmesg

The dmesg output shows that usbhid is loaded and it shows the keyboard

Code: [Select]
[    6.004628] input:   USB Keyboard Consumer Control as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/usb1/1-4/1-4.2/1-4.2:1.1/0003:04D9:1702.0005/input/input11
[    6.004683] hid-generic 0003:04D9:1702.0005: input: USB HID v1.10 Device [  USB Keyboard] on usb-0003:01:00.0-4.2/input1

The numlock light works on the keyboard too, so it appears to have power to the keyboard

8

If you make a fresh install of Debian with GNOME and then go and look at the default Settings panel for Power, you will see "Power Saving" default setting

The Debian wiki has some notes about systems which should never go to sleep.  In particular, they show some settings in systemd so it seems this can't just be fixed by changing the settings in GNOME

Code: [Select]
sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

This default is one of many ways where systemd appears to diverge from a traditional UNIX-like system.

You can check the current status of a system by checking each of those targets with the status command:

Code: [Select]
$ sudo systemctl status sleep.target
● sleep.target - Sleep
     Loaded: loaded (/lib/systemd/system/sleep.target; static)
     Active: inactive (dead)
       Docs: man:systemd.special(7)

and on a system where the target was disabled:

Code: [Select]
$ sudo systemctl status sleep.target
● sleep.target
     Loaded: masked (Reason: Unit sleep.target is masked.)
     Active: inactive (dead)
[code]

9

I did a fresh Debian install on one system and then left it at the GNOME login screen while making a phone call.

When I came back, the screen was frozen, SSH was not working any more but the CPUs and fans still active.

I rebooted it, checked the log and found that it had tried to sleep and managed to become frozen.

It is probably a good idea to track this with bug reports in both Debian and GNOME and change their default settings on ppc64le.

However, is there anything that can be done at a lower level, for example, in the kernel, to reject the sleep attempt rather than letting the system get into this frozen state?  If that could make it appear stable for every OS it would be much better than fixing it in one distribution at a time.

It was kind of obvious to me what had happened before I even looked in the logs but I can imagine some users might get a fright

10
On one of my machines, I notice that it doesn't respond to the USB keyboard when I am in the bootloader.  I haven't tested the other machine yet.

Normally I control these workstations through SSH to the BMC and it is a long time since I tried using a directly connected keyboard so I don't even remember if this worked before

I tried two different keyboards.  Both keyboards work fine with other workstations.

When Linux and GNOME boots on the Talos II, the keyboard begins working

In the bootloader config, I found an option to choose between hvc0 and tty1 consoles.  I don't know if that should make any difference but I tried both settings and it didn't help.

11
Operating Systems and Porting / Re: suspend, sleep, hibernate and resume?
« on: September 12, 2022, 01:16:41 am »

The Tweet says Blackbird v1.02 motherboard supports it but firmware and kernel changes also required

It would be useful to know which version of the Talos II ( / Lite) supports it, if any

Is there any easy way to identify the motherboard version without opening the machine?  Is it logged at bootup, available in the IPMI API or anything else?  Can we deduce the version from the serial number?

12

I decided to make another test of OpenBSD using the nightly build from 29 August that I found here.

The system is Talos II, dual CPU, 128GB and AMD Radeon RX 580 8GB.  Is this GPU currently supported on OpenBSD ppc?

I looked at the instructions here in the FAQ for X setup

Code: [Select]
# rcctl enable xenodm
# rcctl start xenodm

# pkg_info

amdgpu-firmware-20220708 firmware binary images for amdgpu(4) driver


Looking at the log, the X server is failing to run

Code: [Select]
cat /var/log/Xorg.0.log

[    19.833] (--) no aperture driver access: only wsfb driver useable
[    20.049] (EE)
Fatal server error:
[    20.049] (EE) xf86OpenConsole: No console driver found
Supported drivers: wscons
Check your kernel's console driver configuration and /dev entries(EE)
[    20.049] (EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
 for help.
[    20.049] (EE) Please also check the log file at
"/var/log/Xorg.0.log" for additional information.
[    20.049] (EE)
[    20.050] (EE) Server terminated with error (1). Closing log file.





13
Talos II / mixing memory sizes on the same CPU?
« on: August 30, 2022, 03:00:09 am »

I understand that each CPU has four memory channels, therefore, optimum performance is achieved when all four memory banks have an identical size and they are accessed in parallel.

Nonetheless, is it safe or practical to mix memory sizes in the memory banks on the same CPU?

For example, could the user have 2 x 16GB and 2 x 32GB connected to a single CPU?

Would the CPU ignore the extra capacity in the larger modules and use them all as 16GB?  Or would it somehow access the RAM using two channel speeds?

14
Talos II / different memory sizes on each CPU in multi-CPU systems?
« on: August 30, 2022, 02:56:21 am »

For dual CPU systems, is it necessary to have an identical memory configuration on each CPU?

For example, if a user has 4 x 16GB on CPU1 does that mean the user must have 4 x 16GB on CPU2?

Or can the user have 4 x 16GB on CPU1 and 4 x 32GB on CPU2?

15

If you decide to loan it to hackerspaces that are nearby for me then I'm personally happy to visit them, take it there in person and run a workshop on it.  I would then be willing to go back a few weeks later, collect it and take it to the next hackerspace or post it back

Pages: [1] 2 3 ... 19