Raptor Computing Systems Community Forums (BETA)
Raptor Computing Systems Hardware => Blackbird => Topic started by: tle on February 03, 2020, 06:27:41 am
-
UPDATE: The issue is still around in the 2.0 version of the firmware.
Hi all
I am trying to make Gigabyte Radeon Graphics Card GV-RXVEGA64GAMING OC-8GD card working on boot. I follow the instructions in https://wiki.raptorcs.com/wiki/Add_GPU_Firmware_To_BOOTKERNFW to bundle up firmware then flash those firmwares to PNOR
$ sudo dnf install @development-tools
$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
$ mkdir -p /tmp/firmwares/amdgpu
$ cp linux-firmware/amdgpu/vega10_* /tmp/firmware/amdgpu/
$ cd /tmp/firmware
$ mksquashfs * /tmp/firmware.bin -all-root -keep-as-directory
$ scp /tmp/firmware.bin root@bmc-ip-address:/tmp/firmware.bin
$ ssh root@bmc-ip-address
$ pflash -P BOOTKERNFW -e -p /tmp/firmware.bin
When the system boot up, no video output is detected on my screen, I switched back to the built-in HDMI of AST2500 and the system seems to hang with following output:
--== Welcome to Hostboot hostboot-3beba24/hbicore.bin ==--
3.08320|secure|SecureROM valid - enabling functionality
8.62232|Booting from SBE side 0 on master proc=00050000
8.66832|ISTEP 6. 5 - host_init_fsi
9.20956|ISTEP 6. 6 - host_set_ipl_parms
9.80358|ISTEP 6. 7 - host_discover_targets
10.45804|HWAS|PRESENT> DIMM[03]=8080000000000000
10.45805|HWAS|PRESENT> Proc[05]=8000000000000000
10.45806|HWAS|PRESENT> Core[07]=5055500000000000
10.76196|ISTEP 6. 8 - host_update_master_tpm
10.83124|SECURE|Security Access Bit> 0x0000000000000000
10.83125|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
10.86796|ISTEP 6. 9 - host_gard
11.33699|HWAS|Deconfig HUID 0x00030000, Physical:/Sys0/Node0/DIMM0
11.33711|HWAS|FUNCTIONAL> DIMM[03]=0080000000000000
11.33712|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
11.33714|HWAS|FUNCTIONAL> Core[07]=5055500000000000
11.33914|ISTEP 6.11 - host_start_occ_xstop_handler
12.86084|ISTEP 6.12 - host_voltage_config
13.01363|ISTEP 7. 1 - mss_attr_cleanup
14.48736|ISTEP 7. 2 - mss_volt
14.84739|ISTEP 7. 3 - mss_freq
15.53796|ISTEP 7. 4 - mss_eff_config
16.81043|ISTEP 7. 5 - mss_attr_update
16.84049|ISTEP 8. 1 - host_slave_sbe_config
16.94450|ISTEP 8. 2 - host_setup_sbe
16.94554|ISTEP 8. 3 - host_cbs_start
16.94665|ISTEP 8. 4 - proc_check_slave_sbe_seeprom_complete
16.95431|ISTEP 8. 5 - host_attnlisten_proc
16.95527|ISTEP 8. 6 - host_p9_fbc_eff_config
16.96162|ISTEP 8. 7 - host_p9_eff_config_links
17.03091|ISTEP 8. 8 - proc_attr_update
17.03201|ISTEP 8. 9 - proc_chiplet_fabric_scominit
17.06735|ISTEP 8.10 - proc_xbus_scominit
17.10310|ISTEP 8.11 - proc_xbus_enable_ridi
17.12720|ISTEP 8.12 - host_set_voltages
17.14691|ISTEP 9. 1 - fabric_erepair
17.35688|ISTEP 9. 2 - fabric_io_dccal
17.37554|ISTEP 9. 3 - fabric_pre_trainadv
17.38003|ISTEP 9. 4 - fabric_io_run_training
17.38356|ISTEP 9. 5 - fabric_post_trainadv
17.38559|ISTEP 9. 6 - proc_smp_link_layer
17.39132|ISTEP 9. 7 - proc_fab_iovalid
17.84717|ISTEP 9. 8 - host_fbc_eff_config_aggregate
17.86589|ISTEP 10. 1 - proc_build_smp
18.32489|ISTEP 10. 2 - host_slave_sbe_update
20.82944|ISTEP 10. 4 - proc_cen_ref_clk_enable
20.96002|ISTEP 10. 5 - proc_enable_osclite
20.96099|ISTEP 10. 6 - proc_chiplet_scominit
21.12679|ISTEP 10. 7 - proc_abus_scominit
21.14613|ISTEP 10. 8 - proc_obus_scominit
21.14745|ISTEP 10. 9 - proc_npu_scominit
21.16494|ISTEP 10.10 - proc_pcie_scominit
21.22242|ISTEP 10.11 - proc_scomoverride_chiplets
21.22526|ISTEP 10.12 - proc_chiplet_enable_ridi
21.24276|ISTEP 10.13 - host_rng_bist
21.27032|ISTEP 10.14 - host_update_redundant_tpm
21.27243|ISTEP 11. 1 - host_prd_hwreconfig
21.87940|ISTEP 11. 2 - cen_tp_chiplet_init1
21.88167|ISTEP 11. 3 - cen_pll_initf
21.88472|ISTEP 11. 4 - cen_pll_setup
21.90654|ISTEP 11. 5 - cen_tp_chiplet_init2
21.90887|ISTEP 11. 6 - cen_tp_arrayinit
21.91180|ISTEP 11. 7 - cen_tp_chiplet_init3
21.91599|ISTEP 11. 8 - cen_chiplet_init
21.91904|ISTEP 11. 9 - cen_arrayinit
21.92274|ISTEP 11.10 - cen_initf
21.92493|ISTEP 11.11 - cen_do_manual_inits
21.92722|ISTEP 11.12 - cen_startclocks
21.93014|ISTEP 11.13 - cen_scominits
21.93381|ISTEP 12. 1 - mss_getecid
23.07910|ISTEP 12. 2 - dmi_attr_update
23.11676|ISTEP 12. 3 - proc_dmi_scominit
23.16648|ISTEP 12. 4 - cen_dmi_scominit
23.17511|ISTEP 12. 5 - dmi_erepair
23.19312|ISTEP 12. 6 - dmi_io_dccal
23.19612|ISTEP 12. 7 - dmi_pre_trainadv
23.19915|ISTEP 12. 8 - dmi_io_run_training
23.21495|ISTEP 12. 9 - dmi_post_trainadv
23.21729|ISTEP 12.10 - proc_cen_framelock
23.22026|ISTEP 12.11 - host_startprd_dmi
23.22243|ISTEP 12.12 - host_attnlisten_memb
23.22604|ISTEP 12.13 - cen_set_inband_addr
23.24306|ISTEP 13. 1 - host_disable_memvolt
24.17510|ISTEP 13. 2 - mem_pll_reset
24.24792|ISTEP 13. 3 - mem_pll_initf
24.30519|ISTEP 13. 4 - mem_pll_setup
24.33331|ISTEP 13. 6 - mem_startclocks
24.34941|ISTEP 13. 7 - host_enable_memvolt
24.35151|ISTEP 13. 8 - mss_scominit
25.70075|ISTEP 13. 9 - mss_ddr_phy_reset
26.14936|ISTEP 13.10 - mss_draminit
26.61721|ISTEP 13.11 - mss_draminit_training
27.51642|ISTEP 13.12 - mss_draminit_trainadv
27.61042|ISTEP 13.13 - mss_draminit_mc
27.78375|ISTEP 14. 1 - mss_memdiag
31.58001|ISTEP 14. 2 - mss_thermal_init
31.64733|ISTEP 14. 3 - proc_pcie_config
31.71197|ISTEP 14. 4 - mss_power_cleanup
31.71609|ISTEP 14. 5 - proc_setup_bars
31.77851|ISTEP 14. 6 - proc_htm_setup
31.79344|ISTEP 14. 7 - proc_exit_cache_contained
31.85528|ISTEP 15. 1 - host_build_stop_image
35.09151|ISTEP 15. 2 - proc_set_pba_homer_bar
35.15084|ISTEP 15. 3 - host_establish_ex_chiplet
35.16463|ISTEP 15. 4 - host_start_stop_engine
35.39656|ISTEP 16. 1 - host_activate_master
36.68418|ISTEP 16. 2 - host_activate_slave_cores
36.77981|ISTEP 16. 3 - host_secure_rng
36.80049|ISTEP 16. 4 - mss_scrub
36.83218|ISTEP 16. 5 - host_load_io_ppe
36.87857|ISTEP 16. 6 - host_ipl_complete
37.75469|ISTEP 18.11 - proc_tod_setup
38.02326|ISTEP 18.12 - proc_tod_init
38.05033|ISTEP 20. 1 - host_load_payload
39.03394|ISTEP 20. 2 - host_load_hdat
40.11956|ISTEP 21. 1 - host_runtime_setup
46.40398|htmgt|OCCs are now running in ACTIVE state
52.37718|ISTEP 21. 2 - host_verify_hdat
52.41944|ISTEP 21. 3 - host_start_payload
[ 53.303670341,5] OPAL skiboot-c81f9d6 starting...
[ 53.303673446,7] initial console log level: memory 7, driver 5
[ 53.303675558,6] CPU: P9 generation processor (max 4 threads/core)
[ 53.303677518,7] CPU: Boot CPU PIR is 0x0024 PVR is 0x004e1203
[ 53.303680283,7] OPAL table: 0x30103930 .. 0x30103f10, branch table: 0x30002000
[ 53.303683401,7] Assigning physical memory map table for nimbus
[ 53.303686056,7] Parsing HDAT...
[ 53.303687465,7] SPIRA-S found.
[ 53.303689697,6] BMC #0: HW version 3, SW version 2, chip DD1.0
[ 53.303770751,6] SP Family is ibm,ast2500,openbmc
[ 53.303776925,7] LPC: IOPATH chip id = 0
[ 53.303778302,7] LPC: FW BAR = f0000000
[ 53.303779833,7] LPC: MEM BAR = e0000000
[ 53.303781399,7] LPC: IO BAR = d0010000
[ 53.303782895,7] LPC: Internal BAR = c0012000
[ 53.303795512,7] LPC UART: base addr = 3f8 (3f8) size = 1 clk = 1843200, baud = 115200
[ 53.303798310,7] LPC: BT [0, 0] sms_int: 0, bmc_int: 0
[ 53.305153464,5] HDAT I2C: found e3p0 - unknown@18 dp:ff (ff:)
[ 53.305273899,5] HDAT I2C: found e3p1 - unknown@1c dp:ff (ff:)
[ 53.306035877,5] CHIP: Chip ID 0000 type: P9N DD2.30
[ 53.306474802,5] PLAT: Detected Blackbird platform
[ 53.306544669,5] PLAT: Detected BMC platform ast2500:openbmc
[ 53.323024841,5] CPU: All 32 processors called in...
[ 53.501903768,7] LPC: Routing irq 10, policy: 0 (r=1)
[ 53.501904733,7] LPC: SerIRQ 10 using route 0 targetted at OPAL
[ 54.509575320,5] HIOMAP: Negotiated hiomap protocol v2
[ 54.510806979,5] HIOMAP: Block size is 64KiB
[ 54.510832611,5] HIOMAP: BMC suggested flash timeout of 8s
[ 54.510876435,5] HIOMAP: Flash size is 64MiB
[ 54.510970596,5] HIOMAP: Erase granule size is 64KiB
[ 56.418150405,5] FLASH: Found system flash: (unnamed) id:0
[ 57.207119491,7] LPC: Routing irq 4, policy: 0 (r=1)
[ 57.207120295,7] LPC: SerIRQ 4 using route 1 targetted at OPAL
[ 57.207242358,5] OCC: All Chip Rdy after 0 ms
[ 58.009544728,3] STB: VERSION verification FAILED. log=0xffffffffffff8160
[ 59.124974141,3] STB: IMA_CATALOG verification FAILED. log=0xffffffffffff8160
[ 59.320998010,3] CAPP: Error loading ucode lid. index=203d1
[ 59.335020541,5] PCI: Resetting PHBs and training links...
[ 60.355590956,5] PCI: Probing slots...
[ 60.412005477,5] PCI Summary:
[ 60.412048918,5] PHB#0000:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..03 SLOT=CPU1 Slot2 (16x)
[ 60.412163589,5] PHB#0000:01:00.0 [SWUP] 1022 1470 R:c1 C:060400 B:02..03 LOC_CODE=CPU1 Slot2 (16x)
[ 60.412317704,5] PHB#0000:02:00.0 [SWDN] 1022 1471 R:00 C:060400 B:03..03
[ 60.412419910,5] PHB#0000:03:00.0 [LGCY] 1002 687f R:c1 C:030000 ( vga) LOC_CODE=CPU1 Slot2 (16x)
[ 60.412569858,5] PHB#0000:03:00.1 [EP ] 1002 aaf8 R:00 C:040300 (multimedia-device) LOC_CODE=CPU1 Slot2 (16x)
[ 60.412755922,5] PHB#0001:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=CPU1 Slot1 (8x)
[ 60.412866802,5] PHB#0001:01:00.0 [EP ] 144d a802 R:01 C:010802 ( mass-storage) LOC_CODE=CPU1 Slot1 (8x)
[ 60.413042976,5] PHB#0002:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Builtin SATA
[ 60.413236298,5] PHB#0002:01:00.0 [LGCY] 1b4b 9235 R:11 C:010601 ( sata) LOC_CODE=Builtin SATA
[ 60.413376374,5] PHB#0003:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Builtin USB
[ 60.413520796,5] PHB#0003:01:00.0 [EP ] 104c 8241 R:02 C:0c0330 ( usb-xhci) LOC_CODE=Builtin USB
[ 60.413672005,5] PHB#0004:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Builtin Ethernet
[ 60.413905142,5] PHB#0004:01:00.0 [EP ] 14e4 1657 R:01 C:020000 ( ethernet) LOC_CODE=Builtin Ethernet
[ 60.414016347,5] PHB#0004:01:00.1 [EP ] 14e4 1657 R:01 C:020000 ( ethernet) LOC_CODE=Builtin Ethernet
[ 60.414222156,5] PHB#0004:01:00.2 [EP ] 14e4 1657 R:01 C:020000 ( ethernet) LOC_CODE=Builtin Ethernet
[ 60.414373675,5] PHB#0005:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..02 SLOT=BMC
[ 60.414454930,5] PHB#0005:01:00.0 [ETOX] 1a03 1150 R:04 C:060400 B:02..02 LOC_CODE=BMC
[ 60.414582932,5] PHB#0005:02:00.0 [PCID] 1a03 2000 R:41 C:040000 ( video) LOC_CODE=BMC
[ 60.424173661,5] IPMI: Resetting boot count on successful boot
[ 60.424270580,5] INIT: Waiting for kernel...
[ 73.210357140,3] STB: BOOTKERNEL verification FAILED. log=0xffffffffffff8160
[ 73.210954847,5] INIT: 64-bit LE kernel discovered
[ 73.358043529,5] INIT: Starting kernel at 0x20011000, fdt at 0x306f65d0 217248 bytes
[ 74.469358228,3] LPC[000]: Got SYNC no-response error. Error address reg: 0xd0010080
[ 74.469369652,6] IPMI: dropping non severe PEL event
[ 74.469419009,7] UART: IRQ functional !
[ 4.135656] IMC PMU (null) Register failed
[ 4.995288] kAFS: failed to register: -97
[ 5.393954] squashfs: SQUASHFS error: Xattrs in filesystem, these will be ignored
[ 5.393979] squashfs: SQUASHFS error: unable to read xattr id index table
[ 5.401894] udevd[1677]: specified group 'kvm' unknown
[ 5.410641] udevd[1678]: specified group 'kvm' unknown
nvram process returned non-zero exit status
dmesg: klogctl: Operation not permitted
[ 81.461400396,3] PHB#0000[0:0]: phbRegbFirstErrorStatus = 0000000000000000
[ 81.461461441,3] PHB#0000[0:0]: phbRegbErrorLog0 = 0000000000000000
[ 81.461522427,3] PHB#0000[0:0]: phbRegbErrorLog1 = 0000000000000000
[ 81.461585112,3] PHB#0000[0:0]: PEST[1ff] = 3740002a03000000 0000000000000000
cpu 0x0: Vector: 300 (Data Access) at [c0000001f96df950]
pc: c0080000022312fc: amdgpu_fence_process+0xc0/0x13c [amdgpu]
lr: c0080000022312c0: amdgpu_fence_process+0x84/0x13c [amdgpu]
sp: c0000001f96dfbd0
msr: 900000000280b033
dar: 8
dsisr: 80000
current = 0xc0000001f95b5900
paca = 0xc0000001ff7ff480 irqmask: 0x03 irq_happened: 0x01
pid = 1798, comm = kworker/0:4
Linux version 4.19.0-openpower1 (root@raptor-build-public-staging-01) (gcc version 6.5.0 (Buildroot 2019.02.1-05273-gef2bf42027)) #2 SMP Wed May 22 00:16:10 UTC 2019
enter ? for help
[c0000001f96dfc20] c008000002231640 amdgpu_fence_count_emitted+0x20/0x40 [amdgpu]
[c0000001f96dfc50] c0080000022d8fb4 amdgpu_uvd_idle_work_handler+0x90/0x160 [amdgpu]
[c0000001f96dfcb0] c0000000000943f8 process_one_work+0x204/0x32c
[c0000001f96dfd40] c000000000094af0 worker_thread+0x2d0/0x394
[c0000001f96dfdc0] c00000000009a780 kthread+0x14c/0x154
[c0000001f96dfe30] c00000000000b6b0 ret_from_kernel_thread+0x5c/0x6c
The full kernel log can be found at https://gist.github.com/runlevel5/89f87c2296f0f82f634921248309e7be
and the full OPAL log can be found at https://gist.github.com/runlevel5/94b44fa48cae84ffcb7d2f15fe4945a7
Just wondering firmware is loaded correctly or not, is there anything I could do determine the root cause?
-
Have you read this wiki entry (https://wiki.raptorcs.com/wiki/Troubleshooting/GPU), already? To get the AMDGPU work under Linux.
-
@MPC7500 yes I have followed the instruction
Bootloader does not show up on monitor(s) attached to a discrete GPU
Most modern discrete GPUs require firmware. As Talos™ II is aimed at a security-conscious audience, we do not currently include GPU firmware in the production firmware images. Instructions are available in the Users Guide to add firmware for your GPU to the PNOR if needed. Note that any added firmware may be able to access and modify data associated with the affected device(s); we strongly recommend you perform a security risk analysis before loading any firmware, and select open firmware where/if it is available.
If you are using a GPU that does not require firmware, or have already added any needed firmware files to the host PNOR, please ensure that the on-board VGA disable jumper (J10109) is capped. The bootloader output will preferentially show up on the on-board VGA port if it remains enabled.
Alternatively, you either use a serial console or VGA monitor / adapter to interact with the bootloader.
I would like to know what the output would be like on the host PNOR when the linux vega10 firmwares are loaded correctly.
Btw, I also did try to disable on-board VGA output by capping the jumper (J10109), nothing is displayed on my monitor.
NOTE: I have tested this graphic card on my Windows PC box, it works perfectly.
-
I don't have a discrete GPU myself, so I couldn't try it. But output will be Petitboot. That's why you need the disable jumper. Otherwise Petitboot will be shown on the AST GPU.
In the firmware directory you need two directories. amdgpu and radeon. Which files do you have in the directories?
But anyway, I would first to get it working under Linux.
Maybe there is someone else who could help.
-
> In the firmware directory you need two directories. amdgpu and radeon. Which files do you have in the directories?
I only copy `amdgpu/vega10_*` files. I did not add any firmware under `radeon` because I don't need any of them.
I can confirm the card running fine under my Linux host
-
Great. Now the card only has to work at Petitboot?
-
> Great. Now the card only has to work at Petitboot?
Yes.
I think this issue could be resolved with newer Linux kernel. On my host, the card was not working too (under Fedora 32 rawhide running 5.4.17 kernel). I managed to get it working by compiling my own Linux 5.5.0
Let's wait for Raptor CS to bump their firmware linux version