Skip to main content

PCIe Troubleshooting Guide

If your Metis PCIe or M.2 card is not detected by lspci, or it is detected but not communicating properly with the host, the checks below will help you isolate and fix the problem.

PCIe device enumeration

Check lspci output

Run these two commands to list the devices the host detects, and check whether Metis is among them:

lspci
lspci -tv

If Metis is detected, you should see one of the following:

Device 1f9d:1100
Processing accelerators: Axelera AI Metis AIPU (rev 02)

Update PCI IDs

If lspci or lspci -tv shows the device as:

Device 1f9d:1100

then this is your Metis card. It does not show our device name because Axelera registered for vendor recognition after Ubuntu 22.04 was released. Update the PCI IDs:

sudo update-pciids

Your Metis card will then appear as:

Processing accelerators: Axelera AI Metis AIPU (rev 02)

Remove and rescan with the axdevice tool

From your voyager-sdk folder, activate the virtual environment (source venv/bin/activate), then run:

axdevice --refresh

This command:

  • removes all Axelera PCIe / M.2 devices and triggers a rescan, and
  • reloads firmware for Axelera PCIe / M.2 devices.

Remove and rescan a PCIe device or bridge manually

Metis device. Find the port address of your Metis device with lspci and lspci -tv. To remove it (here the device is at 0000:01:00.0):

echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.0/remove > /dev/null

Then rescan:

echo 1 | sudo tee /sys/bus/pci/rescan > /dev/null

PCIe bridge to Metis. Find the bridge address with lspci and lspci -tv. To remove it (here the bridge is at 0000:01:00.0):

echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.0/remove > /dev/null

Then rescan:

echo 1 | sudo tee /sys/bus/pci/rescan > /dev/null

Disable PCIe ASPM

Your system may put PCIe into a power-saving mode. To disable PCIe Active State Power Management (ASPM), edit the grub file:

sudo nano /etc/default/grub

Add the pcie_aspm=off parameter:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pcie_aspm=off"

Save the file, then update grub and reboot:

sudo update-grub
sudo reboot

Disable d3cold_allowed

On some hosts you can control the 3.3 V supply to the PCIe slots with d3cold_allowed. When it is set to 1, the host supplies 3.3 V only if the link is up during enumeration, and removes it if the link goes down. Disabling this can help.

Find the bridge address with lspci and lspci -tv, then run (replace xx and yy with the empty-port bridge you see in lspci -tv):

echo 0 | sudo tee /sys/bus/pci/devices/0000\:xx\:yy.0/d3cold_allowed
echo 0 | sudo tee /sys/bus/pci/rescan > /dev/null
lspci -tv

"Failed with return code -1400" when running inference

If you see this error when running inference.py:

[ERROR][axeWaitForCommandList]: Uio wait kernel failed with return code -1400.

From your voyager-sdk folder, activate the virtual environment (source venv/bin/activate), then run:

axdevice --refresh

Metis not detected after startup but detected after a reboot

This may be because the pcie-rescan script is not enabled on boot. Enable it:

systemctl enable pcie-check.service

The service will then run on all future boots.

Enable PCIe Shutdown Mode in the BIOS

Check whether your BIOS has any of these options and enable it:

  • "PCIe Slot Power Control"
  • "PCIe Power-On Reset"
  • "PCIe Cold Reset"
  • "PCIe Shutdown Mode"

PCIe Shutdown Mode (cold reset) is a feature on some boards that cuts power entirely to a PCIe slot during a reboot or reset. We have seen hosts where enabling it helped.

Metis shows up as Synopsys Silicon IP

Remove and rescan the PCIe device, as described above.

Kernel module: build, load, and versioning

Loading the driver with Secure Boot

Check whether Secure Boot is enabled:

sudo mokutil --sb-state

If Secure Boot is enabled but the Metis kernel module has not been signed, you will see this error when loading the module:

$ sudo modprobe metis
modprobe: ERROR: could not insert 'metis': Key was rejected by service

You have two options:

Check the driver is loaded

Check your driver version:

cat /sys/class/metis/version

It displays the driver version, for example:

0.07.16

Share this version, together with your Voyager SDK version, with the Axelera AI Support Team.

You can also check whether the kernel module is loaded:

lsmod | grep metis
metis                  90112  0

Remove an old driver and install the updated driver

Check your driver version:

cat /sys/class/metis/version
0.07.16

If an old driver version is shown, remove it:

sudo modprobe -r metis

Running the Voyager SDK installer installs the latest driver for your Voyager SDK version:

./install.sh --all --YES

Generate dmesg and look for axl and pci messages

Run sudo dmesg -T to print the kernel message buffer with human-readable timestamps. This is useful for debugging, especially the axl and pci entries. Save the output to a text or log file and share it with the Axelera AI Support Team.

Metis device enumeration and configuration

Disable IOMMU

Some Linux kernels (up to 5.15) had a kernel bug. On hosts with these kernels, add the intel_iommu=off or amd_iommu=off kernel parameter at boot.

Kernel 6.1 and later

From Linux kernel 6.1, this bug is fixed, so this workaround should no longer be required.

On Ubuntu, edit the grub file:

sudo nano /etc/default/grub

Add intel_iommu=off (Intel) or amd_iommu=off (AMD):

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=off"
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off"

Save the file, then update grub and reboot:

sudo update-grub
sudo reboot

Check the firmware version

From your voyager-sdk folder, activate the virtual environment (source venv/bin/activate), then run:

triton_multi_ctx --fwver

This displays the Metis firmware version loaded on your card, for example:

Firmware version: v1.2.5

Share this version, together with your Voyager SDK version, with the Axelera AI Support Team.

Check the board controller version

From your voyager-sdk folder, activate the virtual environment (source venv/bin/activate), then run:

triton_multi_ctx --bc-version

This displays the board controller version, for example:

zephyr version: baf12c979030
app name: board_bringup
app version: v1.0

Share this version, together with your Voyager SDK version, with the Axelera AI Support Team.

Cold boot via PCIe

If you have a PCIe connection to the Metis card but face communication issues, a cold boot via PCIe may help. From your voyager-sdk folder, activate the virtual environment (source venv/bin/activate), then run:

triton_multi_ctx --cold-boot 3

Check PCIe speed and lanes

Display the port of your Metis device:

lspci

Here the Metis device is at 0000:01:00.0. Inspect the link:

sudo lspci -s 0000:01:00.0 -vvv

The output shows the PCIe link established with the Metis card:

  • MSI information
  • Speed of the PCIe link (ideally Gen3, but it can work at lower speeds)
  • Number of PCIe lanes (ideally x4, but it works with a single lane)

Example output:

Capabilities: [50] MSI: Enable+ Count=32/32 Maskable+ 64bit+
Address: 00000000fee00818 Data: 0000
Masking: 00000000 Pending: 00000000

LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <4us, L1 <16us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+

In this example you can see 32 MSI, Gen3 speed, and 4 lanes. Share this information with the Axelera AI Support Team.

Known issues on specific host platforms

Memory overlays on your SOC

Symptoms

  • The board has a Rockchip chip or another SOC such as NXP or Broadcom.
  • A Metis M.2 card is installed and visible via lspci.
  • The driver installs successfully but the device does not work.

Identification

Obtain dmesg with sudo dmesg. You are affected if you find:

axl xxxx:xx:xx.x: Failed to request resources
axl: probe of xxxx:xx:xx.x failed with error -12

You may also see [disabled] messages in lspci for the RC (Root Complex) behind Metis:

lspci -s 0000:00:00.0 -vv
0000:00:00.0 PCI bridge: Rockchip Electronics Co., Ltd RK3588 (rev 01) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 253
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: [disabled]
Memory behind bridge: [disabled]
Prefetchable memory behind bridge: [disabled]

Explanation

The driver probe fails because the PCIe root complex port behind Metis does not have enough space for the Metis device's non-prefetchable memory. The default SOC device tree assigns less than 33 MB for non-prefetchable memory windows per PCIe root complex.

Solution

Increase this size in a custom device tree, or insert a PCIe overlay. Contact Axelera AI for further help, telling us which SOC your board uses.

"Failed to request resources" on ASUS Prime X570-P

You may have communication issues with the Metis card on the ASUS Prime X570-P board. This is specific to this board and can be solved.

Obtain dmesg with sudo dmesg. You are affected if you find:

axl xxxx:xx:xx.x: Failed to request resources

The ASUS Prime X570-P (AMD X570 chipset) has two PCIe x16 slots. These slots share PCIe lanes and are not designed to operate independently unless used for dual-GPU configurations. According to the motherboard manual:

  • If a GPU is installed in the first PCIe x16 slot, the second slot must either remain empty or contain a second GPU.
  • Non-GPU cards (such as accelerators or network adapters) in the second slot are detected by lspci, but memory access is disabled, rendering them unusable.

Solution

  1. Slot configuration:
    • Install the Metis card (accelerator) in the primary PCIe x16 slot.
    • Move the VGA/GPU card to the secondary PCIe x16 slot.
  2. BIOS settings — adjust the following for stability and correct device recognition:
    • CPU PCIE ASPM Mode: Auto → Disabled
    • PCIEx16_1 Bandwidth Bifurcation Configuration: Auto → x8/x8
    • PCIE Above 4G Decoding: Disabled → Enabled
    • PCIE Resize BAR Support: Disabled → Auto
    • PCIE SR-IOV Support: Auto → Disabled

With these BIOS settings applied, no additional kernel boot parameters should be required.