axelera.runtime API
The low-level Python API for loading compiled models and running them directly on the AIPU. This gives you fine-grained control over device selection, buffer management, and multi-instance execution.
For most applications, the InferenceStream API is simpler and handles all of this automatically. Use axelera.runtime when you need direct tensor-level access to the hardware.
import axelera.runtime as axr
Typical usage
import numpy as np
import axelera.runtime as axr
with axr.Context() as ctx:
devices = ctx.list_devices()
conn = ctx.device_connect(devices[0])
model = ctx.load_model("build/yolov8n-coco/yolov8n-coco/1/model.json")
instance = conn.load_model_instance(model, aipu_cores=4, double_buffer=True)
# Prepare inputs as numpy arrays
inputs = [np.zeros(t.unpadded_shape, dtype=np.float32) for t in model.inputs()]
outputs = [np.zeros(t.shape, dtype=np.int8) for t in model.outputs()]
instance.run(inputs, outputs)
Context
The root object. Manages all devices and resources. Use as a context manager or call release() explicitly.
ctx = axr.Context()
# or:
with axr.Context() as ctx:
...
Methods
| Method | Returns | Description |
|---|---|---|
list_devices() | list[DeviceInfo] | Enumerate all connected Metis devices |
device_connect(device, num_sub_devices=1) | Connection | Reserve a device (or sub-devices). Returns a Connection for loading models. |
load_model(path) | Model | Load a compiled model.json file. The same Model can be loaded onto multiple connections. |
configure_device(device, **kwargs) | bool | Apply a configuration property to a device. Returns True when complete, False if pending (poll with device_ready()). |
device_ready(device) | bool | Check if a configure_device() call has completed. |
read_device_configuration(device) | dict[str, str] | Read all current configuration properties. |
release() | None | Release all objects. Called automatically when used as a context manager. |
configure_device properties
| Property | Default | Description |
|---|---|---|
clock_profile | 800 | Device clock in MHz |
clock_profile_core_0 … clock_profile_core_3 | 800 | Per-core clock in MHz |
mvm_utilisation_core_0 … mvm_utilisation_core_3 | 100 | Per-core MVM utilization limit (%) |
Connection
A reserved connection to a device (or sub-device group). Created by ctx.device_connect().
Methods
| Method | Returns | Description |
|---|---|---|
load_model_instance(model, **kwargs) | ModelInstance | Load a compiled model onto this connection for execution |
load_model_instance kwargs
| Property | Default | Description |
|---|---|---|
aipu_cores | 0 | Number of AIPU cores / L2 resources. Set to the model's batch size. |
num_sub_devices | 0 | Number of sub-devices. Set to the model's batch size. |
double_buffer | 0 | Enable double-buffering for higher throughput |
input_dmabuf | 0 | Inputs are DMA buffer file descriptors instead of numpy arrays |
output_dmabuf | 0 | Outputs are DMA buffer file descriptors instead of numpy arrays |
device_profiling | 0 | Enable device-side profiling |
host_profiling | 0 | Enable host-side profiling |
elf_in_ddr | 1 | True if model was compiled with elf_in_ddr=True (default) |
Model
Represents a loaded compiled model. Created by ctx.load_model().
Methods
| Method | Returns | Description |
|---|---|---|
inputs() | list[TensorInfo] | Input tensor metadata |
outputs() | list[TensorInfo] | Output tensor metadata |
Properties
| Property | Description |
|---|---|
preamble_graph | Path to preamble ONNX file (host-executed prefix operations, if any) |
postamble_graph | Path to postamble ONNX file (host-executed suffix operations, if any) |
input_tensor_layout | Always NHWC in this version |
ModelInstance
A model loaded onto a specific device connection. Created by conn.load_model_instance().
Methods
| Method | Returns | Description |
|---|---|---|
run(inputs, outputs) | None | Execute one inference step |
inputs and outputs are lists of numpy arrays (or DMA buffer file descriptors if dmabuf mode is enabled). Shapes must match model.inputs() / model.outputs(). Raises an exception on failure.
TensorInfo
Describes one input or output tensor, including quantization and padding metadata.
| Field | Type | Description |
|---|---|---|
shape | tuple | Full tensor shape (including padding) |
unpadded_shape | tuple | Shape without padding |
dtype | np.dtype | Data type (default: np.int8) |
name | str | Tensor name |
padding | list of (start, end) tuples | Padding per dimension (numpy.pad format) |
scale | float | Quantization scale |
zero_point | int | Quantization zero-point |
size | int | Size in bytes |
Quantize an input:
t = model.inputs()[0]
src = np.zeros(t.unpadded_shape, dtype=np.float32)
quant = np.round((src / t.scale) + t.zero_point).clip(-128, 127).astype(np.int8)
padded = np.pad(quant, t.padding, constant_values=t.zero_point)
Dequantise an output:
t = model.outputs()[0]
out = np.zeros(t.shape, dtype=np.int8)
# (populate out from instance.run())
depadded = out[tuple(slice(b, -e if e else None) for b, e in t.padding)]
dequant = (depadded.astype(np.float32) - t.zero_point) * t.scale
DeviceInfo
Returned by ctx.list_devices().
| Field | Type | Description |
|---|---|---|
name | str | Device name, e.g. 'metis-0:3:0' |
subdevice_count | int | Number of AIPU cores (4 for Metis) |
board_type | BoardType | pcie, m2, devboard, sbc, etc. |
firmware_version | str | Running firmware version |
flashed_firmware_version | str | Version stored in flash |
board_controller_firmware_version | str | Board controller firmware version |
board_revision | int | Hardware board revision number |
board_controller_board_type | str | Board controller's reported board type |
max_memory | int | Maximum device memory in bytes. (Not populated in current implementation — always 0) |
in_use | bool | Whether the device is currently reserved by a process. (Not populated in current implementation) |
in_use_by | str | Name of the process holding the device. (Not populated in current implementation) |
Exceptions
| Exception | Description |
|---|---|
axr.ConnectionError | Failed to connect to device |
axr.DeviceInUse | Device is reserved by another process |
axr.IncompatibleDevice | Model and device are incompatible |
axr.InvalidArgument | Invalid parameter passed to an API call |
axr.InvalidConfiguration | Device configuration is invalid |
axr.InternalError | SDK internal error |
axr.Pending | Async operation not yet complete |
axr.UnknownError | Unclassified error |
See also
- InferenceStream API — higher-level Python API (recommended for most applications)
- axrunmodel — CLI tool built on the same runtime
- Model Formats — the
model.jsonfile this API loads