Basic Usage Tutorial: axelera.runtime Low-Level API
Introduction
This tutorial explains the building blocks of axelera.runtime - the objects you'll work with, how to structure your application for performance, and why things work the way they do. After completing this tutorial, you'll understand how to build robust, high-performance applications using the low-level API.
New to the API? Start with the Quick Start guide first, then return here to understand the "why" behind the patterns.
What you'll learn:
- The API object hierarchy and how objects relate to each other
- Why worker threads are essential for performance
- How to allocate AIPU cores efficiently (batch size, num_instances)
- The required Axelera-specific transformations (padding, quantization, depadding, dequantization)
- Resource management patterns to avoid leaks
Examples used in this tutorial:
- axruntime_quickstart.py - Minimal ImageNet classification showing core pattern
- axruntime_yolo11.py - Production-ready YOLO detection with prefill/drain pattern
Foundation Concepts - Understanding the API structure
- Core API Concepts - Object hierarchy and resource ownership
- Understanding Padding and Quantization - Required transformations
Step-by-Step Walkthrough - Building an application
- Walkthrough: Basic Usage Example - 7 steps from initialization to cleanup
Performance & Best Practices
- Batch Size and Core Allocation - Maximizing hardware utilization
- Resource Management Best Practices - Avoiding leaks and crashes
Next Steps
- Next Steps - Advanced guides (double buffering, cascaded pipelines, etc.)
Prerequisites
- Completed the Quick Start guide
- SDK installed and virtual environment activated (
source venv/bin/activate) - Compiled model available (e.g.,
build/resnet50-imagenet-onnx/model.json)- You can download pre-compiled model with (
axdownloadmodel resnet50-imagenet-onnx)
- You can download pre-compiled model with (
Core API Concepts
Object Hierarchy
The axelera.runtime API follows a clear hierarchy:
Context (root object, manages all resources)
├── Model (loaded model metadata and weights)
└── Connection (reserved hardware resources)
└── ModelInstance (executable model on hardware)
Key points:
- Context is the entry point - create it first
- Model contains the compiled model - can be reused for multiple instances
- Connection reserves AIPU cores on an Axelera device - one per ModelInstance
- ModelInstance is ready for inference - one per thread
Resource Ownership
The Axelera resource hierarchy follows a parent-child model:
- Parent releases children: Releasing a parent automatically releases all child resources
- Context controls lifecycle: Context manages when all Axelera resources are freed
- Explicit control: You decide when to create and release resources
Good pattern (context manager):
with Context() as ctx:
model = ctx.load_model("model.json")
# ... use model ...
# Context and all children automatically released here
Also valid (explicit release):
ctx = Context()
model = ctx.load_model("model.json")
# ... use model ...
ctx.release() # Releases Context, Model, and all other children
Foundational Concepts
Before diving into the walkthrough, understand these two critical concepts that affect how you structure your application.
Understanding Padding and Quantization
The Axelera Metis AIPU performs inference using INT8 quantized models with hardware-aligned tensors. When using the low-level API, you must handle these transformations manually (higher-level APIs do this automatically).
The Pipeline
Float32 Input → Pre-processing → Quantize → INT8 → Pad → INT8 Padded Input
↓
Inference (AIPU)
↓
Float32 Output ← Post-processing ← Dequantize ← Depad ← INT8 Padded Output
Here, pre-processing and post-processing include the normal pre- or post-processing you would do for this model on any platform, in addition to any operations contained in the pre- or post-amble ONNX graph.
Axelera-Required Processing:
- Input: Quantize (FP32 → INT8) then Pad (add channel alignment)
- Output: Depad (remove channel alignment) then Dequantize (INT8 → FP32)
Formulas
Input preprocessing:
# 1. Quantize: float32 → int8
quantized = np.round(normalized / input_info.scale + input_info.zero_point)
quantized = quantized.clip(-128, 127).astype(np.int8)
# Note: Normalization (e.g., /255.0) and quantization can often be fused into a single operation for efficiency
# 2. Pad: Add hardware alignment (skip batch dimension)
padded = np.pad(quantized, input_info.padding[1:],
mode='constant', constant_values=input_info.zero_point)
Output postprocessing:
# 1. Depad: Remove alignment padding
depadded = output[tuple(slice(b, -e if e else None)
for b, e in output_info.padding)]
# 2. Dequantize: int8 → float32
result = (depadded.astype(np.float32) - output_info.zero_point) * output_info.scale
Helper functions: See axelera_quantize(), axelera_pad(), axelera_depad(), axelera_dequantize() in yolo_utils.py.
Batch Size, AIPU Cores, and num_instances
Understanding how batch size relates to hardware utilization is crucial for performance.
Key facts:
- One Metis device has 4 AIPU cores
- Batch size is fixed at compile time (check
model.inputs()[0].shape[0]) - A model with batch_size=N uses N cores per inference
The formula: num_instances = 4 cores // batch_size
Examples:
| Batch Size | Num Instances | Cores Used | Behavior |
|---|---|---|---|
| 1 | 4 | 4 | Process images individually across 4 parallel instances |
| 2 | 2 | 4 | Process 2 images at a time across 2 parallel instances |
| 4 | 1 | 4 | Process 4 images together in a single batched inference |
Why this matters:
- Each ModelInstance needs its own worker thread
- More instances = more parallelism = better CPU utilization while waiting for AIPU
- See Worker Thread Pattern for details
Example code: See axruntime_yolo11.py:356-384 for the calculation pattern.
Walkthrough: Basic Usage Example
Let's walk through the core API concepts using axruntime_quickstart.py and axruntime_yolo11.py as references. The Quick Start example shows the simplest integration pattern, while the YOLO11 example demonstrates production-ready patterns like prefill/drain pipelining.
Step 1: Creating Context and Loading Model
The first step is always creating a Context and loading your model:
with Context() as ctx:
LOG.info(f"Loading model from: {model_path}")
model = ctx.load_model(str(model_path))
What's happening:
Context()initializes the runtime and prepares to access Axelera hardwarectx.load_model()loads a compiled model from amodel.jsonfile- The
withcreates a Python context manager to ensure automatic cleanup when done, a best practice for resource management
Model file structure: Your model directory typically contains:
model.json- Model metadata and configuration. When loading a model, this is the file path you need to specify.manifest.json- Tensor quantization parameters- Model weights and compiled binary
For a complete initialization function with detailed step-by-step comments, see initialize_model() in axruntime_yolo11.py.
Step 2: Understanding TensorInfo
After loading the model, inspect its input and output tensors:
input_infos = model.inputs()
output_infos = model.outputs()
input_info = input_infos[0] # First input tensor
LOG.info(f"Input shape (with padding): {input_info.shape}")
LOG.info(f"Input shape (without padding): {input_info.unpadded_shape}")
LOG.info(f"Input padding: {input_info.padding}")
LOG.info(f"Input quantization: scale={input_info.scale}, zero_point={input_info.zero_point}")
TensorInfo contains:
| Property | Description | Example |
|---|---|---|
shape | Full tensor shape including padding | (1, 224, 224, 4) |
unpadded_shape | Logical shape without padding | (1, 224, 224, 3) |
padding | Padding per dimension [(start, end), ...] | [(0, 0), (0, 0), (0, 0), (0, 1)] |
scale | Quantization scale factor | 0.003921 |
zero_point | Quantization zero point | 0 |
dtype | Data type (always np.int8 for quantized) | int8 |
size | Total size in bytes | 150528 |
Why two shapes?
unpadded_shape: The actual image dimensions (what you resize to)shape: The buffer size you allocate (includes hardware alignment padding)
The information in TensorInfo allows you to easily perform the Axelera-specific pre- and post-processing steps described in Understanding Padding and Quantization.
Step 3: Connecting to Device
Next, look at the model batch size and create Connections to the hardware:
batch_size = input_info.shape[0] # From model, fixed at model compile time
num_instances = aipu_cores // batch_size # Calculate instances
# Create one connection per model instance
connections = []
for i in range(num_instances):
conn = ctx.device_connect(device=None, num_sub_devices=batch_size)
connections.append(conn)
Critical concept: Batch size and model instances
The batch size (read from the model info) determines how many ModelInstances we create using the formula num_instances = aipu_cores / batch_size. With batch_size=1 and 4 cores, we create 4 ModelInstances. See Batch Size, AIPU Cores, and num_instances for details on different batch sizes.
Parameters explained:
device=None: Auto-select an available devicenum_sub_devices: Number of cores to reserve (matches batch_size)
Step 4: Loading ModelInstance
Create a ModelInstance for each Connection:
instances = []
for i, conn in enumerate(connections):
instance = conn.load_model_instance(
model,
num_sub_devices=batch_size,
aipu_cores=batch_size
)
instances.append(instance)
Parameters explained:
model: The Model object to instantiatenum_sub_devices: Number of AIPU cores for this instance (matches batch_size)aipu_cores: L2 memory allocation (1 core = 25% of device's L2 memory). Should match num_sub_devices.
Key points:
Each ModelInstance is independent and can run inference in parallel, and should only be used by one thread. Each ModelInstance is tied to the Connection that created it.
Step 5: Preparing Buffers and Preprocessing
Buffer allocation:
Pre-allocate buffers for inputs and outputs to avoid allocating memory during inference. Each model instance has its own buffers:
inputs = []
outputs = []
for i in range(num_instances):
instance_inputs = [np.zeros(info.shape, info.dtype) for info in input_infos]
inputs.append(instance_inputs)
instance_outputs = [np.zeros(info.shape, info.dtype) for info in output_infos]
outputs.append(instance_outputs)
Why separate buffers per instance? Multiple workers run in parallel in different threads, so each needs its own buffers to avoid data races.
Preprocessing:
Preprocessing splits into two parts:
General Preprocessing (model-specific):
- Resize, color conversion, normalization
Axelera-Required Processing (needed for axelera.runtime API): 2. Quantization: Convert float32 → int8 3. Padding: Add hardware channel alignment 4. Preamble graph (if your model has one): Some models have compiler-extracted preprocessing operations
See Quick Start - Preprocessing for the complete pattern, or preprocess_yolo_detection() in yolo_utils.py for a production implementation with YOLO models. For preamble/postamble graphs, see ONNX Postamble Processing.
Step 6: Running Inference with Worker Threads
The examples use a worker thread pattern for maximum performance. The call to instance.run(inputs, outputs) is a blocking call - it will wait until the AIPU finishes inference on the input and returns it over the PCIe bus. If used in a simple single-threaded design, only one ModelInstance (and thus one AIPU core) will ever be used at once - this will significantly bottleneck performance. Therefore, we recommend a multi-threaded design to make sure that the OS scheduler is able to execute other tasks (including inference on other worker threads) while waiting for results.
Worker thread implementation:
The InferenceWorker class creates a dedicated thread for each ModelInstance:
- Each worker has its own input and output queues
- The worker's
run()method loops continuously, callinginstance.run()(the blocking inference call) - While one worker waits for inference, the OS scheduler can run other workers
See Quick Start - Run Inference for the complete run_inference() implementation.
Pipeline execution patterns:
Simple round-robin (used in Quick Start): Each frame is submitted to the next available worker and results are collected immediately. Simple but not maximum throughput.
Prefill/drain pattern (used in YOLO11 example): For maximum throughput:
- Prefill: Submit N frames to all workers before collecting any results
- Steady state: For each new frame, collect one result then submit the new frame
- Drain: After all frames submitted, collect the remaining N results
This overlaps preprocessing (CPU) with inference (AIPU) for 2-3x throughput improvement. See run_realtime_inference() in axruntime_yolo11.py for the complete implementation with detailed comments.
Why round-robin collection preserves order: Each worker only processes its assigned frames (Worker 0: frames 0,4,8... Worker 1: frames 1,5,9...), so polling workers in order (0→1→2→3→0...) reconstructs the original submission order.
Step 7: Postprocessing and Cleanup
Postprocessing:
Postprocessing splits into two parts:
Axelera-Required Processing (needed for axelera.runtime API):
- Depad: Remove hardware channel alignment padding
- Transpose: Convert NHWC → NCHW format
- Dequantize: Convert int8 → float32
- Postamble graph (if your model has one): Run extracted operations via ONNX Runtime or manual implementation
To check if your model has a postamble graph, look for postprocess_graph.onnx in your model directory or check the postamble_graph/postprocess_graph field in manifest.json. We provide code examples for handling all of these Axelera-specific steps in this documentation.
General Postprocessing (same as any AI framework): 5. Model-specific operations: NMS for detection, argmax for classification, etc.
See Quick Start - Postprocessing for the complete pattern, or postprocess_yolo_detection() in yolo_utils.py for a production implementation. Our examples provide helper functions for all Axelera-required processing steps.
Note on YOLO11 Postamble Graph: The YOLO11 model has a postamble graph containing the detection head operations (DFL for box regression, sigmoid activation for classification, and box coordinate transformations). The axruntime_yolo11.py example implements these postamble operations manually in yolo_utils.decode_yolo_detections(), while axruntime_yolo11_onnxruntime.py uses ONNX Runtime to execute the postamble graph automatically. Both approaches produce identical results - the manual approach gives more control, while ONNX Runtime is simpler. See the Post-Processing with ONNX Postamble Graphs guide for details.
Cleanup:
finally:
# Shut down workers
for worker in workers:
worker.shutdown()
for worker in workers:
worker.join(timeout=5.0)
# Context manager automatically releases:
# - ModelInstances
# - Connections
# - Model
# - Context
Using a Python context manager ensures all resources are freed, even if an exception occurs. See Quick Start - Cleanup for the complete implementation.
Resource Management Best Practices
Context Manager Pattern
Recommended: Always use the with statement for Context. This has the following advantages of automatic cleanup, even if exceptions occur.
with Context() as ctx:
model = ctx.load_model("model.json")
conn = ctx.device_connect()
instance = conn.load_model_instance(model)
# ... use instance ...
# Everything automatically released here
Parent-Child Ownership
Remember the hierarchy:
Context
├── Model ─┐
└── Connection ─┐
└── ModelInstance
Rules:
- Releasing Context releases everything
- Releasing Connection releases ModelInstance
- You can release children independently of the parent
Example:
with Context() as ctx:
model1 = ctx.load_model("model1.json")
conn1 = ctx.device_connect()
instance1 = conn1.load_model_instance(model1)
# Use instance1...
# Release instance1 and conn1, but keep Context and model1
instance1.release()
conn1.release()
# Load a different model
model2 = ctx.load_model("model2.json")
conn2 = ctx.device_connect()
instance2 = conn2.load_model_instance(model2)
# Use instance2...
# Context releases model2, instance2, conn2 automatically
Explicit Release
If not using context manager, always call release():
ctx = Context()
try:
model = ctx.load_model("model.json")
# ... use model ...
finally:
ctx.release() # Ensures cleanup even if exception occurs
Next Steps
Now that you understand the basics, explore the task-based guides:
Performance Optimization:
- Double Buffering - DMA pipelining for maximum throughput
Multi-Model Applications:
- Cascaded Pipelines - Running multiple models in sequence
- Multiple Devices - Scaling across multiple Metis devices
Advanced Topics:
- ONNX Postamble Processing - Using compiler-extracted post-processing graphs
- API Reference - Complete API documentation
Questions?
For issues or feedback, visit the Axelera GitHub repository.