Cascaded Inference Pipelines

Overview

This guide explains how to run multiple models in sequence using the low-level axelera.runtime Python API. The example demonstrates a two-stage pipeline: YOLO11s object detection followed by YOLOv8s-pose pose estimation. If you haven't completed the Quick Start yet, start there to understand the basic API usage and core concepts.

Cascaded pipelines require careful AIPU core allocation, coordinate space management, and understanding of the low-level resource management APIs.

Prerequisites:

Completed Quick Start and Basic Usage Tutorial
Understand Context, Model, Connection, ModelInstance, and resource allocation
SDK virtual environment activated: source venv/bin/activate
Compiled models:
- Detection model: voyager-sdk/build/yolo11s-coco-onnx/1/model.json
- Pose model: voyager-sdk/build/yolov8spose-coco-onnx/1/model.json
- You can download both precompiled models with: axdownloadmodel yolo11s-coco-onnx yolov8spose-coco-onnx

Complete example: axruntime_cascaded_pipeline.py

Architecture Overview

┌─────────────────────────────────────────────────────┐
│  Input Image (original dimensions, e.g., 1920×1080) │
└────────────────┬────────────────────────────────────┘
                 │
                 ▼
    ┌───────────────────────────┐
    │   Stage 1: Detection      │
    │   Model: YOLO11s          │
    │   Cores: 2 (configurable) │
    │   Batch: 1 (configurable) │
    └────────────┬──────────────┘
                 │
                 │ Person boxes (in 640×640 model space)
                 ▼
    ┌───────────────────────────┐
    │  Scale coordinates to     │
    │  original image space     │
    │  (640×640 → 1920×1080)    │
    └────────────┬──────────────┘
                 │
                 │ Scaled boxes in image space
                 ▼
    ┌───────────────────────────┐
    │  Crop person ROIs from    │
    │  original image           │
    │  (e.g., 435×857 crops)    │
    └────────────┬──────────────┘
                 │
                 │ Person ROI images (variable sizes)
                 ▼
    ┌───────────────────────────┐
    │   Stage 2: Pose           │
    │   Model: YOLOv8s-pose     │
    │   Cores: 2 (configurable) │
    │   Preprocess: Stretch     │
    │   ROI → 640×640           │
    └────────────┬──────────────┘
                 │
                 │ Keypoints in 640×640 model space ⚠️
                 ▼
    ┌───────────────────────────┐
    │ ⚠️ CRITICAL STEP:         │
    │  Rescale coordinates      │
    │  640×640 → ROI size       │
    │  (e.g., 640×640 → 435×857)│
    └────────────┬──────────────┘
                 │
                 │ Keypoints in ROI space
                 ▼
    ┌───────────────────────────┐
    │  Apply NMS (filter        │
    │  duplicate detections)    │
    └────────────┬──────────────┘
                 │
                 │ Filtered keypoints
                 ▼
    ┌───────────────────────────┐
    │  Transform to image space │
    │  (ROI coords → full image)│
    └────────────┬──────────────┘
                 │
                 ▼
    ┌───────────────────────────┐
    │   Final Results           │
    │   Person boxes + Poses    │
    │   (all in image space)    │
    └───────────────────────────┘

Using Multiple Models

The basic API components remain the same - Context, Model, Connection, ModelInstance. Here's how to use them for multiple models:

Context: Only one Context for all models
Model: One Model object per model (2 in this example: YOLO11s and YOLOv8s-pose)
Connection and ModelInstance: Specific to each model instance (see Batch Size and AIPU Cores)

Implementation Pattern

In the example, there is a Python class for each pipeline stage (Stage1Detection and Stage2Pose). These classes own the Connection and ModelInstance objects for each stage loaded onto the AIPU. While this example shows a cascaded pipeline, the API usage is the same for independent models.

Batch Size and AIPU Cores Deep Dive

For the basics of batch size and the formula num_instances = aipu_cores / batch_size, see Basic Usage Tutorial - Batch Size.

Core Allocation Principle

When using multiple models (cascaded pipelines), you must allocate AIPU cores between models. Each Metis device has 4 cores total.

Key principle:

Total cores used = Σ(batch_size × num_instances) for each model ≤ 4

The number of InferenceWorker threads should match the number of ModelInstances in use at once.

Example Allocations

Core allocation should be based on performance testing. In general, heavier models should have more cores allocated.

Example allocations for 4 cores:

Configuration	Stage 1 Cores	Stage 2 Cores	Use Case
Balanced	2	2	Similar computation time per stage
Detection-heavy	3	1	Stage 1 processes full images, Stage 2 processes small ROIs
Pose-heavy	1	3	Few detections per frame, expensive pose detection

Multi-Model Examples

Model	Batch Size	AIPU Cores Allocated	Num Instances	Notes
Detection	1	2	2	Stage 1: 2 cores
Pose	1	2	2	Stage 2: 2 cores, total = 4
Detection	2	2	1	Stage 1: 2 cores
Pose	2	2	1	Stage 2: 2 cores, total = 4
Detection	1	3	3	Stage 1: 3 cores
Classifier	1	1	1	Stage 2: 1 core, total = 4

Code Pattern

# Model was compiled with batch_size (check model.json)
# Assumed to be batch_size=1 here
batch_size = model.inputs()[0].shape[0]

# Determine how many cores to use for this model
aipu_cores = 2  # For cascaded pipeline, sharing 4 cores between 2 models

# Calculate number of instances
num_instances = aipu_cores // batch_size

# Create connections - one per instance
connections = [
    ctx.device_connect(None, batch_size)
    for _ in range(num_instances)
]

# Load model instances
instances = [
    conn.load_model_instance(
        model,
        num_sub_devices=batch_size,
        aipu_cores=batch_size  # Cores per instance
    )
    for conn in connections
]

Important notes:

num_sub_devices = batch_size (reserves correct number of sub-devices; a sub-device is another name for a core)
aipu_cores = must equal batch_size (allocates L2 resources correctly)
Total cores used = num_instances × batch_size
For each model, follow the same pattern shown in axruntime_yolo11.py

Critical: Coordinate Space Rescaling

This rescaling step is required for ANY two-stage cascaded application where Stage 2 processes ROIs extracted from Stage 1 detections.

The Problem

When Stage 2 processes cropped ROIs, the model outputs are in the model's input space (e.g., 640×640), not the ROI's original dimensions. Without rescaling, coordinates will be misaligned with the actual image.

Coordinate Flow

1. Stage 1 detects objects in original image (e.g., 1920×1080)
   ↓
2. Extract ROI from original image (e.g., 435×857 person crop)
   ↓
3. Preprocessing STRETCHES ROI to model input size (640×640)
   ↓
4. Model outputs coordinates in 640×640 space ❌ WRONG SPACE!
   ↓
5. MUST RESCALE coordinates back to ROI dimensions (435×857) ✅
   ↓
6. Transform ROI coordinates to full image space

Implementation

See axruntime_cascaded_pipeline.py:415-446 for full implementation:

# After Stage 2 postprocessing, but before using coordinates
roi_height, roi_width = roi.shape[:2]  # Original ROI dimensions
model_size = 640  # Model input size

# Calculate scaling ratios
ratio_x = model_size / roi_width
ratio_y = model_size / roi_height

# Scale boxes from 640×640 back to ROI dimensions
if len(boxes) > 0:
    boxes[:, [0, 2]] /= ratio_x  # x1, x2
    boxes[:, [1, 3]] /= ratio_y  # y1, y2

# Scale keypoints from 640×640 back to ROI dimensions
if len(keypoints) > 0:
    keypoints[:, :, 0] /= ratio_x  # x coordinates
    keypoints[:, :, 1] /= ratio_y  # y coordinates

Applicable to All Cascaded Applications

This rescaling applies to:

Detection → Pose estimation
Detection → Classification (if classifier returns spatial info)
Detection → Segmentation (masks in ROI space)
Detection → Re-identification (feature maps with spatial coordinates)

Without this rescaling, your Stage 2 outputs will be in the wrong coordinate system and unusable for visualization or downstream processing.

NMS for Duplicate Detection Filtering

When Stage 2 processes ROIs, you may get duplicate detections for the same object. Apply NMS (Non-Maximum Suppression) to filter overlapping detections based on IoU (Intersection over Union).

See axruntime_cascaded_pipeline.py:368-413 for full NMS implementation:

# After postprocessing, before coordinate rescaling
if len(boxes) > 0:
    # Sort by confidence
    sorted_indices = np.argsort(boxes[:, 4])[::-1]
    keep_indices = []

    while len(sorted_indices) > 0:
        current = sorted_indices[0]
        keep_indices.append(current)

        if len(sorted_indices) == 1:
            break

        # Compute IoU with remaining boxes
        current_box = boxes[current]
        remaining_boxes = boxes[sorted_indices[1:]]

        # ... IoU calculation ...

        # Keep boxes with IoU below threshold (0.45)
        mask = iou < 0.45
        sorted_indices = sorted_indices[1:][mask]

    # Filter both boxes and keypoints
    boxes = boxes[keep_indices]
    keypoints = keypoints[keep_indices]

When to use NMS:

Multiple overlapping detections on same object
High confidence threshold produces many candidates
Stage 2 model outputs multiple predictions per ROI

Best Practices

Cascaded Pipelines

CRITICAL: Always rescale Stage 2 outputs from model space to ROI space
Apply NMS when Stage 2 produces multiple overlapping detections
Validate total cores ≤ 4 before loading models
Start with 2+2 core split and adjust based on profiling
Use batch_size=1 unless you have specific batching requirements
Profile each stage to identify bottlenecks (CPU preprocessing vs inference time)
Consider ROI count - many small ROIs may benefit from more Stage 2 cores
Clip ROI coordinates to image bounds to avoid crashes

Batch Size Configuration

Check compiled batch size from model.inputs()[0].shape[0]
Calculate num_instances correctly: aipu_cores // batch_size
Set num_sub_devices = batch_size when calling device_connect()
Set aipu_cores = batch_size when calling load_model_instance()
Handle batch_size > 1 in preprocessing (repeat or stack images)

Basic Usage Tutorial - Core API concepts and batch size fundamentals
Multiple Devices - Scaling cascaded pipelines across devices
API Reference - Complete API documentation

Last Updated: 2026-02-11

Overview​

Architecture Overview​

Using Multiple Models​

Implementation Pattern​

Batch Size and AIPU Cores Deep Dive​

Core Allocation Principle​

Example Allocations​

Multi-Model Examples​

Code Pattern​

Critical: Coordinate Space Rescaling​

The Problem​

Coordinate Flow​

Implementation​

Applicable to All Cascaded Applications​

NMS for Duplicate Detection Filtering​

Best Practices​

Cascaded Pipelines​

Batch Size Configuration​

Related Documentation​