Double Buffering for Performance (Advanced)

Overview

This guide covers double buffering, an advanced performance optimization for the low-level axelera.runtime Python API. Double buffering is a hardware feature that requires careful buffer management and result tracking. If you haven't completed the Quick Start yet, start there to understand the basic API usage.

Double buffering is a feature in the Metis runtime that enables pipelined inference execution. When enabled, it overlaps data transfer time with computation time for higher throughput.

Prerequisites:

Completed Quick Start and Basic Usage Tutorial
Understand worker threads and inference patterns
Knowledge of threading, queue management, and buffer tracking
SDK virtual environment activated:: source venv/bin/activate
Compiled model available (e.g., build/resnet50-imagenet-onnx/model.json)
- You can download pre-compiled model with (axdownloadmodel resnet50-imagenet-onnx)

Download example model:

axdownloadmodel resnet50-imagenet-onnx

What is Double Buffering?

When double buffering is enabled:

While the AIPU processes inference N, the host can transfer data for inference N+1
This overlaps data transfer time with computation time
Results are delayed by 2 inference cycles when using round-robin workers

Trade-off: Increased latency (2-frame delay) for higher throughput.

Pipeline Comparison

Without double buffering:

run(input0) → result0
run(input1) → result1
run(input2) → result2

With double buffering (single worker):

run(input0) → dummy0
run(input1) → dummy1
run(input2) → result0  # Result delayed by 2 cycles
run(input3) → result1
run(input4) → result2

With double buffering (N workers, round-robin):

# Worker 0: run(input0) → dummy0
# Worker 1: run(input1) → dummy1
# Worker 0: run(input2) → dummy2
# Worker 1: run(input3) → dummy3
# Worker 0: run(input4) → result0  # Delay = 2 × N cycles
# Worker 1: run(input5) → result1

Critical insight: With N workers, you must:

Drop the first 2×N results (dummy outputs)
Push 2×N extra dummy frames at the end to flush the pipeline
Buffer the input→output association to match results with correct inputs

When to Use Double Buffering

Use double buffering when:

Throughput is more important than latency
Processing batch workloads (e.g., video files, image directories)
DMA transfer time is significant compared to inference time
You can tolerate 2-frame result delay

Don't use double buffering when:

Latency is more important than throughput
Processing single images or small batches
Avoiding application complexity is more important than maximizing throughput

Implementation

Complete example: axruntime_dblbuff.py

Run the example:

# With double buffering (default)
python examples/axruntime/python/axruntime_dblbuff.py \
    build/resnet50-imagenet-onnx/model.json images/ --repeat 10

# Without double buffering (for comparison)
python examples/axruntime/python/axruntime_dblbuff.py \
    build/resnet50-imagenet-onnx/model.json images/ --repeat 10 --no-double-buffer

instances = [
    conn.load_model_instance(
        model,
        num_sub_devices=batch_size,
        aipu_cores=batch_size,
        double_buffer=True  # Enable double buffering
    )
    for conn in connections
]

Step 2: Calculate number of dummy results to drop

# With N workers and double buffering, first 2×N results are dummy
num_to_drop = len(workers) * 2 if double_buffer else 0

# Add dummy frames at end to flush pipeline
input_paths += [None] * num_to_drop

Step 3: Buffer the input→output association

awaiting_result = collections.deque()

for in_frameno, image_path in enumerate(input_paths):
    # Preprocess and submit (or submit dummy frame)
    if image_path is None:
        inputs[next_available][0][:] = np.zeros_like(inputs[next_available][0])
    else:
        input = preprocess(image_path, input_infos[0])
        inputs[next_available][0][:] = input

    workers[next_available].push(image_path, inputs[next_available], outputs[next_available])

    if in_frameno >= prefill:
        # Retrieve result
        out_path, outs = workers[next_ready].pop()

        # Buffer the association
        awaiting_result.append(out_path)

        if out_frameno >= num_to_drop:
            # Now we have valid results - match with correct input
            out_path = awaiting_result.popleft()
            postprocess(out_frameno - num_to_drop, count, out_path, outs[0], labels, output_infos[0])

        out_frameno += 1

Key points:

Track input paths separately from output results using awaiting_result deque
Skip first num_to_drop results - these are dummy frames from double buffer warm-up
Push dummy frames at the end to drain the pipeline

Step 4: Drain remaining results

# Drain the remaining workers to extract the last N frames
for _ in range(prefill):
    next_ready = out_frameno % len(workers)
    out_path, outs = workers[next_ready].pop()
    awaiting_result.append(out_path)

    if out_frameno >= num_to_drop:
        out_path = awaiting_result.popleft()
        postprocess(out_frameno - num_to_drop, count, out_path, outs[0], labels, output_infos[0])

    out_frameno += 1

Best Practices

Always drop the first 2×N results where N = number of workers
Push 2×N dummy frames at the end to flush the pipeline
Use a deque or similar structure to track input→output association
Measure performance - not all workloads benefit equally
Consider latency requirements - 2-frame delay may be unacceptable for real-time apps

Performance Considerations

Throughput gains: Typically 10-30% improvement for workloads with significant DMA transfer overhead.

Latency penalty: Results delayed by 2×N frames (where N = number of workers).

Memory overhead: Requires buffering input associations (minimal).

When it helps most:

Large input/output tensors (significant transfer time)
Models with balanced compute vs. transfer time
Batch processing workloads

When it helps least:

Very fast models (compute time << transfer time)
Small tensors (minimal transfer overhead)
Single-image latency-critical applications

Basic Usage Tutorial - Understanding worker threads and inference patterns
API Reference - ModelInstance.load_model_instance() parameters

Last Updated: 2026-02-11

Overview​

What is Double Buffering?​

Pipeline Comparison​

When to Use Double Buffering​

Implementation​

Step 1: Enable double buffering when loading model instance​

Step 2: Calculate number of dummy results to drop​

Step 3: Buffer the input→output association​

Step 4: Drain remaining results​

Best Practices​

Performance Considerations​

Related Documentation​