Double Buffering for Performance (Advanced)
Overview
This guide covers double buffering, an advanced performance optimization for the low-level axelera.runtime Python API. Double buffering is a hardware feature that requires careful buffer management and result tracking. If you haven't completed the Quick Start yet, start there to understand the basic API usage.
Double buffering is a feature in the Metis runtime that enables pipelined inference execution. When enabled, it overlaps data transfer time with computation time for higher throughput.
Prerequisites:
- Completed Quick Start and Basic Usage Tutorial
- Understand worker threads and inference patterns
- Knowledge of threading, queue management, and buffer tracking
- SDK virtual environment activated::
source venv/bin/activate - Compiled model available (e.g.,
build/resnet50-imagenet-onnx/model.json)- You can download pre-compiled model with (
axdownloadmodel resnet50-imagenet-onnx)
- You can download pre-compiled model with (
Download example model:
axdownloadmodel resnet50-imagenet-onnx
What is Double Buffering?
When double buffering is enabled:
- While the AIPU processes inference N, the host can transfer data for inference N+1
- This overlaps data transfer time with computation time
- Results are delayed by 2 inference cycles when using round-robin workers
Trade-off: Increased latency (2-frame delay) for higher throughput.
Pipeline Comparison
Without double buffering:
run(input0) → result0
run(input1) → result1
run(input2) → result2
With double buffering (single worker):
run(input0) → dummy0
run(input1) → dummy1
run(input2) → result0 # Result delayed by 2 cycles
run(input3) → result1
run(input4) → result2
With double buffering (N workers, round-robin):
# Worker 0: run(input0) → dummy0
# Worker 1: run(input1) → dummy1
# Worker 0: run(input2) → dummy2
# Worker 1: run(input3) → dummy3
# Worker 0: run(input4) → result0 # Delay = 2 × N cycles
# Worker 1: run(input5) → result1
Critical insight: With N workers, you must:
- Drop the first 2×N results (dummy outputs)
- Push 2×N extra dummy frames at the end to flush the pipeline
- Buffer the input→output association to match results with correct inputs
When to Use Double Buffering
Use double buffering when:
- Throughput is more important than latency
- Processing batch workloads (e.g., video files, image directories)
- DMA transfer time is significant compared to inference time
- You can tolerate 2-frame result delay
Don't use double buffering when:
- Latency is more important than throughput
- Processing single images or small batches
- Avoiding application complexity is more important than maximizing throughput
Implementation
Complete example: axruntime_dblbuff.py
Run the example:
# With double buffering (default)
python examples/axruntime/python/axruntime_dblbuff.py \
build/resnet50-imagenet-onnx/model.json images/ --repeat 10
# Without double buffering (for comparison)
python examples/axruntime/python/axruntime_dblbuff.py \
build/resnet50-imagenet-onnx/model.json images/ --repeat 10 --no-double-buffer
Step 1: Enable double buffering when loading model instance
instances = [
conn.load_model_instance(
model,
num_sub_devices=batch_size,
aipu_cores=batch_size,
double_buffer=True # Enable double buffering
)
for conn in connections
]
Step 2: Calculate number of dummy results to drop
# With N workers and double buffering, first 2×N results are dummy
num_to_drop = len(workers) * 2 if double_buffer else 0
# Add dummy frames at end to flush pipeline
input_paths += [None] * num_to_drop
Step 3: Buffer the input→output association
awaiting_result = collections.deque()
for in_frameno, image_path in enumerate(input_paths):
# Preprocess and submit (or submit dummy frame)
if image_path is None:
inputs[next_available][0][:] = np.zeros_like(inputs[next_available][0])
else:
input = preprocess(image_path, input_infos[0])
inputs[next_available][0][:] = input
workers[next_available].push(image_path, inputs[next_available], outputs[next_available])
if in_frameno >= prefill:
# Retrieve result
out_path, outs = workers[next_ready].pop()
# Buffer the association
awaiting_result.append(out_path)
if out_frameno >= num_to_drop:
# Now we have valid results - match with correct input
out_path = awaiting_result.popleft()
postprocess(out_frameno - num_to_drop, count, out_path, outs[0], labels, output_infos[0])
out_frameno += 1
Key points:
- Track input paths separately from output results using
awaiting_resultdeque - Skip first
num_to_dropresults - these are dummy frames from double buffer warm-up - Push dummy frames at the end to drain the pipeline
Step 4: Drain remaining results
# Drain the remaining workers to extract the last N frames
for _ in range(prefill):
next_ready = out_frameno % len(workers)
out_path, outs = workers[next_ready].pop()
awaiting_result.append(out_path)
if out_frameno >= num_to_drop:
out_path = awaiting_result.popleft()
postprocess(out_frameno - num_to_drop, count, out_path, outs[0], labels, output_infos[0])
out_frameno += 1
Best Practices
- Always drop the first 2×N results where N = number of workers
- Push 2×N dummy frames at the end to flush the pipeline
- Use a deque or similar structure to track input→output association
- Measure performance - not all workloads benefit equally
- Consider latency requirements - 2-frame delay may be unacceptable for real-time apps
Performance Considerations
Throughput gains: Typically 10-30% improvement for workloads with significant DMA transfer overhead.
Latency penalty: Results delayed by 2×N frames (where N = number of workers).
Memory overhead: Requires buffering input associations (minimal).
When it helps most:
- Large input/output tensors (significant transfer time)
- Models with balanced compute vs. transfer time
- Batch processing workloads
When it helps least:
- Very fast models (compute time << transfer time)
- Small tensors (minimal transfer overhead)
- Single-image latency-critical applications
Related Documentation
- Basic Usage Tutorial - Understanding worker threads and inference patterns
- API Reference - ModelInstance.load_model_instance() parameters
Last Updated: 2026-02-11