Skip to main content

Inference Configuration

How to tune the data transformation pipeline between the AIPU and your decoder using handle_* flags in model YAML files.

Prerequisites

You must deploy your model before configuring inference. Deployment creates the compiled model artifacts (quantized model, preamble, postamble). This page covers the runtime behavior when you run inference on that deployed model.


The problem these flags solve

AI models work with float32. The Metis AIPU works with quantized int8 in padded, NHWC layout. Between model output and your decoder, several transformations are needed:

TransformationDirectionWhat it does
QuantizeBefore AIPUfloat32 → int8
PadBefore AIPUAdd alignment bytes
DepadAfter AIPURemove alignment bytes
DequantizeAfter AIPUint8 → float32
TransposeAfter AIPUNHWC → NCHW (if decoder expects it)
PostambleAfter AIPURun compiler-extracted postprocessing

The handle_* flags control who does each transformation — the framework (automatic) or your decoder (manual, but faster if you can fuse operations).


Data flow

Input (float32)

┌─────────────────────────────────┐
│ PREAMBLE (optional) │ ← handle_preamble
│ + QUANTIZE + PAD │
└─────────────────────────────────┘

┌═════════════════════════════════┐
║ AIPU (Metis) ║ ← int8 inference
┗═════════════════════════════════┛
↓ (int8, padded)
┌─────────────────────────────────┐
│ DEPAD + DEQUANTIZE + TRANSPOSE │ ← handle_* flags
│ + POSTAMBLE (optional) │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│ DECODER (task-specific) │ ← float32 or int8+params
└─────────────────────────────────┘

Output (results)

Quick start

For most users, the defaults work fine — you don't need to set anything. If you want to tune performance, start here:

# Option 1: Framework handles everything (default behavior)
inference:
handle_all: True

# Option 2: Decoder handles everything (best performance, requires custom decoder)
inference:
handle_all: False

# Option 3: Fine-tune individual flags
inference:
handle_dequantization_and_depadding: True
handle_transpose: False
handle_postamble: False

Rule of thumb: Start with handle_all: True (or just omit the flags entirely). Only switch to False or fine-tune after profiling shows a bottleneck.


YAML reference

inference:
# Convenience flag — sets all 4 individual flags at once
handle_all: True/False/None # default: None (use individual settings)

# Individual flags (each defaults to True)
handle_dequantization_and_depadding: True/False
handle_transpose: True/False
handle_postamble: True/False
handle_preamble: True/False

# Performance tuning
dequantize_using_lut: True/False # default: True

# ONNX Runtime config (only relevant when handle_postamble is False)
postamble_onnx: "path/to/postamble.onnx"
postamble_onnxruntime_intra_op_num_threads: 4
postamble_onnxruntime_inter_op_num_threads: 4
warning

handle_all and individual flags are mutually exclusive. Setting both raises an error — use one approach or the other.


Flag reference

handle_all

Sets all 4 individual flags at once.

ValueEffectUse when
None (default)Each flag uses its own settingFine-tuning individual flags
TrueFramework handles all transforms → decoder gets float32Starting out, large models
FalseDecoder handles all transforms → decoder gets int8 + paramsProduction, performance-critical

handle_dequantization_and_depadding

ValueDecoder receivesWhen to use
True (default)float32 arraysMost cases
Falseint8 + quantization parametersDecoder can fuse dequant with its own logic

handle_transpose

ValueDecoder receivesWhen to use
True (default)Expected layout (NCHW if needed)Most cases
FalseRaw NHWC from AIPUDecoder works with NHWC directly

Transpose is expensive — skip it if your decoder doesn't need NCHW.

handle_postamble

ValueDecoder receivesWhen to use
True (default)Final float32 results (postamble already applied)Most cases
FalsePre-postamble output (decoder handles the rest)Fusing postamble with decode logic
Dependency

When handle_postamble=True, the framework also forces handle_dequantization_and_depadding=True and handle_transpose=True (with a warning if you set them to False).

handle_preamble

ValueEffectWhen to use
True (default)Detects and applies special preprocessing patternsMost models
FalseStandard padding onlyOverride pattern detection

Currently partially implemented — the runtime detects specific preprocessing patterns (e.g. YOLO preprocessing) but does not execute arbitrary preamble ONNX files.

dequantize_using_lut

ValueMethodBest for
True (default)Lookup table (pre-computed 256 int8 → float32 mappings)batch_size = 1
FalseCalculate on-the-fly: (int8 - zero_point) * scalebatch_size > 1

Pipeline scenarios

Simple: handle_all: True

Input → Preamble+Quantize+Pad → AIPU → Depad+Dequant+Transpose+Postamble → Decoder
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
framework handles all of this

Best for: prototyping, large models, no custom decoder needed.

Optimized: handle_all: False

Input → Preamble+Quantize+Pad → AIPU → Decoder
^^^^^^^
decoder does everything (fused)

Best for: production, small models, real-time. Requires a custom C++ decoder.

Hybrid: individual flags

inference:
handle_dequantization_and_depadding: True # framework handles the complex part
handle_transpose: False # decoder skips transpose (works with NHWC)
handle_postamble: False # decoder fuses postamble with decode

Best for: production after profiling. Framework handles the hard part, decoder optimizes what it can.


Trade-off summary

ApproachSimplicityPerformanceCustom code needed
handle_all: TrueHighGoodNone
handle_all: FalseLowBestFull C++ decoder
Fine-tunedMediumBetterSelective

Start with handle_all: True. Optimize only if measurements show it's needed.


See also