Skip to main content

deploy.py

Compiles and deploys a model (or full pipeline) from a YAML definition file. Used when deploying custom weights or new model architectures.

./deploy.py <network-name-or-yaml-path>

After deployment, the compiled pipeline is cached in build/. Subsequent calls to inference.py with the same network name use the cached build without recompiling.


Output

Artifacts are written to build/<network-name>/<model-name>/:

build/
└── yolov8n-licenseplate/
└── yolov8n-licenseplate/
└── 1/
├── model.json # pipeline descriptor
├── model.axmodel # compiled AIPU binary
└── manifest.json # quantization parameters

See Model Formats for what each file contains.


Options

OptionDescription
--build-root \<path\>Write compiled output here instead of build/
--data-root \<path\>Look for datasets here instead of data/
--num-cal-images \<N\>Number of calibration images for quantization (default: 200; range: 100–400)
--aipu-cores \<N\>Target N AIPU cores (1–4). Only valid for single-model networks.
--pipe \<type\>Pipeline type: gst (default, AIPU), torch (CPU/ONNX), torch-aipu (Python + AIPU)
--mode \<mode\>Deployment mode (see below)
--model \<model\>Compile only the named model within the network; skip pipeline deployment
--models-onlyCompile all models in the network; skip pipeline deployment
--pipeline-onlyDeploy the pipeline only; skip model compilation (uses pre-compiled models)
--exportProduce a zip archive. For QUANTIZE mode: \<model\>-prequantized.zip. For other modes: \<model\>.zip. Saved to exported/.

--mode values

ModeDescription
PREQUANTIZED(default) Use a pre-quantized model if available; otherwise quantize then compile
QUANTIZEQuantize only — do not compile. Outputs quantized_model_manifest.json.
QUANTCOMPILEQuantize then compile in one step

Examples

Deploy a custom YAML:

./deploy.py customers/mymodels/yolov8n-licenseplate.yaml

Increase calibration images for better quantization accuracy:

./deploy.py customers/mymodels/yolov8n-licenseplate.yaml --num-cal-images 400

Quantize only (inspect before committing to full compile):

./deploy.py customers/mymodels/yolov8n-licenseplate.yaml --mode QUANTIZE

All options:

./deploy.py --help

What happens during deployment

When you run ./deploy.py <network-name>, the following steps execute:

1. Calibration data preparation

The YAML preprocess: section (Resize, Normalize, etc.) is applied to calibration images from the dataset (default: 200 images). These preprocessed images are used for the quantization step.

preprocess:
- letterbox:
width: 640
height: 640
- torch-totensor:
- normalize:
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
note

YAML preprocessing is used only during deployment to prepare calibration data. It is not part of the runtime inference pipeline — the GStreamer operators handle runtime pre-processing independently.

2. Compilation

The compiler takes the ONNX model plus calibration data and produces:

  • Quantized model — weights converted from FP32 to INT8
  • AIPU binary (.axmodel) — optimized for Metis hardware
  • Preamble/postamble — CPU-side transforms extracted by the compiler
  • Manifest — quantization parameters for runtime dequantisation

3. Pipeline deployment

The compiled model is wrapped into a GStreamer pipeline descriptor (model.json) that the runtime can load directly.


See also