deploy.py
Compiles and deploys a model (or full pipeline) from a YAML definition file. Used when deploying custom weights or new model architectures.
./deploy.py <network-name-or-yaml-path>
After deployment, the compiled pipeline is cached in build/. Subsequent calls to inference.py with the same network name use the cached build without recompiling.
Output
Artifacts are written to build/<network-name>/<model-name>/:
build/
└── yolov8n-licenseplate/
└── yolov8n-licenseplate/
└── 1/
├── model.json # pipeline descriptor
├── model.axmodel # compiled AIPU binary
└── manifest.json # quantization parameters
See Model Formats for what each file contains.
Options
| Option | Description |
|---|---|
--build-root <path> | Write compiled output here instead of build/ |
--data-root <path> | Look for datasets here instead of data/ |
--num-cal-images <N> | Number of calibration images for quantization (default: 200; range: 100–400) |
--cal-seed <N> | Seed an RNG that shuffles the calibration image order. Omit for the default sorted-deterministic order; set a seed to sweep different orders for accuracy ablation. See Reproducibility. |
--aipu-cores <N> | Target N AIPU cores (1–4). Only valid for single-model networks. |
--pipe <type> | Pipeline type: gst (default, AIPU), torch (CPU/ONNX), torch-aipu (Python + AIPU) |
--mode <mode> | Deployment mode (see below) |
--model <model> | Compile only the named model within the network; skip pipeline deployment |
--models-only | Compile all models in the network; skip pipeline deployment |
--pipeline-only | Deploy the pipeline only; skip model compilation (uses pre-compiled models) |
--export | Produce a zip archive. For QUANTIZE mode: <model>-prequantized.zip. For other modes: <model>.zip. Saved to exported/. |
--mode values
| Mode | Description |
|---|---|
PREQUANTIZED | (default) Use a pre-quantized model if available; otherwise quantize then compile |
QUANTIZE | Quantize only — do not compile. Outputs quantized_model_manifest.json. |
QUANTCOMPILE | Quantize then compile in one step |
Examples
Deploy a custom YAML:
./deploy.py customers/mymodels/yolov8n-licenseplate.yaml
Increase calibration images for better quantization accuracy:
./deploy.py customers/mymodels/yolov8n-licenseplate.yaml --num-cal-images 400
Quantize only (inspect before committing to full compile):
./deploy.py customers/mymodels/yolov8n-licenseplate.yaml --mode QUANTIZE
All options:
./deploy.py --help
What happens during deployment
When you run ./deploy.py <network-name>, the following steps execute:
1. Calibration data preparation
The YAML preprocess: section (Resize, Normalize, etc.) is applied to calibration images from the dataset (default: 200 images). These preprocessed images are used for the quantization step.
preprocess:
- letterbox:
width: 640
height: 640
- torch-totensor:
- normalize:
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
YAML preprocessing is used only during deployment to prepare calibration data. It is not part of the runtime inference pipeline — the GStreamer operators handle runtime pre-processing independently.
2. Compilation
The compiler takes the ONNX model plus calibration data and produces:
- Quantized model — weights converted from FP32 to INT8
- AIPU binary (
.axmodel) — optimized for Metis hardware - Preamble/postamble — CPU-side transforms extracted by the compiler
- Manifest — quantization parameters for runtime dequantisation
3. Pipeline deployment
The compiled model is wrapped into a GStreamer pipeline descriptor (model.json) that the runtime can load directly.
Reproducibility
deploy.py is bit-identical-reproducible by default. Two back-to-back runs against the same network and the same calibration image directory produce the same quantized_model.pt (sha256-equal). The behaviour is controlled by two knobs:
| Knob | Behaviour |
|---|---|
(omit --cal-seed) | Default. Calibration images are read in lexicographically sorted filename order with shuffle=False. Same inputs → same artifact every run. |
--cal-seed <N> | The image order is shuffled by a torch RNG seeded with <N>. Two runs with the same <N> produce the same artifact; different <N> produce different artifacts. Use this to sweep calibration orders during accuracy experiments. |
Hash determinism (advanced)
deploy.py does not by itself control Python's per-process hash randomisation (used in dict / set iteration). The cli ignores PYTHONHASHSEED assigned at runtime because the interpreter has already started by then. If you need hash-order determinism across runs (rare; the calibration path is already covered by sorted-glob + explicit shuffle), export the seed in the shell before invoking deploy.py:
PYTHONHASHSEED=0 ./deploy.py <network>
Validating reproducibility
To confirm your environment produces identical artifacts:
./deploy.py <network> --mode QUANTIZE --build-root /tmp/runA
./deploy.py <network> --mode QUANTIZE --build-root /tmp/runB
sha256sum /tmp/run{A,B}/<network>/<network>/quantized/quantized_model.pt
# both hashes must match
See also
- Deploy Custom Weights — end-to-end walkthrough
- Compiler CLI — lower-level compiler without the YAML pipeline
- Model Formats — output file structure