Model Zoo — Pre-trained Models

The Voyager Model Zoo is a collection of pre-trained AI models ready to run on Axelera hardware. When you run inference.py yolov5s-v7-coco usb:0, the model name (yolov5s-v7-coco) comes from the Model Zoo.

Listing available models

From the SDK root directory (with environment activated):

make

This lists three categories:

Category	What it contains
ZOO	Individual models — one model, one task
REFERENCE APPLICATION PIPELINES	Multi-model pipelines (e.g., detection cascaded into pose estimation)
TUTORIALS	Example models used by the tutorial documentation

How model names work

Model names follow a pattern:

\<architecture\>-\<dataset\>[-\<variant\>]

Examples:

Name	Architecture	Dataset	Notes
`yolov5s-v7-coco`	YOLOv5 small	COCO	v7 release of YOLOv5
`yolov8s-coco-onnx`	YOLOv8 small	COCO	ONNX format
`resnet50-imagenet`	ResNet-50	ImageNet	Classification model

Task types

Task	What it does	Example model
Object detection	Finds and labels objects with bounding boxes	`yolov5s-v7-coco`
Classification	Identifies what's in an image (single label)	`resnet50-imagenet`
Semantic segmentation	Labels every pixel by category	`yolov8sseg-coco-onnx`
Instance segmentation	Labels every pixel AND distinguishes individual objects	`yolov8sseg-coco-onnx`
Keypoint detection	Finds body joints and pose landmarks	`yolov8lpose-coco-onnx`
Depth estimation	Estimates distance of each pixel from camera	`fastdepth-nyuv2`
License plate recognition	Reads license plates	Available in Model Zoo
Face recognition	Identifies or verifies faces	Available in Model Zoo

Running a model

# Object detection with USB camera
./inference.py yolov5s-v7-coco usb:0

# Classification with a video file
./inference.py resnet50-imagenet media/traffic1_1080p.mp4

# Headless benchmarking (no display)
./inference.py yolov8s-coco-onnx usb:0 --no-display --frames 1000

The first time you run a model, the SDK:

Downloads the pre-trained weights (if not cached)
Compiles the model for the AIPU
Caches the compiled model for subsequent runs
Runs inference

Subsequent runs skip steps 1-3 and start immediately.

Datasets

Models are trained on specific datasets. The dataset name in the model identifier tells you what the model can recognize:

Dataset	What it contains	Typical use
COCO	80 object categories (person, car, dog, etc.)	General object detection
ImageNet	1000 image categories	Image classification
VOC	20 object categories	Object detection (smaller set)

Non-redistributable datasets

Most datasets download automatically when you first run a model. A few datasets require manual registration and download due to licensing restrictions. These must be downloaded by hand from the links below, then placed in the specified directory within your SDK installation. The SDK raises an error with the expected path if a required dataset is missing.

Dataset	Archive	Download location
Cityscapes (val)	`gtFine_val.zip`	`data/cityscapes`
Cityscapes (val)	`leftImg8bit_val.zip`	`data/cityscapes`
Cityscapes (test)	`gtFine_test.zip`	`data/cityscapes`
Cityscapes (test)	`leftImg8bit_test.zip`	`data/cityscapes`
ImageNet (train)	`ILSVRC2012_devkit_t12.tar.gz`	`data/ImageNet`
ImageNet (train)	`ILSVRC2012_img_train.tar`	`data/ImageNet`
ImageNet (val)	`ILSVRC2012_devkit_t12.tar.gz`	`data/ImageNet`
ImageNet (val)	`ILSVRC2012_img_val.tar`	`data/ImageNet`
WiderFace (train)	`widerface_train.zip`	`data/widerface`
WiderFace (val)	`widerface_val.zip`	`data/widerface`

You are responsible for adhering to the terms and conditions of each dataset's license.

Performance characteristics

The tables below list all Model Zoo models for this SDK release. Columns:

Ref FP32 — accuracy of the original floating-point model
Accuracy loss — FP32 accuracy minus quantized int8 accuracy (lower is better)
Ref PCIe FPS — host throughput on Intel Core i9-13900K + Metis 1× PCIe card
Ref M.2 FPS — host throughput on Intel Core i5-1145G7E + Metis 1× M.2 card

Accuracy is measured using:

./inference.py \<model\> dataset --pipe=torch-aipu --no-display

Throughput is measured using a 720p h.264 video file:

./inference.py \<model\> media/traffic2_720p.mp4 --pipe=gst --no-display

Image Classification

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 Top1	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
DenseNet-121	🔗	🔗	224x224	ImageNet-1K	74.44	0.86	281	156	BSD-3-Clause
EfficientNet-B0	🔗	🔗	224x224	ImageNet-1K	77.67	1.12	1429	1450	BSD-3-Clause
EfficientNet-B1	🔗	🔗	224x224	ImageNet-1K	77.6	0.47	972	960	BSD-3-Clause
EfficientNet-B2	🔗	🔗	224x224	ImageNet-1K	77.79	0.46	903	863	BSD-3-Clause
EfficientNet-B3	🔗	🔗	224x224	ImageNet-1K	78.54	0.50	787	721	BSD-3-Clause
EfficientNet-B4	🔗	🔗	224x224	ImageNet-1K	79.27	0.71	576	436	BSD-3-Clause
MobileNetV2	🔗	🔗	224x224	ImageNet-1K	71.87	1.50	3670	3638	BSD-3-Clause
MobileNetV4-small	🔗	🔗	224x224	ImageNet-1K	73.74	5.07	4937	4807	Apache 2.0
MobileNetV4-medium	🔗	🔗	224x224	ImageNet-1K	79.04	0.90	2517	2395	Apache 2.0
MobileNetV4-large	🔗	🔗	384x384	ImageNet-1K	82.92	0.95	761	460	Apache 2.0
MobileNetV4-aa_large	🔗	🔗	384x384	ImageNet-1K	83.22	1.96	667	391	Apache 2.0
SqueezeNet 1.0	🔗	🔗	224x224	ImageNet-1K	58.1	2.80	953	811	BSD-3-Clause
SqueezeNet 1.1	🔗	🔗	224x224	ImageNet-1K	58.19	1.86	7298	7264	BSD-3-Clause
Inception V3	🔗	🔗	224x224	ImageNet-1K	69.85	0.25	1136	636	BSD-3-Clause
RegNetX-1_6GF	🔗	🔗	224x224	ImageNet-1K	79.33	0.22	695	369	BSD-3-Clause
RegNetX-400MF	🔗	🔗	224x224	ImageNet-1K	74.48	0.36	1199	636	BSD-3-Clause
RegNetY-1_6GF	🔗	🔗	224x224	ImageNet-1K	80.73	0.24	595	322	BSD-3-Clause
RegNetY-400MF	🔗	🔗	224x224	ImageNet-1K	75.63	0.13	1642	975	BSD-3-Clause
ResNet-18	🔗	🔗	224x224	ImageNet-1K	69.76	0.36	3904	3749	BSD-3-Clause
ResNet-34	🔗	🔗	224x224	ImageNet-1K	73.3	0.12	2282	2075	BSD-3-Clause
ResNet-50 v1.5	🔗	🔗	224x224	ImageNet-1K	76.15	0.18	1946	1756	BSD-3-Clause
ResNet-101	🔗	🔗	224x224	ImageNet-1K	77.37	0.79	1049	673	BSD-3-Clause
ResNet-152	🔗	🔗	224x224	ImageNet-1K	78.31	0.23	493	261	BSD-3-Clause
ResNet-10t	🔗	🔗	224x224	ImageNet-1K	68.22	1.06	5212	5015	Apache 2.0
ResNeXt50_32x4d	🔗	🔗	224x224	ImageNet-1K	77.61	0.08	437	236	BSD-3-Clause
Wide ResNet-50	🔗	🔗	224x224	ImageNet-1K	78.48	0.36	436	236	BSD-3-Clause

Object Detection

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 mAP	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
GELAN-s	🔗	🔗	640x640	COCO2017	46.41	2.99	376	237	GPL-3.0
GELAN-m	🔗	🔗	640x640	COCO2017	50.86	1.06	203	148	GPL-3.0
GELAN-c	🔗	🔗	640x640	COCO2017	52.3	0.49	199	144	GPL-3.0
RetinaFace - Resnet50	🔗	🔗	840x840	WiderFace	95.25	0.25	90	51	MIT
RetinaFace - mb0.25	🔗	🔗	640x640	WiderFace	89.44	1.36	1020	774	MIT
SSD-MobileNetV1	🔗	🔗	300x300	COCO2017	24.77	-0.05	3356	3019	Apache 2.0
SSD-MobileNetV2	🔗	🔗	300x300	COCO2017	19.25	0.87	2261	2195	Apache 2.0
YOLOv3	🔗	🔗	640x640	COCO2017	46.61	0.79	163	96	AGPL-3.0
YOLOv5s-Relu	🔗	🔗	640x640	COCO2017	35.09	0.52	785	536	AGPL-3.0
YOLOv5s-v5	🔗	🔗	640x640	COCO2017	36.18	0.37	790	526	AGPL-3.0
YOLOv5n	🔗	🔗	640x640	COCO2017	27.72	0.87	1028	656	AGPL-3.0
YOLOv5s	🔗	🔗	640x640	COCO2017	37.25	0.80	865	824	AGPL-3.0
YOLOv5m	🔗	🔗	640x640	COCO2017	44.94	0.85	455	322	AGPL-3.0
YOLOv5l	🔗	🔗	640x640	COCO2017	48.67	0.84	299	204	AGPL-3.0
YOLOv7	🔗	🔗	640x640	COCO2017	51.02	0.58	212	173	GPL-3.0
YOLOv7-tiny	🔗	🔗	416x416	COCO2017	33.12	0.49	1441	1110	GPL-3.0
YOLOv7 640x480	🔗	🔗	640x480	COCO2017	50.78	0.52	242	164	GPL-3.0
YOLOv8n	🔗	🔗	640x640	COCO2017	37.12	1.18	834	764	AGPL-3.0
YOLOv8s	🔗	🔗	640x640	COCO2017	44.8	0.93	643	524	AGPL-3.0
YOLOv8m	🔗	🔗	640x640	COCO2017	50.16	1.32	242	177	AGPL-3.0
YOLOv8l	🔗	🔗	640x640	COCO2017	52.83	2.06	181	142	AGPL-3.0
YOLOv8n-obb	🔗	🔗	1024x1024	DOTAv1DetectionOBBDataset	48.73	5.68	269	162	AGPL-3.0
YOLOv8l-obb	🔗	🔗	1024x1024	DOTAv1DetectionOBBDataset	56.06	4.41	36	19	AGPL-3.0
YOLOX-s	🔗	🔗	640x640	COCO2017	39.24	-0.81	642	423	Apache-2.0
YOLOX-m	🔗	🔗	640x640	COCO2017	46.26	-0.37	349	268	Apache-2.0
YOLOX-x Human	🔗	🔗	1440x800	COCO2017	57.66	3.38	21	-	MIT
YOLOv9t	🔗	🔗	640x640	COCO2017	37.81	1.25	415	247	AGPL-3.0
YOLOv9s	🔗	🔗	640x640	COCO2017	46.28	1.12	374	237	AGPL-3.0
YOLOv9m	🔗	🔗	640x640	COCO2017	51.24	2.29	203	148	AGPL-3.0
YOLOv9c	🔗	🔗	640x640	COCO2017	52.67	2.35	194	150	AGPL-3.0
YOLOv10n	🔗	🔗	640x640	COCO2017	38.08	0.74	738	561	AGPL-3.0
YOLOv10s	🔗	🔗	640x640	COCO2017	45.74	0.45	580	461	AGPL-3.0
YOLOv10b	🔗	🔗	640x640	COCO2017	51.79	0.45	251	217	AGPL-3.0
YOLO11n	🔗	🔗	640x640	COCO2017	39.17	0.71	759	574	AGPL-3.0
YOLO11s	🔗	🔗	640x640	COCO2017	46.54	0.55	565	426	AGPL-3.0
YOLO11m	🔗	🔗	640x640	COCO2017	51.31	0.55	269	196	AGPL-3.0
YOLO11l	🔗	🔗	640x640	COCO2017	53.23	0.49	183	125	AGPL-3.0
YOLO11x	🔗	🔗	640x640	COCO2017	54.67	0.58	53	31	AGPL-3.0
YOLO11n-obb	🔗	🔗	1024x1024	DOTAv1DetectionOBBDataset	50.01	1.07	250	172	AGPL-3.0
YOLO11l-obb	🔗	🔗	1024x1024	DOTAv1DetectionOBBDataset	56.41	1.08	36	20	AGPL-3.0
YOLO26n	🔗	🔗	640x640	COCO2017	40.18	1.95	662	487	AGPL-3.0
YOLO26s	🔗	🔗	640x640	COCO2017	47.66	2.05	498	396	AGPL-3.0
YOLO26m	🔗	🔗	640x640	COCO2017	52.45	2.14	258	192	AGPL-3.0
YOLO26l	🔗	🔗	640x640	COCO2017	54.11	2.03	179	122	AGPL-3.0
YOLO26x	🔗	🔗	640x640	COCO2017	56.92	2.43	53	31	AGPL-3.0
YOLO26n-obb	🔗	🔗	1024x1024	DOTAv1DetectionOBBDataset	49.41	3.12	206	139	AGPL-3.0
YOLO26s-obb	🔗	🔗	1024x1024	DOTAv1DetectionOBBDataset	54.02	2.01	167	114	AGPL-3.0
YOLO26m-obb	🔗	🔗	1024x1024	DOTAv1DetectionOBBDataset	56.66	1.72	58	33	AGPL-3.0
YOLO26l-obb	🔗	🔗	1024x1024	DOTAv1DetectionOBBDataset	57.35	1.05	34	19	AGPL-3.0
YOLO26x-obb	🔗	🔗	1024x1024	DOTAv1DetectionOBBDataset	58.4	5.34	15	-	AGPL-3.0
YOLO-NAS S	🔗	🔗	640x640	COCO2017	47.06		450	318	Apache-2.0
YOLO-NAS M	🔗	🔗	640x640	COCO2017	51.0		285	221	Apache-2.0
YOLO-NAS L	🔗	🔗	640x640	COCO2017	51.79		157	96	Apache-2.0

Semantic Segmentation

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 mIoU	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
U-Net FCN 256	🔗	🔗	256x256	Cityscapes	57.75	0.34	249	198	Apache 2.0
U-Net FCN 512		🔗	512x512	Cityscapes	66.62	0.01	34	19	Apache 2.0

Instance Segmentation

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 mAP	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
YOLOv8n-seg	🔗	🔗	640x640	COCO2017	29.98	0.92	639	433	AGPL-3.0
YOLOv8s-seg	🔗	🔗	640x640	COCO2017	36.32	0.57	482	345	AGPL-3.0
YOLOv8m-seg	🔗	🔗	640x640	COCO2017	40.39	0.65	198	156	AGPL-3.0
YOLOv8l-seg	🔗	🔗	640x640	COCO2017	42.27	1.11	167	134	AGPL-3.0
YOLO11n-seg	🔗	🔗	640x640	COCO2017	31.84	1.11	598	406	AGPL-3.0
YOLO11l-seg	🔗	🔗	640x640	COCO2017	43.26	0.13	156	107	AGPL-3.0
YOLO26n-seg	🔗	🔗	640x640	COCO2017	32.95	2.64	516	352	AGPL-3.0
YOLO26s-seg	🔗	🔗	640x640	COCO2017	39.28	3.28	385	292	AGPL-3.0
YOLO26m-seg	🔗	🔗	640x640	COCO2017	43.34	1.40	201	156	AGPL-3.0
YOLO26l-seg	🔗	🔗	640x640	COCO2017	45.09	1.87	141	96	AGPL-3.0
YOLO26x-seg	🔗	🔗	640x640	COCO2017	46.54	2.00	46	28	AGPL-3.0

Keypoint Detection

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 mAP	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
YOLOv8n-pose	🔗	🔗	640x640	COCO2017	51.11	1.75	822	723	AGPL-3.0
YOLOv8s-pose	🔗	🔗	640x640	COCO2017	60.65	2.98	592	471	AGPL-3.0
YOLOv8m-pose	🔗	🔗	640x640	COCO2017	65.58	1.91	231	168	AGPL-3.0
YOLOv8l-pose	🔗	🔗	640x640	COCO2017	68.39	1.47	186	145	AGPL-3.0
YOLO11n-pose	🔗	🔗	640x640	COCO2017	51.15	3.23	759	532	AGPL-3.0
YOLO11l-pose	🔗	🔗	640x640	COCO2017	67.44	3.14	179	122	AGPL-3.0
YOLO26n-pose	🔗	🔗	640x640	COCO2017	57.66	6.54	658	450	AGPL-3.0
YOLO26s-pose	🔗	🔗	640x640	COCO2017	63.61	5.12	467	359	AGPL-3.0
YOLO26m-pose	🔗	🔗	640x640	COCO2017	69.54	4.83	235	166	AGPL-3.0
YOLO26l-pose	🔗	🔗	640x640	COCO2017	71.05	3.02	174	120	AGPL-3.0
YOLO26x-pose	🔗	🔗	640x640	COCO2017	72.75	16.62	51	30	AGPL-3.0

Depth Estimation

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 RMSE	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
FastDepth	🔗	🔗	224x224	NYUDepthV2	0.6574	-0.0065	974	855	MIT

License Plate Recognition

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 WLA	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
LPRNet		🔗	94x24	LPRNetDataset	89.4	1.90	10268	9335	Apache-2.0

Image Enhancement (Super Resolution)

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 PSNR	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
Real-ESRGAN-x4plus	🔗	🔗	128x128	SuperResolutionCustomSet128x128	24.77		-	-	BSD-3-Clause

Face Recognition

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 top1_avg	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
FaceNet - InceptionResnetV1	🔗	🔗	160x160	LFWTorchvisionPair	98.35	0.00	1321	720	MIT

Re-Identification

Model	ONNX	Repo	Resolution	Dataset	Ref FP32 mAP	Accuracy loss	Ref PCIe FPS	Ref M.2 FPS	Model license
OSNet x1_0	🔗	🔗	256x128	Market1501ReIdDataset	82.55	0.93	1732	1770	Apache-2.0
SBS50	🔗	🔗	384x128	Market1501ReIdDataset	89.02	-0.16	666	405	Apache-2.0

Large Language Models

For usage details see the LLM Inference guide.

Model	Max context (tokens)	Required PCIe card RAM
microsoft/Phi-3-mini-4k-instruct	512	4 GB
microsoft/Phi-3-mini-4k-instruct	1024	16 GB
microsoft/Phi-3-mini-4k-instruct	2048	16 GB
meta-llama/Llama-3.2-1B-Instruct	1024	4 GB
meta-llama/Llama-3.2-3B-Instruct	1024	4 GB
meta-llama/Llama-3.1-8B-Instruct	1024	16 GB
Almawave/Velvet-2B	1024	4 GB

Experimenting with optimized input shapes

Most models are trained on square inputs (640×640), but real-world video is often rectangular (16:9). Standard pipelines pad the input ("letterboxing"), forcing the model to process empty pixels.

By switching to a rectangular input shape that matches your video's aspect ratio, you can often achieve significant speedups with minimal accuracy impact. This is especially effective for fixed-camera applications like surveillance or traffic monitoring.

How to test

Export models with dynamic input shapes, then compare:

# Standard 640×640
./inference.py yolox-m-coco-onnx dataset --pipe=torch-aipu --no-display

# Rectangular 640×480
./inference.py yolox-m-coco-onnx-rect dataset --pipe=torch-aipu --no-display

Expected results

Configuration	Input shape	Speedup	mAP impact	Best for
Standard	640×640	Baseline	Baseline	General purpose, diverse content
Optimized	640×480	+24%	−0.3%	Near-square content, balanced performance
Optimized	640×384	+47%	−2.0%	Landscape video (16:9), maximum throughput

Custom weights

You can use your own trained weights with any model architecture. This involves updating the model's YAML configuration to point to your custom weight file. See Deploy Custom Weights for the full walkthrough.

Listing available models​

How model names work​

Task types​

Running a model​

Datasets​

Non-redistributable datasets​

Performance characteristics​

Image Classification​

Object Detection​

Semantic Segmentation​

Instance Segmentation​

Keypoint Detection​

Depth Estimation​

License Plate Recognition​

Image Enhancement (Super Resolution)​

Face Recognition​

Re-Identification​

Large Language Models​

Experimenting with optimized input shapes​

How to test​

Expected results​

Custom weights​

See also​