Skip to main content

Model Zoo — Pre-trained Models

The Voyager Model Zoo is a collection of pre-trained AI models ready to run on Axelera hardware. When you run inference.py yolov5s-v7-coco usb:0, the model name (yolov5s-v7-coco) comes from the Model Zoo.

Listing available models

From the SDK root directory (with environment activated):

make

This lists three categories:

CategoryWhat it contains
ZOOIndividual models — one model, one task
REFERENCE APPLICATION PIPELINESMulti-model pipelines (e.g., detection cascaded into pose estimation)
TUTORIALSExample models used by the tutorial documentation

How model names work

Model names follow a pattern:

\<architecture\>-\<dataset\>[-\<variant\>]

Examples:

NameArchitectureDatasetNotes
yolov5s-v7-cocoYOLOv5 smallCOCOv7 release of YOLOv5
yolov8s-coco-onnxYOLOv8 smallCOCOONNX format
resnet50-imagenetResNet-50ImageNetClassification model

Task types

TaskWhat it doesExample model
Object detectionFinds and labels objects with bounding boxesyolov5s-v7-coco
ClassificationIdentifies what's in an image (single label)resnet50-imagenet
Semantic segmentationLabels every pixel by categoryyolov8sseg-coco-onnx
Instance segmentationLabels every pixel AND distinguishes individual objectsyolov8sseg-coco-onnx
Keypoint detectionFinds body joints and pose landmarksyolov8lpose-coco-onnx
Depth estimationEstimates distance of each pixel from camerafastdepth-nyuv2
License plate recognitionReads license platesAvailable in Model Zoo
Face recognitionIdentifies or verifies facesAvailable in Model Zoo

Running a model

# Object detection with USB camera
./inference.py yolov5s-v7-coco usb:0

# Classification with a video file
./inference.py resnet50-imagenet media/traffic1_1080p.mp4

# Headless benchmarking (no display)
./inference.py yolov8s-coco-onnx usb:0 --no-display --frames 1000

The first time you run a model, the SDK:

  1. Downloads the pre-trained weights (if not cached)
  2. Compiles the model for the AIPU
  3. Caches the compiled model for subsequent runs
  4. Runs inference

Subsequent runs skip steps 1-3 and start immediately.

Datasets

Models are trained on specific datasets. The dataset name in the model identifier tells you what the model can recognize:

DatasetWhat it containsTypical use
COCO80 object categories (person, car, dog, etc.)General object detection
ImageNet1000 image categoriesImage classification
VOC20 object categoriesObject detection (smaller set)

Non-redistributable datasets

Most datasets download automatically when you first run a model. A few datasets require manual registration and download due to licensing restrictions. These must be downloaded by hand from the links below, then placed in the specified directory within your SDK installation. The SDK raises an error with the expected path if a required dataset is missing.

DatasetArchiveDownload location
Cityscapes (val)gtFine_val.zipdata/cityscapes
Cityscapes (val)leftImg8bit_val.zipdata/cityscapes
Cityscapes (test)gtFine_test.zipdata/cityscapes
Cityscapes (test)leftImg8bit_test.zipdata/cityscapes
ImageNet (train)ILSVRC2012_devkit_t12.tar.gzdata/ImageNet
ImageNet (train)ILSVRC2012_img_train.tardata/ImageNet
ImageNet (val)ILSVRC2012_devkit_t12.tar.gzdata/ImageNet
ImageNet (val)ILSVRC2012_img_val.tardata/ImageNet
WiderFace (train)widerface_train.zipdata/widerface
WiderFace (val)widerface_val.zipdata/widerface

You are responsible for adhering to the terms and conditions of each dataset's license.


Performance characteristics

The tables below list all Model Zoo models for this SDK release. Columns:

  • Ref FP32 — accuracy of the original floating-point model
  • Accuracy loss — FP32 accuracy minus quantized int8 accuracy (lower is better)
  • Ref PCIe FPS — host throughput on Intel Core i9-13900K + Metis 1× PCIe card
  • Ref M.2 FPS — host throughput on Intel Core i5-1145G7E + Metis 1× M.2 card

Accuracy is measured using:

./inference.py \<model\> dataset --pipe=torch-aipu --no-display

Throughput is measured using a 720p h.264 video file:

./inference.py \<model\> media/traffic2_720p.mp4 --pipe=gst --no-display

Image Classification

ModelONNXRepoResolutionDatasetRef FP32 Top1Accuracy lossRef PCIe FPSRef M.2 FPSModel license
DenseNet-121🔗🔗224x224ImageNet-1K74.440.86281156BSD-3-Clause
EfficientNet-B0🔗🔗224x224ImageNet-1K77.671.1214291450BSD-3-Clause
EfficientNet-B1🔗🔗224x224ImageNet-1K77.60.47972960BSD-3-Clause
EfficientNet-B2🔗🔗224x224ImageNet-1K77.790.46903863BSD-3-Clause
EfficientNet-B3🔗🔗224x224ImageNet-1K78.540.50787721BSD-3-Clause
EfficientNet-B4🔗🔗224x224ImageNet-1K79.270.71576436BSD-3-Clause
MobileNetV2🔗🔗224x224ImageNet-1K71.871.5036703638BSD-3-Clause
MobileNetV4-small🔗🔗224x224ImageNet-1K73.745.0749374807Apache 2.0
MobileNetV4-medium🔗🔗224x224ImageNet-1K79.040.9025172395Apache 2.0
MobileNetV4-large🔗🔗384x384ImageNet-1K82.920.95761460Apache 2.0
MobileNetV4-aa_large🔗🔗384x384ImageNet-1K83.221.96667391Apache 2.0
SqueezeNet 1.0🔗🔗224x224ImageNet-1K58.12.80953811BSD-3-Clause
SqueezeNet 1.1🔗🔗224x224ImageNet-1K58.191.8672987264BSD-3-Clause
Inception V3🔗🔗224x224ImageNet-1K69.850.251136636BSD-3-Clause
RegNetX-1_6GF🔗🔗224x224ImageNet-1K79.330.22695369BSD-3-Clause
RegNetX-400MF🔗🔗224x224ImageNet-1K74.480.361199636BSD-3-Clause
RegNetY-1_6GF🔗🔗224x224ImageNet-1K80.730.24595322BSD-3-Clause
RegNetY-400MF🔗🔗224x224ImageNet-1K75.630.131642975BSD-3-Clause
ResNet-18🔗🔗224x224ImageNet-1K69.760.3639043749BSD-3-Clause
ResNet-34🔗🔗224x224ImageNet-1K73.30.1222822075BSD-3-Clause
ResNet-50 v1.5🔗🔗224x224ImageNet-1K76.150.1819461756BSD-3-Clause
ResNet-101🔗🔗224x224ImageNet-1K77.370.791049673BSD-3-Clause
ResNet-152🔗🔗224x224ImageNet-1K78.310.23493261BSD-3-Clause
ResNet-10t🔗🔗224x224ImageNet-1K68.221.0652125015Apache 2.0
ResNeXt50_32x4d🔗🔗224x224ImageNet-1K77.610.08437236BSD-3-Clause
Wide ResNet-50🔗🔗224x224ImageNet-1K78.480.36436236BSD-3-Clause

Object Detection

ModelONNXRepoResolutionDatasetRef FP32 mAPAccuracy lossRef PCIe FPSRef M.2 FPSModel license
GELAN-s🔗🔗640x640COCO201746.412.99376237GPL-3.0
GELAN-m🔗🔗640x640COCO201750.861.06203148GPL-3.0
GELAN-c🔗🔗640x640COCO201752.30.49199144GPL-3.0
RetinaFace - Resnet50🔗🔗840x840WiderFace95.250.259051MIT
RetinaFace - mb0.25🔗🔗640x640WiderFace89.441.361020774MIT
SSD-MobileNetV1🔗🔗300x300COCO201724.77-0.0533563019Apache 2.0
SSD-MobileNetV2🔗🔗300x300COCO201719.250.8722612195Apache 2.0
YOLOv3🔗🔗640x640COCO201746.610.7916396AGPL-3.0
YOLOv5s-Relu🔗🔗640x640COCO201735.090.52785536AGPL-3.0
YOLOv5s-v5🔗🔗640x640COCO201736.180.37790526AGPL-3.0
YOLOv5n🔗🔗640x640COCO201727.720.871028656AGPL-3.0
YOLOv5s🔗🔗640x640COCO201737.250.80865824AGPL-3.0
YOLOv5m🔗🔗640x640COCO201744.940.85455322AGPL-3.0
YOLOv5l🔗🔗640x640COCO201748.670.84299204AGPL-3.0
YOLOv7🔗🔗640x640COCO201751.020.58212173GPL-3.0
YOLOv7-tiny🔗🔗416x416COCO201733.120.4914411110GPL-3.0
YOLOv7 640x480🔗🔗640x480COCO201750.780.52242164GPL-3.0
YOLOv8n🔗🔗640x640COCO201737.121.18834764AGPL-3.0
YOLOv8s🔗🔗640x640COCO201744.80.93643524AGPL-3.0
YOLOv8m🔗🔗640x640COCO201750.161.32242177AGPL-3.0
YOLOv8l🔗🔗640x640COCO201752.832.06181142AGPL-3.0
YOLOv8n-obb🔗🔗1024x1024DOTAv1DetectionOBBDataset48.735.68269162AGPL-3.0
YOLOv8l-obb🔗🔗1024x1024DOTAv1DetectionOBBDataset56.064.413619AGPL-3.0
YOLOX-s🔗🔗640x640COCO201739.24-0.81642423Apache-2.0
YOLOX-m🔗🔗640x640COCO201746.26-0.37349268Apache-2.0
YOLOX-x Human🔗🔗1440x800COCO201757.663.3821-MIT
YOLOv9t🔗🔗640x640COCO201737.811.25415247AGPL-3.0
YOLOv9s🔗🔗640x640COCO201746.281.12374237AGPL-3.0
YOLOv9m🔗🔗640x640COCO201751.242.29203148AGPL-3.0
YOLOv9c🔗🔗640x640COCO201752.672.35194150AGPL-3.0
YOLOv10n🔗🔗640x640COCO201738.080.74738561AGPL-3.0
YOLOv10s🔗🔗640x640COCO201745.740.45580461AGPL-3.0
YOLOv10b🔗🔗640x640COCO201751.790.45251217AGPL-3.0
YOLO11n🔗🔗640x640COCO201739.170.71759574AGPL-3.0
YOLO11s🔗🔗640x640COCO201746.540.55565426AGPL-3.0
YOLO11m🔗🔗640x640COCO201751.310.55269196AGPL-3.0
YOLO11l🔗🔗640x640COCO201753.230.49183125AGPL-3.0
YOLO11x🔗🔗640x640COCO201754.670.585331AGPL-3.0
YOLO11n-obb🔗🔗1024x1024DOTAv1DetectionOBBDataset50.011.07250172AGPL-3.0
YOLO11l-obb🔗🔗1024x1024DOTAv1DetectionOBBDataset56.411.083620AGPL-3.0
YOLO26n🔗🔗640x640COCO201740.181.95662487AGPL-3.0
YOLO26s🔗🔗640x640COCO201747.662.05498396AGPL-3.0
YOLO26m🔗🔗640x640COCO201752.452.14258192AGPL-3.0
YOLO26l🔗🔗640x640COCO201754.112.03179122AGPL-3.0
YOLO26x🔗🔗640x640COCO201756.922.435331AGPL-3.0
YOLO26n-obb🔗🔗1024x1024DOTAv1DetectionOBBDataset49.413.12206139AGPL-3.0
YOLO26s-obb🔗🔗1024x1024DOTAv1DetectionOBBDataset54.022.01167114AGPL-3.0
YOLO26m-obb🔗🔗1024x1024DOTAv1DetectionOBBDataset56.661.725833AGPL-3.0
YOLO26l-obb🔗🔗1024x1024DOTAv1DetectionOBBDataset57.351.053419AGPL-3.0
YOLO26x-obb🔗🔗1024x1024DOTAv1DetectionOBBDataset58.45.3415-AGPL-3.0
YOLO-NAS S🔗🔗640x640COCO201747.06450318Apache-2.0
YOLO-NAS M🔗🔗640x640COCO201751.0285221Apache-2.0
YOLO-NAS L🔗🔗640x640COCO201751.7915796Apache-2.0

Semantic Segmentation

ModelONNXRepoResolutionDatasetRef FP32 mIoUAccuracy lossRef PCIe FPSRef M.2 FPSModel license
U-Net FCN 256🔗🔗256x256Cityscapes57.750.34249198Apache 2.0
U-Net FCN 512🔗512x512Cityscapes66.620.013419Apache 2.0

Instance Segmentation

ModelONNXRepoResolutionDatasetRef FP32 mAPAccuracy lossRef PCIe FPSRef M.2 FPSModel license
YOLOv8n-seg🔗🔗640x640COCO201729.980.92639433AGPL-3.0
YOLOv8s-seg🔗🔗640x640COCO201736.320.57482345AGPL-3.0
YOLOv8m-seg🔗🔗640x640COCO201740.390.65198156AGPL-3.0
YOLOv8l-seg🔗🔗640x640COCO201742.271.11167134AGPL-3.0
YOLO11n-seg🔗🔗640x640COCO201731.841.11598406AGPL-3.0
YOLO11l-seg🔗🔗640x640COCO201743.260.13156107AGPL-3.0
YOLO26n-seg🔗🔗640x640COCO201732.952.64516352AGPL-3.0
YOLO26s-seg🔗🔗640x640COCO201739.283.28385292AGPL-3.0
YOLO26m-seg🔗🔗640x640COCO201743.341.40201156AGPL-3.0
YOLO26l-seg🔗🔗640x640COCO201745.091.8714196AGPL-3.0
YOLO26x-seg🔗🔗640x640COCO201746.542.004628AGPL-3.0

Keypoint Detection

ModelONNXRepoResolutionDatasetRef FP32 mAPAccuracy lossRef PCIe FPSRef M.2 FPSModel license
YOLOv8n-pose🔗🔗640x640COCO201751.111.75822723AGPL-3.0
YOLOv8s-pose🔗🔗640x640COCO201760.652.98592471AGPL-3.0
YOLOv8m-pose🔗🔗640x640COCO201765.581.91231168AGPL-3.0
YOLOv8l-pose🔗🔗640x640COCO201768.391.47186145AGPL-3.0
YOLO11n-pose🔗🔗640x640COCO201751.153.23759532AGPL-3.0
YOLO11l-pose🔗🔗640x640COCO201767.443.14179122AGPL-3.0
YOLO26n-pose🔗🔗640x640COCO201757.666.54658450AGPL-3.0
YOLO26s-pose🔗🔗640x640COCO201763.615.12467359AGPL-3.0
YOLO26m-pose🔗🔗640x640COCO201769.544.83235166AGPL-3.0
YOLO26l-pose🔗🔗640x640COCO201771.053.02174120AGPL-3.0
YOLO26x-pose🔗🔗640x640COCO201772.7516.625130AGPL-3.0

Depth Estimation

ModelONNXRepoResolutionDatasetRef FP32 RMSEAccuracy lossRef PCIe FPSRef M.2 FPSModel license
FastDepth🔗🔗224x224NYUDepthV20.6574-0.0065974855MIT

License Plate Recognition

ModelONNXRepoResolutionDatasetRef FP32 WLAAccuracy lossRef PCIe FPSRef M.2 FPSModel license
LPRNet🔗94x24LPRNetDataset89.41.90102689335Apache-2.0

Image Enhancement (Super Resolution)

ModelONNXRepoResolutionDatasetRef FP32 PSNRAccuracy lossRef PCIe FPSRef M.2 FPSModel license
Real-ESRGAN-x4plus🔗🔗128x128SuperResolutionCustomSet128x12824.77--BSD-3-Clause

Face Recognition

ModelONNXRepoResolutionDatasetRef FP32 top1_avgAccuracy lossRef PCIe FPSRef M.2 FPSModel license
FaceNet - InceptionResnetV1🔗🔗160x160LFWTorchvisionPair98.350.001321720MIT

Re-Identification

ModelONNXRepoResolutionDatasetRef FP32 mAPAccuracy lossRef PCIe FPSRef M.2 FPSModel license
OSNet x1_0🔗🔗256x128Market1501ReIdDataset82.550.9317321770Apache-2.0
SBS50🔗🔗384x128Market1501ReIdDataset89.02-0.16666405Apache-2.0

Large Language Models

For usage details see the LLM Inference guide.

ModelMax context (tokens)Required PCIe card RAM
microsoft/Phi-3-mini-4k-instruct5124 GB
microsoft/Phi-3-mini-4k-instruct102416 GB
microsoft/Phi-3-mini-4k-instruct204816 GB
meta-llama/Llama-3.2-1B-Instruct10244 GB
meta-llama/Llama-3.2-3B-Instruct10244 GB
meta-llama/Llama-3.1-8B-Instruct102416 GB
Almawave/Velvet-2B10244 GB

Experimenting with optimized input shapes

Most models are trained on square inputs (640×640), but real-world video is often rectangular (16:9). Standard pipelines pad the input ("letterboxing"), forcing the model to process empty pixels.

By switching to a rectangular input shape that matches your video's aspect ratio, you can often achieve significant speedups with minimal accuracy impact. This is especially effective for fixed-camera applications like surveillance or traffic monitoring.

How to test

Export models with dynamic input shapes, then compare:

# Standard 640×640
./inference.py yolox-m-coco-onnx dataset --pipe=torch-aipu --no-display

# Rectangular 640×480
./inference.py yolox-m-coco-onnx-rect dataset --pipe=torch-aipu --no-display

Expected results

ConfigurationInput shapeSpeedupmAP impactBest for
Standard640×640BaselineBaselineGeneral purpose, diverse content
Optimized640×480+24%−0.3%Near-square content, balanced performance
Optimized640×384+47%−2.0%Landscape video (16:9), maximum throughput

Custom weights

You can use your own trained weights with any model architecture. This involves updating the model's YAML configuration to point to your custom weight file. See Deploy Custom Weights for the full walkthrough.

See also