Skip to main content

CompilerConfig Reference

Complete property listing for axelera.compiler.CompilerConfig. For usage examples and the compilation workflow, see Compiler Python API.

from axelera.compiler import CompilerConfig

config = CompilerConfig()
config.aipu_cores_used = 4
config.multicore_mode = "batch"

Properties

Quantization

PropertyTypeDefaultDescription
compiler_modeCompilerMode"quantize_and_lower"Controls what the compiler does: quantize only, lower only, or both.
quantization_schemeQuantizationScheme"per_tensor_histogram"Quantization algorithm for activations and weights.
quantize_dw_channel_wisebooleanfalseQuantize depthwise convolutions channel-wise instead of per-tensor.
onnx_opset_versioninteger ≥ 1717ONNX opset version used during PyTorch → ONNX conversion.
quantization_debugbooleanfalseDump model after quantization for accuracy debugging.
remove_io_quantizationbooleanfalseRemove quantize/dequantize ops from graph inputs and outputs.
quantized_graph_exportbooleantrueExport the quantized graph to a JSON file.

Graph transformations

PropertyTypeDefaultDescription
apply_pword_paddingbooleantruePad input channels to a multiple of PWORD (hardware requirement).
rewrite_concat_to_resaddbooleantrueConvert concatenation ops to pad+shift+add (no native concat on hardware).
rewrite_dense_to_conv2dbooleantrueCanonicalise nn.dense to conv2d with 1×1 kernel.
remove_io_padding_and_layout_transformbooleantrueRemove padding/transpose ops from the graph when the SDK pipeline handles them. Leave true unless running without the SDK preprocessing elements.
apply_arithmetic_simplificationbooleantrueFast-math simplifications. May not preserve numerics exactly.
simplify_mac_before_after_lutbooleantrueOptimize multiply-add ops around LUT operations. May not preserve numerics exactly.
validate_operatorsbooleantrueCheck operator compatibility with hardware before compilation.

Graph cleaner

PropertyTypeDefaultDescription
run_graph_cleanerbooleantrueRun the ONNX graph cleaner to remove pre/post-processing ops that belong on the host.
graph_cleaner_split_pre_post_processingbooleantrueSplit pre- and post-processing into separate passes in the cleaner.
graph_cleaner_conditionGraphCleanerCondition or nullnullCondition used to identify the split point.
graph_cleaner_nodeGraphCleanerNode or nullnullNode type the condition applies to.
graph_cleaner_thresholdinteger0Threshold value for the condition.
graph_cleaner_dump_core_onnxstring or nullnullFilename to save the core ONNX model after cleaning (for debugging).
graph_cleaner_dump_full_opt_onnxstring or nullnullFilename to save the full optimized ONNX model after cleaning.
remove_layout_transform_from_preamblebooleanfalseRemove transpose/reshape from model preamble for NHWC layouts.

Multi-core and scheduling

PropertyTypeDefaultDescription
aipu_cores_usedinteger 1–41Number of AIPU cores to compile for.
multicore_modeMulticoreMode"multiprocess"How cores share work. See Multi-Core Configuration.
pipeline_spatial_tilesbooleantrueSoftware-pipeline tasks across height tiles.
pipeline_channel_tilesbooleantrueSoftware-pipeline tasks across output channel tiles.
inter_operator_asyncbooleantrueAsynchronous scheduling between sequential operators.
use_list_schedulerbooleanfalseUse resource-aware list scheduler for async scheduling.
unroll_prologue_epiloguebooleanfalseUnroll prologue/epilogue loops to expose more parallelism.
use_hw_tokensbooleantrueUse hardware tokens for synchronisation.
group_ifdw_tasksbooleanfalseHoist IFDW operations earlier for better IMC weightset utilization.
double_bufferbooleantrueDouble-buffer host↔device data transfers to hide latency.
host_processes_usedinteger1Number of host processes for execution.

Tiling and memory planning

PropertyTypeDefaultDescription
max_memplan_attemptsinteger ≥ 15Max iterations for the memory planner to find a valid configuration.
max_tiling_attemptsinteger ≥ 18Max attempts to tile an operator to fit in memory.
force_h_tilinginteger or nullnullForce a specific height-dimension tiling factor.
force_oc_tilinginteger or nullnullForce a specific output-channel tiling factor.
tiling_depthinteger ≥ 11Max tiling depth. Set to 6 to enable depth-first scheduling; 1 disables it.
dfs_search_constraintinteger or nullnullLimit depth-first search space. Set to 1 for large networks.

Memory hierarchy

PropertyTypeDefaultDescription
enable_buffer_promotionbooleantrueAllow buffers to be promoted from DDR → L2 → L1 for faster access.
split_buffer_promotionbooleanfalseSplit promotion into two passes (L1 and L2) with DFS scheduling in between.
io_memory_poolstring"global.ddr"Initial home for I/O buffers. One of: global.ddr, global.l2, global.l1.
constant_memory_poolstring"global.ddr"Initial home for constant buffers.
workspace_memory_poolstring"global.ddr"Initial home for workspace buffers.
l1_constraintinteger or nullnullMax L1 memory the compiler may use (bytes).
l2_constraintinteger or nullnullMax L2 memory the compiler may use (bytes).
ddr_constraintinteger or nullnullMax DDR memory the compiler may use (bytes).
l1_size_usedinteger4194304L1 memory available to the compiler (bytes). Default: 4 MB.
l2_size_usedinteger33554432L2 memory available to the compiler (bytes). Default: 32 MB.
ddr_size_usedinteger1073741824DDR memory available to the compiler (bytes). Default: 1 GB.
use_sysdmabooleanfalseUse System DMA for DDR→L2 weight transfers (Core DMA handles L2→L1).
dma_dual_channelbooleantrueEnable dual-channel DMA optimization.
page_memorybooleantrueApply paging to reduce L2/DDR fragmentation.
elf_in_ddrbooleantrueStore ELF file in DDR rather than L2.
stream_tasklistbooleantrueStream tasklist chunks into AIPU memory on-the-fly.
dpu_constants_homestring"global.l2"Where DPU constants reside. One of: global.ddr, global.l2.

IMC and MVM

PropertyTypeDefaultDescription
mvm_utilization_limitfloat 0.125–1.01.0Fraction of MVM array active MACs. Reduce to lower power consumption.
enable_icrbooleantrueIn-Core Replication for layers with small output-channel counts.
icrx_forceinteger-1Force a specific ICR factor. -1 = automatic.
icrx_parallel_block_thresholdinteger 1–44Maximum number of parallel IMC blocks for which ICR is still applied.
icrx_max_factorinteger 2–88Maximum ICR factor along the X image direction.
enable_swicrbooleantrueSubword ICR for first layers with small input-channel counts.
imc_double_buffer_pipelinebooleanfalseDouble-buffer weight loading in software-pipelined sections.
imc_double_buffer_sequentialbooleanfalseDouble-buffer weight loading in sequential IR sections.
dpu_allocation_algorithmDPUAllocationAlgorithm"try_all"Register-allocation algorithm for the DPU vector unit.
softmax_neutral_valuefloat-100000.0Padding value for softmax inputs. Chosen so the softmax LUT outputs zero for padding elements without affecting non-padding numerics.

Output and paths

PropertyTypeDefaultDescription
output_dirpath stringDirectory for compiler outputs and deployment artifacts.
remove_output_dirbooleanfalseClean the output directory before compilation.
model_namestring""Model name used in logging and to select model-specific optimization presets.
save_error_artifactbooleanfalseSave a ZIP archive with the lowered model and error messages on failure.
randomize_onnx_modelbooleantrueRandomise cached ONNX model weights before saving the error artifact.

Hardware and runtime

PropertyTypeDefaultDescription
frequencyinteger 20 MHz–800 MHz800000000Device clock frequency in Hz.
resources_usedfloat (0, 1]1.0Fraction of memory resources the compiled model may use.
input_dmabufbooleanfalseUse DMA for input data transfer.
output_dmabufbooleanfalseUse DMA for output data transfer.

Profiling and debugging

PropertyTypeDefaultDescription
profiling_levelslist of ProfilingLevel[]Profiling trace levels to enable. Enabling multiple simultaneously can reduce accuracy.
profiling_drop_percentilefloat0.25Drop top and bottom N% of profiling samples to remove outliers.
trace_tvm_passesbooleanfalseTrace TVM pass execution (start/end times, parent-child relationships). Output: pass_dependency_graph.json.
propagate_span_informationbooleantruePropagate source-span information through the compiler.
model_debug_save_dirpath stringDirectory to save the quantized/optimized model for debugging.
quantization_debugbooleanfalseDump the quantized model for accuracy measurement and debugging.

Advanced — internal paths

These are resolved automatically by the SDK. Override only if your installation is non-standard or you are running the compiler outside the SDK environment.

PropertyTypeDefaultDescription
compiler_dirpath stringRoot directory of the compiler package, used to locate internal resources and dependencies.
runtime_dirpath stringDirectory for the Axelera runtime.
device_dirpath stringDirectory for device resources.

Advanced — memory layout constants

Hardware memory map values. Do not change unless directed by Axelera support — incorrect values will prevent models from running.

PropertyTypeDefaultDescription
l1_size_reservedinteger524288L1 memory reserved for the system (bytes). Default: 512 KB.
l2_size_reservedinteger1245184L2 memory reserved for the system (bytes).
l2_size_reserved_tasklistinteger1048576L2 memory reserved per core for the tasklist (bytes). Default: 1 MB.
ddr_size_maxinteger1073741824Total DDR memory size (bytes). Default: 1 GB.
ddr_size_reservedinteger33554432DDR memory reserved for the system (bytes). Default: 32 MB.
l1_virtual_addressinteger206175207424Virtual address of L1 memory. Copied from mmap_config.h.
l1_core0_physical_addressinteger402653184Physical address of L1 memory in core 0. Copied from memorymap.h. Required because the AIPU simulator does not support virtual memory.
dpu_instructions_homestring"default"Where DPU instructions are placed. One of: "default", "l2".
dwpu_instructions_homestring"default"Where DWPU instructions are placed. One of: "default", "l2".
ignore_weight_buffersbooleantrueExclude weight buffers when determining tiling factors during memory scheduling.

Enum definitions

CompilerMode

ValueDescription
"quantize_and_lower"Quantize the model then compile to hardware binary (default).
"quantize_only"Quantize only — produces a model that runs on CPU for accuracy validation.
"lower_only"Compile a pre-quantized model to hardware binary without re-quantizing.

MulticoreMode

ValueDescription
"multiprocess"Each core runs a separate OS process (default).
"multithread"Cores share a process with multiple threads.
"batch"Different items in a batch run on different cores simultaneously.
"cooperative"Cores cooperate on a single inference.
"pipeline"Model is split across cores as a pipeline stage.

See Multi-Core Configuration for when to use each mode.

QuantizationScheme

ValueDescription
"per_tensor_histogram"Activations quantized per-tensor with histogram observer; weights per-channel with min-max (default).
"per_tensor_min_max"Activations quantized per-tensor with min-max observer; weights per-channel with min-max.
"hybrid_per_tensor_per_channel"Activations per-tensor (histogram), except depth-wise convolution inputs (per-channel min-max); weights per-channel min-max.

DPUAllocationAlgorithm

ValueDescription
"try_all"Try all algorithms in sequence until one succeeds (default).
"graph"Graph-coloring register allocator.
"lazy"Lazy (greedy) allocator.
"backjump_recursive"Backtracking recursive allocator.

GraphCleanerCondition

ValueDescription
"maximum_weight_tensor_size"Split on the node with the largest weight tensor.
"maximum_weight_tensor_first_dimension_size"Split on the node with the largest first weight dimension.

GraphCleanerNode

ValueDescription
"MatMul"Apply the condition to MatMul nodes.
"Gemm"Apply the condition to Gemm nodes.
"Clip"Apply the condition to Clip nodes.

ProfilingLevel

Trace-line type identifiers used in the profiling output file.

[?] · [B] · [PB] · [PE] · [K] · [M] · [T]