LLM Inference

Running Small Language Models (SLMs) on Metis hardware using the axllm command.

note

All models are precompiled. You cannot load arbitrary LLMs without prior compilation for the Metis platform. See the Model Zoo — Large Language Models for the full list of available precompiled language models and RAM requirements.

Quick start

# Single prompt
axllm llama-3-2-1b-1024-4core-static --prompt "Give me a joke"

# Interactive chat session
axllm llama-3-2-1b-1024-4core-static

# Web UI (opens Gradio interface)
axllm llama-3-2-1b-1024-4core-static --ui

Pipeline modes

Mode	Description
`--pipeline=transformers-aipu`	(default) Runs the model on the Metis AIPU. All models are precompiled.
`--pipeline=transformers`	Runs on CPU/GPU via Hugging Face Transformers. For development and testing only — not for benchmarking.

Some models require a 16GB AIPU card. Check the Model Zoo — LLM RAM requirements before selecting a model. The system will also tell you at model load time if you don't have sufficient memory.

Usage modes

Single prompt

axllm <model-name> --prompt "Your prompt here"

Interactive CLI

axllm <model-name>

Type your messages at the prompt. The model maintains conversation history within the session. Type exit to quit.

Rich CLI (formatted output)

axllm <model-name> --rich-cli

Colorized output with Markdown formatting and chat bubbles.

Web UI

axllm <model-name> --ui          # public URL (accessible outside localhost)
axllm <model-name> --ui local    # local only

Launches a Gradio web interface. The terminal prints the URL — open it in a browser.

note

The public URL changes on each launch.

Options

Performance statistics

axllm <model-name> --show-stats --prompt "Hello"

Output:

Tokenization: 0.4ms | Prefill: 3.1us | TTFT: 0.573s | Gen: 2.111s | Tokens/sec: 9.95 | Tokens: 21
CPU %: 1.8%
Core Temp: 34.0°C

Metric	Description
`Tokenization`	Time to tokenise the input
`Prefill`	Time for context setup
`TTFT`	Time to first token (includes model startup)
`Gen`	Total generation time
`Tokens/sec`	Generation throughput

System prompt

axllm <model-name> --system-prompt "You are a helpful assistant that always answers in bullet points."

Temperature

Controls randomness. 0 = deterministic. Higher = more creative.

axllm <model-name> --temperature 0.7 --prompt "Tell me a story."

Suggested range: 0.2 (focused) to 1.0 (creative). Default: 0.

Combined example

axllm llama-3-2-1b-1024-4core-static \
  --system-prompt "You are a pirate." \
  --temperature 0.8 \
  --rich-cli

All options

axllm --help

Embedding files

Embedding files are generated offline by Axelera AI and bundled with each precompiled model. The axextractembeddings tool is for Axelera-internal model preparation — end users do not need to run it.

Quick start​

Pipeline modes​

Usage modes​

Single prompt​

Interactive CLI​

Rich CLI (formatted output)​

Web UI​

Options​

Performance statistics​

System prompt​

Temperature​

Combined example​

All options​

Embedding files​

See also​