Skip to main content

Host Sleep States Causing Inference Crashes

Some hosts are configured to enter sleep mode automatically, which can interrupt long inference runs. When the system enters the S3 sleep state, PCIe bus power may be cut to connected devices, causing communication failures with the accelerator.

Diagnosis

Check the kernel log for sleep-state transitions:

sudo dmesg | grep -i "S3"

If the system entered S3, you will see a line such as:

ACPI: PM: Preparing to enter system sleep state S3

Fix

Disable all sleep and hibernation modes by editing /etc/systemd/sleep.conf:

[Sleep]
AllowSuspend=no
AllowHibernation=no
AllowSuspendThenHibernate=no
AllowHybridSleep=no

Then reload the systemd configuration:

sudo systemctl daemon-reload