Host Sleep States Causing Inference Crashes
Some hosts are configured to enter sleep mode automatically, which can interrupt long inference runs. When the system enters the S3 sleep state, PCIe bus power may be cut to connected devices, causing communication failures with the accelerator.
Diagnosis
Check the kernel log for sleep-state transitions:
sudo dmesg | grep -i "S3"
If the system entered S3, you will see a line such as:
ACPI: PM: Preparing to enter system sleep state S3
Fix
Disable all sleep and hibernation modes by editing /etc/systemd/sleep.conf:
[Sleep]
AllowSuspend=no
AllowHibernation=no
AllowSuspendThenHibernate=no
AllowHybridSleep=no
Then reload the systemd configuration:
sudo systemctl daemon-reload