Fixing Longhorn Volumes That Refuse to Attach

This document (000021788) is provided subject to the disclaimer at the end of this document.

Environment

Longhorn v1.2.0+

Situation

Longhorn volumes in Kubernetes clusters can sometimes fail to attach to pods, causing persistent issues where pods enter restart loops or remain in a pending state. These issues often occur in environments with node disruptions, incorrect scheduling, or replica faults. This article outlines common attachment issues and provides troubleshooting steps and resolutions to restore volume functionality.

Resolution

Scenario 1: Volumes Detach Unexpectedly and Won’t Reattach

Restart the Deployment or StatefulSet to recreate pods.
Ensure your Longhorn version is 1.2.0 or later, which includes auto-recreation of pods upon unexpected detachments.

Scenario 2: Volumes Can't Attach Even After Pod Recreation

Identify the affected PVC and its bound PersistentVolume (PV).
Scale down pods and ensure the volume is detached.
Use kubectl -n longhorn-system edit volumes.longhorn.io <volume-name> to clear these fields:
- spec.nodeID,
- status.currentNodeID,
- status.ownerID,
- status.pendingNodeID
Reapply changes and scale up pods.

Scenario 3: Volumes Can’t Attach Due to Prior Attachments (RWO Limitation)

Use spec.nodeName in the pod template to pin all pods accessing the volume to the same node.
Alternatively, use Pod Affinity to co-locate pods.
Consider migrating to RWX (ReadWriteMany) volumes via Longhorn Share Manager (NFS-backed).

Scenario 4: Faulted Replicas Blocking Attachments

Try the Salvage option from the Longhorn UI.
If that fails and the fault is acceptable:
- Use kubectl -n longhorn-system edit replicas.longhorn.io <replica-name>
- Clear the spec.failedAt field.
- Force-reattach the volume (only if data integrity is not critical).
For production workloads, consider using multiple replicas for fault tolerance.

Cause

Longhorn volumes can become unresponsive or fail to attach due to inconsistencies in the control plane state or underlying replica issues. These problems typically arise when the system is interrupted or doesn't clean up volume metadata properly after node transitions, pod restarts, or other cluster events.

One common cause is that Longhorn retains outdated information about a volume's attachment status. For example, fields like nodeID, ownerID, or currentNodeID may still be set, causing Longhorn to incorrectly assume the volume is already in use, even when it's not. This stale state can block new attachment attempts and lead to pods getting stuck in a pending or crashloop state.

Another issue occurs when replicas are marked as faulted due to incomplete writes, running out of space, or disruptions during I/O. Even if the data is still usable, Longhorn may refuse to attach the volume for safety reasons unless it's manually salvaged or reset.

In environments using ReadWriteOnce (RWO) volumes, attachment failures can also happen when multiple pods or workloads try to access the volume from different nodes. If Longhorn believes the volume is still attached elsewhere, it will block new attachments to maintain data integrity.

Finally, during normal node operations like draining, cordoning, or rescheduling pods, volume metadata may become desynchronized. Longhorn might think the volume is still attached or being operated on, leading to attachment errors until the control plane state is corrected.

Additional Information

Longhorn Documentation: https://longhorn.io/docs/

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.