2026-01-26

Incident Report: Vaultwarden Storage Failure

A post-mortem analysis of a catastrophic storage failure, the limits of distributed replication, and the reality of self-hosted sovereignty.

Reliability is about preventing failure and, sometimes, it is about managing recovery. On January 26th, the physical infrastructure hosting my vaultwarden instance suffered a power event that exposed a critical flaw in my redundancy strategy.

No complaint here, just an analysis of why "High Availability" is not "backup," and why sovereignty requires more than just owning the hardware.

So, what happened?

A rapid series of power outages caused a corruption of the distributed block storage system (Longhorn). Despite having three replicas across different nodes, the corruption was replicated instantly to all instances.

Impact: Total loss of the vaultwarden persistent volume of the Postgres database.
Data Loss: 0% (Due to client-side caching and external backups).
Recovery Time: ~2 hours to rebuild the namespace state.

Context

The infrastructure relies on a k3s cluster using Longhorn for distributed block storage. This setup is designed to handle node failures (availability). However, it assumes a somehow stable physical environment.

The event, a "flickering" power outage (ON-OFF-ON-OFF), was sent the rack. By the time I manually cut power to protect the hardware, the filesystem was already corrupted.

Root cause

Why did 3 replicas fail?

One often confuse Redundancy (Availability) with Resilience (Integrity). Longhorn replicates block-level data. When the filesystem corruption occurred on the primary write operation during the power spike, Longhorn's replication engine did exactly what it was designed to do: it replicated that corruption to all three nodes immediately.

Snapshot failure: Even the local snapshots were marked as corrupted because the parent volume metadata was unreadable.
Result: The PVC was effectively bound to a ghost. The data existed on disk, but the filesystem map was unrecoverable.

The "sovereignty" lesson

Don't rely on assumptions:

Longhorn replicas won't necessarily save you.
Bitwarden client caches (on iOS or macOS) can't really serve as a "last resort" backup. While the Bitwarden client stores personal vault data locally for offline access, it does not store Organization secrets. When I exported my local vault "just in case," I found that the Organization items were empty.

Note: Organization keys are retrieved dynamically. If the server is down and the session token expires or the specific cache is cleared, that data is inaccessible from the client export.

Corrective measures

The system worked because I had an external backup less than 48 hours old. Note that relying on manual intervention is not a strategy.

Implemented changes:

S3 Snapshots: Configured Longhorn to push backups to an object storage bucket.
Database dumps: Added a cronjob to export the Vaultwarden SQL database to flat file storage nightly.

Conclusion

Sovereignty is a method, not a product. Owning the "Cloud" means you also own the outage. If you cannot recover your infrastructure from a bare-metal state using only external keys, you do not own it; you are just renting it from luck.