Cheers.dev backup strategy

Backing up Forgejo seems like it should be straightforward: copy the data directory somewhere safe and call it a day. For cheers.dev, that misses a few important pieces.

Here’s the deal: I couldn’t find clear guidance on backing up Forgejo without downtime. This is my solution. If I’m doing something dumb, send me an email or tweet me or something.

In the architecture post, I mentioned the backup setup but did not spell it out. The short version is that cheers.dev has three data paths:

SQLite, replicated with Litestream
Forgejo object data, stored in DigitalOcean Spaces
Git repositories, protected with frozen DigitalOcean block volume snapshots (nightly)

and a bonus:

Forgejo dumps which are saved off-site (weekly)

The cheers.dev backup paths, split by data type.

That sounds a bit fussy, and it is. Git hosting is one of those services where “mostly backed up” is usually the sentence right before a bad afternoon.

What needs to survive

Forgejo is small enough to run comfortably on a single-core droplet, but the data model is not one blob called “the Forgejo stuff.”

For cheers.dev, the important pieces are:

the Forgejo SQLite database
Git repositories on the attached block volume
LFS objects, packages, attachments, avatars, repo archives, Actions logs, and Actions artifacts in Spaces
deployment configuration in the cheers.dev (GitHub) repository
secrets and recovery material in 1Password

The first three are the live service data. The last two are what make a rebuild possible if the droplet is gone.

I care less about a perfect backup system than a predictable recovery path. Perfect backup systems mostly exist in vendor slides and incident reports. Each failure mode should be boring enough that I can follow the runbook instead of inventing one while stressed.

SQLite and Litestream

Forgejo is using SQLite with WAL mode.

I like SQLite here for the same reason I liked it in the architecture post: it keeps the stack small. I do not need a managed database, a second host, or another moving part just to run a small forge for friends and agents.

The database still needs its own recovery path, so Litestream replicates /data/gitea/gitea.db to a dedicated DigitalOcean Spaces bucket. That gives the database a much smaller recovery window than the daily filesystem snapshot cadence.

If the SQLite database is corrupted or lost, the restore path is:

Stop Forgejo.
Restore the database with Litestream.
Move the restored file back into place.
Start Forgejo again.

Note: Litestream only covers the SQLite database. It does not back up Git repository objects on disk.

That boundary matters. Litestream protects users, issues, repo metadata, settings, and the rest of the SQLite-backed state. It does not protect the bare Git repositories, and forgetting that is how you end up with a healthy database pointing at repositories you no longer have.

Spaces for object data

Forgejo is configured to put the blob-like data in DigitalOcean Spaces.

That includes LFS objects, packages, attachments, avatars, repository archives, Actions logs, and Actions artifacts. Those are exactly the categories that can grow quietly while everything still looks fine from the web UI.

I like this split because it keeps the attached block volume focused on the filesystem-backed parts Forgejo expects to manage directly. The heavier object data goes to object storage where it belongs.

Spaces is the primary storage location for those objects. For now, recovery for that data depends on DigitalOcean’s object storage durability, bucket configuration, and whether I add an offsite copy later.

For the first version of cheers.dev, that is an acceptable tradeoff. For a bigger instance, or one with users who are not personally known to me, I would want a second copy outside DigitalOcean.

Note: this is an accepted gap, not a clever backup trick. Object storage is durable, but “stored in one provider” is still “stored in one provider.”

Volume snapshots for Git repositories

The Git repositories are the awkward part.

They live on the attached DigitalOcean block volume under /mnt/data/forgejo. They are not in Spaces, and they are not covered by Litestream. If that volume disappears, the repositories disappear with it unless I have a volume snapshot or another copy.

The current backup job creates daily DigitalOcean block volume snapshots of the Forgejo data volume. The timer runs at 09:00 UTC, keeps the newest seven automated snapshots, and leaves manual snapshots alone.

The snapshot job does this:

Ask Forgejo to flush queues, best effort.
Arm a systemd dead-man timer that will unfreeze /mnt/data if the backup process dies.
Freeze /mnt/data with fsfreeze.
Ask the DigitalOcean API to create the volume snapshot.
Immediately unfreeze the filesystem.
Wait for the snapshot to finish.
Prune old automated snapshots, keeping the newest seven.

The freeze window is intentionally tiny. In the last run I did the filesystem was frozen for about a second. That is the shape I want: make the snapshot consistent, but do not turn the Git host into an ice sculpture because an API call got weird.

Two implementation details matter here.

First, the DigitalOcean API call has a hard timeout while the filesystem is frozen. If the API does not respond quickly, the job fails and unfreezes the volume instead of hanging indefinitely.

Second, the dead-man timer exists because trap handlers are great until the process gets killed hard. If the script dies between freeze and unfreeze, systemd still has a separate timer whose only job is to unfreeze the mount.

That is the kind of small paranoia I want in operational code. I like boring.

Recovery windows

The recovery windows are intentionally different for each data type.

The SQLite database has the smallest recovery window because Litestream is continuously replicating it. The Git repositories have a daily recovery window because they depend on the block volume snapshots. The object data in Spaces is primarily relying on Spaces durability unless I add a separate replication job.

Practically, that means:

database recovery should land near the Litestream replication point
repository recovery is bounded by the latest block volume snapshot, roughly twenty-four hours in the normal case
object storage recovery depends on the Spaces bucket surviving, unless I add offsite sync
full service recovery also depends on the deployment repo, 1Password, DigitalOcean, Cloudflare, Tailscale, Resend, and working admin access

That last bullet is not really a backup window, but it matters. A backup you cannot restore because DNS, credentials, or an access path is missing is not much of a backup.

Restore scenarios I care about

The recovery runbook breaks this down by failure mode. That works better than a single “restore cheers.dev” page when something is already on fire.

If the core droplet dies but the volume survives, the job is mostly rebuild work: create a replacement droplet in the same region and VPC, attach the surviving volume, reassign the reserved IP, restore Tailscale access, rsync the deployment config, restore .env from 1Password, and start the Compose stack.

If the block volume is lost, the job is to restore the newest cheers-data-auto-* snapshot into a new volume, attach it, mount it at /mnt/data, and then use Litestream if the database needs a newer restore point than the snapshot contains.

If only the SQLite database is broken, the job is smaller: stop Forgejo, restore the DB with Litestream, move it into place, and start Forgejo again.

If a single repository is corrupted or deleted, the current process is less elegant. Restore the latest volume snapshot to a second volume, mount it read-only, copy the target bare repository back into place, then detach the restore volume.

That last one is a good example of “fine for now, not ideal forever.” Per-repository restore would be nicer, but I am not going to design a larger backup system before the small one has proven it can actually restore.

Known gaps

I am not hiding these gaps:

The snapshot script should assert that /mnt/data is the expected mount before it freezes anything or asks DigitalOcean to snapshot the configured volume. The job snapshots the correct DigitalOcean volume by name, but I still want a local mountpoint or UUID guard so a missing mount cannot produce a misleadingly successful run.
OnFailure catches failed runs, not missing runs. If the systemd timer is disabled or never fires, the alert path does not help. That is a monitoring problem, but it still belongs in the backup story.
Object storage and volume snapshots do not have an offsite copy yet. Right now the setup is mostly DigitalOcean-internal, which is fine for a small friends instance but not the same thing as multi-provider disaster recovery.

Backups are claims. Restores are evidence.

Current shape

The current backup strategy is deliberately layered:

Litestream protects SQLite
Spaces holds the bulky Forgejo object data
daily frozen block volume snapshots protect Git repositories and filesystem data
1Password holds the secrets needed to rebuild
the deployment repo holds the service configuration
the runbook describes the recovery paths

None of this is exotic. That is the point.

I want cheers.dev to be small enough that I can understand it, but serious enough that losing the host is an inconvenience instead of a tragedy. The backup setup is not finished, and the restore drill still matters more than this post.

The shape is there now: separate the data types, give each one a recovery path, keep the blast radius understandable, and write the process down before future me has to reconstruct it from shell history and panic.