Disaster Recovery for a One-Server Laravel App: The 2am Restore

Your backup strategy is not a strategy until you've done the restore. I don't mean read through the steps — I mean actually spun up a blank box, followed your runbook, and watched the app come back. Most shops haven't. I know because I was one of them, until a managed hosting client's VPS provider had a catastrophic node failure and I was staring at a blank server at 1:47am.

That night taught me more about disaster recovery than the previous ten years of "yes we have backups" had.

What We're Protecting Against

For a one-server Laravel app, the failure modes that actually happen are:

Provider-level hardware failure or datacenter incident (rare, but unrecoverable without offsite backups)
Runaway migration or bad deploy that corrupts the database
Someone runs php artisan migrate:fresh on production (yes, this happens)
Disk fills up and MySQL or Postgres starts corrupting tables
Compromised server — you need to nuke it and restore clean

A one-server setup means your app, web server, database, queue workers, and scheduler all live together. The blast radius of any failure is total. You don't have a replica to promote. You don't have a snapshot you can roll back in thirty seconds. You have whatever you actually backed up, and however fast you can restore it.

The Backup Stack I Actually Use

For NWOS-managed clients on a single VPS, I run a layered approach:

1. Daily database dumps, shipped offsite

For Postgres:

#!/bin/bash
# /usr/local/bin/backup-db.sh

DATE=$(date +%Y-%m-%d-%H%M)
APP="myclient"
BACKUP_DIR="/var/backups/db"
S3_BUCKET="s3://myclient-backups/db"

mkdir -p "$BACKUP_DIR"

pg_dump -U postgres -Fc "${APP}_production" > "${BACKUP_DIR}/${APP}-${DATE}.dump"

# Encrypt before shipping
gpg --batch --yes --recipient backups@nwos.com \
    --output "${BACKUP_DIR}/${APP}-${DATE}.dump.gpg" \
    --encrypt "${BACKUP_DIR}/${APP}-${DATE}.dump"

aws s3 cp "${BACKUP_DIR}/${APP}-${DATE}.dump.gpg" "${S3_BUCKET}/"

# Keep 7 days local, 90 days in S3
find "$BACKUP_DIR" -name "*.dump.gpg" -mtime +7 -delete

aws s3 lifecycle-configuration ... # set in S3 console, 90-day expiry

For MySQL/MariaDB, swap pg_dump for:

mysqldump --single-transaction --quick --lock-tables=false \
    -u root -p"${DB_PASSWORD}" "${APP}_production" | gzip > "${BACKUP_DIR}/${APP}-${DATE}.sql.gz"

The --single-transaction flag is non-negotiable for InnoDB. Without it you're taking a dirty read during a live app.

2. Filesystem snapshot via rsync to a second location

#!/bin/bash
# /usr/local/bin/backup-files.sh

rsync -az --delete \
    --exclude='node_modules' \
    --exclude='.git' \
    --exclude='storage/logs' \
    /var/www/myclient/ \
    backup-user@backup-host:/backups/myclient/files/

I back up /var/www, /etc/nginx, /etc/supervisor, and /etc/cron.d. The app code itself is in Git so it's mostly the .env, the storage/app uploads, and config files I care about. But grabbing the whole web root takes seconds and saves archaeology at 2am.

3. The runbook

This is the part people skip. Every client has a RUNBOOK.md in a private repo. Not because I'll forget — I won't — but because at 2am after being woken up, I want to follow a checklist, not think.

What the Restore Actually Looks Like

Here's the sequence I followed during that node failure incident, cleaned up into something repeatable.

Spin up the new server (15 minutes)

I use a standard Debian or Ubuntu LTS. I have an Ansible playbook that installs Nginx, PHP-FPM (pinned version), Postgres or MySQL, Redis, Supervisor, certbot, and a deploy user. Running it:

ansible-playbook -i hosts/production provision.yml

If you don't have this automated, the restore takes three hours instead of thirty minutes. That's the tax you pay. Write the playbook once.

Pull the app code (2 minutes)

cd /var/www
git clone git@github.com:nwos/myclient.git myclient
cd myclient
cp /root/restore/.env .env  # pre-staged from secrets manager or 1Password
composer install --no-dev --optimize-autoloader
npm ci && npm run build

Restore the database (5-10 minutes depending on size)

For Postgres:

# Pull from S3
aws s3 cp s3://myclient-backups/db/myclient-2025-01-14-0300.dump.gpg /tmp/

# Decrypt
gpg --output /tmp/myclient.dump --decrypt /tmp/myclient-2025-01-14-0300.dump.gpg

# Create the database
psql -U postgres -c "CREATE DATABASE myclient_production;"

# Restore
pg_restore -U postgres -d myclient_production /tmp/myclient.dump

For MySQL:

aws s3 cp s3://myclient-backups/db/myclient-2025-01-14-0300.sql.gz /tmp/

mysql -u root -p myclient_production < <(zcat /tmp/myclient-2025-01-14-0300.sql.gz)

Restore uploads (varies)

rsync -az backup-user@backup-host:/backups/myclient/files/storage/app/ \
    /var/www/myclient/storage/app/

Wire up the app

php artisan config:cache
php artisan route:cache
php artisan view:cache
php artisan storage:link
chown -R www-data:www-data /var/www/myclient/storage
chown -R www-data:www-data /var/www/myclient/bootstrap/cache

Then copy the Nginx vhost config back from your backup, reload Nginx, restart PHP-FPM and Supervisor, update the DNS A record to the new IP, and you're waiting on TTL.

Total wall-clock time in that real incident: 47 minutes from blank server to live app.

That's with a ~4GB Postgres database, about 12GB of file uploads, and me doing this for the first time on a degraded incident with adrenaline involved. With a practiced runbook it should be 30 minutes or less for most apps.

The Gotchas That Will Bite You

You've never tested the GPG decryption key on a fresh machine. The private key lives... where? If the answer is "on the server that just died," you're done. The GPG private key needs to be in your secrets manager, in 1Password, somewhere that survives the server. I keep it in 1Password with a note on the passphrase.

Your .env is not backed up anywhere. The database dump is useless if you don't have the encryption keys, third-party API credentials, and APP_KEY that go with it. APP_KEY mismatch means every encrypted field in your database is garbage. Back up .env separately, encrypted, offsite. I use 1Password vaults per client.

The PHP version on the new server doesn't match. Your Ansible playbook needs to pin the PHP version. If you restore a Laravel 10 app expecting PHP 8.2 onto a box that installs 8.3 by default, you might be fine, you might not. Pin it.

pg_restore errors that aren't fatal look fatal at 2am. Postgres will emit warnings about roles not existing, sequences, extension ownership. Most of them are harmless. Know the difference before you're exhausted and second-guessing whether the restore worked.

You didn't back up the queue state. Failed jobs in your jobs table, failed_jobs, horizon snapshots — these don't come back. That's usually acceptable. Just know you might have in-flight jobs that silently disappeared. For healthcare clients I document exactly what was in-flight during the outage window so we can manually reconcile.

DNS TTL is your enemy. If your TTL was 3600 when the incident started, you might be waiting an hour after restore before traffic hits the new server. Set TTL to 300 during normal operations. It costs nothing.

When This Approach Is Enough

For most of the apps I manage — a regional e-commerce store, a biotech LIMS integration, a healthcare patient portal — a 30-60 minute RTO (recovery time objective) with a 24-hour RPO (you lose at most a day of data) is acceptable. Nobody loves it, but they accept it when you explain the cost of the alternative.

This approach is not enough when:

You have financial transaction data where even an hour of loss is a compliance problem
You're processing real-time orders and a 30-minute outage means thousands in lost revenue
You have regulatory SLAs that require documented sub-hour RTO

In those cases you need a hot standby, streaming replication, and a failover plan — which means you've outgrown the one-server architecture anyway. The answer isn't a better backup script, it's a different infrastructure topology.

What I'd Tell Past Me

Do the restore drill once a quarter. Spin up a $6 VPS, follow your runbook, verify the app works, destroy the server. It takes ninety minutes and it will find at least one thing that would have cost you hours during an actual incident.

The backup script is twenty lines. The restore runbook is what actually matters. And the only way to know if your runbook works is to run it before you need it.