Postgres Replication Lag Burned Me: Read-After-Write Is Harder Than It Looks
I shipped a feature that worked perfectly in dev and fell apart in production. The culprit was replication lag I didn't account for.
I deployed a feature for a healthcare client that let patients update their contact information and immediately see a confirmation screen reflecting those changes. It worked flawlessly in every environment I tested. It broke in production within the first hour. The bug took me an embarrassing amount of time to track down, and the root cause was something I already knew about but hadn't respected: replication lag.
What Replication Lag Actually Means in Practice
If you're running Postgres with a primary and one or more read replicas — which you probably are if you're on RDS, Aurora, Supabase, Railway, or any managed host worth using — writes go to the primary and reads can be distributed to replicas. That's the whole point. Offload read traffic, keep the primary free for writes.
The problem is that replication isn't instantaneous. Postgres uses streaming replication by default, which is asynchronous. The primary commits a write, sends the WAL (write-ahead log) to replicas, and the replicas apply it. On a healthy, lightly loaded system that might happen in under a millisecond. On a busy system, or when network conditions aren't ideal, or when a replica is catching up after a restart, you might see lag of hundreds of milliseconds or even seconds.
Most of the time, you don't care. A user browsing a product catalog doesn't need to see data that's 200ms fresher. But the moment a user writes something and immediately reads it back, you have a read-after-write consistency problem. The write landed on the primary. The read went to a replica that hasn't caught up yet. The user sees stale data — or worse, sees that their change didn't happen at all.
That's exactly what happened with my patient portal. User submits a new phone number, write goes to primary, browser redirects to a confirmation page, that page fires a query to a replica, replica is 300ms behind, phone number looks unchanged. Patient calls the front desk, staff is confused, I get a support ticket.
How the Bug Hides From You
Dev and staging environments almost never reproduce this. You're typically running a single Postgres instance — no replicas, no lag. Every read hits the same server that just processed the write. Works every time.
In production with a managed host, your Laravel DB_READ_HOST or your pgBouncer config or your RDS read endpoint is silently routing reads to a replica. You added that months ago for performance reasons, felt good about it, moved on.
Here's the Laravel config that creates the problem invisibly:
// config/database.php
'pgsql' => [
'driver' => 'pgsql',
'read' => [
'host' => [
env('DB_READ_HOST_1', '10.0.1.52'),
env('DB_READ_HOST_2', '10.0.1.53'),
],
],
'write' => [
'host' => env('DB_HOST', '10.0.1.10'),
],
// ... rest of config
],
The moment you split read/write like this, every DB::select(), every Eloquent ->get(), every ->first() goes to a read host. Laravel handles this transparently. That's convenient right up until it isn't.
What I Do About It Now
There are a few strategies. I'll tell you which one I actually use and why.
Option 1: Use a synchronous commit or synchronous standby.
You can configure Postgres to wait for at least one replica to confirm before acknowledging a commit. Set synchronous_commit = on and configure synchronous_standby_names. This gives you strong consistency but you pay for it in write latency and complexity. I've never done this in production because the latency penalty isn't worth it for the types of apps I build. If you're running a financial ledger, maybe. For healthcare portals and e-commerce, no.
Option 2: Sticky sessions — route the user to the primary for a window after a write.
Store something in the session after a write that says "this user just wrote, send their reads to primary for the next N seconds." This works but it's fiddly and easy to get wrong. If your app has lots of write paths, you'll forget one.
Option 3: Force specific queries to the primary connection.
This is what I actually do. Laravel exposes DB::connection('pgsql_primary') or you can use the ->onWriteConnection() trick on Eloquent. I define a named primary-only connection in my database config and I reach for it explicitly in any code path that follows a write.
// config/database.php — add a dedicated primary connection
'pgsql_primary' => [
'driver' => 'pgsql',
'host' => env('DB_HOST', '10.0.1.10'),
'port' => env('DB_PORT', '5432'),
'database' => env('DB_DATABASE', 'myapp'),
'username' => env('DB_USERNAME', 'myapp'),
'password' => env('DB_PASSWORD', ''),
'charset' => 'utf8',
'prefix' => '',
'schema' => 'public',
],
Then in a controller action that does a write-then-read:
public function updateContact(UpdateContactRequest $request, Patient $patient): RedirectResponse
{
// Write goes to primary via normal connection
$patient->update([
'phone' => $request->phone,
'email' => $request->email,
]);
// Immediately read back from the primary, not a replica
$updated = Patient::on('pgsql_primary')->find($patient->id);
return redirect()
->route('patient.profile', $patient)
->with('contact', $updated);
}
It's explicit. Anyone reading the code can see that we made a deliberate choice to read from primary here. It's not magic.
Option 4: Pass the data through, don't re-read it.
Honestly, in many cases the cleanest answer is: don't query again at all. You already have the data from the request. You know what you wrote. Just use it.
public function updateContact(UpdateContactRequest $request, Patient $patient): RedirectResponse
{
$patient->update([
'phone' => $request->phone,
'email' => $request->email,
]);
// Don't re-fetch. Flash the updated model values directly.
return redirect()
->route('patient.profile', $patient)
->with('success', 'Contact information updated.')
->with('phone', $request->phone)
->with('email', $request->email);
}
For simple confirmation screens this is the right call. It's faster too. You already have the data.
The Gotcha Nobody Talks About: Aurora's Lag Isn't Consistent
I run several clients on Aurora Postgres. Aurora is great — managed failover, fast storage, easy scaling. But Aurora's replication works differently than standard Postgres streaming replication. Aurora uses a shared storage layer with a separate redo log application process on readers. Under normal conditions Aurora reader lag is very low. Under load, or right after a writer instance does something expensive, I've seen reader lag spike to 2-3 seconds.
You can actually query it:
-- On an Aurora reader instance
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS replica_lag_seconds;
I've wired that into a health-check endpoint for a couple of clients just to have visibility. If lag is above a threshold, I know something's wrong before users start complaining.
On standard RDS Postgres you can watch ReplicaLag in CloudWatch. But knowing the lag exists and knowing which application code paths are affected by it are two different problems.
When I'd Reach for Explicit Primary Reads
- Any read that immediately follows a write in the same request cycle.
- Financial totals or inventory counts shown right after a transaction.
- Auth flows — if a user creates an account and their next request checks if the account exists, you need the primary.
- Admin actions where a staff member modifies a record and the next page shows that record.
When I Wouldn't Bother
- Background jobs that run on a delay. By the time a queue worker picks up a job, replication has almost certainly caught up.
- List views, dashboards, analytics. Slightly stale data is fine. Users don't know the difference and you're not showing them their own just-submitted data.
- Read-heavy APIs that don't follow a write. Let the replicas do their job.
The Real Lesson
The real lesson isn't "replication lag is a problem" — I already knew that. The real lesson is that optimistic assumptions about infrastructure behavior compound quietly. I split read/write connections for performance, felt good about it, and didn't go back through every write-then-read code path to ask "is this one of the cases where lag matters?"
Now it's part of my code review checklist. Any time I see a write followed by a redirect followed by a query, I ask whether that query could be hitting a replica. Usually it can. And usually it matters.
Replication lag is one of those things where the default behavior is correct 95% of the time, and the 5% will confuse your users and make you feel like an idiot at 9pm. Know your write paths. Be explicit where it counts.
Need help shipping something like this? Get in touch.