log in
consulting hosting industries the daily tools about contact

Finding Your Real Ceiling Before Traffic Does

Reading MariaDB slow query log and PHP-FPM status in isolation gives you half the picture. Here's how I read them together to find the actual breaking point.

The server isn't slow yet, but it will be. That's the only thing I know for certain about any production system I've ever run. The question is whether you find the ceiling on your terms or whether a flash sale, a crawl spike, or a badly-timed cron job finds it for you at 11pm on a Friday.

I've done this wrong more times than I'd like to admit. I used to chase slow queries in isolation, tune indexes, feel good about it, and then get paged two months later when FPM ran out of workers and the query pool backed up anyway. Or I'd bump FPM workers, reduce contention, and never notice that two queries were still doing a full 4M-row scan on every page load. These two data sources answer different questions. Put them together and you get a real capacity number — not a guess.

What Each One Actually Tells You

The MariaDB slow query log tells you where the database is spending time. It records queries that exceed a threshold you set, along with execution time, rows examined, rows sent, and whether an index was used. It's a paper trail of your worst-case database behavior.

PHP-FPM's status endpoint tells you how many workers are busy right now, how many are idle, and how many requests are queued waiting for a worker. It's a real-time (or near-real-time) view of your application process pool.

The gap between them is where you find the actual ceiling. If FPM workers are maxing out but query times are fast, your bottleneck is CPU/concurrency — you're under-provisioned on workers or you have too much PHP work per request. If workers are idle but slow queries are piling up, your bottleneck is the database — more workers won't help and will probably hurt. If both are red at the same time, you have compounding problems and the fixes have to come in the right order.

Setting Up the Slow Query Log

This is the part people skip because it feels like overhead. The overhead is real but small, and completely worth it in production. In /etc/mysql/mariadb.conf.d/50-server.cnf (or wherever your MariaDB config lives):

[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.5
log_queries_not_using_indexes = 1
min_examined_row_limit = 100

I set long_query_time to 0.5 seconds for most apps, not the default 10. Ten seconds is already a disaster. I want to see the queries that are trending toward disaster, not the ones that already arrived. The min_examined_row_limit = 100 keeps the log from filling up with trivial single-row lookups that just happen to skip an index on a tiny table.

Reload without a restart:

mysql -u root -p -e "SET GLOBAL slow_query_log = 1; SET GLOBAL long_query_time = 0.5;"

Then let it run for at least a full business day before you draw conclusions. An hour of log data will lie to you.

Reading the Log with pt-query-digest

pt-query-digest from Percona Toolkit is the right tool. Don't read the raw log — it's unreadable at any meaningful volume.

pt-query-digest /var/log/mysql/slow.log \
  --limit 20 \
  --output report \
  > /tmp/slow_digest.txt

What I'm looking for in the output:

  • Response time % — the top five queries by cumulative time, not by count. A query that runs 10,000 times at 50ms is worse than one that runs 5 times at 2 seconds.
  • Rows examined vs rows sent ratio — anything over 100:1 is a red flag. I've seen 40,000:1 in production. That's not a slow query, that's a crime.
  • EXPLAIN output if present — pt-query-digest can call EXPLAIN for you with --explain.

Here's a quick PHP snippet I use in a maintenance script to pull the top offenders via the information_schema.PROCESSLIST during a load test, since that gives me live in-flight data:

<?php

$pdo = new PDO(
    'mysql:host=127.0.0.1;dbname=information_schema',
    'monitor_user',
    getenv('DB_MONITOR_PASSWORD'),
    [PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION]
);

$stmt = $pdo->query("
    SELECT
        ID,
        USER,
        DB,
        COMMAND,
        TIME,
        STATE,
        LEFT(INFO, 120) AS query_preview
    FROM information_schema.PROCESSLIST
    WHERE COMMAND != 'Sleep'
      AND TIME > 1
    ORDER BY TIME DESC
    LIMIT 20
");

$rows = $stmt->fetchAll(PDO::FETCH_ASSOC);

foreach ($rows as $row) {
    printf(
        "[%ds] %s | %s | %s\n",
        $row['TIME'],
        $row['STATE'],
        $row['DB'],
        $row['query_preview']
    );
}

I'll run this in a loop during a load test with watch -n 2 php check_processlist.php. It's crude but it gives me a live view of what's actually in flight, not what was slow last Tuesday.

Exposing PHP-FPM Status

If you don't have FPM's status page wired up, do it now. In your FPM pool config (usually /etc/php/8.x/fpm/pool.d/www.conf):

pm.status_path = /fpm-status

Then in your Nginx config for localhost-only access:

location /fpm-status {
    access_log off;
    allow 127.0.0.1;
    deny all;
    fastcgi_pass unix:/run/php/php8.2-fpm.sock;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}

Hit it with curl -s http://127.0.0.1/fpm-status?full and you get something like:

pool:                 www
process manager:      dynamic
start time:           ...
start since:          86400
accepted conn:        1842937
listen queue:         0
max listen queue:     14
listen queue len:     128
idle processes:       3
active processes:     9
total processes:      12
max active processes: 12
max children reached: 1

The number I care about most is listen queue. If that's ever above zero during normal operation, you've already run out of workers and requests are waiting. max listen queue is the high-water mark since FPM started — that's the one that'll surprise you. I've looked at systems where the owner had no idea the queue had peaked at 47 during last month's email blast.

max children reached being anything other than zero is also a warning. It means FPM hit its pm.max_children ceiling at least once and started refusing to spawn more.

Reading Them Together: The Real Ceiling

Here's the mental model I use. You have a pipeline:

HTTP request → FPM worker → PHP execution → DB query → response

Each stage can be the bottleneck. The question is which one becomes exhausted first under load.

I run a load test — I use wrk or k6, nothing fancy — and I watch both data sources simultaneously. One terminal running watch -n 1 'curl -s http://127.0.0.1/fpm-status', another tailing the slow query log or watching the processlist.

The scenarios I've hit on real client systems:

Scenario A — FPM-limited: Workers hit max_children, listen queue grows, DB is mostly idle. The fix is either more workers (if you have RAM and CPU headroom) or shorter PHP request time (cache more, do less per request). Adding DB indexes won't help here.

Scenario B — DB-limited: Workers are mostly idle, but the handful that are active are all blocked waiting on queries. DB thread count is climbing, slow log is lighting up. The fix is query optimization, indexing, or read replicas. Adding FPM workers will make this worse — more workers means more concurrent queries means more DB contention.

Scenario C — Both saturated: This is the bad one. It's usually masking a single root cause. I had a healthcare client last year where a reporting query was doing a 2M-row full scan on every dashboard load. That query took 8 seconds, held a worker for 8 seconds, and backed up the queue. FPM looked maxed, DB looked maxed, but fix the query and everything cleared. Don't throw resources at this until you've found the query.

The Numbers I Actually Track

For any app I'm capacity planning, I want to know:

  1. P95 query time from pt-query-digest (not average — average hides spikes)
  2. Peak max listen queue since last FPM restart
  3. max children reached count
  4. Worker saturation percentage = active / max_children at peak load
  5. Rows examined / rows sent ratio for top 10 queries by cumulative time

If P95 query time is under 100ms, worker saturation at peak is under 70%, listen queue is zero, and rows examined ratio is under 50:1, I feel reasonably good. None of those numbers are magic — they're starting points. The ratios matter more than the absolutes.

When This Matters Most

I do this analysis before any anticipated traffic event (launch, sale, press mention) and after any schema change or major feature release. I also do it when a client says "the site felt slow yesterday" — nine times out of ten, the slow query log has the answer, and FPM status tells me whether it was isolated or systemic.

I wouldn't bother with this level of rigor for an internal tool with 20 users. But if you're running anything with real traffic variability — e-commerce, anything with email-driven spikes, anything healthcare-adjacent where you can't have downtime — this is table stakes.

The slow query log and FPM status have been around forever. They're not glamorous. But I've never needed a fancy APM tool to diagnose a production outage when I had both of these configured and a baseline to compare against. Know your normal, and the abnormal finds you first.

Need help shipping something like this? Get in touch.