log in
consulting hosting industries the daily tools about contact

Claude Prompt Caching: When It Pays and When It Doesn't

Anthropic's prompt caching sounds like free money. It mostly is — but the billing model has a trap that'll catch you if you're not paying attention.

Prompt caching in the Anthropic Claude API looked, at first glance, like a pure win. Cheaper reads, faster responses, same output quality. Then I looked at the write pricing and had to recalculate a few assumptions. It still pays for itself in the right situations — but "the right situations" is doing a lot of work in that sentence.

What Prompt Caching Actually Does

The short version: you mark part of your prompt with a cache_control breakpoint, Anthropic stores that chunk server-side for five minutes, and subsequent requests that hit the same cached prefix pay a reduced read rate instead of the full input token rate.

For Claude Sonnet 3.5, as of this writing, that's roughly:

  • Normal input tokens: $3.00 / 1M tokens
  • Cache write tokens: $3.75 / 1M tokens (25% premium)
  • Cache read tokens: $0.30 / 1M tokens (90% discount)

So writing to cache costs more than a normal request. Reading from cache costs much less. The math only works if you read from that cache enough times within the five-minute TTL to amortize the write premium.

Break-even is straightforward. If write costs 1.25x normal and read costs 0.1x normal, you need to ask: how many reads does it take before total spend dips below what you'd pay without caching at all?

For a 10,000-token system prompt sent 10 times in five minutes:

  • Without caching: 10 × 10,000 × $0.000003 = $0.30
  • With caching: 1 write × 10,000 × $0.000003750 + 9 reads × 10,000 × $0.000000300 = $0.0375 + $0.027 = $0.0645

That's a real saving. But if those 10 requests happen over two hours instead of five minutes, the cache expires between them and you're paying the write premium repeatedly with no read benefit.

The Code

Here's how I'm setting this up in Laravel. I'm using the Anthropic PHP SDK via HTTP since there's no official first-party PHP client yet — just a clean wrapper around their REST API.

<?php

namespace App\Services;

use Illuminate\Support\Facades\Http;

class ClaudeService
{
    private string $apiKey;
    private string $baseUrl = 'https://api.anthropic.com/v1';
    private string $model = 'claude-sonnet-4-5';

    public function __construct()
    {
        $this->apiKey = config('services.anthropic.key');
    }

    public function chatWithCachedSystem(string $systemPrompt, array $messages): array
    {
        $response = Http::withHeaders([
            'x-api-key'         => $this->apiKey,
            'anthropic-version' => '2023-06-01',
            'anthropic-beta'    => 'prompt-caching-2024-07-31',
        ])->post("{$this->baseUrl}/messages", [
            'model'      => $this->model,
            'max_tokens' => 1024,
            'system'     => [
                [
                    'type' => 'text',
                    'text' => $systemPrompt,
                    'cache_control' => ['type' => 'ephemeral'],
                ],
            ],
            'messages' => $messages,
        ]);

        if ($response->failed()) {
            throw new \RuntimeException(
                'Claude API error: ' . $response->body()
            );
        }

        $data = $response->json();

        // Log cache performance so you can actually verify it's working
        $usage = $data['usage'] ?? [];
        \Log::debug('Claude cache usage', [
            'input_tokens'               => $usage['input_tokens'] ?? 0,
            'cache_creation_input_tokens' => $usage['cache_creation_input_tokens'] ?? 0,
            'cache_read_input_tokens'    => $usage['cache_read_input_tokens'] ?? 0,
            'output_tokens'              => $usage['output_tokens'] ?? 0,
        ]);

        return $data;
    }
}

The anthropic-beta: prompt-caching-2024-07-31 header is required. Leave it out and your cache_control blocks are silently ignored — no error, no warning, just full-price tokens every time. That one got me for longer than I'd like to admit.

The usage object in the response is your ground truth. Watch cache_creation_input_tokens vs cache_read_input_tokens. If you're only ever seeing creation tokens, the cache is expiring between requests and you're paying the premium for nothing.

For a document analysis tool I built for a local biotech — they needed to run multiple queries against the same 40-page SOP documents — the usage log told the whole story. Early on, requests were spaced 8-10 minutes apart. Every single one was a cache write. I moved their workflow to batch the queries together and suddenly 80% of requests were cache reads. Monthly API spend dropped noticeably.

The Gotchas That Bit Me

The five-minute TTL is strict and starts on write. There's no sliding window, no refresh-on-read. If your system prompt is 20,000 tokens and your user takes six minutes to type a follow-up message, the cache is gone. You're writing again.

Cache keys are based on exact content, position, and model. Change one character in your cached block, change the model, and you get a new write. This matters if you're doing any dynamic interpolation in your system prompt. I had a client portal that was injecting the current date into the system prompt — a very common pattern — and the cache was useless because the date string changed every day, which meant every prompt was unique. Moved the date to the user message instead, problem solved.

The minimum cacheable size is 1,024 tokens. Trying to cache a short instruction block won't error — it'll just never cache. You won't know this unless you check the usage fields and notice cache_creation_input_tokens is always zero.

Multiple cache breakpoints work, but they nest. You can mark up to four cache_control points. Each one caches everything from the start of the prompt up to that breakpoint. So if you have a 10,000-token document followed by a 500-token instruction block and you put breakpoints after each, the second breakpoint's cache includes the full 10,500 tokens, not just 500. Plan accordingly.

Session continuity in multi-turn conversations requires care. If you're caching conversation history, the entire message array prefix up to the breakpoint has to be identical. If you summarize or truncate old messages, you've busted the cache. I handle this by keeping a raw unsummarized tail of the last N turns specifically for cache stability, and summarizing older turns into the system prompt block (which is separately cached).

When I'd Reach for This

Prompt caching is worth the setup work when you have a large, stable, reused context and multiple requests hitting it within a short window.

The sweet spots I've found:

  • Document Q&A. User uploads a contract or report, asks several questions about it. Cache the document, let the questions fly. Works extremely well.
  • RAG with retrieved context. If you're retrieving the same chunks repeatedly (common when users explore a topic), caching the retrieved documents cuts costs fast.
  • Multi-turn chat with long system prompts. Detailed personas, product catalogs, rule sets. Cache the system prompt, treat it as static.
  • Batch processing pipelines. If you're running a queue that processes similar documents back-to-back, structure it so the shared instruction block is cached and only the document content varies per request.

When I wouldn't bother:

  • Low-volume or infrequent requests. If users come back every few hours, you're just paying the write premium repeatedly.
  • Short system prompts under 1,024 tokens. Caching doesn't apply, full stop.
  • Highly dynamic prompts. Anything with per-request interpolation in the cached section kills the benefit.
  • One-shot scripts and batch jobs that run once. The write happens, nothing reads it, you paid extra for nothing.

The Broader Picture

I've become a bit allergic to LLM features that sound like pure wins. Prompt caching is genuinely useful — but it requires you to think about your access patterns the same way you'd think about database query caching. The TTL is short, the write premium is real, and the cache key is unforgiving about exact matches.

The usage fields in the response are non-negotiable to log. You cannot verify this is working without them, and "I think it's caching" is not the same as seeing cache_read_input_tokens: 18432 in your logs.

When the conditions are right, prompt caching is one of the more impactful cost levers I've found in the LLM space. It just needs to earn it.

Related

Need help shipping something like this? Get in touch.