OpenAI Embeddings: Why Cosine Similarity Alone Fails in Production Search

Cosine similarity is the first thing every embedding tutorial teaches you, and it works beautifully in the notebook demo. Then you ship it to a real user with a real corpus and you start getting complaints: results that are semantically "close" but completely wrong for the actual query. I've been burned by this twice now — once on a document retrieval system for a biotech client, once building internal search for a print management platform — and both times the fix was the same: cosine similarity is a starting point, not a finish line.

What Embeddings Actually Give You

OpenAI's text-embedding-3-small and text-embedding-3-large models convert text into high-dimensional vectors — 1536 or 3072 floats — that encode semantic meaning. Things that mean similar things end up geometrically close. That's genuinely useful.

Cosine similarity measures the angle between two vectors. It's cheap to compute, it's normalized (always -1 to 1), and it doesn't care about vector magnitude. For pure semantic closeness it's fine. The problem is that production search is never just about semantic closeness. It's about relevance, which is a different thing entirely.

A document about "patient medication dosage" and a document about "drug overdose prevention" can have very high cosine similarity. Whether either one is the right result for the query "what's the max dose for metformin" depends on context, recency, source authority, and a dozen other signals that the cosine score knows nothing about.

The Setup I Actually Use

I store embeddings in Postgres via pgvector. For most of my clients that's already the primary database, so there's no additional infrastructure to sell them on.

Here's the migration:

// database/migrations/2024_01_15_000000_add_embedding_to_documents.php
public function up(): void
{
    DB::statement('CREATE EXTENSION IF NOT EXISTS vector');

    Schema::table('documents', function (Blueprint $table) {
        $table->integer('token_count')->nullable();
        $table->timestamp('embedded_at')->nullable();
        $table->index('embedded_at');
    });

    DB::statement(
        'ALTER TABLE documents ADD COLUMN embedding vector(1536)'
    );

    DB::statement(
        'CREATE INDEX documents_embedding_idx ON documents
         USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100)'
    );
}

And the embedding generation job:

// app/Jobs/EmbedDocument.php
public function handle(OpenAI $openai): void
{
    $response = $openai->embeddings()->create([
        'model' => 'text-embedding-3-small',
        'input' => $this->document->embedding_text, // pre-cleaned field
    ]);

    $vector = $response->embeddings[0]->embedding;
    $literal = '[' . implode(',', $vector) . ']';

    DB::statement(
        'UPDATE documents SET embedding = ?, token_count = ?, embedded_at = NOW() WHERE id = ?',
        [$literal, $response->usage->totalTokens, $this->document->id]
    );
}

Nothing exotic yet. Here's where most people stop — they write the search query, order by cosine distance, done. That's the mistake.

What a Pure Cosine Query Looks Like (and Why It's Not Enough)

$queryEmbedding = $this->embed($query); // same API call as above
$literal = '[' . implode(',', $queryEmbedding) . ']';

// This is the demo version. Don't ship this alone.
$results = DB::select("
    SELECT id, title, 1 - (embedding <=> ?) AS score
    FROM documents
    ORDER BY embedding <=> ?
    LIMIT 20
", [$literal, $literal]);

This will return the 20 vectors closest in angle to your query vector. That's it. No awareness of document age, source quality, exact keyword matches, user context, or whether the document was marked deprecated six months ago.

The Actual Production Query

What I ship now is a hybrid re-ranking approach. Rough steps: cast a wide net with vector search, then score and re-rank in PHP before returning results to the user.

public function search(string $query, int $userId): array
{
    $queryEmbedding = $this->embed($query);
    $literal = '[' . implode(',', $queryEmbedding) . ']';

    // Step 1: Pull a wider candidate set (top 60, not top 10)
    $candidates = DB::select("
        SELECT
            d.id,
            d.title,
            d.body_snippet,
            d.source_tier,       -- 1=authoritative, 2=contributed, 3=auto-generated
            d.published_at,
            d.is_deprecated,
            1 - (d.embedding <=> :vec) AS cosine_score,
            ts_rank(
                to_tsvector('english', d.title || ' ' || d.body_text),
                plainto_tsquery('english', :q)
            ) AS keyword_rank
        FROM documents d
        WHERE d.embedded_at IS NOT NULL
          AND d.is_deleted = false
        ORDER BY d.embedding <=> :vec2
        LIMIT 60
    ", ['vec' => $literal, 'q' => $query, 'vec2' => $literal]);

    // Step 2: Re-rank in PHP with a composite score
    $now = now();

    foreach ($candidates as &$row) {
        $recencyScore = $this->recencyScore($row->published_at, $now);
        $tierBoost    = match((int)$row->source_tier) {
            1 => 1.20,
            2 => 1.00,
            default => 0.75,
        };
        $deprecationPenalty = $row->is_deprecated ? 0.40 : 1.00;

        $row->final_score =
            ($row->cosine_score   * 0.55)
            + ($row->keyword_rank * 0.25)
            + ($recencyScore      * 0.20);

        $row->final_score *= $tierBoost * $deprecationPenalty;
    }
    unset($row);

    usort($candidates, fn($a, $b) => $b->final_score <=> $a->final_score);

    return array_slice($candidates, 0, 10);
}

private function recencyScore(string $publishedAt, \Carbon\Carbon $now): float
{
    $days = Carbon::parse($publishedAt)->diffInDays($now);
    // Exponential decay: half-life ~180 days
    return exp(-0.00385 * $days);
}

The weights (0.55, 0.25, 0.20) are not magic numbers — I tuned them against a small labeled set of "good" and "bad" results for that specific client. Your corpus will need different weights. That's the point: you have to tune.

Gotchas That Will Bite You

The IVFFlat index lies to you at small corpus sizes. The lists = 100 parameter needs your table to have at least 10,000 rows before the approximate index actually beats a sequential scan. I've seen projects where pgvector silently falls back to seq scan and nobody noticed because the query was still fast on 500 documents. Then the corpus hit 50k rows and latency spiked. Build in monitoring on EXPLAIN ANALYZE during load testing.

Embedding drift is real. OpenAI has versioned their embedding models. text-embedding-ada-002 vectors are not compatible with text-embedding-3-small vectors. I keep a model_version column on every embedded row. If you ever switch models — or if OpenAI ever silently updates one — you need to re-embed everything. I learned this the hard way when I had a mixed corpus and couldn't figure out why similarity scores were completely nonsensical for about 15% of results.

Short documents are poison. A two-sentence document embeds just fine, but it'll surface as a top result for almost anything tangentially related to its topic because there's so little semantic signal to discriminate against. I now enforce a minimum of ~150 tokens before embedding, and I blend in the document's title with a 2x repeat in the embedding_text field to give the model more anchor.

Rate limits hit you at re-embed time, not initial embed time. When you need to re-embed 40,000 documents because you switched models, you're making 40,000 API calls. text-embedding-3-small is cheap ($0.02/million tokens) but the rate limits are real. I batch documents in groups of 100 (the API accepts an array), which cuts call count by 100x.

// Batch embedding — do this, not one-at-a-time
$inputs = $documents->pluck('embedding_text')->toArray();
$response = $openai->embeddings()->create([
    'model' => 'text-embedding-3-small',
    'input' => $inputs, // up to 2048 inputs per call
]);
// $response->embeddings is ordered to match $inputs

Cosine similarity does not penalize off-topic documents that happen to share vocabulary. A document about "Python (the snake)" and a query about "Python (the language)" can score well because the word co-occurs in both training contexts. This is where the keyword rank signal in the hybrid approach earns its 25% weight — full-text search is dumb but it's honestly dumb, which is a useful property.

When I'd Reach for This (and When I Wouldn't)

I'd use OpenAI embeddings plus hybrid re-ranking for any internal knowledge base search over a few thousand documents, product catalog search where descriptions are prose-heavy, or any domain where synonyms and paraphrasing matter and your users can't be expected to use exact keywords.

I would not use it — or at least wouldn't use it alone — for short-form lookup (SKU search, username search, exact-match queries). Those are Postgres full-text or just a LIKE. I also wouldn't use it if the corpus changes faster than you can re-embed; embeddings are a batch artifact, not a real-time one. And I wouldn't use it if your data is so sensitive that you can't send it to OpenAI's API. In that case, look at running a local model via Ollama or llama.cpp and eating the infrastructure cost.

For a healthcare client where PHI is involved, I run a self-hosted nomic-embed-text model and keep the vectors in a private Postgres instance. The quality is slightly lower than text-embedding-3-large, but it ships legally, which is the only quality metric that matters in that context.

The Bottom Line

Embeddings are a genuinely useful tool and OpenAI's API makes them embarrassingly easy to call. The trap is mistaking "easy to call" for "easy to deploy correctly." Cosine similarity is the floor, not the ceiling — ship it as your only ranking signal and you're handing your clients a search box that looks smart and occasionally returns nonsense. Add recency, authority, keyword overlap, and business-specific penalties, tune the weights against real feedback, and you've got something worth putting in front of users.

OpenAI Embeddings: Cosine Similarity Isn't Enough for Production