Whisper Transcription: API vs Self-Hosted vs Groq — Real Latency Benchmarks

I spent a few days last month wiring up audio transcription for a client — a telehealth platform that records short patient intake calls and needs the transcript available before the provider walks into the room. That constraint turned a simple "call the Whisper API" task into a real benchmark exercise. The numbers surprised me enough that I'm writing them down.

What Whisper Actually Does For You

Whisper is OpenAI's speech-to-text model. You hand it an audio file, it hands you back a transcript. That's the whole pitch. What makes it worth caring about is accuracy — it handles accents, crosstalk, medical terminology, and background noise better than anything I used before it. I integrated a competing STT service for a legal transcription client back in 2021 and spent weeks tuning it to stop mangling proper nouns. Whisper handles most of that out of the box.

The question isn't whether to use Whisper. For anything requiring real accuracy, it's the obvious choice right now. The question is where you run it, because the latency profile across the three options is dramatically different, and that matters a lot depending on your use case.

The Three Options

OpenAI's hosted API (api.openai.com/v1/audio/transcriptions) — the obvious starting point. You POST a file, you get a transcript. No infrastructure to manage. Priced at $0.006 per minute of audio.

Self-hosted Whisper — run the model on your own GPU (or CPU, if you're patient). OpenAI open-sourced the weights. You can run it via the openai-whisper Python package, or the much faster faster-whisper library which uses CTranslate2 under the hood.

Groq's Whisper endpoint — Groq built custom inference hardware (they call it the LPU) that runs Whisper extremely fast. Same API shape as OpenAI's, just pointed at api.groq.com. They're running whisper-large-v3-turbo as of this writing.

Benchmark Setup

I used a mix of audio clips — 30 seconds, 2 minutes, and 5 minutes — in MP3 format at 128kbps. Content was conversational English, recorded in a typical office environment. I ran 10 trials per clip per provider and took the median. Upload time is included in all numbers because in production, it's part of the wall-clock latency your user experiences.

My test machine is a DigitalOcean droplet in NYC. Self-hosted tests ran on an RTX 3090 in my office (not the droplet — I'm not running GPU VMs for a benchmark post).

Audio Length	OpenAI API	Self-Hosted (faster-whisper, GPU)	Groq API
30 seconds	4.2s	2.1s	0.8s
2 minutes	11.4s	5.6s	1.9s
5 minutes	24.7s	12.3s	3.4s

Groq is not a small difference. For a 5-minute clip, Groq is roughly 7x faster than the OpenAI API. Self-hosted on a decent GPU lands in the middle.

The Code

All three options use basically the same HTTP call shape, so the Laravel code is nearly identical. Here's the abstraction I built:

<?php

namespace App\Services;

use Illuminate\Http\Client\Response;
use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Storage;

class WhisperTranscriptionService
{
    public function __construct(
        private readonly string $provider = 'openai' // 'openai' | 'groq'
    ) {}

    public function transcribe(string $storagePath, string $language = 'en'): string
    {
        $endpoint = match ($this->provider) {
            'groq'  => 'https://api.groq.com/openai/v1/audio/transcriptions',
            default => 'https://api.openai.com/v1/audio/transcriptions',
        };

        $apiKey = match ($this->provider) {
            'groq'  => config('services.groq.key'),
            default => config('services.openai.key'),
        };

        $model = match ($this->provider) {
            'groq'  => 'whisper-large-v3-turbo',
            default => 'whisper-1',
        };

        $filePath = Storage::path($storagePath);
        $mimeType = mime_content_type($filePath);

        $response = Http::withToken($apiKey)
            ->timeout(120)
            ->attach('file', fopen($filePath, 'r'), basename($filePath), ['Content-Type' => $mimeType])
            ->post($endpoint, [
                'model'           => $model,
                'language'        => $language,
                'response_format' => 'json',
            ]);

        $response->throw();

        return $response->json('text');
    }
}

And a simple job that dispatches this:

<?php

namespace App\Jobs;

use App\Models\AudioRecording;
use App\Services\WhisperTranscriptionService;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;

class TranscribeAudioRecording implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable;

    public int $tries = 3;
    public int $timeout = 180;

    public function __construct(private readonly AudioRecording $recording) {}

    public function handle(WhisperTranscriptionService $whisper): void
    {
        $transcript = $whisper->transcribe($this->recording->storage_path);

        $this->recording->update([
            'transcript'       => $transcript,
            'transcribed_at'   => now(),
        ]);
    }

    public function failed(\Throwable $e): void
    {
        $this->recording->update(['transcription_failed' => true]);
        // alert, log, etc.
    }
}

Swapping providers is just a config change. I appreciate that Groq kept the API shape identical to OpenAI's — they clearly made a deliberate choice there.

The Gotchas That Bit Me

OpenAI's 25MB file size limit is a real constraint. A 5-minute WAV file at CD quality blows right past that. You need to either compress before upload (MP3 at 64kbps is fine for speech) or chunk the audio. I've been using ffmpeg via Symfony Process to normalize everything to mono MP3 before it ever touches the API.

Groq's rate limits are aggressive on the free tier. The paid tier is reasonable, but during development I hit the audio minutes-per-day cap a few times. Not a production problem once you're on a paid plan, but it stung during testing. Also, Groq's file size limit is smaller — 25MB like OpenAI, but their turbo model is fast enough that you shouldn't need huge files anyway.

Self-hosted faster-whisper has a cold start problem if you're not keeping the model loaded in memory. First request after idle can take 8-12 seconds while the model loads. For a web-facing service that needs to respond quickly, you need a persistent worker process — not a PHP script that boots the Python interpreter per request. I used a simple FastAPI wrapper with the model loaded at startup, and called it from Laravel over HTTP. Works fine, but it's more infrastructure to babysit.

Self-hosted also means you own the GPU. That RTX 3090 in my office is fine for a benchmark but not a production deployment. A dedicated GPU instance on AWS or Lambda Labs adds up fast. For most clients I serve, the math doesn't close unless volume is very high — we're talking thousands of hours of audio per month.

Timestamps and word-level confidence — OpenAI's API gives you segment-level timestamps with response_format=verbose_json. Groq supports this too. Self-hosted faster-whisper gives you even more control, including word-level timestamps. For the telehealth use case, segment timestamps were enough for the provider to scan the transcript. If you're building a synchronized transcript viewer (think: click a word, seek the audio), self-hosted gives you more knobs.

When I'd Reach For Each

OpenAI API — default choice for anything where latency isn't critical, volume is moderate, and I don't want to own infrastructure. A weekly batch job transcribing recorded webinars? OpenAI. A legal client archiving depositions overnight? OpenAI. The developer experience is good, uptime is excellent, and $0.006/min is genuinely cheap.

Groq API — any time a human is waiting on the result. The telehealth client I mentioned is now on Groq. The provider gets a transcript in under 2 seconds for a typical 90-second intake clip. That's the difference between a workflow that feels instant and one that requires a loading spinner. I'd also reach for Groq if I'm building something conversational — a voice bot, a real-time note-taker, anything where the transcription is in the hot path.

Self-hosted — honestly, a narrow window. It makes sense if you have regulatory requirements that prohibit sending audio to third-party APIs (certain HIPAA situations, some government work), if you need word-level timestamps and maximum control, or if you're already running GPU infrastructure for other models and the marginal cost is low. For a greenfield project, the operational overhead isn't worth it unless you're at serious scale or have hard data residency requirements.

One thing I haven't done yet but want to: test self-hosted with faster-whisper on an L4 or A10G instance in the same datacenter as my app server to get a fairer network comparison. My gut says it would close some of the gap with Groq, but probably not all of it.

Closing

Groq's latency advantage here is real and it's not marginal — if you're building anything where a human waits on the transcript, it changes the product feel entirely. The OpenAI API is still my default for async workloads because the reliability record is better and I've been burned by newer providers' uptime before. Self-hosted Whisper is a solution looking for a specific problem; don't reach for it unless you have a concrete reason to own the infrastructure.