Transcription: Whisper, the verbatim prompt, and device fallback¶
Transcription is the front of the pipeline: faster-whisper produces the
word-level timestamps every detector builds on. The logic lives in
asr.py:transcribe (asr.py:34). The README's "Transcription device" section
covers the user-facing install/GPU story; this is the maintainer's view of the
three non-obvious choices in this module.
Model, device, compute type¶
transcribe(path, model_name="large-v3", verbatim=True, device="auto",
compute_type="auto") maps directly to the CLI's --model, --device, and
--compute-type. faster_whisper.WhisperModel is lazy-imported inside the
function (asr.py:51) — it's heavy, and the pure-logic test suite must import
the rest of the package without paying for it (see the README's Tests section).
large-v3 is the default for best accuracy; medium.en is a faster balance and
small.en faster still. The model choice matters for detection quality:
a larger model produces tighter word boundaries, which directly improves the
intra-word and overlong detectors (they reason about word duration). device and
compute_type are passed straight through to WhisperModel.
The verbatim prompt (asr.py:VERBATIM_PROMPT)¶
Whisper was trained to produce readable transcripts, so by default it silently
cleans up disfluencies — which would leave the word-list detector
(detection.md, pass 1) with nothing to match. Two settings on the
model.transcribe call (asr.py:57–62) bias it the other way:
initial_prompt=VERBATIM_PROMPT— a short primer ("Um, uh, er, erm, ah, hmm. … Verbatim transcription including all filler words and disfluencies.") that conditions the model to emit fillers as tokens.condition_on_previous_text=False— otherwise each segment is conditioned on the model's own prior output, which dilutes the prompt's influence as transcription proceeds.
Together these make the cheap, exact word-list detector viable; the three acoustic detectors are the safety net for fillers Whisper still drops or fuses despite the prompt.
CUDA → CPU fallback (asr._is_recoverable_cuda_error)¶
The common failure mode for faster-whisper on a GPU box is a machine with an
NVIDIA GPU and driver but no CUDA runtime wheels (libcublas, libcudnn).
With device="auto" the backend picks the GPU, then raises a bare RuntimeError
— and it does so lazily, on the first encode(), not at model construction
(asr.py:64–65). That's why the entire segment iteration is wrapped inside the
_run closure: the error can't be caught around the constructor alone.
The recovery (asr.py:77–91):
_run(device)is attempted first.- On
RuntimeError, the message is matched against_RECOVERABLE_CUDA_MARKERS("cublas","cudnn","cuda",asr.py:26). These substrings cover the observed cublas/cudnn load failures and most other GPU-side CUDA errors (driver too old, OOM). - The retry on CPU happens only when
device == "auto"— i.e. the user letermchoose. It prints a warning to stderr and re-runs on CPU. - An explicit
--device cudais treated as a demand: the real error is re-raised, never silently downgraded.
So auto always completes (GPU when it works, CPU otherwise, with a warning),
cpu skips the GPU probe entirely, and cuda surfaces failures. The README's
"Transcription device" section documents how to install the CUDA wheels to make
the GPU path actually load.