Voxtral transcribes at the speed of sound

By Simon Willison's Weblog February 4, 2026 · Edited February 5, 2026

Voxtral transcribes at the speed of sound (https://mistral.ai/news/voxtral-transcribe-2) Mistral just released Voxtral Transcribe 2 - a family of two new models, one open weights, for transcribing

Voxtral transcribes at the speed of sound (https://mistral.ai/news/voxtral-transcribe-2)

Mistral just released Voxtral Transcribe 2 - a family of two new models, one open weights, for transcribing audio to text. This is the latest in their Whisper-like model family, and a sequel to the original Voxtral which they released in July 2025 (https://simonwillison.net/2025/Jul/16/voxtral/).

Voxtral Realtime - official name Voxtral-Mini-4B-Realtime-2602 - is the open weights (Apache-2.0) model, available as a 8.87GB download from Hugging Face (https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602).

You can try it out in this live demo (https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtime) - don’t be put off by the “No microphone found” message, clicking “Record” should have your browser request permission and then start the demo working. I was very impressed by the demo - I talked quickly and used jargon like Django and WebAssembly and it correctly transcribed my text within moments of me uttering each sound.

The closed weight model is called voxtral-mini-latest and can be accessed via the Mistral API, using calls that look something like this:

curl -X POST “https://api.mistral.ai/v1/audio/transcriptions”
-H “Authorization: Bearer $MISTRAL_API_KEY”
-F model=“voxtral-mini-latest”
-F file=@“Pelican talk at the library.m4a”
-F diarize=true
-F context_bias=“Datasette”
-F timestamp_granularities=“segment”

It’s priced at $0.003/minute, which is $0.18/hour.

The Mistral API console now has a speech-to-text playground (https://console.mistral.ai/build/audio/speech-to-text) for exercising the new model and it is excellent. You can upload an audio file and promptly get a diarized transcript in a pleasant interface, with options to download the result in text, SRT or JSON format.

Via Hacker News (https://news.ycombinator.com/item?id=46886735)

Tags: ai (https://simonwillison.net/tags/ai), generative-ai (https://simonwillison.net/tags/generative-ai), llms (https://simonwillison.net/tags/llms), hugging-face (https://simonwillison.net/tags/hugging-face), mistral (https://simonwillison.net/tags/mistral), speech-to-text (https://simonwillison.net/tags/speech-to-text)

Reference: https://simonwillison.net/2026/Feb/4/voxtral-2/#atom-everything