Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.
I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:
> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?
The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.
Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
~9GB model.
But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.
We need better independent comparison to see how it performs against the latest Qwen3-ASR, and so on.
I can no longer take at face value the cherry picked comparisons of the companies showing off their new models.
For now, NVIDIA Parakeet v3 is the best for my use case, and runs very fast on my laptop or my phone.
Is it better? Worse? Why do they only compare to gpt4o mini transcribe?
If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.
"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"
So, you don't mean 'try this out', you mean 'buy this product'.
Let's not act like it's a free sampler.
I can't comment on the model : i'm not giving them money.
What estimates do others use?
[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...
This combo has almost unbeatable accuracy and it rejects noises in the background really well. It can even reject people talking in the background.
The only better thing I've seen is Ursa model from Speechmatics. Not open weights unfortunately.