In OCR, even when the characters are poorly scanned, the deep domain understanding these large multi modal AIs have allows it to understand what the document actually meant - this is going to be order id because in the million invoices I have seen before order id is normally below order date - etc. The same issue is going to be there in ASR also is my worry.
So far, the best I have found while testing models for my language learning app (Copycat Cafe) is Soniox. All others performed badly for non native accents. The worst were whisper-based models because they hallucinate when they misunderstand and tend to come up with random phrases that have nothing to do with the topic.
>Timestamps/Speaker diarization. The model does not feature either of these.
What a shame. Is whisperx still the best choice if you want timestamps/diarization?
It has the most crisp, steady P50 of any external service I've used in a long time.
This is a good option. Will check it out.
Accurate and fast model, very happy with it so far!