Remix Hacker News Clone

201

GLM-OCR: Accurate × Fast × Comprehensive

There are a bunch of new OCR models.

I’ve also heard very good things about these two in particular:

- LightOnOCR-2-1B: https://huggingface.co/lightonai/LightOnOCR-2-1B

- PaddleOCR-VL-1.5: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5

The OCR leaderboards I’ve seen leave a lot to be desired.

With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.

I also feel like most/all of these models don’t handle charts, other than to maybe include a link to a cropped image. It would be nice for the OCR model to also convert charts into markdown tables, but this is obviously challenging.

by coder5431770824046

There was so many OCR models released in the past few months, all VLM models and yet none of them handle Korean well. Every time I try with a random screenshot (not a A4 document) they just fail at a "simple" task. And funnily enough Qwen3 8B VL is the best model that usually get it right (although I couldn't get the bbox quite well). Even more funny, whatever is running on an iphone locally on cpu is insanely good, same with google's OCR api. I don't know why we don't get more of the traditional OCR stuff. Paddlepaddle v5 is the closest I could find. At this point, I feel like I might be doing something wrong with those VLMs.

by alaanor1770827516

This is actually the thing I really desperately need. I'm routinely analyzing contracts that were faxed to me, scanned with monstrously poor resolution, wet signed, all kinds of shit. The big LLM providers choke on this raw input and I burn up the entire context window for 30 pages of text. Understandable evals of the quality of these OCR systems (which are moving wicked fast) would be helpful...

And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.

by aliljet1770823666

Text me back when there's a working PDF to EPUB conversion tool. I've been waiting (and searching for one) long enough. :D

EDIT: https://github.com/overcuriousity/pdf2epub looks interesting.

by mikae11770831256

This might be a niche question, but does glm-ocr (or other libraries) have the ability to extract/interpret QR code data?

by surfacedamage1770843260

What's the current SOTA for Japanese and Korean OCR? BalloonsTranslator has a great workflow but the models are pretty old.

by ThrowawayTestr1770848007

I've been trying different OCR models on what should be very simple - subtitles (these are simple machine-rendered text). While all models do very well (95+% accuracy), I haven't seen a model not occasionally make very obvious mistakes. Maybe it will take a different approach to get the last 1%...

by ks20481770829297

Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality? I would love to be hopeful, but so far an open source model was never better than a closed one even when benchmarks were showing that.

by rdos1770826288

Has anyone experiment with using VLM to detect "marks"? Thinking of pen/pencil based markings like underlines, circles,checkmarks.. Can these models do it?

by sinandrei1770830189

I tested this pretty extensively and it has a common failure mode that prevents me from using: extracting footnotes and similar from the full text of academic works. For some reason, many of these models are trained in a way that results in these being excluded, despite these document sections often containing import details and context. Both versions of DeepseekOCR have the same problem. Of the others I’ve tested, dot-ocr in layout mode works best (but is slow) and then datalab’s chandra model (which is larger and has bad license constraints).

by bugglebeetle1770828643

by 1770828626

[dead]

by raphaelmolly81770829347