VLM4OCR
vlm4ocr is a toolkit for Optical character recognition (OCR) with Vision language models (VLMs). In includes three components:
- Web Application for drag-and-drop access
- CLI for command line access
- Python package for Python access
What's new in v0.5.0
- BBox output mode —
OCREngine(output_mode="bbox", ...)returns text with bounding-box coordinates and labels. Leaveuser_promptempty for full-text bbox OCR or set it to a free-text instruction (e.g.,"patient name and DOB") for targeted extraction. Built-in format registry covers Qwen3-VL, Gemma 3/4, and GPT-4.1. See Quick Start. - Web app — BBox tab — visualize bounding boxes directly in the browser with an Image | Raw response toggle. Batch mode emits annotated PNGs and a consolidated JSON per file.
OCRPagedataclass —OCRResult.pagesentries are nowOCRPagedataclasses with.text,.bboxes,.image_width,.image_height, and a.plot_bboxes()helper. Dict-style access still works.