Skip to content

VLM4OCR

vlm4ocr is a toolkit for Optical character recognition (OCR) with Vision language models (VLMs). In includes three components:

What's new in v0.5.0

  • BBox output modeOCREngine(output_mode="bbox", ...) returns text with bounding-box coordinates and labels. Leave user_prompt empty for full-text bbox OCR or set it to a free-text instruction (e.g., "patient name and DOB") for targeted extraction. Built-in format registry covers Qwen3-VL, Gemma 3/4, and GPT-4.1. See Quick Start.
  • Web app — BBox tab — visualize bounding boxes directly in the browser with an Image | Raw response toggle. Batch mode emits annotated PNGs and a consolidated JSON per file.
  • OCRPage dataclassOCRResult.pages entries are now OCRPage dataclasses with .text, .bboxes, .image_width, .image_height, and a .plot_bboxes() helper. Dict-style access still works.