Quick Start
Installation
Python package is available on PyPi.
Quick start
In this demo, we use a locally deployed vLLM OpenAI compatible server to run Qwen2.5-VL-7B-Instruct. For more inference APIs and VLMs, please see VLMEngine.
We define OCR engine and specify output formats.
from vlm4ocr import OCREngine
# Image/PDF paths
image_path = "/examples/synthesized_data/GPT-4o_synthesized_note_1_page_1.jpg"
pdf_path = "/examples/synthesized_data/GPT-4o_synthesized_note_1.pdf"
# Define OCR engine
ocr = OCREngine(vlm_engine, output_mode="markdown")
Full text OCR
Run OCR sequentially
We run OCR sequentially (process one image at a time) for single or multiple files. This approach is suitable for testing or processing small-scaled requests.
# OCR for a single image
ocr_results = ocr.sequential_ocr(image_path, verbose=True)
# OCR for a single pdf (multiple pages)
ocr_results = ocr.sequential_ocr(pdf_path, verbose=True)
# OCR for multiple image and pdf files
ocr_results = ocr.sequential_ocr([pdf_path, image_path], verbose=True)
# Inspect OCR results
len(ocr_results) # 2 files
ocr_results[0].input_dir # input dir
ocr_results[0].filename # input filename
ocr_results[0].status # OCR result status: 'success'
len(ocr_results[0]) # PDF file number of pages
ocr_text = ocr_results[0].to_string() # OCR text (all pages concatenated)
Run OCR concurrently
For high-volume OCR tasks, it is more efficient to run OCR concurrently. The example below concurrently processes 4 images/pages at a time and write outputs to file whenever a file has finished.
import asyncio
async def run_ocr():
response = ocr.concurrent_ocr([image_path_1, image_path_2], concurrent_batch_size=4)
async for result in response:
if result.status == "success":
filename = result.filename
ocr_text = result.to_string()
with open(f"{filename}.md", "w", encoding="utf-8") as f:
f.write(ocr_text)
asyncio.run(run_ocr())
Key information extraction with JSON
In some use cases, we are only interested in a specific set of key information from the OCR results. Processing the entire OCR text is inefficient. We can directly extract the key information using the output_mode="JSON". To use the JSON extraction feature, a custom user prompt that defines the JSON structure is required. The example below demonstrates how to extract key information from images and PDFs.
Run OCR sequentially
import json
user_prompt = """
Your output should include keys: "Patient", "MRN".
For example:
{
"Patient": "John Doe",
"MRN": "12345"
}
"""
ocr = OCREngine(vlm_engine=vlm, output_mode="JSON", user_prompt=user_prompt)
ocr_results = ocr.sequential_ocr([image_path_1, image_path_2], verbose=True)
for result in ocr_results:
for page_num, page in enumerate(result.pages):
print(json.loads(page['text']))
with open(f"{result.filename}_page_{page_num}.json", "w", encoding="utf-8") as f:
json.dump(json.loads(page['text']), f, indent=4)
Run OCR concurrently
import asyncio
import json
user_prompt = """
Your output should include keys: "Patient", "MRN".
For example:
{
"Patient": "John Doe",
"MRN": "12345"
}
"""
ocr = OCREngine(vlm_engine=vlm, output_mode="JSON", user_prompt=user_prompt)
async def run_ocr():
response = ocr.concurrent_ocr([image_path_1, image_path_2], concurrent_batch_size=4)
async for result in response:
if result.status == "success":
filename = result.filename
for page_num, page in enumerate(result.pages):
with open(f"{filename}_page_{page_num}.json", "w", encoding="utf-8") as f:
json.dump(json.loads(page['text']), f, indent=4)
print(f"Saved {filename}_page_{page_num}.json")
else:
print(f"Error processing {result.filename}: {result.error}")
asyncio.run(run_ocr())
Few-shot examples
Few-shot examples can be provided to improve the accuracy. Below are examples of how to include few-shot examples in the OCR engine.
Few-shot examples for full-text OCR
First, we prepare a list of few-shot examples. Each example is an object of FewShotExample that contains an input image (PIL.Image.Image) and the corresponding expected output text. Note that the output text should be the exact text you expect the VLM to generate for the given input image. Do not include any additional explanations or variations. Few-shot examples can also include a max_dimension_pixels parameter to resize the image while maintaining the aspect ratio. This is useful when the original image size exceeds the VLM's maximum input size. Few-shot examples can also include a rotate_correction parameter to automatically correct the image orientation before feeding it to the VLM.
The few-shot example images and text are available in the examples/synthesized_data/few_shot_examples/ folder in this repository.
import os
from PIL import Image
from vlm4ocr import FewShotExample
# Load few-shot examples
example_1_image = Image.open(os.path.join("examples", "synthesized_data", "few_shot_examples", "images", "template_1_sample_4_poor.JPG"))
with open(os.path.join("examples", "synthesized_data", "few_shot_examples", "ground_truth", "template_1_sample_4_poor.txt"), "r") as f:
example_1_text = f.read()
example_2_image = Image.open(os.path.join("examples", "synthesized_data", "few_shot_examples", "images", "template_3_sample_4_poor.JPG"))
with open(os.path.join("examples", "synthesized_data", "few_shot_examples", "ground_truth", "template_3_sample_4_poor.txt"), "r") as f:
example_2_text = f.read()
few_shot_examples = [
FewShotExample(image=example_1_image, text=example_1_text, max_dimension_pixels=512),
FewShotExample(image=example_2_image, text=example_2_text, max_dimension_pixels=512)
]
We load the target image for OCR.
image_path = os.path.join("examples", "synthesized_data", "few_shot_examples", "images", "template_6_sample_4_poor.JPG")
As before, we define the VLM engine and OCR engine.
from vlm4ocr import VLLMVLMEngine, OCREngine
# Define VLM engine
vlm_engine = VLLMVLMEngine(model="Qwen/Qwen2.5-VL-7B-Instruct")
# Define OCR engine
ocr = OCREngine(vlm_engine, output_mode="text")
But this time, we pass the few-shot examples to the OCR methods.
# OCR for a single image
ocr_results = ocr.sequential_ocr(image_path, max_dimension_pixels=512, verbose=True, few_shot_examples=few_shot_examples)
Few-shot examples for key information extraction with JSON
import os
from PIL import Image
from vlm4ocr import FewShotExample
# Load few-shot examples
example_1_image = Image.open(os.path.join("examples", "synthesized_data", "few_shot_examples", "images", "template_1_sample_4_poor.JPG"))
with open(os.path.join("examples", "synthesized_data", "few_shot_examples", "ground_truth", "template_1_sample_4_poor.json"), "r") as f:
example_1_text = f.read()
example_2_image = Image.open(os.path.join("examples", "synthesized_data", "few_shot_examples", "images", "template_3_sample_4_poor.JPG"))
with open(os.path.join("examples", "synthesized_data", "few_shot_examples", "ground_truth", "template_3_sample_4_poor.json"), "r") as f:
example_2_text = f.read()
few_shot_examples = [
FewShotExample(image=example_1_image, text=example_1_text, max_dimension_pixels=512),
FewShotExample(image=example_2_image, text=example_2_text, max_dimension_pixels=512)
]
We load the target image for OCR.
image_path = os.path.join("examples", "synthesized_data", "few_shot_examples", "images", "template_6_sample_4_poor.JPG")
We define the VLM engine, JSON extraction schema, and OCR engine.
from vlm4ocr import VLLMVLMEngine, OCREngine
# Define VLM engine
vlm_engine = VLLMVLMEngine(model="Qwen/Qwen2.5-VL-7B-Instruct")
# Define JSON extraction schema
user_prompt = """
Your output should include keys: "Patient", "MRN".
For example:
{
"Patient": "John Doe",
"MRN": "12345"
}
"""
# Define OCR engine
ocr = OCREngine(vlm_engine, output_mode="JSON", user_prompt=user_prompt)
But this time, we pass the few-shot examples to the OCR methods.