CLI
Command line interface (CLI) provides an easy way to batch process many images, PDFs, and TIFFs in a directory.
Installation
Install the Python package on PyPi and the CLI tool will be automatically installed.
Quick Start
Run OCR for all supported file types in the /examples/synthesized_data/ folder with a locally deployed Qwen2.5-VL-7B-Instruct and generate results as markdown. OCR results and a log file (enabled by --log) will be written to the output_path. --concurrent_batch_size determines the number of images/pages that can be processed at a time.
# OpenAI compatible API
vlm4ocr --input_path /examples/synthesized_data/ \
--output_path /examples/ocr_output/ \
--skip_existing \
--output_mode markdown \
--log \
--max_dimension_pixels 4000 \
--vlm_engine openai_compatible \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--api_key EMPTY \
--base_url http://localhost:8000/v1 \
--concurrent_batch_size 4
Use gpt-4o-mini to process a PDF. Since --output_path is not specified, outputs and logs will be written to the current working directory.
# OpenAI API
export OPENAI_API_KEY=<api key>
vlm4ocr --input_path /examples/synthesized_data/GPT-4o_synthesized_note_1.pdf \
--output_mode HTML \
--log \
--vlm_engine openai \
--model gpt-4o-mini \
--concurrent_batch_size 4
Use a locally deployed Qwen3.5-35B-A3B via vLLM. Pass extra_body to disable the model's thinking mode:
vlm4ocr --input_path /examples/synthesized_data/ \
--output_path /examples/ocr_output/ \
--skip_existing \
--output_mode markdown \
--log \
--vlm_engine vllm \
--model Qwen/Qwen3.5-35B-A3B \
--base_url http://localhost:8000/v1 \
--max_new_tokens 16384 \
--temperature 0.7 \
--top_p 0.8 \
--presence_penalty 1.5 \
--extra_body '{"top_k": 20, "min_p": 0.0, "chat_template_kwargs": {"enable_thinking": false}}' \
--concurrent_batch_size 4
Extract structured JSON from a document by describing the fields in a prompt file:
vlm4ocr --input_path /examples/invoice.pdf \
--output_mode JSON \
--user_prompt_file /examples/prompts/invoice_schema.txt \
--output_path /examples/ocr_output/ \
--vlm_engine openai \
--model gpt-4o \
--concurrent_batch_size 4
Detect and localize all text regions (full-text bbox OCR) across a directory of images:
vlm4ocr --input_path /examples/images/ \
--output_mode bbox \
--output_path /examples/bbox_output/ \
--vlm_engine openai \
--model gpt-4o \
--concurrent_batch_size 4
Target a specific field (targeted bbox extraction) with a user prompt:
vlm4ocr --input_path /examples/images/ \
--output_mode bbox \
--user_prompt "Extract all price and total amount fields." \
--output_path /examples/bbox_output/ \
--vlm_engine openai \
--model gpt-4o \
--concurrent_batch_size 4
Usage
The CLI parameters are grouped into categories to manage the OCR process.
Input/Output Options
--input_pathSpecify a single input file or a directory with multiple files for OCR.--output_modeOutput format. One oftext,markdown,HTML,JSON, orbbox. (default:markdown)JSON— extracts structured data; requires--user_promptor--user_prompt_fileto define the JSON structure.bbox— detects and localizes text regions. Writes one.jsonfile per document (bounding box coordinates and text) and one annotated.pngper page. Without a user prompt, performs full-text OCR over the whole page; with a user prompt, targets the described fields.--output_pathIfinput_pathis a directory of multiple files, this should be an output directory. If input is a single file, this can be a full file path or a directory. If not provided, results are saved to the current working directory.--skip_existingSkip processing files that already have OCR results in the output directory.
Image Processing Parameters
--rotate_correctionApply automatic rotation correction. Choices:tesseract(requires Tesseract OCR),vlm(prompts the configured VLM). Omit to disable. (default: disabled)--max_dimension_pixelsMaximum dimension (width or height) in pixels for input images. Images larger than this will be resized while maintaining aspect ratio. (default: 4000)
VLM Engine Selection
--vlm_engineVLM backend. One ofopenai,azure_openai,ollama,openai_compatible,vllm,sglang, oropenrouter.--modelVLM model name.--max_new_tokensMaximum output tokens. (default: 4096)--temperatureSampling temperature. (default: model default)--top_pSampling top-p. (default: model default)--presence_penaltyPresence penalty. (default: model default)--extra_bodyAdditional request body fields as a JSON string. Useful for engine-specific parameters such astop_k,min_p, orchat_template_kwargs(e.g. to disable thinking mode on Qwen3 models).
OpenAI & OpenAI-Compatible Options (openai, openai_compatible, vllm, sglang, openrouter)
--api_keyAPI key. Can also be set viaOPENAI_API_KEY(OpenAI) orOPENROUTER_API_KEY(openrouter) environment variable. Optional forvllmandsglang(local servers typically don't require one).--base_urlBase URL for the API. Required foropenai_compatible. Defaults:vllm→http://localhost:8000/v1,sglang→http://localhost:30000/v1,openrouter→https://openrouter.ai/api/v1.
Azure OpenAI Options
--azure_api_keyAzure API key. Can also be set viaAZURE_OPENAI_API_KEY.--azure_endpointAzure endpoint URL. Can also be set viaAZURE_OPENAI_ENDPOINT.--azure_api_versionAzure API version. Can also be set viaAZURE_OPENAI_API_VERSION.
Ollama Options
--ollama_hostOllama host URL. (default:http://localhost:11434)--ollama_num_ctxContext length. (default: 4096)--ollama_keep_aliveKeep-alive seconds. (default: 300)
OCR Engine Parameters
--user_promptCustom user prompt as an inline string. For multi-line prompts prefer--user_prompt_file. If both are provided,--user_prompttakes precedence with a warning.--user_prompt_filePath to a text file containing the user prompt. Useful for long prompts or JSON schema definitions.--system_promptCustom system prompt as an inline string. Overrides the built-in default for the selected output mode. If both--system_promptand--system_prompt_fileare provided,--system_prompttakes precedence with a warning.--system_prompt_filePath to a text file containing the system prompt.
Processing Options
--concurrent_batch_sizeNumber of images/pages to process concurrently. Set to 1 for sequential processing. (default: 4)--max_file_loadNumber of input files to pre-load. Set to -1 for automatic: 2 ×concurrent_batch_size. (default: -1)--logWrite logs to a timestamped file in the output directory.--debugEnable debug-level logging on the console (and in the log file if--logis active).
Output Files
--output_mode |
Output per document |
|---|---|
markdown |
<filename>_ocr.md |
HTML |
<filename>_ocr.html |
text |
<filename>_ocr.txt |
JSON |
<filename>_ocr.json |
bbox |
<filename>_ocr.json + <filename>_ocr_page1.png, _page2.png, … |
The bbox JSON structure is: