usage: vlm4ocr [-h] --input_path INPUT_PATH
[--output_mode {markdown,HTML,text,JSON,bbox}]
[--output_path OUTPUT_PATH] [--skip_existing]
[--rotate_correction {tesseract,vlm}]
[--max_dimension_pixels MAX_DIMENSION_PIXELS] --vlm_engine
{openai,azure_openai,ollama,openai_compatible,vllm,sglang,openrouter}
--model MODEL [--max_new_tokens MAX_NEW_TOKENS]
[--temperature TEMPERATURE] [--top_p TOP_P]
[--presence_penalty PRESENCE_PENALTY] [--extra_body EXTRA_BODY]
[--api_key API_KEY] [--base_url BASE_URL]
[--azure_api_key AZURE_API_KEY]
[--azure_endpoint AZURE_ENDPOINT]
[--azure_api_version AZURE_API_VERSION]
[--ollama_host OLLAMA_HOST] [--ollama_num_ctx OLLAMA_NUM_CTX]
[--ollama_keep_alive OLLAMA_KEEP_ALIVE]
[--user_prompt USER_PROMPT]
[--user_prompt_file USER_PROMPT_FILE]
[--system_prompt SYSTEM_PROMPT]
[--system_prompt_file SYSTEM_PROMPT_FILE]
[--concurrent_batch_size CONCURRENT_BATCH_SIZE]
[--max_file_load MAX_FILE_LOAD] [--log] [--debug]
VLM4OCR: Perform OCR on images, PDFs, or TIFF files using Vision Language
Models. Processing is concurrent by default.
options:
-h, --help show this help message and exit
Input/Output Options:
--input_path INPUT_PATH
Path to a single input file or a directory of files.
(default: None)
--output_mode {markdown,HTML,text,JSON,bbox}
Output format. 'JSON' requires --user_prompt or
--user_prompt_file to define the JSON structure.
'bbox' writes per-doc <basename>_ocr.json plus per-
page <basename>_ocr_page<N>.png annotated images.
(default: markdown)
--output_path OUTPUT_PATH
Optional: Path to save OCR results. If input_path is a
directory of multiple files, this should be an output
directory. If input is a single file, this can be a
full file path or a directory. If not provided,
results are saved to the current working directory (or
a sub-directory for logs if --log is used). (default:
None)
--skip_existing Skip processing files that already have OCR results in
the output directory. (default: False)
Image Processing Parameters:
--rotate_correction {tesseract,vlm}
Rotation correction method for input images.
'tesseract' requires Tesseract OCR to be installed.
'vlm' prompts the configured VLM engine. Omit to
disable. (default: None)
--max_dimension_pixels MAX_DIMENSION_PIXELS
Maximum dimension (width or height) in pixels for
input images. Images larger than this will be resized
to fit within this limit while maintaining aspect
ratio. (default: 4000)
VLM Engine Options:
--vlm_engine {openai,azure_openai,ollama,openai_compatible,vllm,sglang,openrouter}
VLM engine. (default: None)
--model MODEL Model identifier for the VLM engine. (default: None)
--max_new_tokens MAX_NEW_TOKENS
Max new tokens for VLM. (default: 4096)
--temperature TEMPERATURE
Sampling temperature. (default: None)
--top_p TOP_P Sampling top p. (default: None)
--presence_penalty PRESENCE_PENALTY
Presence penalty. (default: None)
--extra_body EXTRA_BODY
Extra body parameters as a JSON string (e.g.
'{"chat_template_kwargs": {"enable_thinking":
false}}'). (default: None)
OpenAI & OpenAI-Compatible Options:
--api_key API_KEY API key. (default: None)
--base_url BASE_URL Base URL for OpenAI-compatible services. (default:
None)
Azure OpenAI Options:
--azure_api_key AZURE_API_KEY
Azure API key. (default: None)
--azure_endpoint AZURE_ENDPOINT
Azure endpoint URL. (default: None)
--azure_api_version AZURE_API_VERSION
Azure API version. (default: None)
Ollama Options:
--ollama_host OLLAMA_HOST
Ollama host URL. (default: http://localhost:11434)
--ollama_num_ctx OLLAMA_NUM_CTX
Context length for Ollama. (default: 4096)
--ollama_keep_alive OLLAMA_KEEP_ALIVE
Ollama keep_alive seconds. (default: 300)
OCR Engine Parameters:
--user_prompt USER_PROMPT
Custom user prompt (inline string). For longer prompts
use --user_prompt_file. If both are given,
--user_prompt wins with a warning. (default: None)
--user_prompt_file USER_PROMPT_FILE
Path to a text file containing the user prompt.
(default: None)
--system_prompt SYSTEM_PROMPT
Custom system prompt (inline string). For longer
prompts use --system_prompt_file. If both are given,
--system_prompt wins with a warning. (default: None)
--system_prompt_file SYSTEM_PROMPT_FILE
Path to a text file containing the system prompt.
(default: None)
Processing Options:
--concurrent_batch_size CONCURRENT_BATCH_SIZE
Number of images/pages to process concurrently. Set to
1 for sequential processing of VLM calls. (default: 4)
--max_file_load MAX_FILE_LOAD
Number of input files to pre-load. Set to -1 for
automatic config: 2 * concurrent_batch_size. (default:
-1)
--log Enable writing logs to a timestamped file in the
output directory. (default: False)
--debug Enable debug level logging for console (and file if
--log is active). (default: False)