usage: vlm4ocr [-h] --input_path INPUT_PATH
[--output_mode {markdown,HTML,text}]
[--output_path OUTPUT_PATH] [--skip_existing]
[--rotate_correction]
[--max_dimension_pixels MAX_DIMENSION_PIXELS] --vlm_engine
{openai,azure_openai,ollama,openai_compatible} --model MODEL
[--max_new_tokens MAX_NEW_TOKENS] [--temperature TEMPERATURE]
[--api_key API_KEY] [--base_url BASE_URL]
[--azure_api_key AZURE_API_KEY]
[--azure_endpoint AZURE_ENDPOINT]
[--azure_api_version AZURE_API_VERSION]
[--ollama_host OLLAMA_HOST] [--ollama_num_ctx OLLAMA_NUM_CTX]
[--ollama_keep_alive OLLAMA_KEEP_ALIVE]
[--user_prompt USER_PROMPT]
[--concurrent_batch_size CONCURRENT_BATCH_SIZE]
[--max_file_load MAX_FILE_LOAD] [--log] [--debug]
VLM4OCR: Perform OCR on images, PDFs, or TIFF files using Vision Language
Models. Processing is concurrent by default.
options:
-h, --help show this help message and exit
Input/Output Options:
--input_path INPUT_PATH
Path to a single input file or a directory of files.
(default: None)
--output_mode {markdown,HTML,text}
Output format. (default: markdown)
--output_path OUTPUT_PATH
Optional: Path to save OCR results. If input_path is a
directory of multiple files, this should be an output
directory. If input is a single file, this can be a
full file path or a directory. If not provided,
results are saved to the current working directory (or
a sub-directory for logs if --log is used). (default:
None)
--skip_existing Skip processing files that already have OCR results in
the output directory. (default: False)
Image Processing Parameters:
--rotate_correction Enable automatic rotation correction for input images.
This requires Tesseract OCR to be installed and
configured correctly. (default: False)
--max_dimension_pixels MAX_DIMENSION_PIXELS
Maximum dimension (width or height) in pixels for
input images. Images larger than this will be resized
to fit within this limit while maintaining aspect
ratio. (default: 4000)
VLM Engine Options:
--vlm_engine {openai,azure_openai,ollama,openai_compatible}
VLM engine. (default: None)
--model MODEL Model identifier for the VLM engine. (default: None)
--max_new_tokens MAX_NEW_TOKENS
Max new tokens for VLM. (default: 4096)
--temperature TEMPERATURE
Sampling temperature. (default: 0.0)
OpenAI & OpenAI-Compatible Options:
--api_key API_KEY API key. (default: None)
--base_url BASE_URL Base URL for OpenAI-compatible services. (default:
None)
Azure OpenAI Options:
--azure_api_key AZURE_API_KEY
Azure API key. (default: None)
--azure_endpoint AZURE_ENDPOINT
Azure endpoint URL. (default: None)
--azure_api_version AZURE_API_VERSION
Azure API version. (default: None)
Ollama Options:
--ollama_host OLLAMA_HOST
Ollama host URL. (default: http://localhost:11434)
--ollama_num_ctx OLLAMA_NUM_CTX
Context length for Ollama. (default: 4096)
--ollama_keep_alive OLLAMA_KEEP_ALIVE
Ollama keep_alive seconds. (default: 300)
OCR Engine Parameters:
--user_prompt USER_PROMPT
Custom user prompt. (default: None)
Processing Options:
--concurrent_batch_size CONCURRENT_BATCH_SIZE
Number of images/pages to process concurrently. Set to
1 for sequential processing of VLM calls. (default: 4)
--max_file_load MAX_FILE_LOAD
Number of input files to pre-load. Set to -1 for
automatic config: 2 * concurrent_batch_size. (default:
-1)
--log Enable writing logs to a timestamped file in the
output directory. (default: False)
--debug Enable debug level logging for console (and file if
--log is active). (default: False)