LLM Inference Engine
We provide an interface for different LLM inference engines to work in the information extraction workflow. The built-in engines are VLLMInferenceEngine, LiteLLMInferenceEngine, OpenAIInferenceEngine, HuggingFaceHubInferenceEngine, OllamaInferenceEngine, and LlamaCppInferenceEngine. For customization, see customize inference engine. Inference engines accept a LLMConfig class where sampling parameters (e.g., temperature, top-p, top-k, maximum new tokens) and reasoning configuration (e.g., OpenAI o-series models, Qwen3) can be set.
vLLM
The vLLM support follows the OpenAI Compatible Server. For more parameters, please refer to the documentation. Below are examples for different models.
Meta-Llama-3.1-8B-Instruct
Start the server in command line.
CUDA_VISIBLE_DEVICES=<GPU#> vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--api-key MY_API_KEY \
--tensor-parallel-size <# of GPUs to use>
CUDA_VISIBLE_DEVICES to specify GPUs to use. The --tensor-parallel-size should be set accordingly. The --api-key is optional.
the default port is 8000. --port sets the port.
Define inference engine
from llm_ie.engines import VLLMInferenceEngine
inference_engine = VLLMInferenceEngine(model="meta-llama/Meta-Llama-3.1-8B-Instruct")
model must match the repo name specified in the server.
Qwen3-30B-A3B (hybrid thinking mode)
Start the server in command line. Specify --reasoning-parser qwen3 to enable the reasoning parser.
vllm serve Qwen/Qwen3-30B-A3B \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--reasoning-parser qwen3
from llm_ie.engines import VLLMInferenceEngine, Qwen3LLMConfig
# Thinking mode
inference_engine = VLLMInferenceEngine(model="Qwen/Qwen3-30B-A3B",
config=Qwen3LLMConfig(thinking_mode=True, temperature=0.6, top_p=0.95, top_k=20))
# Non-thinking mode
inference_engine = VLLMInferenceEngine(model="Qwen/Qwen3-30B-A3B",
config=Qwen3LLMConfig(thinking_mode=False, temperature=0.7, top_p=0.8, top_k=20))
Qwen3-30B-Thinking-2507
Start the server in command line. Specify --reasoning-parser qwen3 to enable the reasoning parser.
vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--reasoning-parser qwen3
from llm_ie.engines import VLLMInferenceEngine, ReasoningLLMConfig
inference_engine = VLLMInferenceEngine(model="Qwen/Qwen3-30B-A3B-Thinking-2507",
config=ReasoningLLMConfig(temperature=0.6, top_p=0.95, top_k=20))
gpt-oss-120b
Start the server in command line. Specify --reasoning-parser openai_gptoss to enable the reasoning parser.
vllm serve openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--reasoning-parser openai_gptoss
from llm_ie.engines import VLLMInferenceEngine, ReasoningLLMConfig
inference_engine = VLLMInferenceEngine(model="openai/gpt-oss-120b",
config=ReasoningLLMConfig(temperature=1.0, top_p=1.0, top_k=0))
SGLang
The SGLang support follows the OpenAI APIs. For more parameters, please refer to the documentation. Below are examples for different models.
Meta-Llama-3.1-8B-Instruct
Start the server in command line.
CUDA_VISIBLE_DEVICES=<GPU#> python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--api-key MY_API_KEY \
--tensor-parallel-size <# of GPUs to use>
CUDA_VISIBLE_DEVICES to specify GPUs to use. The --tensor-parallel-size should be set accordingly. The --api-key is optional.
the default port is 8000. --port sets the port.
Define inference engine
from llm_ie.engines import SGLangInferenceEngine
inference_engine = SGLangInferenceEngine(model="meta-llama/Meta-Llama-3.1-8B-Instruct")
model must match the repo name specified in the server.
Qwen3-30B-A3B (hybrid thinking mode)
Start the server in command line. Specify --reasoning-parser qwen3 to enable the reasoning parser.
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-30B-A3B \
--reasoning-parser qwen3 \
--tensor-parallel-size <# of GPUs to use> \
--context-length 32000
from llm_ie.engines import SGLangInferenceEngine, Qwen3LLMConfig
# Thinking mode
inference_engine = SGLangInferenceEngine(model="Qwen/Qwen3-30B-A3B",
config=Qwen3LLMConfig(thinking_mode=True, temperature=0.6, top_p=0.95, top_k=20))
# Non-thinking mode
inference_engine = SGLangInferenceEngine(model="Qwen/Qwen3-30B-A3B",
config=Qwen3LLMConfig(thinking_mode=False, temperature=0.7, top_p=0.8, top_k=20))
Qwen3-30B-Thinking-2507
Start the server in command line. Specify --reasoning-parser qwen3-thinking to enable the reasoning parser.
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-30B-A3B-Thinking-2507 \
--reasoning-parser qwen3-thinking \
--tensor-parallel-size 4 <# of GPUs to use> \
--context-length 32000
from llm_ie.engines import SGLangInferenceEngine, ReasoningLLMConfig
inference_engine = SGLangInferenceEngine(model="Qwen/Qwen3-30B-A3B-Thinking-2507",
config=ReasoningLLMConfig(temperature=0.6, top_p=0.95, top_k=20))
gpt-oss-120b
Start the server in command line. Specify --reasoning-parser gpt-oss to enable the reasoning parser.
python3 -m sglang.launch_server \
--model-path <model path> \
--served-model-name openai/gpt-oss-120b \
--reasoning-parser gpt-oss \
--tensor-parallel-size <# of GPUs to use>
from llm_ie.engines import SGLangInferenceEngine, ReasoningLLMConfig
inference_engine = SGLangInferenceEngine(model="openai/gpt-oss-120b",
config=ReasoningLLMConfig(temperature=1.0, top_p=1.0, top_k=0))
OpenAI API & Compatible Services
In bash, save API key to the environmental variable OPENAI_API_KEY.
In Python, create inference engine and specify model name. For the available models, refer to OpenAI webpage. For more parameters, see OpenAI API reference.
OpenAI models
gpt-4.1-mini
from llm_ie.engines import OpenAIInferenceEngine, BasicLLMConfig
inference_engine = OpenAIInferenceEngine(model="gpt-4.1-mini",
config=BasicLLMConfig(temperature=0.0, max_new_tokens=1024))
o-series reasoning models
For OpenAI reasoning models (o-series), pass a OpenAIReasoningLLMConfig object to OpenAIInferenceEngine constructor.
from llm_ie.engines import OpenAIInferenceEngine, OpenAIReasoningLLMConfig
inference_engine = OpenAIInferenceEngine(model="o4-mini",
config=OpenAIReasoningLLMConfig(reasoning_effort="low"))
OpenAI compatible services
For OpenAI compatible services (OpenRouter for example):
from llm_ie.engines import OpenAIInferenceEngine, BasicLLMConfig
inference_engine = OpenAIInferenceEngine(base_url="https://openrouter.ai/api/v1", model="meta-llama/llama-4-scout",
config=BasicLLMConfig(temperature=0.0, max_new_tokens=1024))
Azure OpenAI API
In bash, save the endpoint name and API key to environmental variables AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY.
In Python, create inference engine and specify model name. For the available models, refer to OpenAI webpage. For more parameters, see Azure OpenAI reference.
gpt-4.1-mini
from llm_ie.engines import AzureOpenAIInferenceEngine, BasicLLMConfig
inference_engine = AzureOpenAIInferenceEngine(model="gpt-4.1-mini", config=BasicLLMConfig(temperature=0.0, max_new_tokens=1024))
o-series reasoning models
For reasoning models (o-series), pass a OpenAIReasoningLLMConfig object to OpenAIInferenceEngine constructor.
from llm_ie.engines import AzureOpenAIInferenceEngine, OpenAIReasoningLLMConfig
inference_engine = AzureOpenAIInferenceEngine(model="o1-mini",
config=OpenAIReasoningLLMConfig(reasoning_effort="low"))
OpenRouter
We provide an interface for OpenRouter service. To use OpenRouter, sign up on their website and get an API key. For more details, refer to OpenRouter.
In bash, save API key to the environmental variable OPENROUTER_API_KEY.
Meta-Llama-3.1-8B-Instruct
Define inference engine
import os
from llm_ie.engines import OpenRouterInferenceEngine
inference_engine = OpenRouterInferenceEngine(model="meta-llama/llama-3.1-8b-instruct")
model must match the repo name specified on OpenRouter.
Qwen3-30B-A3B (hybrid thinking mode)
Define inference engine
from llm_ie.engines import OpenRouterInferenceEngine, Qwen3LLMConfig
# Thinking mode
inference_engine = OpenRouterInferenceEngine(model="qwen/qwen3-30b-a3b",
config=Qwen3LLMConfig(thinking_mode=True, temperature=0.6, top_p=0.95, top_k=20))
# Non-thinking mode
inference_engine = OpenRouterInferenceEngine(model="qwen/qwen3-30b-a3b",
config=Qwen3LLMConfig(thinking_mode=False, temperature=0.7, top_p=0.8, top_k=20))
Qwen3-30B-Thinking-2507
Define inference engine
from llm_ie.engines import OpenRouterInferenceEngine, ReasoningLLMConfig
inference_engine = OpenRouterInferenceEngine(model="qwen/qwen3-30b-a3b-thinking-2507",
config=ReasoningLLMConfig(temperature=0.6, top_p=0.95, top_k=20))
gpt-oss-120b
Define inference engine
from llm_ie.engines import OpenRouterInferenceEngine, ReasoningLLMConfig
inference_engine = OpenRouterInferenceEngine(model="openai/gpt-oss-120b",
config=ReasoningLLMConfig(temperature=1.0, top_p=1.0, top_k=0))
OpenAI API & Compatible Services
In bash, save API key to the environmental variable OPENAI_API_KEY.
In Python, create inference engine and specify model name. For the available models, refer to OpenAI webpage. For more parameters, see OpenAI API reference.
Huggingface_hub
The model can be a model id hosted on the Hugging Face Hub or a URL to a deployed Inference Endpoint. Refer to the Inference Client documentation for more details.
from llm_ie.engines import HuggingFaceHubInferenceEngine
inference_engine = HuggingFaceHubInferenceEngine(model="meta-llama/Meta-Llama-3-8B-Instruct")
Ollama
The model_name must match the names on the Ollama library. Use the command line ollama ls to check your local model list. num_ctx determines the context length LLM will consider during text generation. Empirically, longer context length gives better performance, while consuming more memory and increases computation. keep_alive regulates the lifespan of LLM. It indicates a number of seconds to hold the LLM after the last API call. Default is 5 minutes (300 sec).
from llm_ie.engines import OllamaInferenceEngine
inference_engine = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0", num_ctx=4096, keep_alive=300)
Llama-cpp-python
The repo_id and gguf_filename must match the ones on the Huggingface repo to ensure the correct model is loaded. n_ctx determines the context length LLM will consider during text generation. Empirically, longer context length gives better performance, while consuming more memory and increases computation. Note that when n_ctx is less than the prompt length, Llama.cpp throws exceptions. n_gpu_layers indicates a number of model layers to offload to GPU. Default is -1 for all layers (entire LLM). Flash attention flash_attn is supported by Llama.cpp. The verbose indicates whether model information should be displayed. For more input parameters, see 🦙 Llama-cpp-python.
from llm_ie.engines import LlamaCppInferenceEngine
inference_engine = LlamaCppInferenceEngine(repo_id="bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF",
gguf_filename="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
n_ctx=4096,
n_gpu_layers=-1,
flash_attn=True,
verbose=False)
LiteLLM
The LiteLLM is an adaptor project that unifies many proprietary and open-source LLM APIs. Popular inferncing servers, including OpenAI, Huggingface Hub, and Ollama are supported via its interface. For more details, refer to LiteLLM GitHub page.
To use LiteLLM with LLM-IE, import the LiteLLMInferenceEngine and follow the required model naming.
from llm_ie.engines import LiteLLMInferenceEngine
# Huggingface serverless inferencing
os.environ['HF_TOKEN']
inference_engine = LiteLLMInferenceEngine(model="huggingface/meta-llama/Meta-Llama-3-8B-Instruct")
# OpenAI GPT models
os.environ['OPENAI_API_KEY']
inference_engine = LiteLLMInferenceEngine(model="openai/gpt-4o-mini")
# OpenAI compatible local server
inference_engine = LiteLLMInferenceEngine(model="openai/Llama-3.1-8B-Instruct", base_url="http://localhost:8000/v1", api_key="EMPTY")
# Ollama
inference_engine = LiteLLMInferenceEngine(model="ollama/llama3.1:8b-instruct-q8_0")
Test inference engine configuration
To test the inference engine, use the chat() method.
from llm_ie.engines import OllamaInferenceEngine
inference_engine = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0")
inference_engine.chat(messages=[{"role": "user", "content":"Hi"}], verbose=True)
Customize inference engine
The abstract class InferenceEngine defines the interface and required method chat(). Inherit this class for customized API.
class InferenceEngine:
@abc.abstractmethod
def __init__(self, config:LLMConfig, **kwrs):
"""
This is an abstract class to provide interfaces for LLM inference engines.
Children classes that inherts this class can be used in extrators. Must implement chat() method.
Parameters:
----------
config : LLMConfig
the LLM configuration. Must be a child class of LLMConfig.
"""
return NotImplemented
@abc.abstractmethod
def chat(self, messages:List[Dict[str,str]],
verbose:bool=False, stream:bool=False) -> Union[str, Generator[Dict[str, str], None, None]]:
"""
This method inputs chat messages and outputs LLM generated text.
Parameters:
----------
messages : List[Dict[str,str]]
a list of dict with role and content. role must be one of {"system", "user", "assistant"}
verbose : bool, Optional
if True, LLM generated text will be printed in terminal in real-time.
stream : bool, Optional
if True, returns a generator that yields the output in real-time.
"""
return NotImplemented
def _format_config(self) -> Dict[str, Any]:
"""
This method format the LLM configuration with the correct key for the inference engine.
Return : Dict[str, Any]
the config parameters.
"""
return NotImplemented