LLM Inference Engine
We provide an interface for different LLM inference engines to work in the information extraction workflow. The built-in engines are LiteLLMInferenceEngine
, OpenAIInferenceEngine
, HuggingFaceHubInferenceEngine
, OllamaInferenceEngine
, and LlamaCppInferenceEngine
. For customization, see customize inference engine. Inference engines accept a LLMConfig class where sampling parameters (e.g., temperature, top-p, top-k, maximum new tokens) and reasoning configuration (e.g., OpenAI o-series models, Qwen3) can be set.
LiteLLM
The LiteLLM is an adaptor project that unifies many proprietary and open-source LLM APIs. Popular inferncing servers, including OpenAI, Huggingface Hub, and Ollama are supported via its interface. For more details, refer to LiteLLM GitHub page.
To use LiteLLM with LLM-IE, import the LiteLLMInferenceEngine
and follow the required model naming.
from llm_ie.engines import LiteLLMInferenceEngine
# Huggingface serverless inferencing
os.environ['HF_TOKEN']
inference_engine = LiteLLMInferenceEngine(model="huggingface/meta-llama/Meta-Llama-3-8B-Instruct")
# OpenAI GPT models
os.environ['OPENAI_API_KEY']
inference_engine = LiteLLMInferenceEngine(model="openai/gpt-4o-mini")
# OpenAI compatible local server
inference_engine = LiteLLMInferenceEngine(model="openai/Llama-3.1-8B-Instruct", base_url="http://localhost:8000/v1", api_key="EMPTY")
# Ollama
inference_engine = LiteLLMInferenceEngine(model="ollama/llama3.1:8b-instruct-q8_0")
OpenAI API & Compatible Services
In bash, save API key to the environmental variable OPENAI_API_KEY
.
In Python, create inference engine and specify model name. For the available models, refer to OpenAI webpage. For more parameters, see OpenAI API reference.
from llm_ie.engines import OpenAIInferenceEngine
inference_engine = OpenAIInferenceEngine(model="gpt-4o-mini")
For OpenAI reasoning models (o-series), pass a OpenAIReasoningLLMConfig
object to OpenAIInferenceEngine
constructor.
from llm_ie.engines import OpenAIInferenceEngine, OpenAIReasoningLLMConfig
inference_engine = OpenAIInferenceEngine(model="o1-mini",
config=OpenAIReasoningLLMConfig(reasoning_effort="low"))
For OpenAI compatible services (OpenRouter for example):
from llm_ie.engines import OpenAIInferenceEngine
inference_engine = OpenAIInferenceEngine(base_url="https://openrouter.ai/api/v1", model="meta-llama/llama-4-scout")
Azure OpenAI API
In bash, save the endpoint name and API key to environmental variables AZURE_OPENAI_ENDPOINT
and AZURE_OPENAI_API_KEY
.
In Python, create inference engine and specify model name. For the available models, refer to OpenAI webpage. For more parameters, see Azure OpenAI reference.
from llm_ie.engines import AzureOpenAIInferenceEngine
inference_engine = AzureOpenAIInferenceEngine(model="gpt-4o-mini")
For reasoning models (o-series), pass a OpenAIReasoningLLMConfig
object to OpenAIInferenceEngine
constructor.
from llm_ie.engines import AzureOpenAIInferenceEngine
inference_engine = AzureOpenAIInferenceEngine(model="o1-mini",
config=OpenAIReasoningLLMConfig(reasoning_effort="low"))
Huggingface_hub
The model
can be a model id hosted on the Hugging Face Hub or a URL to a deployed Inference Endpoint. Refer to the Inference Client documentation for more details.
from llm_ie.engines import HuggingFaceHubInferenceEngine
inference_engine = HuggingFaceHubInferenceEngine(model="meta-llama/Meta-Llama-3-8B-Instruct")
Ollama
The model_name
must match the names on the Ollama library. Use the command line ollama ls
to check your local model list. num_ctx
determines the context length LLM will consider during text generation. Empirically, longer context length gives better performance, while consuming more memory and increases computation. keep_alive
regulates the lifespan of LLM. It indicates a number of seconds to hold the LLM after the last API call. Default is 5 minutes (300 sec).
from llm_ie.engines import OllamaInferenceEngine
inference_engine = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0", num_ctx=4096, keep_alive=300)
vLLM
The vLLM support follows the OpenAI Compatible Server. For more parameters, please refer to the documentation.
Start the server
CUDA_VISIBLE_DEVICES=<GPU#> vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --api-key MY_API_KEY --tensor-parallel-size <# of GPUs to use>
CUDA_VISIBLE_DEVICES
to specify GPUs to use. The --tensor-parallel-size
should be set accordingly. The --api-key
is optional.
the default port is 8000. --port
sets the port.
Define inference engine
from llm_ie.engines import OpenAIInferenceEngine
inference_engine = OpenAIInferenceEngine(base_url="http://localhost:8000/v1",
api_key="MY_API_KEY",
model="meta-llama/Meta-Llama-3.1-8B-Instruct")
model
must match the repo name specified in the server.
Llama-cpp-python
The repo_id
and gguf_filename
must match the ones on the Huggingface repo to ensure the correct model is loaded. n_ctx
determines the context length LLM will consider during text generation. Empirically, longer context length gives better performance, while consuming more memory and increases computation. Note that when n_ctx
is less than the prompt length, Llama.cpp throws exceptions. n_gpu_layers
indicates a number of model layers to offload to GPU. Default is -1 for all layers (entire LLM). Flash attention flash_attn
is supported by Llama.cpp. The verbose
indicates whether model information should be displayed. For more input parameters, see 🦙 Llama-cpp-python.
from llm_ie.engines import LlamaCppInferenceEngine
inference_engine = LlamaCppInferenceEngine(repo_id="bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF",
gguf_filename="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
n_ctx=4096,
n_gpu_layers=-1,
flash_attn=True,
verbose=False)
Test inference engine configuration
To test the inference engine, use the chat()
method.
from llm_ie.engines import OllamaInferenceEngine
inference_engine = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0")
inference_engine.chat(messages=[{"role": "user", "content":"Hi"}], verbose=True)
Customize inference engine
The abstract class InferenceEngine
defines the interface and required method chat()
. Inherit this class for customized API.
class InferenceEngine:
@abc.abstractmethod
def __init__(self, config:LLMConfig, **kwrs):
"""
This is an abstract class to provide interfaces for LLM inference engines.
Children classes that inherts this class can be used in extrators. Must implement chat() method.
Parameters:
----------
config : LLMConfig
the LLM configuration. Must be a child class of LLMConfig.
"""
return NotImplemented
@abc.abstractmethod
def chat(self, messages:List[Dict[str,str]],
verbose:bool=False, stream:bool=False) -> Union[str, Generator[Dict[str, str], None, None]]:
"""
This method inputs chat messages and outputs LLM generated text.
Parameters:
----------
messages : List[Dict[str,str]]
a list of dict with role and content. role must be one of {"system", "user", "assistant"}
verbose : bool, Optional
if True, LLM generated text will be printed in terminal in real-time.
stream : bool, Optional
if True, returns a generator that yields the output in real-time.
"""
return NotImplemented
def _format_config(self) -> Dict[str, Any]:
"""
This method format the LLM configuration with the correct key for the inference engine.
Return : Dict[str, Any]
the config parameters.
"""
return NotImplemented