Skip to content

LLM Inference Engine

We provide an interface for different LLM inference engines to work in the information extraction workflow. The built-in engines are VLLMInferenceEngine, LiteLLMInferenceEngine, OpenAIInferenceEngine, HuggingFaceHubInferenceEngine, OllamaInferenceEngine, and LlamaCppInferenceEngine. For customization, see customize inference engine. Inference engines accept a LLMConfig class where sampling parameters (e.g., temperature, top-p, top-k, maximum new tokens) and reasoning configuration (e.g., OpenAI o-series models, Qwen3) can be set.

vLLM

The vLLM support follows the OpenAI Compatible Server. For more parameters, please refer to the documentation. Below are examples for different models.

Meta-Llama-3.1-8B-Instruct

Start the server in command line.

CUDA_VISIBLE_DEVICES=<GPU#> vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --api-key MY_API_KEY \
    --tensor-parallel-size <# of GPUs to use>
Use CUDA_VISIBLE_DEVICES to specify GPUs to use. The --tensor-parallel-size should be set accordingly. The --api-key is optional. the default port is 8000. --port sets the port.

Define inference engine

from llm_ie.engines import VLLMInferenceEngine

inference_engine = VLLMInferenceEngine(model="meta-llama/Meta-Llama-3.1-8B-Instruct")
The model must match the repo name specified in the server.

Qwen3-30B-A3B (hybrid thinking mode)

Start the server in command line. Specify --reasoning-parser qwen3 to enable the reasoning parser.

vllm serve Qwen/Qwen3-30B-A3B \
    --tensor-parallel-size 4 \
    --enable-prefix-caching \
    --gpu-memory-utilization 0.9 \
    --reasoning-parser qwen3
Define inference engine
from llm_ie.engines import VLLMInferenceEngine, Qwen3LLMConfig

# Thinking mode
inference_engine = VLLMInferenceEngine(model="Qwen/Qwen3-30B-A3B", 
                                       config=Qwen3LLMConfig(thinking_mode=True, temperature=0.6, top_p=0.95, top_k=20))
# Non-thinking mode
inference_engine = VLLMInferenceEngine(model="Qwen/Qwen3-30B-A3B", 
                                       config=Qwen3LLMConfig(thinking_mode=False, temperature=0.7, top_p=0.8, top_k=20))

Qwen3-30B-Thinking-2507

Start the server in command line. Specify --reasoning-parser qwen3 to enable the reasoning parser.

vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 \
   --tensor-parallel-size 4 \
   --enable-prefix-caching \
   --reasoning-parser qwen3
Define inference engine
from llm_ie.engines import VLLMInferenceEngine, ReasoningLLMConfig

inference_engine = VLLMInferenceEngine(model="Qwen/Qwen3-30B-A3B-Thinking-2507", 
                                       config=ReasoningLLMConfig(temperature=0.6, top_p=0.95, top_k=20))

gpt-oss-120b

Start the server in command line. Specify --reasoning-parser openai_gptoss to enable the reasoning parser.

vllm serve openai/gpt-oss-120b \
    --tensor-parallel-size 4 \
    --enable-prefix-caching \
    --reasoning-parser openai_gptoss
Define inference engine
from llm_ie.engines import VLLMInferenceEngine, ReasoningLLMConfig

inference_engine = VLLMInferenceEngine(model="openai/gpt-oss-120b", 
                                       config=ReasoningLLMConfig(temperature=1.0, top_p=1.0, top_k=0))

SGLang

The SGLang support follows the OpenAI APIs. For more parameters, please refer to the documentation. Below are examples for different models.

Meta-Llama-3.1-8B-Instruct

Start the server in command line.

CUDA_VISIBLE_DEVICES=<GPU#> python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --api-key MY_API_KEY \
    --tensor-parallel-size <# of GPUs to use>
Use CUDA_VISIBLE_DEVICES to specify GPUs to use. The --tensor-parallel-size should be set accordingly. The --api-key is optional. the default port is 8000. --port sets the port.

Define inference engine

from llm_ie.engines import SGLangInferenceEngine

inference_engine = SGLangInferenceEngine(model="meta-llama/Meta-Llama-3.1-8B-Instruct")
The model must match the repo name specified in the server.

Qwen3-30B-A3B (hybrid thinking mode)

Start the server in command line. Specify --reasoning-parser qwen3 to enable the reasoning parser.

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-30B-A3B \
    --reasoning-parser qwen3 \
    --tensor-parallel-size <# of GPUs to use> \
    --context-length 32000 
Define inference engine
from llm_ie.engines import SGLangInferenceEngine, Qwen3LLMConfig

# Thinking mode
inference_engine = SGLangInferenceEngine(model="Qwen/Qwen3-30B-A3B", 
                                         config=Qwen3LLMConfig(thinking_mode=True, temperature=0.6, top_p=0.95, top_k=20))
# Non-thinking mode
inference_engine = SGLangInferenceEngine(model="Qwen/Qwen3-30B-A3B", 
                                         config=Qwen3LLMConfig(thinking_mode=False, temperature=0.7, top_p=0.8, top_k=20))

Qwen3-30B-Thinking-2507

Start the server in command line. Specify --reasoning-parser qwen3-thinking to enable the reasoning parser.

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-30B-A3B-Thinking-2507 \
    --reasoning-parser qwen3-thinking \
    --tensor-parallel-size 4 <# of GPUs to use> \
    --context-length 32000
Define inference engine
from llm_ie.engines import SGLangInferenceEngine, ReasoningLLMConfig

inference_engine = SGLangInferenceEngine(model="Qwen/Qwen3-30B-A3B-Thinking-2507", 
                                         config=ReasoningLLMConfig(temperature=0.6, top_p=0.95, top_k=20))

gpt-oss-120b

Start the server in command line. Specify --reasoning-parser gpt-oss to enable the reasoning parser.

python3 -m sglang.launch_server \
  --model-path <model path> \  
  --served-model-name openai/gpt-oss-120b \
  --reasoning-parser gpt-oss \
  --tensor-parallel-size <# of GPUs to use>
Define inference engine
from llm_ie.engines import SGLangInferenceEngine, ReasoningLLMConfig

inference_engine = SGLangInferenceEngine(model="openai/gpt-oss-120b", 
                                         config=ReasoningLLMConfig(temperature=1.0, top_p=1.0, top_k=0))

OpenAI API & Compatible Services

In bash, save API key to the environmental variable OPENAI_API_KEY.

export OPENAI_API_KEY=<your_API_key>

In Python, create inference engine and specify model name. For the available models, refer to OpenAI webpage. For more parameters, see OpenAI API reference.

OpenAI models

gpt-4.1-mini

from llm_ie.engines import OpenAIInferenceEngine, BasicLLMConfig

inference_engine = OpenAIInferenceEngine(model="gpt-4.1-mini", 
                                         config=BasicLLMConfig(temperature=0.0, max_new_tokens=1024))

o-series reasoning models

For OpenAI reasoning models (o-series), pass a OpenAIReasoningLLMConfig object to OpenAIInferenceEngine constructor.

from llm_ie.engines import OpenAIInferenceEngine, OpenAIReasoningLLMConfig

inference_engine = OpenAIInferenceEngine(model="o4-mini", 
                                         config=OpenAIReasoningLLMConfig(reasoning_effort="low"))

OpenAI compatible services

For OpenAI compatible services (OpenRouter for example):

from llm_ie.engines import OpenAIInferenceEngine, BasicLLMConfig

inference_engine = OpenAIInferenceEngine(base_url="https://openrouter.ai/api/v1", model="meta-llama/llama-4-scout",
                                         config=BasicLLMConfig(temperature=0.0, max_new_tokens=1024))

Azure OpenAI API

In bash, save the endpoint name and API key to environmental variables AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY.

export AZURE_OPENAI_API_KEY="<your_API_key>"
export AZURE_OPENAI_ENDPOINT="<your_endpoint>"

In Python, create inference engine and specify model name. For the available models, refer to OpenAI webpage. For more parameters, see Azure OpenAI reference.

gpt-4.1-mini

from llm_ie.engines import AzureOpenAIInferenceEngine, BasicLLMConfig

inference_engine = AzureOpenAIInferenceEngine(model="gpt-4.1-mini", config=BasicLLMConfig(temperature=0.0, max_new_tokens=1024))

o-series reasoning models

For reasoning models (o-series), pass a OpenAIReasoningLLMConfig object to OpenAIInferenceEngine constructor.

from llm_ie.engines import AzureOpenAIInferenceEngine, OpenAIReasoningLLMConfig

inference_engine = AzureOpenAIInferenceEngine(model="o1-mini", 
                                              config=OpenAIReasoningLLMConfig(reasoning_effort="low"))

OpenRouter

We provide an interface for OpenRouter service. To use OpenRouter, sign up on their website and get an API key. For more details, refer to OpenRouter.

In bash, save API key to the environmental variable OPENROUTER_API_KEY.

export OPENROUTER_API_KEY=<your_API_key>

Meta-Llama-3.1-8B-Instruct

Define inference engine

import os
from llm_ie.engines import OpenRouterInferenceEngine

inference_engine = OpenRouterInferenceEngine(model="meta-llama/llama-3.1-8b-instruct")
The model must match the repo name specified on OpenRouter.

Qwen3-30B-A3B (hybrid thinking mode)

Define inference engine

from llm_ie.engines import OpenRouterInferenceEngine, Qwen3LLMConfig

# Thinking mode
inference_engine = OpenRouterInferenceEngine(model="qwen/qwen3-30b-a3b", 
                                             config=Qwen3LLMConfig(thinking_mode=True, temperature=0.6, top_p=0.95, top_k=20))
# Non-thinking mode
inference_engine = OpenRouterInferenceEngine(model="qwen/qwen3-30b-a3b", 
                                             config=Qwen3LLMConfig(thinking_mode=False, temperature=0.7, top_p=0.8, top_k=20))

Qwen3-30B-Thinking-2507

Define inference engine

from llm_ie.engines import OpenRouterInferenceEngine, ReasoningLLMConfig

inference_engine = OpenRouterInferenceEngine(model="qwen/qwen3-30b-a3b-thinking-2507",
                                             config=ReasoningLLMConfig(temperature=0.6, top_p=0.95, top_k=20))

gpt-oss-120b

Define inference engine

from llm_ie.engines import OpenRouterInferenceEngine, ReasoningLLMConfig

inference_engine = OpenRouterInferenceEngine(model="openai/gpt-oss-120b", 
                                             config=ReasoningLLMConfig(temperature=1.0, top_p=1.0, top_k=0))

OpenAI API & Compatible Services

In bash, save API key to the environmental variable OPENAI_API_KEY.

export OPENAI_API_KEY=<your_API_key>

In Python, create inference engine and specify model name. For the available models, refer to OpenAI webpage. For more parameters, see OpenAI API reference.

Huggingface_hub

The model can be a model id hosted on the Hugging Face Hub or a URL to a deployed Inference Endpoint. Refer to the Inference Client documentation for more details.

from llm_ie.engines import HuggingFaceHubInferenceEngine

inference_engine = HuggingFaceHubInferenceEngine(model="meta-llama/Meta-Llama-3-8B-Instruct")

Ollama

The model_name must match the names on the Ollama library. Use the command line ollama ls to check your local model list. num_ctx determines the context length LLM will consider during text generation. Empirically, longer context length gives better performance, while consuming more memory and increases computation. keep_alive regulates the lifespan of LLM. It indicates a number of seconds to hold the LLM after the last API call. Default is 5 minutes (300 sec).

from llm_ie.engines import OllamaInferenceEngine

inference_engine = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0", num_ctx=4096, keep_alive=300)

Llama-cpp-python

The repo_id and gguf_filename must match the ones on the Huggingface repo to ensure the correct model is loaded. n_ctx determines the context length LLM will consider during text generation. Empirically, longer context length gives better performance, while consuming more memory and increases computation. Note that when n_ctx is less than the prompt length, Llama.cpp throws exceptions. n_gpu_layers indicates a number of model layers to offload to GPU. Default is -1 for all layers (entire LLM). Flash attention flash_attn is supported by Llama.cpp. The verbose indicates whether model information should be displayed. For more input parameters, see 🦙 Llama-cpp-python.

from llm_ie.engines import LlamaCppInferenceEngine

inference_engine = LlamaCppInferenceEngine(repo_id="bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF",
                                           gguf_filename="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
                                           n_ctx=4096,
                                           n_gpu_layers=-1,
                                           flash_attn=True,
                                           verbose=False)

LiteLLM

The LiteLLM is an adaptor project that unifies many proprietary and open-source LLM APIs. Popular inferncing servers, including OpenAI, Huggingface Hub, and Ollama are supported via its interface. For more details, refer to LiteLLM GitHub page.

To use LiteLLM with LLM-IE, import the LiteLLMInferenceEngine and follow the required model naming.

from llm_ie.engines import LiteLLMInferenceEngine

# Huggingface serverless inferencing
os.environ['HF_TOKEN']
inference_engine = LiteLLMInferenceEngine(model="huggingface/meta-llama/Meta-Llama-3-8B-Instruct")

# OpenAI GPT models
os.environ['OPENAI_API_KEY']
inference_engine = LiteLLMInferenceEngine(model="openai/gpt-4o-mini")

# OpenAI compatible local server
inference_engine = LiteLLMInferenceEngine(model="openai/Llama-3.1-8B-Instruct", base_url="http://localhost:8000/v1", api_key="EMPTY")

# Ollama 
inference_engine = LiteLLMInferenceEngine(model="ollama/llama3.1:8b-instruct-q8_0")

Test inference engine configuration

To test the inference engine, use the chat() method.

from llm_ie.engines import OllamaInferenceEngine

inference_engine = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0")
inference_engine.chat(messages=[{"role": "user", "content":"Hi"}], verbose=True)
The output should be something like (might vary by LLMs and versions)

'How can I help you today?'

Customize inference engine

The abstract class InferenceEngine defines the interface and required method chat(). Inherit this class for customized API.

class InferenceEngine:
    @abc.abstractmethod
    def __init__(self, config:LLMConfig, **kwrs):
        """
        This is an abstract class to provide interfaces for LLM inference engines. 
        Children classes that inherts this class can be used in extrators. Must implement chat() method.

        Parameters:
        ----------
        config : LLMConfig
            the LLM configuration. Must be a child class of LLMConfig.
        """
        return NotImplemented


    @abc.abstractmethod
    def chat(self, messages:List[Dict[str,str]], 
             verbose:bool=False, stream:bool=False) -> Union[str, Generator[Dict[str, str], None, None]]:
        """
        This method inputs chat messages and outputs LLM generated text.

        Parameters:
        ----------
        messages : List[Dict[str,str]]
            a list of dict with role and content. role must be one of {"system", "user", "assistant"}
        verbose : bool, Optional
            if True, LLM generated text will be printed in terminal in real-time.
        stream : bool, Optional
            if True, returns a generator that yields the output in real-time.  
        """
        return NotImplemented

    def _format_config(self) -> Dict[str, Any]:
        """
        This method format the LLM configuration with the correct key for the inference engine. 

        Return : Dict[str, Any]
            the config parameters.
        """
        return NotImplemented