LLM Configuration

In some cases, it is helpful to adjust LLM sampling parameters (e.g., temperature, top-p, top-k, maximum new tokens) or use reasoning models (e.g., OpenAI o-series models, Qwen3) which requires special treatments in system prompt, user prompt, and sampling parameters. For example, OpenAI o-series reasoning models disallow passing a system prompt or setting custom temperature. Another example is Qwen3 hybrid thinking mode. Special tokens "/think" and "/no_think" should be appended to user prompts to control for the reasoning behavior.

Setting sampling parameters

LLM sampling parameters such as temperature, top-p, top-k, and maximum new tokens can be set by passing a LLMConfig class to the InferenceEngine constructor.

from llm_ie.engines import OpenAIInferenceEngine, BasicLLMConfig

config = BasicLLMConfig(temperature=0.2, max_new_tokens=4096)
inference_engine = OpenAIInferenceEngine(model="gpt-4o-mini", config=config)

Reasoning models

To use reasoning models such as OpenAI o-series (e.g., o1, o3, o3-mini, o4-mini), some special processing is required. We provide dedicated configuration classes for them.

OpenAI o-series reasoning models

OpenAI o-series reasoning model API does not allow setting system prompts. Contents in the system should be included in user prompts. Also, custom temperature is not allowed. We provide a dedicated configuration class OpenAIReasoningLLMConfig for these models.

from llm_ie.engines import OpenAIInferenceEngine, OpenAIReasoningLLMConfig

inference_engine = OpenAIInferenceEngine(model="o1-mini", 
                                         config=OpenAIReasoningLLMConfig(reasoning_effort="low"))

Qwen3 (hybrid thinking mode)

Qwen3 has a special way to manage reasoning behavior. The same models have thinking mode and non-thinking mode, controled by the prompting template. When a special token "/think" is appended to the user prompt, the models generate thinking tokens in a <think>... </think> block. When a special token "/no_think" is appended to the user prompt, the models generate an empty <think>... </think> block. We provide a dedicated configuration class Qwen3LLMConfig for these models.

from llm_ie.engines import OpenAIInferenceEngine, Qwen3LLMConfig

# Thinking mode
llm = OpenAIInferenceEngine(base_url="http://localhost:8000/v1", 
                            model="Qwen/Qwen3-30B-A3B", 
                            api_key="EMPTY", 
                            config=Qwen3LLMConfig(thinking_mode=True, temperature=0.8, max_tokens=8192))

# Non-thinking mode
llm = OpenAIInferenceEngine(base_url="http://localhost:8000/v1", 
                            model="Qwen/Qwen3-30B-A3B", 
                            api_key="EMPTY", 
                            config=Qwen3LLMConfig(thinking_mode=False, temperature=0.0, max_tokens=2048))