Skip to content

Chunkers API

This module provides classes for splitting documents into manageable units for processing by LLMs and for providing context to those units.

Unit Chunkers

Unit chunkers determine how a document is divided into smaller pieces for frame extraction. Each piece is a FrameExtractionUnit.

llm_ie.chunkers.UnitChunker

UnitChunker()

Bases: ABC

This is the abstract class for frame extraction unit chunker. It chunks a document into units (e.g., sentences). LLMs process unit by unit.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def __init__(self):
    """
    This is the abstract class for frame extraction unit chunker.
    It chunks a document into units (e.g., sentences). LLMs process unit by unit. 
    """
    pass

chunk

chunk(text: str) -> List[FrameExtractionUnit]
Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def chunk(self, text:str) -> List[FrameExtractionUnit]:
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    return NotImplemented

llm_ie.chunkers.WholeDocumentUnitChunker

WholeDocumentUnitChunker()

Bases: UnitChunker

This class chunks the whole document into a single unit (no chunking).

Source code in package/llm-ie/src/llm_ie/chunkers.py
def __init__(self):
    """
    This class chunks the whole document into a single unit (no chunking).
    """
    super().__init__()

chunk

chunk(text: str) -> List[FrameExtractionUnit]
Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def chunk(self, text:str) -> List[FrameExtractionUnit]:
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    return [FrameExtractionUnit(
        start=0,
        end=len(text),
        text=text
    )]

llm_ie.chunkers.SentenceUnitChunker

SentenceUnitChunker()

Bases: UnitChunker

This class uses the NLTK PunktSentenceTokenizer to chunk a document into sentences.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def __init__(self):
    """
    This class uses the NLTK PunktSentenceTokenizer to chunk a document into sentences.
    """
    super().__init__()

chunk

chunk(text: str) -> List[FrameExtractionUnit]
Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def chunk(self, text:str) -> List[FrameExtractionUnit]:
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    sentences = []
    for start, end in self.PunktSentenceTokenizer().span_tokenize(text):
        sentences.append(FrameExtractionUnit(
            start=start,
            end=end,
            text=text[start:end]
        ))    
    return sentences

llm_ie.chunkers.TextLineUnitChunker

TextLineUnitChunker()

Bases: UnitChunker

This class chunks a document into lines.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def __init__(self):
    """
    This class chunks a document into lines.
    """
    super().__init__()

chunk

chunk(text: str) -> List[FrameExtractionUnit]
Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def chunk(self, text:str) -> List[FrameExtractionUnit]:
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    lines = text.split('\n')
    line_units = []
    start = 0
    for line in lines:
        end = start + len(line)
        line_units.append(FrameExtractionUnit(
            start=start,
            end=end,
            text=line
        ))
        start = end + 1 
    return line_units

Context Chunkers

Context chunkers determine what contextual information is provided to the LLM alongside a specific FrameExtractionUnit.

llm_ie.chunkers.ContextChunker

ContextChunker()

Bases: ABC

This is the abstract class for context chunker. Given a frame extraction unit, it returns the context for it.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def __init__(self):
    """
    This is the abstract class for context chunker. Given a frame extraction unit,
    it returns the context for it.
    """
    pass

chunk

chunk(unit: FrameExtractionUnit) -> str
Parameters:

unit : FrameExtractionUnit The frame extraction unit.

Return : str The context for the frame extraction unit.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def chunk(self, unit:FrameExtractionUnit) -> str:
    """
    Parameters:
    ----------
    unit : FrameExtractionUnit
        The frame extraction unit.

    Return : str 
        The context for the frame extraction unit.
    """
    return NotImplemented

llm_ie.chunkers.NoContextChunker

NoContextChunker()

Bases: ContextChunker

This class does not provide any context.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def __init__(self):
    """
    This class does not provide any context.
    """
    super().__init__()

fit

fit(text: str, units: List[FrameExtractionUnit])
Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def fit(self, text:str, units:List[FrameExtractionUnit]):
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    pass

llm_ie.chunkers.WholeDocumentContextChunker

WholeDocumentContextChunker()

Bases: ContextChunker

This class provides the whole document as context.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def __init__(self):
    """
    This class provides the whole document as context.
    """
    super().__init__()
    self.text = None

fit

fit(text: str, units: List[FrameExtractionUnit])
Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def fit(self, text:str, units:List[FrameExtractionUnit]):
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    self.text = text

llm_ie.chunkers.SlideWindowContextChunker

SlideWindowContextChunker(window_size: int)

Bases: ContextChunker

This class provides a sliding window context. For example, +-2 sentences around a unit sentence.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def __init__(self, window_size:int):
    """
    This class provides a sliding window context. For example, +-2 sentences around a unit sentence. 
    """
    super().__init__()
    self.window_size = window_size
    self.units = None

fit

fit(text: str, units: List[FrameExtractionUnit])
Parameters:

units : List[FrameExtractionUnit] The list of frame extraction units.

Source code in package/llm-ie/src/llm_ie/chunkers.py
def fit(self, text:str, units:List[FrameExtractionUnit]):
    """
    Parameters:
    ----------
    units : List[FrameExtractionUnit]
        The list of frame extraction units.
    """
    self.units = sorted(units)