Chunkers API

This module provides classes for splitting documents into manageable units for processing by LLMs and for providing context to those units.

Unit Chunkers

Unit chunkers determine how a document is divided into smaller pieces for frame extraction. Each piece is a FrameExtractionUnit.

llm_ie.chunkers.UnitChunker

UnitChunker()

Bases: ABC

This is the abstract class for frame extraction unit chunker. It chunks a document into units (e.g., sentences). LLMs process unit by unit.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def __init__(self):
    """
    This is the abstract class for frame extraction unit chunker.
    It chunks a document into units (e.g., sentences). LLMs process unit by unit. 
    """
    pass

chunk

chunk(text: str) -> List[FrameExtractionUnit]

Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def chunk(self, text:str) -> List[FrameExtractionUnit]:
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    return NotImplemented

llm_ie.chunkers.WholeDocumentUnitChunker

WholeDocumentUnitChunker()

Bases: UnitChunker

This class chunks the whole document into a single unit (no chunking).

Source code in package/llm-ie/src/llm_ie/chunkers.py

def __init__(self):
    """
    This class chunks the whole document into a single unit (no chunking).
    """
    super().__init__()

chunk

chunk(text: str) -> List[FrameExtractionUnit]

Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def chunk(self, text:str) -> List[FrameExtractionUnit]:
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    return [FrameExtractionUnit(
        start=0,
        end=len(text),
        text=text
    )]

llm_ie.chunkers.SentenceUnitChunker

SentenceUnitChunker()

Bases: UnitChunker

This class uses the NLTK PunktSentenceTokenizer to chunk a document into sentences.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def __init__(self):
    """
    This class uses the NLTK PunktSentenceTokenizer to chunk a document into sentences.
    """
    super().__init__()

chunk

chunk(text: str) -> List[FrameExtractionUnit]

Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def chunk(self, text:str) -> List[FrameExtractionUnit]:
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    sentences = []
    for start, end in self.PunktSentenceTokenizer().span_tokenize(text):
        sentences.append(FrameExtractionUnit(
            start=start,
            end=end,
            text=text[start:end]
        ))    
    return sentences

llm_ie.chunkers.TextLineUnitChunker

TextLineUnitChunker()

Bases: UnitChunker

This class chunks a document into lines.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def __init__(self):
    """
    This class chunks a document into lines.
    """
    super().__init__()

chunk

chunk(text: str) -> List[FrameExtractionUnit]

Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def chunk(self, text:str) -> List[FrameExtractionUnit]:
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    lines = text.split('\n')
    line_units = []
    start = 0
    for line in lines:
        end = start + len(line)
        line_units.append(FrameExtractionUnit(
            start=start,
            end=end,
            text=line
        ))
        start = end + 1 
    return line_units

Context Chunkers

Context chunkers determine what contextual information is provided to the LLM alongside a specific FrameExtractionUnit.

llm_ie.chunkers.ContextChunker

ContextChunker()

Bases: ABC

This is the abstract class for context chunker. Given a frame extraction unit, it returns the context for it.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def __init__(self):
    """
    This is the abstract class for context chunker. Given a frame extraction unit,
    it returns the context for it.
    """
    pass

chunk

chunk(unit: FrameExtractionUnit) -> str

Parameters:

unit : FrameExtractionUnit The frame extraction unit.

Return : str The context for the frame extraction unit.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def chunk(self, unit:FrameExtractionUnit) -> str:
    """
    Parameters:
    ----------
    unit : FrameExtractionUnit
        The frame extraction unit.

    Return : str 
        The context for the frame extraction unit.
    """
    return NotImplemented

llm_ie.chunkers.NoContextChunker

NoContextChunker()

Bases: ContextChunker

This class does not provide any context.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def __init__(self):
    """
    This class does not provide any context.
    """
    super().__init__()

fit

fit(text: str, units: List[FrameExtractionUnit])

Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def fit(self, text:str, units:List[FrameExtractionUnit]):
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    pass

llm_ie.chunkers.WholeDocumentContextChunker

WholeDocumentContextChunker()

Bases: ContextChunker

This class provides the whole document as context.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def __init__(self):
    """
    This class provides the whole document as context.
    """
    super().__init__()
    self.text = None

fit

fit(text: str, units: List[FrameExtractionUnit])

Parameters:

text : str The document text.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def fit(self, text:str, units:List[FrameExtractionUnit]):
    """
    Parameters:
    ----------
    text : str
        The document text.
    """
    self.text = text

llm_ie.chunkers.SlideWindowContextChunker

SlideWindowContextChunker(window_size: int)

Bases: ContextChunker

This class provides a sliding window context. For example, +-2 sentences around a unit sentence.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def __init__(self, window_size:int):
    """
    This class provides a sliding window context. For example, +-2 sentences around a unit sentence. 
    """
    super().__init__()
    self.window_size = window_size
    self.units = None

fit

fit(text: str, units: List[FrameExtractionUnit])

Parameters:

units : List[FrameExtractionUnit] The list of frame extraction units.

Source code in package/llm-ie/src/llm_ie/chunkers.py

def fit(self, text:str, units:List[FrameExtractionUnit]):
    """
    Parameters:
    ----------
    units : List[FrameExtractionUnit]
        The list of frame extraction units.
    """
    self.units = sorted(units)