java.lang.Object
org.elasticsearch.xpack.core.inference.chunking.WordBoundaryChunker
All Implemented Interfaces:
Chunker

public class WordBoundaryChunker extends Object implements Chunker
Breaks text into smaller strings or chunks on Word boundaries. Whitespace is preserved and included in the start of the following chunk not the end of the chunk. If the chunk ends on a punctuation mark the punctuation is included in the next chunk. The overlap value must be > (chunkSize /2) to avoid the complexity of tracking the start positions of multiple chunks within the chunk.
  • Constructor Details

    • WordBoundaryChunker

      public WordBoundaryChunker()
  • Method Details

    • chunk

      public List<Chunker.ChunkOffset> chunk(String input, ChunkingSettings chunkingSettings)
      Break the input text into small chunks as dictated by the chunking parameters
      Specified by:
      chunk in interface Chunker
      Parameters:
      input - Text to chunk
      chunkingSettings - The chunking settings that configure chunkSize and overlap
      Returns:
      List of chunked text
    • chunk

      public List<Chunker.ChunkOffset> chunk(String input, int chunkSize, int overlap)
      Break the input text into small chunks as dictated by the chunking parameters
      Parameters:
      input - Text to chunk
      chunkSize - The number of words in each chunk
      overlap - The number of words to overlap each chunk. Can be 0 but must be non-negative.
      Returns:
      List of chunked text