chunk

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 1, 2026 License: MIT Imports: 3 Imported by: 0

Documentation

Overview

Package chunk splits extracted text into overlapping chunks suitable for RAG (Retrieval-Augmented Generation) and full-text search indexing.

Splitting strategy:

  1. Split on paragraph boundaries first (double newline)
  2. If paragraphs exceed max tokens, split on sentence boundaries
  3. If sentences exceed max tokens, split on word boundaries
  4. Apply configurable overlap between consecutive chunks

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CountTokens

func CountTokens(text string) int

CountTokens is the exported version for use outside the package.

func EstimateTokens

func EstimateTokens(text string) int

EstimateTokens estimates GPT-style token count from text. Rough heuristic: ~0.75 words per token for English, ~4 chars per token.

Types

type Chunk

type Chunk struct {
	Index       int    // 0-based position in the sequence
	Text        string // chunk text content
	TokenCount  int    // approximate token count
	OverlapPrev int    // how many tokens overlap with the previous chunk
}

Chunk is one text fragment with metadata.

func Split

func Split(text string, opts Options) []Chunk

Split divides text into overlapping chunks.

type Options

type Options struct {
	// MaxTokens is the maximum number of tokens per chunk. Default: 512.
	MaxTokens int
	// OverlapTokens is the number of tokens to overlap between chunks. Default: 64.
	OverlapTokens int
	// MinChunkTokens is the minimum chunk size; shorter chunks are merged. Default: 32.
	MinChunkTokens int
}

Options configures the chunking behaviour.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL