corpse

package module
v0.0.0-...-1637694 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 29, 2025 License: MIT Imports: 7 Imported by: 2

README

[!CAUTION] This module has moved to github.com/lrstanley/x/text/corpse and is being maintained there, given it's not really production ready and is only an experiment.


logo

🔗 Table of Contents

✨ Features

  • Text Vectorization: Convert text documents into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency).
  • Term Processing:
    • Built-in tokenization for text processing.
    • Support for extensible term filtering (stop words, lemmatization, stemming, etc).
    • Configurable term pruning to remove common or rare terms (minimum and maximum document frequency).
  • Vector Management:
    • Configurable vector size limits.
    • Automatic term frequency tracking.
    • Efficient memory usage with object pooling.
  • Search Capabilities:
    • Simple integration with HNSW (Hierarchical Navigable Small Worlds) for fast similarity search, and other search algorithms.

⚠ Limitations

  • Designed for addition-only vectorization. If you want to remove or update documents, you'll need to re-index the entire corpus. This does mean reduced memory usage, however.
  • Designed primarily for in-memory vectorization. If you need clustering, or advanced features, use a proper vector database (and something like LLM-based embedding).
  • I'm not an expert in text embedding, so there may be better ways to do this.

⚙ Usage

$ go get github.com/lrstanley/corpse

# if you want lemmatization or stemming. note that these support english only.
# see the source if you want to use a different language.
$ go get github.com/lrstanley/corpse/lemm
$ go get github.com/lrstanley/corpse/stem
package main

import (
    "fmt"
    "github.com/coder/hnsw"
    "github.com/lrstanley/corpse"
    "github.com/lrstanley/corpse/lemm"
    "github.com/lrstanley/corpse/stem"
)

func main() {
    vectorSize := 25

    // Initialize a corpus with custom options.
    corp := corpse.New(
        corpse.WithMaxVectorSize(vectorSize),
        corpse.WithTermFilters(
            lemm.NewTermFilter(), // Add lemmatization.
            stem.NewTermFilter(), // Add stemming.
            corpse.StopTermFilter([]string{ // Remove common stop words.
                "the", "and", "is", "in", "etc...",
            }),
        ),
        corpse.WithPruneHooks(
            // Remove terms that appear in more than 85% of documents.
            corpse.PruneMoreThanPercent(85),
        ),
    )

    // Index your documents
    documents := map[string]string{
        "brown-fox":     "The quick brown fox jumps over the lazy dog.",
        "yellow-fox":    "The slow yellow fox jumps over the fast cat.",
        "foo-bar":       "Foo bar@baz",
        "walking-store": "I was walking to the store. Alphabetically, working, testing, and so on.",
        "lorem-ipsum":   "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
    }

    for _, doc := range documents {
        corp.IndexDocument(doc)
    }

    // Create a search graph.
    graph := hnsw.NewGraph[string]()
    graph.M = vectorSize // Should match your vector size.

    // Add documents to the graph.
    for id, doc := range documents {
        graph.Add(hnsw.MakeNode(id, corp.CreateVector(doc)))
    }

    // Search for similar documents.
    query := "yellow fox"
    results := graph.Search(corp.CreateVector(query), 2)

    for _, result := range results {
        fmt.Println(result.Key)
    }
}

For more advanced examples, check out the examples directory.

📚 References


🙋♂ Support & Assistance

  • ❤ Please review the Code of Conduct for guidelines on ensuring everyone has the best experience interacting with the community.
  • 🙋♂ Take a look at the support document on guidelines for tips on how to ask the right questions.
  • 🐞 For all features/bugs/issues/questions/etc, head over here.

🤝 Contributing

  • ❤ Please review the Code of Conduct for guidelines on ensuring everyone has the best experience interacting with the community.
  • 📋 Please review the contributing doc for submitting issues/a guide on submitting pull requests and helping out.
  • 🗝 For anything security related, please review this repositories security policy.

⚖ License

MIT License

Copyright (c) 2025 Liam Stanley <[email protected]>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Also located here

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DefaultTokenizer

func DefaultTokenizer(text string) iter.Seq[string]

func IsNoMatchVector

func IsNoMatchVector(vector []float32) bool

IsNoMatchVector returns true if the vector didn't match any terms.

func VectorSubCount

func VectorSubCount(vector []float32) (count int)

VectorSubCount returns the number of non-zero values in the vector.

Types

type Corpus

type Corpus struct {
	// contains filtered or unexported fields
}

Corpus stores term frequencies across all documents.

func New

func New(options ...Option) *Corpus

New creates a new corpus with the given options.

func (*Corpus) CreatePaddedVector

func (c *Corpus) CreatePaddedVector(text string) []float32

CreatePaddedVector creates a vector with the maximum potential vector size, padding with zeros if the vector is smaller. Not needed unless the graph you use to compare vectors does not support sparse vectors, as it will use more memory.

This is concurrent-safe.

func (*Corpus) CreateVector

func (c *Corpus) CreateVector(text string) []float32

CreateVector creates a TF-IDF vector for the given text. Note that for documents, before generating a vector and adding it to a graph, ALL documents must be indexed first. Note that the returned vector will not be padded. See [CreatePaddedVector] if you need a constant-sized vector.

This will automatically call Corpus.Prune if there are any new documents that have been indexed since the last prune.

This is concurrent-safe.

func (*Corpus) GetDocumentCount

func (c *Corpus) GetDocumentCount() int

GetDocumentCount returns the number of documents that have been indexed.

func (*Corpus) GetTermFrequency

func (c *Corpus) GetTermFrequency() map[string]int

GetTermFrequency returns a snapshot of the term frequencies. Note that because [CreateVector] calls Corpus.Prune before creating vectors, if you invoke this before [CreateVector], you will receive terms that might not have been pruned yet by [PruneHook]s. Call Corpus.Prune manually before this function first in that case.

func (*Corpus) GetUsedCapacity

func (c *Corpus) GetUsedCapacity() (percent int)

GetUsedCapacity returns the percentage of the corpus capacity that is used. You can use this to determine if you are getting close to the max vector size. If you do go above capacity, all vectors will be calculated with the first X terms (sorted), where X is the max vector size, and you will lose corpus information. Make sure to call Corpus.Prune before checking this.

func (*Corpus) IndexDocument

func (c *Corpus) IndexDocument(text string)

IndexDocument indexes a document, calculating occurrences of each term. Note that you should call this for ALL documents before creating vectors for your documents (or search queries).

This is concurrent-safe.

func (*Corpus) Prune

func (c *Corpus) Prune()

Prune runs all prune hooks, removing terms of less importance from the corpus. This is automatically ran by Corpus.CreateVector if there are any new documents that have been indexed since the last prune. Run it manually if you don't plan to invoke Corpus.CreateVector immediately after indexing all documents. Do not run this until you have indexed all documents.

This is concurrent-safe.

func (*Corpus) Reset

func (c *Corpus) Reset()

Reset resets the corpus to its initial state.

type Option

type Option func(*Corpus)

func WithMaxVectorSize

func WithMaxVectorSize(size int) Option

WithMaxVectorSize sets the maximum potential vector size.

func WithPruneHooks

func WithPruneHooks(hooks ...PruneHook) Option

WithPruneHooks allows adding hooks, which are ran before vectorization, that remove terms from the corpus. This can be used to remove terms that are either in too few documents, or too many documents, to reduce the sizze of the corpus.

func WithTermFilters

func WithTermFilters(filters ...TermFilter) Option

WithTermFilters allows adding filters to the tokenizer iterator. For example to add: - stopword removal - lemmatization - stemming

Order of operations: tokenizer -> filter (1st call) -> filter (2nd call) -> ... -> filter (n-th call)

func WithTokenizer

func WithTokenizer(tokenizer Tokenizer) Option

type PruneHook

type PruneHook func(documents int, termFreq map[string]int) (toRemove []string)

func PruneLessThan

func PruneLessThan(count int) PruneHook

PruneLessThan is a PruneHook that removes terms that appear in less than the given number of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.

func PruneLessThanPercent

func PruneLessThanPercent(percent int) PruneHook

PruneLessThanPercent is a PruneHook that removes terms that appear in less than the given percentage of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.

func PruneMoreThan

func PruneMoreThan(count int) PruneHook

PruneMoreThan is a PruneHook that removes terms that appear in more than the given number of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.

func PruneMoreThanPercent

func PruneMoreThanPercent(percent int) PruneHook

PruneMoreThanPercent is a PruneHook that removes terms that appear in more than the given percentage of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.

type TermFilter

type TermFilter func(iter.Seq[string]) iter.Seq[string]

func StopTermFilter

func StopTermFilter(words []string) TermFilter

StopTermFilter removes stop words from the tokenizer iterator (i.e. ignores them).

func TermFilterFunc

func TermFilterFunc(filter func(string) string) TermFilter

TermFilterFunc is a helper function that creates a TermFilter from a function that transforms a single term. If the filter returns an empty string, the term is skipped.

func WithMaxLenTermFilter

func WithMaxLenTermFilter(maxLen int) TermFilter

WithMaxLenTermFilter removes terms that are longer than the given length.

func WithMinLenTermFilter

func WithMinLenTermFilter(minLen int) TermFilter

WithMinLenTermFilter removes terms that are shorter than the given length.

type Tokenizer

type Tokenizer func(text string) iter.Seq[string]

Directories

Path Synopsis
internal
lemm module
stem module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL