corpse

package module

v0.0.0-...-1637694 Latest Latest Go to latest Published: Sep 29, 2025 License: MIT Imports: 7 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/lrstanley/corpse

Links

Open Source Insights

README ¶

[!CAUTION] This module has moved to github.com/lrstanley/x/text/corpse and is being maintained there, given it's not really production ready and is only an experiment.

✨ Features

Text Vectorization: Convert text documents into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency).
Term Processing:
- Built-in tokenization for text processing.
- Support for extensible term filtering (stop words, lemmatization, stemming, etc).
- Configurable term pruning to remove common or rare terms (minimum and maximum document frequency).
Vector Management:
- Configurable vector size limits.
- Automatic term frequency tracking.
- Efficient memory usage with object pooling.
Search Capabilities:
- Simple integration with HNSW (Hierarchical Navigable Small Worlds) for fast similarity search, and other search algorithms.

⚠ Limitations

Designed for addition-only vectorization. If you want to remove or update documents, you'll need to re-index the entire corpus. This does mean reduced memory usage, however.
Designed primarily for in-memory vectorization. If you need clustering, or advanced features, use a proper vector database (and something like LLM-based embedding).
I'm not an expert in text embedding, so there may be better ways to do this.

⚙ Usage

$ go get github.com/lrstanley/corpse

# if you want lemmatization or stemming. note that these support english only.
# see the source if you want to use a different language.
$ go get github.com/lrstanley/corpse/lemm
$ go get github.com/lrstanley/corpse/stem

package main

import (
    "fmt"
    "github.com/coder/hnsw"
    "github.com/lrstanley/corpse"
    "github.com/lrstanley/corpse/lemm"
    "github.com/lrstanley/corpse/stem"
)

func main() {
    vectorSize := 25

    // Initialize a corpus with custom options.
    corp := corpse.New(
        corpse.WithMaxVectorSize(vectorSize),
        corpse.WithTermFilters(
            lemm.NewTermFilter(), // Add lemmatization.
            stem.NewTermFilter(), // Add stemming.
            corpse.StopTermFilter([]string{ // Remove common stop words.
                "the", "and", "is", "in", "etc...",
            }),
        ),
        corpse.WithPruneHooks(
            // Remove terms that appear in more than 85% of documents.
            corpse.PruneMoreThanPercent(85),
        ),
    )

    // Index your documents
    documents := map[string]string{
        "brown-fox":     "The quick brown fox jumps over the lazy dog.",
        "yellow-fox":    "The slow yellow fox jumps over the fast cat.",
        "foo-bar":       "Foo bar@baz",
        "walking-store": "I was walking to the store. Alphabetically, working, testing, and so on.",
        "lorem-ipsum":   "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
    }

    for _, doc := range documents {
        corp.IndexDocument(doc)
    }

    // Create a search graph.
    graph := hnsw.NewGraph[string]()
    graph.M = vectorSize // Should match your vector size.

    // Add documents to the graph.
    for id, doc := range documents {
        graph.Add(hnsw.MakeNode(id, corp.CreateVector(doc)))
    }

    // Search for similar documents.
    query := "yellow fox"
    results := graph.Search(corp.CreateVector(query), 2)

    for _, result := range results {
        fmt.Println(result.Key)
    }
}

For more advanced examples, check out the examples directory.

📚 References

🙋♂ Support & Assistance

❤ Please review the Code of Conduct for guidelines on ensuring everyone has the best experience interacting with the community.
🙋♂ Take a look at the support document on guidelines for tips on how to ask the right questions.
🐞 For all features/bugs/issues/questions/etc, head over here.

🤝 Contributing

❤ Please review the Code of Conduct for guidelines on ensuring everyone has the best experience interacting with the community.
📋 Please review the contributing doc for submitting issues/a guide on submitting pull requests and helping out.
🗝 For anything security related, please review this repositories security policy.

⚖ License

MIT License

Copyright (c) 2025 Liam Stanley <[email protected]>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Also located here

Documentation ¶

Index ¶

func DefaultTokenizer(text string) iter.Seq[string]
func IsNoMatchVector(vector []float32) bool
func VectorSubCount(vector []float32) (count int)
type Corpus
- func New(options ...Option) *Corpus
type Option
type PruneHook
type TermFilter
type Tokenizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DefaultTokenizer ¶

func DefaultTokenizer(text string) iter.Seq[string]

func IsNoMatchVector ¶

func IsNoMatchVector(vector []float32) bool

IsNoMatchVector returns true if the vector didn't match any terms.

func VectorSubCount ¶

func VectorSubCount(vector []float32) (count int)

VectorSubCount returns the number of non-zero values in the vector.

Types ¶

type Corpus ¶

type Corpus struct {
	// contains filtered or unexported fields
}

Corpus stores term frequencies across all documents.

func New ¶

func New(options ...Option) *Corpus

New creates a new corpus with the given options.

func (*Corpus) CreatePaddedVector ¶

func (c *Corpus) CreatePaddedVector(text string) []float32

CreatePaddedVector creates a vector with the maximum potential vector size, padding with zeros if the vector is smaller. Not needed unless the graph you use to compare vectors does not support sparse vectors, as it will use more memory.

This is concurrent-safe.

func (*Corpus) CreateVector ¶

func (c *Corpus) CreateVector(text string) []float32

CreateVector creates a TF-IDF vector for the given text. Note that for documents, before generating a vector and adding it to a graph, ALL documents must be indexed first. Note that the returned vector will not be padded. See [CreatePaddedVector] if you need a constant-sized vector.

This will automatically call Corpus.Prune if there are any new documents that have been indexed since the last prune.

This is concurrent-safe.

func (*Corpus) GetDocumentCount ¶

func (c *Corpus) GetDocumentCount() int

GetDocumentCount returns the number of documents that have been indexed.

func (*Corpus) GetTermFrequency ¶

func (c *Corpus) GetTermFrequency() map[string]int

GetTermFrequency returns a snapshot of the term frequencies. Note that because [CreateVector] calls Corpus.Prune before creating vectors, if you invoke this before [CreateVector], you will receive terms that might not have been pruned yet by [PruneHook]s. Call Corpus.Prune manually before this function first in that case.

func (*Corpus) GetUsedCapacity ¶

func (c *Corpus) GetUsedCapacity() (percent int)

GetUsedCapacity returns the percentage of the corpus capacity that is used. You can use this to determine if you are getting close to the max vector size. If you do go above capacity, all vectors will be calculated with the first X terms (sorted), where X is the max vector size, and you will lose corpus information. Make sure to call Corpus.Prune before checking this.

func (*Corpus) IndexDocument ¶

func (c *Corpus) IndexDocument(text string)

IndexDocument indexes a document, calculating occurrences of each term. Note that you should call this for ALL documents before creating vectors for your documents (or search queries).

This is concurrent-safe.

func (*Corpus) Prune ¶

func (c *Corpus) Prune()

Prune runs all prune hooks, removing terms of less importance from the corpus. This is automatically ran by Corpus.CreateVector if there are any new documents that have been indexed since the last prune. Run it manually if you don't plan to invoke Corpus.CreateVector immediately after indexing all documents. Do not run this until you have indexed all documents.

This is concurrent-safe.

func (*Corpus) Reset ¶

func (c *Corpus) Reset()

Reset resets the corpus to its initial state.

type Option ¶

type Option func(*Corpus)

func WithMaxVectorSize ¶

func WithMaxVectorSize(size int) Option

WithMaxVectorSize sets the maximum potential vector size.

func WithPruneHooks ¶

func WithPruneHooks(hooks ...PruneHook) Option

WithPruneHooks allows adding hooks, which are ran before vectorization, that remove terms from the corpus. This can be used to remove terms that are either in too few documents, or too many documents, to reduce the sizze of the corpus.

func WithTermFilters ¶

func WithTermFilters(filters ...TermFilter) Option

WithTermFilters allows adding filters to the tokenizer iterator. For example to add: - stopword removal - lemmatization - stemming

Order of operations: tokenizer -> filter (1st call) -> filter (2nd call) -> ... -> filter (n-th call)

func WithTokenizer ¶

func WithTokenizer(tokenizer Tokenizer) Option

type PruneHook ¶

type PruneHook func(documents int, termFreq map[string]int) (toRemove []string)

func PruneLessThan ¶

func PruneLessThan(count int) PruneHook

PruneLessThan is a PruneHook that removes terms that appear in less than the given number of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.

func PruneLessThanPercent ¶

func PruneLessThanPercent(percent int) PruneHook

PruneLessThanPercent is a PruneHook that removes terms that appear in less than the given percentage of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.

func PruneMoreThan ¶

func PruneMoreThan(count int) PruneHook

PruneMoreThan is a PruneHook that removes terms that appear in more than the given number of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.

func PruneMoreThanPercent ¶

func PruneMoreThanPercent(percent int) PruneHook

PruneMoreThanPercent is a PruneHook that removes terms that appear in more than the given percentage of documents. Keep in mind that if you happen to have very few documents, this may remove all terms.

type TermFilter ¶

type TermFilter func(iter.Seq[string]) iter.Seq[string]

func StopTermFilter ¶

func StopTermFilter(words []string) TermFilter

StopTermFilter removes stop words from the tokenizer iterator (i.e. ignores them).

func TermFilterFunc ¶

func TermFilterFunc(filter func(string) string) TermFilter

TermFilterFunc is a helper function that creates a TermFilter from a function that transforms a single term. If the filter returns an empty string, the term is skipped.

func WithMaxLenTermFilter ¶

func WithMaxLenTermFilter(maxLen int) TermFilter

WithMaxLenTermFilter removes terms that are longer than the given length.

func WithMinLenTermFilter ¶

func WithMinLenTermFilter(minLen int) TermFilter

WithMinLenTermFilter removes terms that are shorter than the given length.

type Tokenizer ¶

type Tokenizer func(text string) iter.Seq[string]

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
internal
utils
lemm module
stem module

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL