sentencepiece

package module
v0.0.0-...-ca555f6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 25, 2025 License: Apache-2.0 Imports: 10 Imported by: 0

README

go-sentencepiece

Logo


Go Reference

This is a pure Go implementation of encoding and decoding text with the SentencePiece tokenizer.

"Encoding" is the operation used to split text into tokens, using a trained tokenizer model. "Decoding" is the reverse process - converting a list of tokens into the original text.

SentencePiece is a general family of tokenizers that is configured by a protobuf configuration file. This repository currently focuses on implementing just the functionality required to reproduce the tokenization of Gemma models (the same tokenizer is used for Google's proprietary Gemini family of models).

This implementation supports both BPE (Byte Pair Encoding) and UNIGRAM tokenization algorithms:

  • BPE: Uses an iterative merge algorithm to combine frequent pairs of tokens
  • UNIGRAM: Uses Viterbi decoding to find the optimal tokenization path

Current status

This package should be ready to use for encoding text into tokens using the Gemma tokenizer; it's been reasonably optimized and extensively tested vs. the SentencePiece Python bindings (see system_test.go in this repository).

If you find any problems or discrepancies, please open an issue.

Tokenizer configuration

The configuration file for the tokenizer is a protobuf (structured data, serialized in the protocol buffer format) that describes a trained tokenizer model; it includes the complete learned vocabulary used for tokenization, as well as other configuration information.

It is not part of this repository. Please fetch it from the official Gemma implementation repository. NewProcessor* constructors will expect to read this file.

Developing

A protobuf is used to configure the tokenizer. The structure of the protobuf is described by the internal/model/sentencepiece_model.proto file, which is vendored from https://github.com/google/sentencepiece

To re-generate the *.pb.go file from it:

$ cd internal/model
$ ./gen.sh

The configuration protobuf itself is obtained as described in the Tokenizer configuration section. All tests require the MODELPATH env var to point to a local copy of the tokenizer configuration file.

Online demo

To see an in-browser demo of this tokenizer in action, visit https://eliben.github.io/go-sentencepiece/

The Go code is compiled to WebAssembly and loaded from a small JS program to allow interactive encoding of text.

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ModelInfo

type ModelInfo struct {
	VocabularySize        int
	BeginningOfSentenceID int
	EndOfSentenceID       int
	UnknownID             int
	PadID                 int
}

ModelInfo stores information about the model proto loaded by the processor.

type Processor

type Processor struct {
	// contains filtered or unexported fields
}

Processor represents a SentencePiece processor (tokenizer). A Processor converts input text into a sequence of tokens LLMs use, and back. The mapping between token IDs and the text they represent is read from the model proto (provided to the constructor); it's the same between all calls to the Encode method.

The term "processor" comes from the original C++ SentencePiece library and its Python bindings.

func NewProcessor

func NewProcessor(protoReader io.Reader) (*Processor, error)

NewProcessor creates a new Processor from a reader with the protobuf data.

func NewProcessorFromPath

func NewProcessorFromPath(protoFile string) (*Processor, error)

NewProcessorFromPath creates a new Processor from a file path to the protobuf data.

func (*Processor) Decode

func (proc *Processor) Decode(ids []int) string

Decode translates a list of IDs produced by [Encode] back into the string it represents.

Example
protoFile := os.Getenv("MODELPATH")
if protoFile == "" {
	log.Println("Need MODELPATH env var to run example")
	return
}

proc, err := sentencepiece.NewProcessorFromPath(protoFile)
if err != nil {
	log.Fatal(err)
}

ids := []int{17534, 2134}
text := proc.Decode(ids)

fmt.Println(text)

func (*Processor) DecodeTokens

func (proc *Processor) DecodeTokens(tokens []Token) string

DecodeTokens is a convenience wrapper around [Decode], accepting a list of tokens as returned by [Encode]. It only uses the ID fields of tokens to decode the text.

func (*Processor) Encode

func (proc *Processor) Encode(text string) []Token

Encode tokenizes the input text and returns a list of Tokens.

Example
protoFile := os.Getenv("MODELPATH")
if protoFile == "" {
	log.Println("Need MODELPATH env var to run example")
	return
}

proc, err := sentencepiece.NewProcessorFromPath(protoFile)
if err != nil {
	log.Fatal(err)
}

text := "Encoding produces tokens that LLMs can learn and understand"
tokens := proc.Encode(text)

for _, token := range tokens {
	fmt.Println(token)
}

func (*Processor) ModelInfo

func (proc *Processor) ModelInfo() *ModelInfo

ModelInfo returns information about the loaded proto model file.

type Token

type Token struct {
	ID   int
	Text string
}

Token represents a single token from the input text. ID is a unique token identifier that the model uses in its internal representation. Text is the piece of text this token represents.

func (Token) String

func (t Token) String() string

Directories

Path Synopsis
internal
cmd/dumper command
cmd/wasm command
Main binary for exposing the go-sentencepiece functionality in the browser via WASM.
Main binary for exposing the go-sentencepiece functionality in the browser via WASM.
priorityqueue
Package priorityqueue provides a generic priority queue with Insert, PopMax, and RemoveFunc operations.
Package priorityqueue provides a generic priority queue with Insert, PopMax, and RemoveFunc operations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL