sentencepiece

package module

v0.0.0-...-ca555f6 Latest Latest Go to latest Published: Oct 25, 2025 License: Apache-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/SpringMT/go-sentencepiece

Links

Open Source Insights

README ¶

go-sentencepiece

Logo

This is a pure Go implementation of encoding and decoding text with the SentencePiece tokenizer.

"Encoding" is the operation used to split text into tokens, using a trained tokenizer model. "Decoding" is the reverse process - converting a list of tokens into the original text.

SentencePiece is a general family of tokenizers that is configured by a protobuf configuration file. This repository currently focuses on implementing just the functionality required to reproduce the tokenization of Gemma models (the same tokenizer is used for Google's proprietary Gemini family of models).

This implementation supports both BPE (Byte Pair Encoding) and UNIGRAM tokenization algorithms:

BPE: Uses an iterative merge algorithm to combine frequent pairs of tokens
UNIGRAM: Uses Viterbi decoding to find the optimal tokenization path

Current status

This package should be ready to use for encoding text into tokens using the Gemma tokenizer; it's been reasonably optimized and extensively tested vs. the SentencePiece Python bindings (see system_test.go in this repository).

If you find any problems or discrepancies, please open an issue.

Tokenizer configuration

The configuration file for the tokenizer is a protobuf (structured data, serialized in the protocol buffer format) that describes a trained tokenizer model; it includes the complete learned vocabulary used for tokenization, as well as other configuration information.

It is not part of this repository. Please fetch it from the official Gemma implementation repository. NewProcessor* constructors will expect to read this file.

Developing

A protobuf is used to configure the tokenizer. The structure of the protobuf is described by the internal/model/sentencepiece_model.proto file, which is vendored from https://github.com/google/sentencepiece

To re-generate the *.pb.go file from it:

$ cd internal/model
$ ./gen.sh

The configuration protobuf itself is obtained as described in the Tokenizer configuration section. All tests require the MODELPATH env var to point to a local copy of the tokenizer configuration file.

Online demo

To see an in-browser demo of this tokenizer in action, visit https://eliben.github.io/go-sentencepiece/

The Go code is compiled to WebAssembly and loaded from a small JS program to allow interactive encoding of text.

Documentation ¶

Index ¶

type ModelInfo
type Processor
- func NewProcessor(protoReader io.Reader) (*Processor, error)
- func NewProcessorFromPath(protoFile string) (*Processor, error)
type Token
- func (t Token) String() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ModelInfo ¶

type ModelInfo struct {
	VocabularySize        int
	BeginningOfSentenceID int
	EndOfSentenceID       int
	UnknownID             int
	PadID                 int
}

ModelInfo stores information about the model proto loaded by the processor.

type Processor ¶

type Processor struct {
	// contains filtered or unexported fields
}

Processor represents a SentencePiece processor (tokenizer). A Processor converts input text into a sequence of tokens LLMs use, and back. The mapping between token IDs and the text they represent is read from the model proto (provided to the constructor); it's the same between all calls to the Encode method.

The term "processor" comes from the original C++ SentencePiece library and its Python bindings.

func NewProcessor ¶

func NewProcessor(protoReader io.Reader) (*Processor, error)

NewProcessor creates a new Processor from a reader with the protobuf data.

func NewProcessorFromPath ¶

func NewProcessorFromPath(protoFile string) (*Processor, error)

NewProcessorFromPath creates a new Processor from a file path to the protobuf data.

func (*Processor) Decode ¶

func (proc *Processor) Decode(ids []int) string

Decode translates a list of IDs produced by [Encode] back into the string it represents.

Example ¶

protoFile := os.Getenv("MODELPATH")
if protoFile == "" {
	log.Println("Need MODELPATH env var to run example")
	return
}

proc, err := sentencepiece.NewProcessorFromPath(protoFile)
if err != nil {
	log.Fatal(err)
}

ids := []int{17534, 2134}
text := proc.Decode(ids)

fmt.Println(text)

func (*Processor) DecodeTokens ¶

func (proc *Processor) DecodeTokens(tokens []Token) string

DecodeTokens is a convenience wrapper around [Decode], accepting a list of tokens as returned by [Encode]. It only uses the ID fields of tokens to decode the text.

func (*Processor) Encode ¶

func (proc *Processor) Encode(text string) []Token

Encode tokenizes the input text and returns a list of Tokens.

Example ¶

protoFile := os.Getenv("MODELPATH")
if protoFile == "" {
	log.Println("Need MODELPATH env var to run example")
	return
}

proc, err := sentencepiece.NewProcessorFromPath(protoFile)
if err != nil {
	log.Fatal(err)
}

text := "Encoding produces tokens that LLMs can learn and understand"
tokens := proc.Encode(text)

for _, token := range tokens {
	fmt.Println(token)
}

func (*Processor) ModelInfo ¶

func (proc *Processor) ModelInfo() *ModelInfo

ModelInfo returns information about the loaded proto model file.

type Token ¶

type Token struct {
	ID   int
	Text string
}

Token represents a single token from the input text. ID is a unique token identifier that the model uses in its internal representation. Text is the piece of text this token represents.

func (Token) String ¶

func (t Token) String() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
internal
cmd/dumper command
cmd/wasm command Main binary for exposing the go-sentencepiece functionality in the browser via WASM.	Main binary for exposing the go-sentencepiece functionality in the browser via WASM.
model
prefixmatcher
priorityqueue Package priorityqueue provides a generic priority queue with Insert, PopMax, and RemoveFunc operations.	Package priorityqueue provides a generic priority queue with Insert, PopMax, and RemoveFunc operations.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL