Documentation
¶
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ModelInfo ¶
type ModelInfo struct {
VocabularySize int
BeginningOfSentenceID int
EndOfSentenceID int
UnknownID int
PadID int
}
ModelInfo stores information about the model proto loaded by the processor.
type Processor ¶
type Processor struct {
// contains filtered or unexported fields
}
Processor represents a SentencePiece processor (tokenizer). A Processor converts input text into a sequence of tokens LLMs use, and back. The mapping between token IDs and the text they represent is read from the model proto (provided to the constructor); it's the same between all calls to the Encode method.
The term "processor" comes from the original C++ SentencePiece library and its Python bindings.
func NewProcessor ¶
NewProcessor creates a new Processor from a reader with the protobuf data.
func NewProcessorFromPath ¶
NewProcessorFromPath creates a new Processor from a file path to the protobuf data.
func (*Processor) Decode ¶
Decode translates a list of IDs produced by [Encode] back into the string it represents.
Example ¶
protoFile := os.Getenv("MODELPATH")
if protoFile == "" {
log.Println("Need MODELPATH env var to run example")
return
}
proc, err := sentencepiece.NewProcessorFromPath(protoFile)
if err != nil {
log.Fatal(err)
}
ids := []int{17534, 2134}
text := proc.Decode(ids)
fmt.Println(text)
func (*Processor) DecodeTokens ¶
DecodeTokens is a convenience wrapper around [Decode], accepting a list of tokens as returned by [Encode]. It only uses the ID fields of tokens to decode the text.
func (*Processor) Encode ¶
Encode tokenizes the input text and returns a list of Tokens.
Example ¶
protoFile := os.Getenv("MODELPATH")
if protoFile == "" {
log.Println("Need MODELPATH env var to run example")
return
}
proc, err := sentencepiece.NewProcessorFromPath(protoFile)
if err != nil {
log.Fatal(err)
}
text := "Encoding produces tokens that LLMs can learn and understand"
tokens := proc.Encode(text)
for _, token := range tokens {
fmt.Println(token)
}
Directories
¶
| Path | Synopsis |
|---|---|
|
internal
|
|
|
cmd/dumper
command
|
|
|
cmd/wasm
command
Main binary for exposing the go-sentencepiece functionality in the browser via WASM.
|
Main binary for exposing the go-sentencepiece functionality in the browser via WASM. |
|
priorityqueue
Package priorityqueue provides a generic priority queue with Insert, PopMax, and RemoveFunc operations.
|
Package priorityqueue provides a generic priority queue with Insert, PopMax, and RemoveFunc operations. |