Documentation
¶
Overview ¶
Package msdoc provides comprehensive functionality for reading, parsing, and creating Microsoft Word .doc files.
This package implements the complete MS-DOC binary file format specification, allowing extraction of text content, metadata, embedded objects, VBA macros, and formatting from Word 97-2003 documents. It also supports creating new documents and modifying existing ones, including support for encrypted/password-protected documents.
Basic reading usage:
doc, err := msdoc.Open("document.doc")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
text, err := doc.Text()
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
metadata := doc.Metadata()
fmt.Printf("Title: %s\n", metadata.Title)
fmt.Printf("Author: %s\n", metadata.Author)
Reading encrypted documents:
doc, err := msdoc.OpenWithPassword("encrypted.doc", "password123")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
Creating new documents:
writer := msdoc.NewWriter()
writer.SetTitle("My Document")
writer.SetAuthor("John Doe")
writer.AddParagraph("Hello, World!")
err := writer.Save("output.doc")
Package msdoc provides comprehensive support for creating and modifying .doc files.
This package now supports full write operations including document creation, text insertion, formatting application, and complete OLE2 compound document generation according to the MS-DOC specification.
Example usage:
writer := msdoc.NewWriter()
writer.SetTitle("My Document")
writer.SetAuthor("John Doe")
writer.AddParagraph("Hello, World!")
// Add formatted text
charProps := &formatting.CharacterProperties{
Bold: true,
FontSize: 24, // 12pt
}
writer.AddFormattedText("Bold text", charProps, nil)
err := writer.Save("output.doc")
This implementation provides complete document creation capabilities including text content, formatting, metadata, and proper OLE2 structure generation.
Index ¶
- func NewWriter() *writer.DocumentWriter
- type Document
- func (d *Document) Close() error
- func (d *Document) GetAllVBAModules() ([]string, error)
- func (d *Document) GetEmbeddedObject(position uint32) (*EmbeddedObject, error)
- func (d *Document) GetEmbeddedObjects() (map[uint32]*EmbeddedObject, error)
- func (d *Document) GetFormattedText() ([]*TextRun, error)
- func (d *Document) GetVBACode(moduleName string) (string, error)
- func (d *Document) GetVBAProject() (*VBAProject, error)
- func (d *Document) HasEmbeddedObjects() bool
- func (d *Document) HasMacros() bool
- func (d *Document) IsEncrypted() bool
- func (d *Document) MarkdownText() (string, error)
- func (d *Document) Metadata() *Metadata
- func (d *Document) Text() (string, error)
- type DocumentWriter
- type EmbeddedObject
- type Metadata
- type TextRun
- type VBAProject
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func NewWriter ¶
func NewWriter() *writer.DocumentWriter
NewWriter creates a new document writer for creating .doc files.
Types ¶
type Document ¶
type Document struct {
// contains filtered or unexported fields
}
Document represents a loaded Microsoft Word .doc file. It provides methods for extracting text content, metadata, embedded objects, macros, and formatting information. It also supports decryption of encrypted documents.
func Open ¶
Open reads and parses the given .doc file. It prepares the document for further operations like text extraction.
The file must be a valid Microsoft Word .doc file (Word 97-2003 format). For encrypted documents, use OpenWithPassword instead.
Returns an error if the file cannot be opened, is not a valid .doc file, or if the internal OLE2 structure is corrupted.
func OpenWithPassword ¶
OpenWithPassword opens an encrypted .doc file with the provided password. This function supports password-protected and encrypted documents.
Returns an error if the file cannot be opened, is not a valid .doc file, the password is incorrect, or if decryption fails.
func (*Document) Close ¶
Close closes the underlying .doc file and releases associated resources. It is safe to call Close multiple times.
func (*Document) GetAllVBAModules ¶
GetAllVBAModules returns the names of all VBA modules in the document.
func (*Document) GetEmbeddedObject ¶
func (d *Document) GetEmbeddedObject(position uint32) (*EmbeddedObject, error)
GetEmbeddedObject returns a specific embedded object by position.
func (*Document) GetEmbeddedObjects ¶
func (d *Document) GetEmbeddedObjects() (map[uint32]*EmbeddedObject, error)
GetEmbeddedObjects returns all embedded objects in the document.
func (*Document) GetFormattedText ¶
GetFormattedText extracts text with formatting information. Returns an array of TextRun structures containing text and formatting.
func (*Document) GetVBACode ¶
GetVBACode returns the VBA code for a specific module.
func (*Document) GetVBAProject ¶
func (d *Document) GetVBAProject() (*VBAProject, error)
GetVBAProject extracts the VBA project from the document. Returns an error if the document does not contain macros.
func (*Document) HasEmbeddedObjects ¶
HasEmbeddedObjects returns true if the document contains embedded objects.
func (*Document) IsEncrypted ¶
IsEncrypted returns true if the document is encrypted.
func (*Document) MarkdownText ¶
MarkdownText extracts text with hyperlinks formatted as markdown
func (*Document) Metadata ¶
Metadata extracts comprehensive metadata from the document.
This method parses both the SummaryInformation and DocumentSummaryInformation streams to extract document properties such as title, author, creation date, company, manager, and many other standard and custom properties.
The current implementation provides complete metadata extraction including all standard OLE property types and custom properties.
Returns a Metadata structure with available information, never returns an error.
func (*Document) Text ¶
Text extracts the plain text content from the document.
This method parses the document's piece table to reconstruct the original text from potentially fragmented pieces stored throughout the file. It handles both ANSI and Unicode text encoding as specified in the MS-DOC format.
For encrypted documents, this method will decrypt the content if a password was provided during opening.
Returns an error if:
- The document is encrypted but no password was provided or decryption failed
- The piece table is corrupted or invalid
- Required streams (WordDocument, Table) cannot be read
- Text data extends beyond stream boundaries
For documents with no text content, returns an empty string with no error.
type DocumentWriter ¶
type DocumentWriter = writer.DocumentWriter
DocumentWriter provides functionality for creating and modifying .doc files. This is an alias for writer.DocumentWriter to maintain clean public API.
func NewDocumentWriter ¶
func NewDocumentWriter() *DocumentWriter
NewDocumentWriter creates a new document writer for creating .doc files. This function replaces the previous stub implementation with full functionality.
type EmbeddedObject ¶
type EmbeddedObject = objects.EmbeddedObject
EmbeddedObject represents an object embedded in the document. This is an alias for objects.EmbeddedObject.
type Metadata ¶
type Metadata = metadata.DocumentMetadata
Metadata holds comprehensive document metadata information. This is an alias for metadata.DocumentMetadata for backward compatibility.
type TextRun ¶
type TextRun = formatting.TextRun
TextRun represents a run of text with consistent formatting. This is an alias for formatting.TextRun.
type VBAProject ¶
type VBAProject = macros.VBAProject
VBAProject represents a VBA project contained in the document. This is an alias for macros.VBAProject.