msdoc

package
v0.0.0-...-227fc60 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 15, 2025 License: MIT Imports: 15 Imported by: 0

Documentation

Overview

Package msdoc provides comprehensive functionality for reading, parsing, and creating Microsoft Word .doc files.

This package implements the complete MS-DOC binary file format specification, allowing extraction of text content, metadata, embedded objects, VBA macros, and formatting from Word 97-2003 documents. It also supports creating new documents and modifying existing ones, including support for encrypted/password-protected documents.

Basic reading usage:

doc, err := msdoc.Open("document.doc")
if err != nil {
	log.Fatal(err)
}
defer doc.Close()

text, err := doc.Text()
if err != nil {
	log.Fatal(err)
}
fmt.Println(text)

metadata := doc.Metadata()
fmt.Printf("Title: %s\n", metadata.Title)
fmt.Printf("Author: %s\n", metadata.Author)

Reading encrypted documents:

doc, err := msdoc.OpenWithPassword("encrypted.doc", "password123")
if err != nil {
	log.Fatal(err)
}
defer doc.Close()

Creating new documents:

writer := msdoc.NewWriter()
writer.SetTitle("My Document")
writer.SetAuthor("John Doe")
writer.AddParagraph("Hello, World!")
err := writer.Save("output.doc")

Package msdoc provides comprehensive support for creating and modifying .doc files.

This package now supports full write operations including document creation, text insertion, formatting application, and complete OLE2 compound document generation according to the MS-DOC specification.

Example usage:

writer := msdoc.NewWriter()
writer.SetTitle("My Document")
writer.SetAuthor("John Doe")
writer.AddParagraph("Hello, World!")

// Add formatted text
charProps := &formatting.CharacterProperties{
	Bold: true,
	FontSize: 24, // 12pt
}
writer.AddFormattedText("Bold text", charProps, nil)

err := writer.Save("output.doc")

This implementation provides complete document creation capabilities including text content, formatting, metadata, and proper OLE2 structure generation.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewWriter

func NewWriter() *writer.DocumentWriter

NewWriter creates a new document writer for creating .doc files.

Types

type Document

type Document struct {
	// contains filtered or unexported fields
}

Document represents a loaded Microsoft Word .doc file. It provides methods for extracting text content, metadata, embedded objects, macros, and formatting information. It also supports decryption of encrypted documents.

func Open

func Open(filename string) (*Document, error)

Open reads and parses the given .doc file. It prepares the document for further operations like text extraction.

The file must be a valid Microsoft Word .doc file (Word 97-2003 format). For encrypted documents, use OpenWithPassword instead.

Returns an error if the file cannot be opened, is not a valid .doc file, or if the internal OLE2 structure is corrupted.

func OpenWithPassword

func OpenWithPassword(filename, password string) (*Document, error)

OpenWithPassword opens an encrypted .doc file with the provided password. This function supports password-protected and encrypted documents.

Returns an error if the file cannot be opened, is not a valid .doc file, the password is incorrect, or if decryption fails.

func (*Document) Close

func (d *Document) Close() error

Close closes the underlying .doc file and releases associated resources. It is safe to call Close multiple times.

func (*Document) GetAllVBAModules

func (d *Document) GetAllVBAModules() ([]string, error)

GetAllVBAModules returns the names of all VBA modules in the document.

func (*Document) GetEmbeddedObject

func (d *Document) GetEmbeddedObject(position uint32) (*EmbeddedObject, error)

GetEmbeddedObject returns a specific embedded object by position.

func (*Document) GetEmbeddedObjects

func (d *Document) GetEmbeddedObjects() (map[uint32]*EmbeddedObject, error)

GetEmbeddedObjects returns all embedded objects in the document.

func (*Document) GetFormattedText

func (d *Document) GetFormattedText() ([]*TextRun, error)

GetFormattedText extracts text with formatting information. Returns an array of TextRun structures containing text and formatting.

func (*Document) GetVBACode

func (d *Document) GetVBACode(moduleName string) (string, error)

GetVBACode returns the VBA code for a specific module.

func (*Document) GetVBAProject

func (d *Document) GetVBAProject() (*VBAProject, error)

GetVBAProject extracts the VBA project from the document. Returns an error if the document does not contain macros.

func (*Document) HasEmbeddedObjects

func (d *Document) HasEmbeddedObjects() bool

HasEmbeddedObjects returns true if the document contains embedded objects.

func (*Document) HasMacros

func (d *Document) HasMacros() bool

HasMacros returns true if the document contains VBA macros.

func (*Document) IsEncrypted

func (d *Document) IsEncrypted() bool

IsEncrypted returns true if the document is encrypted.

func (*Document) MarkdownText

func (d *Document) MarkdownText() (string, error)

MarkdownText extracts text with hyperlinks formatted as markdown

func (*Document) Metadata

func (d *Document) Metadata() *Metadata

Metadata extracts comprehensive metadata from the document.

This method parses both the SummaryInformation and DocumentSummaryInformation streams to extract document properties such as title, author, creation date, company, manager, and many other standard and custom properties.

The current implementation provides complete metadata extraction including all standard OLE property types and custom properties.

Returns a Metadata structure with available information, never returns an error.

func (*Document) Text

func (d *Document) Text() (string, error)

Text extracts the plain text content from the document.

This method parses the document's piece table to reconstruct the original text from potentially fragmented pieces stored throughout the file. It handles both ANSI and Unicode text encoding as specified in the MS-DOC format.

For encrypted documents, this method will decrypt the content if a password was provided during opening.

Returns an error if:

  • The document is encrypted but no password was provided or decryption failed
  • The piece table is corrupted or invalid
  • Required streams (WordDocument, Table) cannot be read
  • Text data extends beyond stream boundaries

For documents with no text content, returns an empty string with no error.

type DocumentWriter

type DocumentWriter = writer.DocumentWriter

DocumentWriter provides functionality for creating and modifying .doc files. This is an alias for writer.DocumentWriter to maintain clean public API.

func NewDocumentWriter

func NewDocumentWriter() *DocumentWriter

NewDocumentWriter creates a new document writer for creating .doc files. This function replaces the previous stub implementation with full functionality.

type EmbeddedObject

type EmbeddedObject = objects.EmbeddedObject

EmbeddedObject represents an object embedded in the document. This is an alias for objects.EmbeddedObject.

type Metadata

type Metadata = metadata.DocumentMetadata

Metadata holds comprehensive document metadata information. This is an alias for metadata.DocumentMetadata for backward compatibility.

type TextRun

type TextRun = formatting.TextRun

TextRun represents a run of text with consistent formatting. This is an alias for formatting.TextRun.

type VBAProject

type VBAProject = macros.VBAProject

VBAProject represents a VBA project contained in the document. This is an alias for macros.VBAProject.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL